the genomics of emerging infectious disease · the organism cause disease before but a new form is...
TRANSCRIPT
The Genomics of Emerging Infectious Disease
www.plos.org
A collection of essays, perspectives, and reviews from six PLoS Journals about how genomics can revolutionize our understanding of emerging infectious disease.
Produced with support from Google.org. The PLoS Journal editors have sole responsibility for the content of this collection.
Image credits: Brindley et al., PLoS Neglected Tropical Diseases 3(10) e538.McHardy et al., PLoS Pathogens 5(10) e1000566.Salama et al., PLoS Pathogens 5(10) e1000544.
Editorial
Genomics of Emerging Infectious Disease: A PLoSCollectionJonathan A. Eisen1*, Catriona J. MacCallum2*
1 University of California Davis, Davis, California, United States of America, 2 Public Library of Science, Cambridge, United Kingdom
Today, the Public Library of Science publishes a collection of
essays, perspectives, and reviews about how genomics, with all its
associated tools and techniques, can provide insights into our
understanding of emerging infectious disease (http://ploscollections.
org/emerginginfectiousdisease/) [1–13]. This collection, focused on
human disease, is particularly timely as pandemic H1N1 2009
influenza (commonly referred to as swine flu) spreads around the
globe, and government officials, the public, journalists, bloggers, and
tweeters strive to find out more. People want to know if this flu poses
more of a threat than other seasonal flu strains, how fast it’s
spreading (and where), and what can be done to contain it. As this
collection illustrates, the increasing speed at which complete genome
sequences and other genome-scale data can be generated for
individual isolates and strains of a pathogen provides tremendous
opportunities to identify the molecular changes in these disease
agents that will enable us to track their spread and evolution through
time (e.g., [3,7,8]) and generate the vaccines and drugs necessary to
combat them (e.g., [5–7]). The collection also shines a spotlight on
specific pathogens, some familiar and widespread, such as the
influenza A virus (e.g., [9]); some ‘‘reemerging,’’ such as the
Mycobacterium tuberculosis complex that causes tuberculosis [10]; and
some identified only recently, as with the bacterium Helicobacter pylori
(which causes peptic ulcers and gastric cancer [11]).
There is no simple definition of an emerging disease, but it can
be loosely described as a disease that is novel in some way—for
example, one that displays a change in geographic location,
genetics, or function. Emerging infectious diseases are caused by a
wide range of organisms, but they are perhaps best typified by
zoonotic viral diseases that cross from animal to human hosts and
can have a devastating impact on human health, causing a high
disease burden and mortality [8]. These zoonotic diseases include
monkeypox, Hendra virus, Nipah virus, and severe acute
respiratory syndrome coronavirus (SARS-CoV), in addition to
influenza A and the lentiviruses that cause AIDS. The apparently
increased transmission of pathogens from animals to humans over
the recent decades has been attributed to the unintended
consequences of globalization as well as environmental factors
and changes in agricultural practices [8]. Generally, the burden of
these diseases is most strongly felt by those in developing countries.
Brindley et al. [12] point to the debilitating effects of the most
common human infectious agent in such areas—helminths
(parasitic worms)—and the role that genomics plays in advancing
our understanding of molecular and medical helminthology.
Compounding the problem of emerging infectious diseases in
developing countries is the reality that researchers in developing
countries have often been unable to participate fully in genomics
research, because of their technological isolation and limited
resources. As Harris et al. emphasize [13], ‘‘collaborations—
starting with capacity building in genomics research—need to be
fostered so that countries that are currently excluded from the
genomics revolution find an entry point for participation.’’
This collection is a collaborative effort that combines financial
support from Google.org (which has also sponsored research on
emerging infectious disease through its Predict and Prevent
initiative [14]) with PLoS’s editorial independence and rigor.
Gupta et al. [1] provide Google.org’s perspective and vision for
how systematic application of genomics, proteomics, and bioinfor-
matics to infectious diseases could predict and prevent the next
pandemic. To realize this vision, they urge the community to unite
under an ‘‘Infectious Disease Genomics Project,’’ analogous to the
Human Genome Project. This is, as the authors admit, a
potentially ‘‘grandiose’’ and difficult proposition. Some researchers
might justifiably argue that much is already being achieved—as
demonstrated by this collection—and that the vision is naıve.
However, as every article in the collection also points out,
tremendous challenges remain if the potential of genomics in this
field is to be realized.
One problem is that, despite the fact that sequencing is now the
method of choice for characterizing new disease agents, and new
substantially faster and cheaper sequencing methods are contin-
ually being produced, we still lack the range of computational tools
necessary to analyze these sequences in sufficient detail [4]. It is
possible to sequence the entire assemblage of viruses in a particular
tissue type or host species [3] and to obtain complete or nearly
complete genome sequences for large samples of bacteria [7]. Yet
we remain in the early, albeit essential, stages of pathogen
discovery (Box 1). These sequences can be interpreted fully only
when integrated with relevant environmental, epidemiological,
and clinical data (e.g., [3,4,8]). And, despite the increased
sequencing, really comprehensive genome data are still only
available for a few key pathogens, which further limits our
understanding. For example, a full quantitative understanding of
the processes that shape the epidemiology and evolution—the
phylodynamics—of RNA viruses is currently possible only for HIV
and influenza A virus [3].
In this collection, you will find not only the views of leading
researchers from several different disciplines, and a provocative
vision from a funding agency, but also the contributions of six
different PLoS journals (PLoS Biology, PLoS Medicine, PLoS
Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases,
and PLoS Pathogens). The PLoS open-access model of publishing
makes possible such a large multidisciplinary cross-journal
collection, in which all articles are simultaneously available online
Citation: Eisen J, MacCallum CJ (2009) Genomics of Emerging Infectious Disease:A PLoS Collection. PLoS Biol 7(10): e1000224. doi:10.1371/journal.pbio.1000224
Published October 26, 2009
Copyright: � 2009 Eisen, MacCallum. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.
Competing Interests: The authors have declared that no competing interestsexist.
* E-mail: [email protected] (JAE); [email protected] (CJM)
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journalcollection (http://ploscollections.org/emerginginfectiousdisease/).
PLoS Biology | www.plosbiology.org 1 October 2009 | Volume 7 | Issue 10 | e1000224
for unrestricted reuse, regardless of venue (see also the podcast that
accompanies the collection; http://ploscollections.org/podcast/
emerginginfectiousdisease.mp3).
Our aim is that this collection will add to other ‘‘open science’’
activities that have helped provide insights into infectious disease
more quickly than would have been thought feasible only a few
years ago. This accelerated availability of research findings is
exemplified by the recent response to the flu pandemic. Consider,
for example, data access. Traditionally, scientists have released
data after publishing a study. Fortunately, in part due to
experience from genome sequencing projects, prepublication flu
sequence data have been released in a relatively unrestricted
manner to the community [15]. This has in turn enabled
anyone—not just those who collected the data—to carry out
analyses while the epidemic is occurring (when in principle there is
still time to save lives) rather than being forced to provide a
Box 1. A Field Guide to Microbes?
When an American robin (Turdus migratorius) showed up in London a few years ago, birders were rapidly all atwitter and manycame flocking to town [22]. Why had this one bird created such a stir? For one main reason—it was out of place. This species isnormally found in North America and only very rarely shows up on the other side of the ‘‘pond.’’ Amazingly, this rapid, collectiveresponse is not that unusual in the world of birding. When a bird is out of place, people notice quickly.
This story of the errant robin gets to the heart of the subject of this collection because being out of place in a metaphorical wayis what defines an emerging infectious disease. Sometimes we have never seen anything quite like the organism or the diseasebefore (e.g., SARS, Legionella). Or perhaps, as with many opportunistic pathogens, we have seen the organism before but it wasnot previously known to cause disease. In other cases, such as with as pandemic H1N1 2009 or E. coli O157:H7, we have seenthe organism cause disease before but a new form is causing far more trouble. And of course organisms can be literally out ofplace, by showing up in a location not expected (e.g., consider the anthrax letters [2]).
Historically, despite the metaphorical similarities with the robin case, the response to emerging infectious disease is almostalways much slower. Clearly, there are many reasons for these differences, which we believe are instructive to consider. At leastfour factors are required for birders’ rapid responses to the arrival of a vagrant bird: (1) knowledge of the natural ‘‘fauna’’ in aparticular place, (2) recognition that a specific bird may be out of place, (3) positive identification of the possibly out-of-placebird, and (4) examination of the ‘‘normal’’ place for relatives of the identified bird.
How are these requirements achieved? Mostly through the existence of high-quality field guides that allow one to place anorganism such as a bird into the context of what is known about its relatives. This placement in turn is possible because of twokey components of field guides. First, such guides contain information about the biological diversity of a group of organisms.This usually includes features such as a taxonomically organized list of species with details for each species on biogeography(distribution patterns across space and time, niche preferences, relative abundance), biological properties (e.g., behavior, size,shape, etc.), and genetic variation within the species (e.g., presence of subspecies). Second, a good field guide providesinformation on how to identify particular types (e.g., species) of those organisms. With such information, and with a network ofinterested observers, an out-of-place bird can be detected with relative ease.
In much the same way, a field guide to microbes would be valuable in the study of emerging infectious diseases. The articles inthis collection describe what can be considered the beginnings of species-specific field guides for the microbial agents ofemerging diseases. If we want to truly gain the benefits that can come from good field guides it will be necessary to expandcurrent efforts to include more organisms, more systematic biogeographical sampling, and more epidemiological and clinicaldata. But the current efforts are a great start.
Figure: The American Robin (Turdus migratorius). (Photo Credit: NASA).doi:10.1371/journal.pbio.1000224.g001
PLoS Biology | www.plosbiology.org 2 October 2009 | Volume 7 | Issue 10 | e1000224
posthumous account of the spread of infection. Such a response
highlights both the importance of early data access and the
removal of restrictions in the use of data (e.g., in many past cases
data might be released but use of the data in presentations and
publications would be limited).
The value of open access to sequence data is helping to put
pressure both on private organizations to release their sequence
data [16,17] and on all agencies to release other information (e.g.,
metadata about strains) more rapidly. This pressure is not being
brought to bear only on flu data—in this collection Van Voorhis
et al. [5] call on pharmaceutical companies to deposit the
structural coordinates of drug targets from all globally important
infectious disease organisms in public databases.
Of course, data about any infectious disease are not very useful
unless placed in the scientific context of past studies (i.e.,
publications) specifically about the disease or about methods to
analyze such data. It is also important to have access to
information about other diseases and other organisms that might
impact its spread or evolution. Perhaps the most intriguing aspect
of open science in response to flu has been the move toward pre-
journal publication release of findings. Many flu researchers took
the available data, analyzed it, and posted results on blogs [18,19],
wikis [20], and other sites. Although some view this ‘‘non peer-
reviewed’’ release as unseemly, it is clear that it has helped
accelerate the science in the study of pandemic H1N1 2009 and
led to some important journal papers [17]. Indeed, such advances
helped provide one of the stimuli for PLoS’s most recent initiative,
PLoS Currents: Influenza, a Google ‘‘Knol,’’ for the rapid
communication of research results and ideas about flu vetted by
expert moderators [21].
This is not to say there are no possible risks or drawbacks from
more openness. For example, some governments may avoid
releasing data because of fears about discrimination (as was seen in
many aspects of the flu in Mexico). Others worry that complete
openness might foster the spread of misinformation. However, as
Fricke et al. argue in their article on the relationship between
genomics and biopreparedness [2], open source genomic resources
are actually of immense benefit to those in charge of our public
health and biosecurity.
It is clear that ‘‘for all stages of combating emerging infections,
from the early identification of the pathogen to the development
and design of vaccines, application of sophisticated genomics tools
is fundamental to success’’ [8]. It is equally clear that open science
and open access to publications and data will be key to that
success. Whatever one’s position has been on the various open
science initiatives, there is no doubt that the ‘‘esoteric’’ label on
some open science initiatives has largely been eliminated by the
emergence of H1N1 flu epidemic.
The faster, cheaper, and more openly we can distribute the
discoveries of science, the better for scientific progress and public
health. As this collection emphasizes, managing the threat of
novel, re-emerging, and longstanding infectious diseases is
challenging enough even without barriers to scientific research.
We encourage you to make the most of this collection by sharing,
rating, and annotating the articles using our online commenting
tools. Better yet, join the discussion by providing your own vision
to prevent the emergence and spread of the next rogue pathogen.
References
1. Gupta R, Michalski MH, Rijsberman FR (2009) Can an Infectious Diseases
Genomics Project predict and prevent the next pandemic? PLoS Biol 7:
e1000219. doi:10.1371/journal.pbio.1000219.
2. Fricke WF, Rasko DA, Ravel J (2009) The role of genomics in the identification,
prediction, and prevention of biological threats. PLoS Biol e1000217. doi:10.1371/
journal.pbio.1000217.
3. Holmes EC, Grenfell BT (2009) Discovering the phylodynamics of RNA viruses.
PLoS Comput Biol 5: e1000505. doi:10.1371/journal.pcbi.1000505.
4. Berglund EC, Nystedt B, Andersson SGE (2009) Computational resources in
infectious disease: Limitations and challenges. PLoS Comput Biol 5: e1000481.
doi:10.1371/journal.pcbi.1000481.
5. Van Voorhis WC, Hol WGJ, Myler PJ, Stewart LJ (2009) The role of medical
structural genomics in discovering new drugs for infectious diseases. PLoS Comp
Biol 5: e1000530. doi:10.1371/journal.pcbi.1000530.
6. Seib KL, Dougan G, Rappuoli R (2009) The key role of genomics in modern
vaccine and drug design for emerging infectious diseases. PLoS Genet 5:
e1000612. doi:10.1371/journal.pgen.1000612.
7. Falush D (2009) Toward the use of genomics to study microevolutionary change
in bacteria. PLoS Genet 5: e1000627. doi:10.1371/journal.pgen.1000627.
8. Haagmans BL, Andeweg AC, Osterhaus ADME (2009) The application of
genomics to emerging zoonotic viral diseases. PLoS Pathog 5: e1000557.
doi:10.1371/journal.ppat.1000557.
9. McHardy AC, Adams B (2009) The role of genomics in tracking the evolution of
influenza A virus. PLoS Pathog 5: e1000566. doi:10.1371/journal.ppat.1000566.
10. Comas I, Gagneux S (2009) The past and future of tuberculosis research. PLoS
Pathog 5: e1000600. doi:10.1371/journal.ppat.1000600.
11. Dorer MS, Talarico S, Salama NR (2009) Helicobacter pylori’s unconventional
role in health and disease. PLoS Pathog 5: e1000544. doi:10.1371/journal.ppat.
1000544.
12. Brindley PJ, Mitreva M, Ghedin E, Lustigman S (2009) Helminth genomics:
The implications for human health. PLoS Negl Trop Dis 3: e538. doi:10.1371/journal.ppat.1000538.
13. Coloma J, Harris E (2009) Molecular genomic approaches to infectious diseasesin resource-limited settings. PLoS Med 6: e1000142. doi:10.1371/journal.
pmed.1000142.
14. Google.org (2008) Predict and Prevent Initiative homepage. Available: http://www.google.org/predict.html. Accessed 16 September 2009.
15. National Center for Biotechnology Information (2009) Influenza VirusResource. Available: http://www.ncbi.nlm.nih.gov/genomes/FLU/SwineFlu.
html. Accessed 11 September 2009.16. Butler D (2005) Flu researchers slam US agency for hoarding data. Nature 437:
458–459.
17. Smith GJD, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, et al. (2009) Originsand evolutionary genomics of the 2009 swine-origin H1N1 influenza A
epidemic. Nature 459: 1122–1125.18. Porter S (2009) Did the California H1N1 swine flu come from Ohio?
Discovering Biology in a Digital World blog. Available: http://scienceblogs.
com/digitalbio/2009/04/did_the_california_h1n1_swine.php. Accessed 11September 2009.
19. Koppstein D (2009) Swine flu phylogeny, part II. Koppology blog. Available:http://koppology.blogspot.com/2009/04/swine-flu-phylogeny-part-ii.html. Ac-
cessed 11 September 2009.20. Rambaut A (2009) Human/Swine A/H1N1 Influenza Origins and Evolution.
Available: http://tree.bio.ed.ac.uk/groups/influenza/. Accessed 11 September
2009.21. Allen L (2009) Welcome to PLoS Currents: Influenza. PLoS Blog. Available:
http://www.plos.org/cms/node/481. Accessed 8 September 2009.22. Evans I (29 March 2009) American Robin Spotted in South London.
Foxnews.com. Available at:http://www.foxnews.com/story/0,2933,189510,00.
html. Accessed 14 September 2009.
PLoS Biology | www.plosbiology.org 3 October 2009 | Volume 7 | Issue 10 | e1000224
Essay
Molecular Genomic Approaches to Infectious Diseases inResource-Limited SettingsJosefina Coloma1,2, Eva Harris1,2*
1 Division of Infectious Diseases and Vaccinology, School of Public Health, University of California Berkeley, Berkeley, California, United States of America, 2 Sustainable
Sciences Institute, San Francisco, California, United States of America
Only half a century after the landmark
discovery of the double helix structure of
DNA, the human genome was sequenced
and a new era of biomedical research was
ushered in [1]. Parallel advances in
comparative genomics, genetics, high-
throughput biochemical techniques, and
bioinformatics have provided researchers
in wealthy nations with a repertoire of
tools to analyze the sequence and func-
tions of organisms at an unprecedented
pace and level of detail. Since the
beginning of the genomics era [2,3],
however, it has been evident that research-
ers in many developing countries will not
be participating fully in genomics research,
mainly because of their technological
isolation and their limited resources and
capacity for genomics research combined
with the urgency of many other health
priorities. To share the benefits of this
technology equitably worldwide, some
have advocated that developed and devel-
oping countries alike should participate in
genomics research to prevent widening of
the already large gap in global health
resources [4]. As most of the funding that
has fueled the rapid advance of the field
comes from developed country govern-
ments, private initiatives, and industry,
however, not much has been done to
enable poorer countries to participate as
equals in genomics research. Developing
countries that are not directly participating
in a genomics initiative can, nonetheless,
gain from the discoveries of this field in a
number of ways, as detailed below. It
remains to be seen, however, how the
developing world will specifically benefit
from the refined genetic information and
the drugs and vaccines produced as a result
of genomics initiatives. Information ex-
change and translation of knowledge must
be carried out continually through fora
accessible to researchers in developing
countries. ‘‘North–South’’ collaborations—
starting with capacity building in genomics
research—need to be fostered so that
countries that are currently excluded from
the genomics revolution find an entry point
for participation. ‘‘South–South’’ collabo-
rations must be encouraged to allow
countries with limited resources to pool
their human and financial capital, learn
from each other’s experience, and share in
the benefits of genomics. Ensuring that the
benefits of genomics-based medicine are
shared by developing countries involves
their inclusion in the discussion of ethical,
legal, social, economic, and sovereignty
issues (Box 1).
Initiatives in the DevelopingWorld
In the developing world, the link between
human genomics and infectious disease is
particularly important. The influence of
host genes on the differential susceptibility
of individuals or populations to infection
and the evolutionary influence of pathogens
on the genetic composition of populations
by selecting for resistant individuals through
coevolution can be now dissected in more
detail with genomics. An array of host–
pathogen interactions are associated with
particular human genes and loci, as best
illustrated by the relationship of the malaria
pathogen with host genetic evolution. As
genetic information about larger popula-
tions becomes increasingly available, it
is important to disseminate information
The Essay section contains opinion pieces on topicsof broad interest to a general medical audience.
Citation: Coloma J, Harris E (2009) Molecular Genomic Approaches to Infectious Diseases in Resource-LimitedSettings. PLoS Med 6(10): e1000142. doi:10.1371/journal.pmed.1000142
Published October 26, 2009
Copyright: � 2009 Coloma, Harris. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
Provenance: Commissioned; externally peer-reviewed.
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/).
Funding: No specific funding was received for this study/essay.
Summary Points
N Researchers in most developingcountries lack the technology,resources, and capacity to partic-ipate fully in genomics research.
N Information exchange and knowl-edge translation must be carried outcontinually through ‘‘North–South’’collaborations, starting with capaci-ty building in genomics research;‘‘South–South’’ collaborations mustbe encouraged to allow countrieswith limited resources to pool theirhuman and financial capital andshare in the benefits of genomics.
N Several emerging countries havemade significant progress in thepast decade by sequencing thegenomes of organisms with littleeconomic value in the developedworld but of great local relevance.
N Molecular diagnostics and molec-ular epidemiology are the firstfrontier of genomics, with acces-sible tools that can be applied inresource-limited settings.
N Developing countries entering thegenomics era should start by es-tablishing their priorities and enact-ing appropriate legislation beforeembarking on large-scale projects.
N Access to training and capacitybuilding of human resources inbioinformatics and data mining arecrucial in the developing world.
PLoS Medicine | www.plosmedicine.org 1 October 2009 | Volume 6 | Issue 10 | e1000142
relating genomics to disease as well as to
devise intervention strategies for at-risk
populations worldwide [5].
Because science and technology are
increasingly recognized as vital compo-
nents for national development, emerging
economies and some developing countries
are building their infrastructures to pro-
mote local innovation and to retain the
value of their human, plant, and microbial
genomic diversity and research. India,
Thailand, South Africa, Indonesia, Brazil,
and Mexico, for example, have devoted
considerable resources to large-scale popu-
lation genotyping projects that explore
human genetic variation. The Institute for
Genomic Medicine (INMEGEN) initiative
in Mexico is the largest and most compre-
hensive, with a broad strategy for incorpo-
rating genomics into health care that
includes infrastructure, strategic public–
private partnerships, research and develop-
ment in genomics relevant to local health
problems, capacity building, and bioethics
policy making [6,7]. Although it is unclear
how Mexico will make the transition from
early-phase investment to translation of
knowledge into products and services with
health and economic impacts, the country
is taking important steps to address the
challenges it and other emerging economies
face, such as the shortage of trained
professionals and the ability to retain local
talent. For example, the National Council
for Science and Technology (CONICYT)
is making efforts to engage the Mexican
scientific diaspora with expertise in geno-
mics by offering repatriation packages tied
to jobs at universities and research insti-
tutes, an approach that is also being
adopted by Brazil.
Brazil’s Foundation for Research Sup-
port in Sao Paolo (FAPESP) genomics
initiative is also considered a political and
scientific achievement. Key to its success
has been early investment in training
young scientists by sponsoring scholarships
abroad in areas related to genomics in
which Brazil lacks expertise. To avoid
brain drain, beneficiaries are required to
return to Brazil for at least four years and
must have a committed teaching position
at a local university before they leave. One
important principle of Brazil’s genomics
initiative is that the projects are relevant to
Brazil and the rest of the developing world
but are low on the list of priorities of the
US and Europe, thus providing both an
important contribution to genomics and a
benefit to Brazil’s economy and scientific
endeavor [8]. FAPESP is in the process of
sequencing the genes of the parasite that
causes schistosomiasis, a disease that
afflicts millions in Brazil. Another example
in Brazil is the government-funded con-
sortium Organization for Nucleotide Se-
quencing and Analysis (ONSA), formed to
sequence and analyze the genome of the
plant pathogen Xylella, which infects
orange trees and has great economic
impact [9]. This effort led to additional
genomics projects on vectors of pathogens
that cause major public health problems in
Brazil, such as the sandfly Lutzomyia long-
ipalpis, which transmits Leishmania spp.,
and the Triatominae bug species, which
are vectors of Trypanosoma cruzi [10].
The impact of genomics on the devel-
oping world is also illustrated by multina-
tional initiatives such as the one funded by
the US National Institutes of Health
(NIH), the UK’s Wellcome Trust, and
private and public institutes in the US and
Europe in collaboration with research
centers in Brazil, Argentina, Venezuela
and Singapore to sequence the genomes of
the parasites T. brucei, T. cruzi and
Leishmania major, which cause the deadly
insect-borne diseases African sleeping sick-
ness, Chagas disease, and leishmaniasis,
respectively [11–13]. The potential new
drug targets identified by these initiatives
have great relevance in over 100 develop-
ing countries where the diseases take a
significant toll on the economy and the
quality of life of their citizens. Similar
initiatives have resulted in sequencing of
other pathogens important to medicine
and agriculture. The data from these
projects are usually freely available online
for data mining and for bioinformatics
analysis at remote locations, as most
researchers follow the recommendation
set by the Bermuda Accord to make
DNA sequences (especially human) freely
and openly available without delay [14].
Resource-limited countries can enter
the genomics era by creating partnerships
and regional centers for technology and
resources [15]. For example, DNA se-
quencing technology, still unaffordable for
many researchers and public laboratories
because of low-use volume and high costs
of equipment, reagents, and maintenance,
can be affordable if a regional center
provides services to a pool of laboratories
and researchers within a country or
geographical region. As an illustration,
using Brazilian infrastructure, Peru and
Chile joined the global potato sequencing
consortium, which will sequence different
varieties of this important agricultural
species [16]. Brazil has also generated
several open-source bioinformatics tools
for the annotation of bacterial and proto-
zoan genomes that can be used by any
researcher worldwide [17]. In Africa, the
Center for Training in Functional Geno-
mics of Insect Vectors of Human Disease
(AFRO VECTGEN) was initiated by
TDR (Special Programme in Research
and Training in Tropical Diseases) at the
World Health Organization (WHO) and
the Department of Medical Entomology
and Vector Ecology of the Malaria
Research and Training Center in Mali to
train young scientists in functional geno-
mics who will ultimately use genome
sequence data for research on insect
vectors of human disease. The program
triggers collaborative research with neigh-
boring nations and the vector biology
network in Mali, which was built around
research grants funded by the US NIH
and TDR/WHO [18]. The Malaria
Genomic Epidemiology Network (Malar-
iaGEN) uses a consortial approach that
brings together researchers from 21 coun-
tries to overcome scientific, ethical, and
practical challenges to conducting large-
scale studies of genomic variation that
could assist efforts in the fight against
malaria [19]. Successful ‘‘North–South’’
partnerships that help scientists bridge
the genomic gap usually involve a project
of mutual interest. An example is the
Box 1. Societal and Ethical Issues in Genomics to Be Discussedwith Full Participation of All Nations
N Issues of confidentiality, stigmatization, discrimination, and misuse of geneticinformation
N Dangers of a reductionist approach to health issues based only on geneticinformation that ignores multifactorial determinants
N Issues about intellectual property rights associated with the patentability ofDNA sequences, the applications derived from them, and the implications fordeveloping countries [45]
N The potential exploitation of developing-country populations by creatinggenetic databases for a price [46]
N The potential risk of breeding human beings by design [47]
N Issues about informed consent, standard of care, and availability and pricing ofnew drugs and vaccines being tested in developing countries [48]
PLoS Medicine | www.plosmedicine.org 2 October 2009 | Volume 6 | Issue 10 | e1000142
common effort of the International Live-
stock Research Institute (ILRI) in Nairobi
and The Institute for Genome Research
(TIGR; now the J. Craig Ventner Insti-
tute) to sequence and annotate the genome
of Theileria parva, a cattle parasite that
causes important economic losses to small
farmers in Africa and elsewhere [20]. This
effort has generated local human resources
in genomics and infrastructure for the
future.
Application of Molecular,Genetic, and Genomic Toolswith Limited Resources
Although the genomics initiatives de-
scribed above challenge the notion that
developing countries must wait to import
advances in science and technology that
emerge from the developed world, poorer
developing countries still do not have the
resources to develop their own genomic
projects on a large scale. However,
implementing simpler molecular genetic
approaches to solve health problems is
very feasible in resource-limited settings.
The decades preceding the human and
microbial genome initiatives were high-
lighted by important developments in
molecular and genetic methods applied
to infectious diseases. These developments
were enabled by increasingly available
genetic information about many patho-
gens and their vectors and by molecular
tools such as PCR and powerful sequenc-
ing technologies, which permitted rapid
advances that were successfully introduced
into the developing world with little delay.
Molecular tools for diagnosis have
gained a ready foothold because many
poor countries do not have the facilities for
traditional diagnosis and surveillance.
Thus, diagnosis often relies on clinical
observations or requires that a sample be
sent out to foreign agencies such as the US
Centers for Disease Control and Preven-
tion (CDC) for confirmation. In addition,
even when available, classic techniques
based on serological, microscopic, and
culture-based methods are often lengthy,
of only moderate sensitivity, and not
highly discriminatory at the level of species
subtype or strain. By adapting DNA
technologies to the existing infrastructure,
using home-grown solutions to reduce
their cost, and applying them to solve
local health problems, molecular ap-
proaches to detect and type infectious
agents on-site offer real value [21]. Fos-
tering appropriate technology transfer and
capacity-building in the ‘‘South’’ enables
public health laboratories and research
groups in less scientifically developed
countries to participate in global genomics
by contributing their findings and sharing
their expertise with their peers [22,23].
For example, we and others adapted PCR-
based molecular diagnostic techniques for
infectious diseases such as leishmaniasis
and dengue for cost-effective application
in laboratories with minimum infrastruc-
ture and basic technical expertise, which
are now fully validated and used routinely
throughout Latin America [21,24–30].
This approach relies on understanding
the principles of the technologies, decon-
structing them into their basic compo-
nents, and rebuilding them on-site [21].
Another area where molecular tools have
demonstrated their utility in resource-poor
settings is in detecting drug resistance in a
variety of pathogens. This has been facili-
tated in large part by successful ‘‘North–
South’’ partnerships that have served to
train scientists in developing countries in the
use, implementation, and interpretation of
modern molecular methods applied to
emerging drug resistance (see [31]). This
approach has been particularly successful
with certain diseases, such as malaria, HIV/
AIDS, tuberculosis, and drug-resistant bac-
terial infections (both nosocomial and com-
munity-based). Unfortunately, most studies
of drug-resistant pathogens are performed
independently of one another, so data on the
prevalence of resistance markers is scattered
in disparate databases or in unpublished
studies without links to clinical, laboratory,
and pharmacokinetic data needed to relate
the genetic information to relevant pheno-
types. To enable molecular markers of
malaria drug resistance to realize their
potential as public health tools, the World-
wide Malaria Resistance Network (WARN)
database is being created with the dual goals
of improving treatment of malaria by
informed drug selection and use and
providing a prompt warning when treat-
ment protocols need to be changed [32,33].
By accelerating the identification and vali-
dation of markers for resistance to combi-
nation therapies, this global database should
help prolong the useful therapeutic lives of
important new drugs.
The ultimate power of genetic tools in
resource-limited settings is evident in the
field of molecular epidemiology, where
genetic information about the host or
infectious agent is analyzed together with
clinical and epidemiological data to derive
and implement appropriate interventions.
For example, molecular tools based on
limited sequence information, such as
molecular fingerprinting of a polymorphic
marker, have made important contributions
to strengthening control of tuberculosis in
both developed and developing countries by
enabling analysis of transmission patterns,
helping identify phenotypic variation
among strains, and facilitating evaluation
of the global distribution, relative transmis-
sibility, virulence, and immunogenicity of
different lineages of M. tuberculosis [34–38].
Bacterial infections, food-borne outbreaks,
and viral infections in developing countries,
including the recent H1N1 influenza pan-
demic, are monitored using similar typing
methodologies [39–41]. Molecular tools
permit a refined case definition and thus
have tremendous potential for decision-
making support and informing targeted
public health interventions in countries with
high burdens of disease and limited tech-
nological capabilities and resources.
The trend to move beyond genetic
marker analysis to full genome sequencing
is growing, as complete genome data can
provide a wealth of information about
etiologic agents of disease that was previ-
ously unknown. Full-genome approaches
are not always necessary, however. In
molecular epidemiology of infectious dis-
eases, nucleic acid fingerprinting can
provide enough answers to important
epidemiological questions to allow critical
interventions to be designed (see above). In
fact, too much genetic information, in
some instances, can obscure the picture, as
several closely related pathogenic variants
might coexist in one individual or one
outbreak that differ by only a few
nucleotides but that nonetheless belong
to the same strain or subtype, complicating
the interpretation of results [42].
The relatively rapid transfer of DNA
technology from developed to developing
countries is an excellent example of what
can be done by forging strong relation-
ships between universities and research
groups and public-health laboratories
across the world. The validity of adapting
these technologies relies on links with
epidemiological data and translation into
local public health interventions.
Setting Priorities
General international ethical and scien-
tific guidelines for genomics have been
created and are being adapted by nations
participating in the field as it evolves.
Governments and regulatory agencies in
the ‘‘North’’ have prepared for the
eventual implementation of genomics-
based medicine in their respective coun-
tries. A critical problem faced by develop-
ing countries is the lack of national
guidelines for genomics research and its
ethical ramifications. Thus, a priority to be
set by countries in the early steps of
genomic applications is to draw up the
PLoS Medicine | www.plosmedicine.org 3 October 2009 | Volume 6 | Issue 10 | e1000142
necessary rules and legislation on geno-
mics and to generate procedures for
implementation. Creating the necessary
communication channels between re-
searchers, social scientists, policy makers,
and civil society organizations is also a
critical step. Other key challenges facing
emerging genomics researchers include
proper informed consent and privacy
protocols for research participants, pro-
tecting them against the potential discrim-
ination that might emerge from genetic
information and ensuring that any benefit
that comes to fruition from the research
reaches them. In parallel, capacity build-
ing of scientists in clinical research and of
ethics committees in these issues is essen-
tial. Past experience with ‘‘safari research’’
in which biological samples are taken out-
of-country for research that does not
benefit local populations have prompted
countries such as Mexico, India, and
Brazil to draw up legislation governing
‘‘sovereignty’’ over genomics material and
data that restricts the export of biological
materials for studies abroad and prioritizes
national interests. Poorer countries cur-
rently lacking their own genomics initia-
tives could benefit from similar legislation
balancing the protection of ‘‘genomic
sovereignty’’ while fostering international
collaborations that bring much-needed
resources and increase local scientific
capacity. Beyond the improvement of their
basic genomics research capabilities, gov-
ernments should engage their relevant
ministries to develop a plan to integrate
genetic and genomics products (including
diagnostics, vaccines, therapies, and oth-
ers), within the health system and public
health programs with emphasis on acces-
sibility and equity to improve health for
all. A good example of priority setting in
genomics is Mexico’s national genomics
program over the last 15 years (see Box 2).
Sharing Know-How
To strengthen genomics globally, the
tools necessary for analysis of genomics
data are urgently needed in developing
countries, where they are currently under-
utilized [43]. A problem with genomics is
that much of the advanced knowledge is
concentrated in individuals and a few
research centers and companies rather
than in textbooks or academia, restricting
dissemination even though massive
amounts of genomic data and software
are openly accessible through the Internet.
A conscious effort on the part of developed
nations to transfer their knowledge of the
use and analysis of genomic databases
needs to be encouraged to help developing
countries manage their own specific data
on indigenous biological species, local
epidemiology and infectious diseases, bio-
diversity, and other issues. Some successful
programs and initiatives include the Well-
come Trust Sanger Institute training
courses on bioinformatics and genomic
analysis, the Sustainable Sciences Insti-
tute–Broad Institute bioinformatics work-
shops (Figure 1), and the TDR/WHO-
South African Bioinformatics Institute
(SANBI) regional training center. Online
training like the S-star alliance bioinfor-
matics courses and conferences such as the
African Bioinformatics Conference (Af-
bix’09) with remote participation are
becoming more widespread and are an
excellent option for countries with limited
resources. GARSA (Genomic Analysis
Resources for Sequence Annotation) is a
flexible Web-based system designed to
analyze genomic data in the context of a
data analysis pipeline. Hosted in Brazil,
this free system aims to facilitate the
analysis, integration, and presentation of
genomic information, concatenating sev-
eral bioinformatics tools and sequence
databases with a simple user interface
[44]. An alternative to on-site sequencing
is to partner with colleagues in more-
developed countries to have samples
processed abroad in sequencing centers.
This is possible only if local legislation
allows for export of biological samples,
and if true partnership and trust exist with
a colleague(s) in the developed country.
Challenges for the Future
As developing countries reevaluate their
role in the genomics era, they will continue
to explore the unique opportunities that
arise from the vast natural and genomic
diversity that they embody. As exemplified
by the successes in Brazil, Mexico, and
several African countries, it is possible to
turn challenges and problems such as
emerging and endemic infectious diseases
into opportunities for unique scientific and
economic growth. Access to sequencing
facilities, open-source databases, and har-
monized methodologies for genomic analy-
sis are essential for the future of genomics in
the developing world. However, unless a
more concerted effort is made to include
countries with limited scientific development
and resources, it is unlikely that they will
fully participate in genomics projects or use
the technologies available other than by
allowing their genetic material to be acces-
sible to others. As emerging countries set
their own priorities for genomics research
and take ownership of its results, the main
challenge across developing nations remains
access to training and knowledge translation.
Human resources and local capacity in
genomics are thus central to development,
as countries with these skills could partici-
pate in the potential benefits of the field with
respect to health, food security, natural
resource management, and other critical
areas. ‘‘North–South’’ and ‘‘South–South’’
collaborations are a viable and extremely
rewarding way to increase the capacities of
developing countries to access genomic tools
to address unique problems considered of
little economic value outside these countries
but of tremendous importance to the
majority of the world’s population.
Author Contributions
ICMJE criteria for authorship read and met: JC
EH. Wrote the first draft of the paper: JC.
Contributed to the writing of the paper: JC EH.
Box 2. Building a Road toward Genomics: The MexicanExperience 1995–2009 [7]
N Increases in investment in science and technology (S&T) from 0.35% to 0.43% ofthe GNP and creation of national S&T legislation to increase regional funding
N Four-fold increase in number of students registered for doctoral-level programs
N Participation in international genomics efforts
N Creation of sequencing initiatives of organisms with local agricultural andhealth relevance
N Creation of a Genomics Sciences degree and two scientific societies ingenomics
N Creation of the National Institute of Genetic Medicine (2004-INMEGEN) withseed funding for modern infrastructure; a strategy for development thatincludes country-wide strategic alliances; high-level research and academicprograms; ethical, legal, and social implications of genomic medicine; andtranslation of the scientific knowledge into public goods
N Establishment of genomics research priorities based on most prevalent localdiseases
N Plans for creation of public–private partnerships to guarantee sustainability
PLoS Medicine | www.plosmedicine.org 4 October 2009 | Volume 6 | Issue 10 | e1000142
References
1. Venter JC (2003) A part of the human genomesequence. Science 299: 1183–1184.
2. Singer PA, Daar AS (2001) Harnessing genomics
and biotechnology to improve global healthequity. Science 294: 87–89.
3. Calva E, Cardosa MJ, Gavilondo JV (2002)
Avoiding the genomics divide. Trends Biotechnol20: 368–370.
4. Acharya T, Daar AS, Thorsteinsdottir H,Dowdeswell E, Singer PA (2004) Strengthening
the role of genomics in global health. PLoS Med
1: e40. doi:10.1371/journal.pmed.0010040.
5. Manolio TA, Rodriguez LL, Brooks L,
Abecasis G, Ballinger D, et al. (2007) New
models of collaboration in genome-wide associa-tion studies: The Genetic Association Informa-
tion Network. Nat Genet 39: 1045–1051.
6. Seguin B, Hardy BJ, Singer PA, Daar AS (2008)
Genomics, public health and developing coun-
tries: The case of the Mexican National Instituteof Genomic Medicine (INMEGEN). Nat Rev
Genet 9 (Suppl 1): S5–9.
7. Jimenez-Sanchez G, Silva-Zolezzi I, Hidalgo A,
March S (2008) Genomic medicine in Mexico:
Initial steps and the road ahead. Genome Res 18:1191–1198.
8. Castilla EE, Luquetti DV (2008) Brazil: Public
Health Genomics. Public Health Genomics.
E-pub ahead of print (3 Sept). doi:10.1159/
000153424.
9. Simpson AJ, Reinach FC, Arruda P, Abreu FA,
Acencio M, et al. (2000) The genome sequence ofthe plant pathogen Xylella fastidiosa. The Xylella
fastidiosa Consortium of the Organization for
Nucleotide Sequencing and Analysis. Nature406: 151–159.
10. Davila AM, Majiwa PA, Grisard EC, Aksoy S,Melville SE (2003) Comparative genomics to
uncover the secrets of tsetse and livestock-infectivetrypanosomes. Trends Parasitol 19: 436–439.
11. Berriman M, Ghedin E, Hertz-Fowler C,
Blandin G, Renauld H, et al. (2005) The genomeof the African trypanosome Trypanosoma brucei.
Science 309: 416–422.
12. El-Sayed NM, Myler PJ, Bartholomeu DC,
Nilsson D, Aggarwal G, et al. (2005) The genomesequence of Trypanosoma cruzi, etiologic agent of
Chagas disease. Science 309: 409–415.
13. Ivens AC, Peacock CS, Worthey EA, Murphy L,Aggarwal G, et al. (2005) The genome of the
kinetoplastid parasite, Leishmania major. Science309: 436–442.
14. Bentley DR (1996) Genomic sequence informa-tion should be released immediately and free-
ly in the public domain. Science 274: 533–
534.
15. Rabinowicz PD (2001) Genomics in LatinAmerica: Reaching the frontiers. Genome Res
11: 319–322.
16. Potato Genome Sequencing Consortium. Avail-able: http://www.potatogenome.net. Accessed 19
July 2009.
17. Almeida LG, Paixao R, Souza RC, Costa GC,Almeida DF, et al. (2004) A new set of bioinfor-
matics tools for genome projects. Genet Mol Res3: 26–52.
18. Doumbia S, Chouong H, Traore SF, Dolo G,
Toure AM, et al. (2007) Establishing an insectdisease vector functional genomics training center
in Africa. Afr J Med Med Sci 36 (Suppl): 31–33.
19. Malaria Genomic Epidemiology Network (2008)A global network for investigating the genomic
epidemiology of malaria. Nature 456: 732–737.
20. Gardner MJ, Bishop R, Shah T, de Villiers EP,
Carlton JM, et al. (2005) Genome sequence of
Theileria parva, a bovine pathogen that transformslymphocytes. Science 309: 134–137.
21. Harris E (1998) A low-cost approach to PCR:Appropriate transfer of biomolecular techniques.
New York: Oxford University Press.
22. Coloma MJ, Harris E (2004) Innovative low costtechnologies for biomedical research and diag-
nosis in developing countries. BMJ 329:
1160–1162.
Figure 1. Participants in a Bioinformatics/Genomics Analysis workshop in Managua, Nicaragua, in June 2008 (conducted by theSustainable Sciences Institute and the Broad Institute). Photograph by Eva Harris.doi:10.1371/journal.pmed.1000142.g001
PLoS Medicine | www.plosmedicine.org 5 October 2009 | Volume 6 | Issue 10 | e1000142
23. Harris E (2004) Scientific capacity building in
developing countries. EMBO Rep 5: 7–11.24. Harris E, Tanner M (2000) Health technology
transfer. BMJ 321: 817–820.
25. Aviles H, Belli A, Armijos R, Monroy FP,Harris E (1999) PCR detection and identification
of Leishmania parasites in clinical specimens inEcuador: A comparison with classical diagnostic
methods. J Parasitol 85: 181–187.
26. Harris E, Kropp G, Belli A, Rodriguez B,Agabian N (1998) Single-step multiplex PCR
assay for characterization of New World Leish-
mania complexes. J Clin Microbiol 36:
1989–1995.27. Belli A, Rodriguez B, Aviles H, Harris E (1998)
Simplified polymerase chain reaction detection of
new world Leishmania in clinical specimens ofcutaneous leishmaniasis. Am J Trop Med Hyg 58:
102–109.28. Coloma J, Harris E (2008) Sustainable transfer of
biotechnology to developing countries: fighting
poverty by bringing scientific tools to developing-country partners. Ann N Y Acad Sci 1136:
358–368.29. Miagostovich MP, Sequeira PC, Dos Santos FB,
Maia A, Nogueira RM, et al. (2003) Moleculartyping of dengue virus type 2 in Brazil. Rev Inst
Med Trop Sao Paulo 45: 17–21.
30. Schriefer A, Schriefer AL, Goes-Neto A,Guimaraes LH, Carvalho LP, et al. (2004)
Multiclonal Leishmania braziliensis populationstructure and its clinical implication in a region
of endemicity for American tegumentary leish-
maniasis. Infect Immun 72: 508–514.31. Falush D (2009) Toward the use of genomics to
study microevolutionary change in bacteria. PLoS
Gen 5: e1000627. doi:10.1371/journal.
pgen.1000627.32. Plowe CV, Roper C, Barnwell JW, Happi CT,
Joshi HH, et al. (2007) World Antimalarial
Resistance Network (WARN) III: Molecularmarkers for drug resistant malaria. Malar J 6:
121.33. Sibley CH, Barnes KI, Watkins WM, Plowe CV
(2008) A network to monitor antimalarial drug
resistance: a plan for moving forward. TrendsParasitol 24: 43–48.
34. Bifani PJ, Mathema B, Kurepina NE,Kreiswirth BN (2002) Global dissemination of
the Mycobacterium tuberculosis W-Beijing familystrains. Trends Microbiol 10: 45–52.
35. Filliol I, Driscoll JR, van Soolingen D,
Kreiswirth BN, Kremer K, et al. (2003) Snapshotof moving and expanding clones of Mycobacterium
tuberculosis and their global distribution assessedby spoligotyping in an international study. J Clin
Microbiol 41: 1963–1970.
36. Manca C, Reed MB, Freeman S, Mathema B,Kreiswirth B, et al. (2004) Differential monocyte
activation underlies strain-specific Mycobacterium
tuberculosis pathogenesis. Infect Immun 72:
5511–5514.37. Valway SE, Sanchez MP, Shinnick TF, Orme I,
Agerton T, et al. (1998) An outbreak involving
extensive transmission of a virulent strain ofMycobacterium tuberculosis. N Engl J Med 338:
633–639.38. Gagneux S, Comas I (2009) The past and future
of tuberculosis research. PLoS Path 5(10): e600.
doi:10.1371/journal.ppat.1000600.39. Poon LL, Chan KH, Smith GJ, Leung CS,
Guan Y, et al. (2009) Molecular detection of a
novel human influenza (H1N1) of pandemic
potential by conventional and real-time quantita-
tive RT-PCR assays. Clin Chem 55: 1555–1558.
40. Reis JN, Palma T, Ribeiro GS, Pinheiro RM,
Ribeiro CT, et al. (2008) Transmission of
Streptococcus pneumoniae in an urban slum commu-
nity. J Infect 57: 204–213.
41. Vieira N, Bates SJ, Solberg OD, Ponce K,
Howsmon R, et al. (2007) High prevalence of
enteroinvasive Escherichia coli isolated in a remote
region of northern coastal Ecuador. Am J Trop
Med Hyg 76: 528–533.
42. Riley LW (2004) Molecular epidemiology of
infectious diseases: Principles and practices.
Herndon (Virginia): ASM Press.
43. Teufel A, Krupp M, Weinmann A, Galle PR
(2006) Current bioinformatics tools in genomic
biomedical research. Int J Mol Med 17: 967–973.
44. Davila AM, Lorenzini DM, Mendes PN,
Satake TS, Sousa GR, et al. (2005) GARSA:
Genomic analysis resources for sequence anno-
tation. Bioinformatics 21: 4302–4303.
45. Cook-Deegan RM, McCormack SJ (2001) Intel-
lectual property. Patents, secrecy, and DNA.
Science 293: 217.
46. Burton B (2002) Proposed genetic database on
Tongans opposed. BMJ 324: 443.
47. Pang T (2002) The impact of genomics on global
health. Am J Public Health 92: 1077–1079.
48. Chokshi DA, Thera MA, Parker M, Diakite M,
Makani J, et al. (2007) Valid consent for genomic
epidemiology in developing countries. PLoS Med
4: e95. doi:10.1371/journal.pmed.0040095.
PLoS Medicine | www.plosmedicine.org 6 October 2009 | Volume 6 | Issue 10 | e1000142
Perspective
Can an Infectious Disease Genomics Project Predict andPrevent the Next Pandemic?Rajesh Gupta¤*, Mark H. Michalski¤, Frank R. Rijsberman
Google.org, Mountain View, California, United States of America
We believe that there is great potential
in the systematic application of genomics,
proteomics, and bioinformatics to infec-
tious diseases, and that this potential has
yet to be fully realized. We suggest that the
international community unite under an
Infectious Disease Genomics Project, anal-
ogous to the Human Genome Project,
with a goal of a comprehensive, open-
access system of genomic information to
accelerate scientific understanding and
product development in the very settings
where diseases have the highest probabil-
ity of emerging. If properly structured,
such an approach could shift fundamen-
tally the global response to emerging
infectious diseases.
Genomics Is SystematicallyTransforming Medicine
The ‘‘Genomic Revolution’’ has trans-
formed our vision and understanding of
how living organisms and systems interact
with each other and with the environment
[1]. Increasingly, the science of genomics
serves as the foundation for translational
research for advancing the management of
many important diseases [2–7]. Decreas-
ing costs and increasing throughput of new
technologies has made possible multina-
tional collaboration on large-scale projects
such as the Human Microbiome Project
and the 1000 Genomes Project [8–10].
Infectious disease management is also
transforming thanks to molecular technol-
ogies as seen in HIV [11,12], tuberculosis
[13,14], malaria [15,16], and other ne-
glected tropical diseases [17,18]. Discov-
ering novel pathogens and elucidating the
implications of genetic variation among
existing pathogens [19,20] is critical for
rapidly mitigating pandemic threats, as
demonstrated recently with severe acute
respiratory syndrome (SARS) [21,22] and
avian (H5N1) and pandemic H1N1 2009
influenza (commonly referred to as ‘‘swine
flu’’) [23–26].
To fully harness the benefit of genomics
in infectious diseases, a chain of overarch-
ing activities must occur. First, under-
standing the dynamics of infectious diseas-
es through the genomics lens requires a
tremendous amount of integrated com-
parative sequence, expression, epigenetic,
and proteomic data from a variety of
pathogens (bacteria, virus, protozoa, fun-
gi), vectors (arthropod and avian sources),
reservoirs (non-human mammals, environ-
ment) and human hosts. Second, generat-
ing, collating, organizing, and curating
these data is an essential public health task.
Third, translating this information to tools
to improve surveillance and response
mechanisms is critical to effectively impact
disease management.
If this bench-to-beside chain of activities
were optimized, we envision that the
following could occur:
N Fully annotated genomes of all known
pathogens, vectors, non-human hosts,
and reservoir species, as well as a large
number of candidate microbes in
families that have a high risk of
generating future pathogens, are held
in public open-access databases such as
GenBank.
N A ‘‘Genomic search’’ of all available
contextual information, from sample
origins through to published analyses,
is as simple as a Google search.
N Sequencing and other molecular tech-
nologies are everyday tools-of-the-
trade in every district hospital and
laboratory in hotspots of emerging
infectious disease, such as southeast
Asia and sub-Saharan Africa.
N Automated molecular diagnostic as-
says are low-cost, reduced at least to
the size of a smart mobile phone, and
can return definitive diagnoses of a
range of specialized known pathogen
panels at the point of care.
N A range of products that use infectious
disease genomic information routine-
ly—such as vector maps, early warning
systems, diagnostics, vaccines, and
drugs—contribute to the prediction
and prevention of epidemics.
While progress is occurring in each of
these areas, the outputs—which are need-
ed today—are far from complete.
Creating an Infectious DiseaseGenomics Project (IDGP)
We believe that accelerated advances in
the area of infectious diseases can occur
under a global collaborative framework
composed of discrete and delineated
activities between the public and private
sectors among resource-wealthy and re-
source-limited settings. The Human Ge-
nome Project (HGP) was a pioneering
international effort that helped unlock the
power of genomics for human health
The Perspective section provides experts with aforum to comment on topical or controversial issuesof broad interest. This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://
ploscollections.org/emerginginfectiousdisease/).
Citation: Gupta R, Michalski MH, Rijsberman FR (2009) Can an Infectious Disease Genomics Project Predict andPrevent the Next Pandemic? PLoS Biol 7(10): e1000219. doi:10.1371/journal.pbio.1000219
Published October 26, 2009
Copyright: � 2009 Gupta et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.
Funding: Google.org is financially supported through its parent company, Google.com. At the time thismanuscript was developed, RG was an employee of Google.org and MM was a consultant to Google.org. Thefunder had no role in the decision to publish or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
¤ Current address: Stanford University, Stanford, California, United States of America
PLoS Biology | www.plosbiology.org 1 October 2009 | Volume 7 | Issue 10 | e1000219
[27,28]. This effort generated important
information in part by having clear,
targeted outcomes and by implementing
a standard methodology across all partic-
ipants. The HGP was a great impetus for
progress seen thus far in genomics and
health. Moreover, the HGP recognized
that sequencing was just the first step in
a much bigger process [26]. A similar
effort for infectious diseases could, in our
view, help predict and prevent the next
pandemic.
To capitalize on existing successful
efforts in the area of genomics and
infectious diseases such as those by the
Broad Institute, Genomics Standards Con-
sortium, J Craig Venter Institute, the
National Institute of Allergy and Infectious
Diseases, and the Wellcome Trust Sanger
Institute (to name a few), we urge the
international community to unite its nu-
merous activities under an Infectious
Diseases Genomic Project (IDGP)—a
coordinated, large-scale, international ef-
fort focused on the genomes of pathogens,
vectors, hosts, and reservoirs and linked to
end-point surveillance and response sys-
tems. Such a project could coordinate
activities in four specific areas: generating
data, linking data, analyzing data, and
applying data (Figure 1).
Generating DataAt the outset, the IDGP would need to
determine what the world requires in
terms of genomic information. A standard
approach to generating depth and diver-
sity in genomic data is essential; beyond
this, continuous real-time surveillance and
characterization of evolving pathogens can
help effectively forestall future epidemics/
pandemics. Frontline work by consor-
tiums, genome research centers, and
individual laboratories has yielded baseline
approaches in this area and a wealth of
critical genomic information for many
important infectious agents [29–34].
While each actor in the genomics field
brings its own priority for targeting
particular pathogens or diseases, a clear
roadmap to generating a complete geno-
mic picture of all infectious agents, emerg-
ing threats, hosts, and reservoirs, incorpo-
rating a broad range of investigators with
varied technological capacity, would en-
hance both data generation and applica-
tion. Such a process allows for communi-
ty-level priority setting, thereby enabling
smaller-scale laboratories to tailor projects
to fit the needs of local communities while
contributing to global efforts.
Linking DataThe data collected must be connected
to all relevant information and analytical
tools in a single, easy-to-use, open-source,
real-time interface. Such a system would
improve on current systems by: gathering
data across the public domain and work-
ing with companies/institutions to harness
information in the private domain; linking
accurate, annotated sequencing informa-
tion to functional genomic and proteo-
mic/functional proteomic information;
attaching scientific literature associated
with all levels of information; and includ-
ing a self-sustaining financial mechanism
potentially based on royalties from com-
mercial products generated from the use of
this system.
Analyzing DataThe data need to be linked via large-
scale, dynamic databases held in virtual
servers allowing for collaboration and
sharing while maintaining originating
information for data rights and sovereign-
ty. Concurrently, these data should be
associated with a centralized collection of
open-source bioinformatics tools capable
of real-time operation in low- and high-
speed computers and varying levels of
internet connectivity. A single interface
also would bring various sample collec-
tions together in formally structured bio-
banks that capture geospatial and context
data to allow efficient scientific collabora-
tion to take place. Centralizing the entire
spectrum of information and analytic tools
also allows researchers in resource-limited
settings to participate in the genomics
revolution without prohibitively costly
machines, laboratories, and sample acces-
sibility. Although we fully acknowledge
Figure 1. A coordinated Infectious Disease Genome Project (IDGP) could unify sequencing efforts, enhance data usability, and leadto essential tools for infectious disease management.doi:10.1371/journal.pbio.1000219.g001
Author Summary
The world of genomics is transforming medicine, and is likely to influence thefuture development of new drugs, diagnostics, and vaccines. To date, the greaterfocus of genomics and medicine has been on conditions affecting resource-wealthy settings, primarily involving scientists and companies in those settings.However, we believe that it is possible to expand genomics into a more globaltechnology that can also focus on diseases of resource-limited settings. This goalcan be achieved if genomics is made a global priority. We feel one way to move inthis direction is through a comprehensive approach to infectious diseases—i.e.,an Infectious Disease Genomics Project—that would mirror the Human GenomeProject. Without an active, unified effort specifically focused on allowing actors atany level to participate in the genomics revolution, infectious diseases thatprimarily affect the poor will likely not achieve the same level of scientificadvancement as diseases affecting the wealthy.
PLoS Biology | www.plosbiology.org 2 October 2009 | Volume 7 | Issue 10 | e1000219
that internet connectivity is a requirement
that is not currently available to all, rapid
technical innovation and investment from
cheap netbook computers to new fiber
optic cables in Africa are changing that
equation. This system could be facilitated
by virtual community collaboration or
crowd-sourcing, taking full advantage of
networking tools such as Wikipedia, Face-
book, Twitter, FusionTables, and PLoS.
Applying DataTechnological advances for basic scien-
tific discovery (such as next-generation
sequencers, microarrays, mass spectrome-
ters, cell-based assay methods, and other
tools for transcriptome, metabolome, and
proteome discovery), novel techniques to
increase throughput and/or decrease the
cost of analysis, and applied clinical
decision-making and surveillance tools
(point-of-care diagnostics, rapid multi-
pathogen assays) are in progress and
should be supported actively. The IDGP
should be informed by and incorporate
emerging technology platforms to rapidly
develop more accurate field diagnostics
and to identify new opportunities for
vaccine and drug development.
Moving beyond Discourse intoAction
An IDGP is attainable if others share
this vision, show leadership, and see the
added value resulting from a coordinated
effort. The HGP certainly was a more
targeted effort and we acknowledge that
an IDGP will have additional obstacles to
overcome. Scientific disagreement over
targets is bound to occur. Complications
resulting from the proposed level of data
sharing should not be underestimated, and
care must be taken to ensure proprietary
rights and acknowledgement when war-
ranted. Adapting molecular genetic tech-
nologies to resource-limited settings is a
significant challenge, but is occurring with
some success. Bringing together a com-
munity of scientists and donors, each with
their own objectives and goals, to work
under a single framework, is a difficult
proposition. Finally, there will be many
who will find this perspective simply too
grandiose. Leaps of progress also require
big visions, however, and it may just be
possible that the 2009 H1N1 influenza
pandemic is a enough of a reminder of
what is at stake to provide a catalyst for
action.
Google.org has supported global public
health through its ‘‘Predict and Prevent’’
initiative with the aim of using the power
of information and technology to address
emerging infectious diseases by helping the
world to know where to look for these
diseases, find the threats earlier, and
respond to them faster [35]. Google.org
has focused its support on sequencing and
pathogen discovery activities, bringing
genomic technologies to resource-limited
settings in East Africa, improving surveil-
lance networks and systems, and exploring
how our core competence in internet
search can assist the infectious diseases
community [36].
As firm supporters of the open access
model for scientific publication [37],
Google.org is pleased to support this series
of essays, The Genomics of Emerging
Infectious Disease, in partnership with the
Public Library of Science (PLoS) journals
(PLoS Biology, PLoS Computational Biology,
PLoS Genetics, PLoS Medicine, PLoS Neglected
Tropical Diseases, and PLoS Pathogens), not
only to help define the current state of the
art in pathogen genomics, but also, we
hope, to stimulate debate on priorities for
research and technology development.
References
1. Yudell M, DeSalle R (2002) The genomic
revolution: Unveiling the unity of life. Washing-
ton (D. C.): Joseph Henry Press. 272 p.
2. Langston AA, Malone KE, Thompson JD,
Daling JR, Ostrander EA (1996) BRCA1 mutations
in a population-based sample of young women with
breast cancer. N Engl J Med 334: 137–142.
3. Futreal P, Liu Q, Shattuck-Eidens D, Cochran C,
Harshman K, et al. (1994) BRCA1 mutations in
primary breast and ovarian carcinomas. Science
266: 120–122.
4. Helgadottir A, Manolescu A, Thorleifsson G,
Gretarsdottir S, Jonsdottir H, et al. (2004) The
gene encoding 5-lipoxygenase activating protein
confers risk of myocardial infarction and stroke.
Nature Genetics 36: 233–239.
5. Wellcome Trust C (2007) Genome-wide associa-
tion study of 14,000 cases of seven common
diseases and 3,000 shared controls. Nature 447:
661–678.
6. Consortium G (2007) New models of collabora-
tion in genome-wide association studies: The
Genetic Association Information Network. Nat
Genet 39: 1045–1051.
7. Vigneri P, Wang J (2001) Induction of apoptosis
in chronic myelogenous leukemia cells through
nuclear entrapment of BCR-ABL tyrosine kinase.
Nat Med 7: 228–234.
8. Gresham D, Kruglyak L (2008) Rise of the
mach ine s . PLoS Gene t 4 : e1000134 .
doi:10.1371/journal.pgen.1000134.
9. Spencer G (2008) Researchers establish interna-
tional human microbiome consortium. NIH
News. Available: http://www.nih.gov/news/
health/oct2008/nhgri-16.htm. Accessed 19 Sep-
tember 2009.
10. Spencer G (2008) International consortium an-
nounces the 1000 Genomes Project. NIH News.
Available: http://www.nih.gov/news/health/
jan2008/nhgri-22.htm. Accessed 19 September
2009.
11. Martinez-Cajas JL, Wainberg MA (2008) Anti-
retroviral therapy: Optimal sequencing of therapy
to avoid resistance. Drugs 68: 43–72.
12. Wilkinson KA, Gorelick RJ, Vasa SM, Guex N,
Rein A, et al. (2008) High-throughput SHAPE
analysis reveals structures in HIV-1 Genomic
RNA strongly conserved across distinct biological
states. PLoS Biol 6: e96. doi:10.1371/journal.
pbio.0060096.
13. Smith CV, Sacchettini JC (2003) Mycobacterium
tuberculosis: A model system for structural geno-
mics. Curr Opin Struct Biol 13: 658–664.
14. Cockle PJ, Gordon SV, Lalvani A, Buddle BM,
Hewinson RG, et al. (2002) Identification of novel
Mycobacterium tuberculosis antigens with potential as
diagnostic reagents or subunit vaccine candidates
by comparative genomics. Infect Immun 70:
6996–7003.
15. Gonzales JM, Patel JJ, Ponmee N, Jiang L, Tan A,
et al. (2008) Regulatory hotspots in the malaria
parasite genome dictate transcriptional variation.
PLoS Biol 6: e238. doi:10.1371/journal.
pbio.0060238.
16. Ekland EH, Fidock DA (2007) Advances in
understanding the genetic basis of antimalarial
drug resistance. Curr Opin Microbiol 10:
363–370.
17. Beaty BJ, Prager DJ, James AA, Jacobs-Lorena M,
Miller LH, et al. (2009) From Tucson to genomics
and transgenics: The Vector Biology Network
and the emergence of modern vector biology.
PLoS Negl Trop Dis 3: e343. doi:10.1371/
journal.pntd.0000343.
18. Hertz-Fowler C, Figueiredo LM, Quail MA,
Becker M, Jackson A, et al. (2008) Telomeric
expression sites are highly conserved in Trypano-
soma brucei. PLoS ONE 3: e3527. doi:10.1371/
journal.pone.0003527.
19. Wolfe N, Heneine W, Carr J, Garcia A,
Shanmugam V, et al. (2005) Emergence of
unique primate T-lymphotropic viruses among
central African bushmeat hunters. Proc Natl
Acad Sci U S A 102: 7994–7999.
20. Palacios G, Druce J, Du L, Tran T, Birch C, et al.
(2008) A new arenavirus in a cluster of fatal
transplant-associated diseases. N Engl J Med 358:
991–998.
21. Grant P, Garson J, Tedder R, Chan P, Tam J,
et al. (2003) Detection of SARS coronavirus in
plasma by real-time RT-PCR. N Engl J Med 349:
2468.
22. Marra M, Jones S, Astell C, Holt R, Brooks-
Wilson A, et al. (2003) The genome sequence of
the SARS-associated coronavirus. Science 300:
1399–1404.
23. Gu J, Xie Z, Gao Z, Liu J, Korteweg C, et al.
(2007) H5N1 infection of the respiratory tract and
beyond: A molecular pathology study. Lancet
370: 1137–1145.
24. Zhao Z-M, Shortridge KF, Garcia M, Guan Y,
Wan X-F (2008) Genotypic diversity of H5N1
highly pathogenic avian influenza viruses. J Gen
Virol 89: 2182–2193.
25. Garten RJ, Davis CT, Russell CA, Shu B,
Lindstrom S, et al. (2009) Antigenic and genetic
characteristics of swine-origin 2009 A(H1N1)
influenza viruses circulating in humans. Science
325: 197–201.
26. Shinde V, Bridges CB, Uyeki TM, Shu B,
Balish A, et al. (2009) Triple-reassortant swine
influenza A (H1) in humans in the United States,
2005–2009. N Engl J Med 360: 2616–2625.
27. Consortium IHGS (2001) Initial sequencing and
analysis of the human genome. Nature 409:
860–921.
28. Collins FS, Morgan M, Patrinos A (2003) The
Human Genome Project: Lessons from large-
scale biology. Science 300: 286–290.
29. Wellcome Trust Sanger Institute (2009) Pathogen
genomics [Web site]. Available: http://www.
sanger.ac.uk/Projects/Pathogens/. Accessed 11
August 2009.
PLoS Biology | www.plosbiology.org 3 October 2009 | Volume 7 | Issue 10 | e1000219
30. National Institute of Allergy and Infectious
Disease (2009) Microbial Genome Sequencing
Centers: Completed NIAID-Supported Sequenc-
ing Projects. Available: http://www3.niaid.nih.
gov/research/resources/mscs/completed.htm.
Accessed 11 August 2009.
31. Cole ST, Brosch R, Parkhill J, Garnier T,
Churcher C, et al. (1998) Deciphering the biology
of Mycobacterium tuberculosis from the complete
genome sequence. Nature 393: 537–544.
32. Gardner MJ, Hall N, Fung E, White O,
Berriman M, et al. (2002) Genome sequence ofthe human malaria parasite Plasmodium falciparum.
Nature 419: 498–511.
33. Greene JM, Collins F, Lefkowitz EJ, Roos D,Scheuermann RH, et al. (2007) National Institute
of Allergy and Infectious Diseases bioinformaticsresource centers: New assets for pathogen infor-
matics. Infect Immun 75: 3212–3219.
34. Field D, Garrity G, Gray T, Morrison N,Selengut J (2008) The minimum information
about a genome sequence (MIGS) specification.
Nat Biotechnol 26: 541–547.35. Google.org (2008) Predict and Prevent initiative.
Available: http://www.google.org/predict.html.
Accessed 19 September 2009.36. Ginsberg J, Mohebbi MH, Patel RS, Brammer L,
Smolinski MS, et al. (2009) Detecting influenzaepidemics using search engine query data. Nature
457: 1012–1014.
37. Gass A (2004) Open access as public policy. PLoSBiol 2: e353. doi:10.1371/journal.pbio.0020353.
PLoS Biology | www.plosbiology.org 4 October 2009 | Volume 7 | Issue 10 | e1000219
Perspective
The Role of Genomics in the Identification, Prediction,and Prevention of Biological ThreatsW. Florian Fricke, David A. Rasko, Jacques Ravel*
Institute for Genome Sciences (IGS), University of Maryland School of Medicine, Baltimore, Maryland, United States of America
Since the publication in 1995 of the first
complete genome sequence of a free-living
organism, the bacterium Haemophilus influ-
enzae [1], more than 1,000 genomes of
species from all three domains of life—
Bacteria, Archaea, and Eukarya—have
been completed and a staggering 4,300
are in progress (not including an even
larger number of viral genome projects)
(GOLD, Genomes Online Database v.
2.0; http://www.genomesonline.org/gold.
cgi, as of August 2009). Whole-genome
shotgun sequencing remains the standard
in biomedical, biotechnological, environ-
mental, agricultural, and evolution-
ary genomics (http://genomesonline.org/
gold_statistics.htm#aname). While next-
generation sequencing technology is
changing the field, this approach will
continue to be used and lead to a
previously unimaginable number of ge-
nome sequences, providing opportunities
that could not have been thought of a few
years ago. These opportunities include
studying genomes in real-time to under-
stand the evolution of known pathogens
and predict the emergence of new infec-
tious agents (Box 1). With the introduction
of next-generation sequencing platforms,
cost has decreased dramatically, resulting
in genomics no longer being an indepen-
dent discipline, but becoming a tool
routinely used in laboratories around the
world to address scientific questions. This
global sequencing effort has been focusing
primarily on pathogenic organisms, which
today are still the subject of the majority of
genome projects [2]. Sequencing two to
five strains of the same pathogen has, in
recent years, afforded us not only a better
understanding of evolution, virulence, and
biology in general [3], but, taken to the
next level (hundreds or thousands of
strains) it will enable even more accurate
diagnostics to support epidemiological
studies, food safety improvements, public
health protection, and forensics investiga-
tions, among others.
Biodefense Funding forGenomic Research
Since the anthrax letter attacks of 2001,
when letters containing anthrax spores
were mailed to several news media offices
and two Democratic senators in the
United States, killing five people and
infecting 17 others, funding agencies in
the US and other countries have priori-
tized research projects on organisms that
might potentially challenge our security
and economy should they be used as
biological weapons. This has resulted in
large amounts of funding dedicated to so-
called ‘‘biodefense’’ research, totaling close
to $50 billion between 2001 and 2009 [4].
Genomics has benefited greatly from this
influx of research dollars and as a result,
representatives of most major animal, plant,
and human pathogens have been sequenced
(http://www.pathogenportal.org/). Support-
ed by federal funds from the National
Institutes of Health (NIH), the National
Institute of Allergy and Infectious Diseases
(NIAID), and the US Department of De-
fense, research programs, such as the Micro-
bial Sequencing Centers and the Bioinfor-
matics Resource Centers (http://www3.
niaid.nih.gov/topics/pathogenGenomics/
PDF/genomicsinitiatives.htm), have been
established that carry out genomics re-
search on pathogenic organisms and have
spearheaded a new phase of the genomics
revolution. Similar programs were started
in Europe, such as those at the Wellcome
Trust Sanger Institute in the United
Kingdom, and the multinational European
effort, The Network of Excellence Euro-
PathoGenomics (http://www.noe-epg.
uni-wuerzburg.de/epg_general.htm). As
an example of the success of these types
of programs, the genome sequences of over
90,000 influenza viruses were rapidly
generated and are now deposited in
GenBank (http://www.ncbi.nlm.nih.gov/
genomes/FLU/aboutdatabase.html). Be-
cause of the availability of large sequencing
capacity and the large amount of informa-
tion, the response to the 2009 H1N1
influenza pandemic was rapid and efficient
(Box 2): Genomics information was gener-
ated within days and validated diagnostic
tools were approved within weeks [5,6]. A
global response was made possible through
tremendous research efforts enabled by
genomic research.
Access to and Documentationof Sequence Data
Open access to genomics resources (i.e.,
raw sequence data and associated publi-
cations) is an essential component of the
nation preparedness to biological threats
(biopreparedness), whether intentionally
delivered or not. Although some consider
open-source genomic resources a threat to
security [7] because they make publicly
available information that could facilitate
the construction of dangerous infectious
agents, we strongly disagree with this point
of view. Rather, we and others [8] believe
that it is an enabling tool more useful to
those in charge of our public health and
biosecurity than to those with ill inten-
tions. Genomic sequence data can provide
a starting point for the development of
new vaccines, drugs, and diagnostic tests
[9], hence improving public health capa-
bilities and increasing our bioprepared-
ness. Access to the organisms from which
the sequences are derived should be
restricted, not their genome sequences.
The Perspective section provides experts with aforum to comment on topical or controversial issuesof broad interest.
Citation: Fricke WF, Rasko DA, Ravel J (2009) The Role of Genomics in the Identification, Prediction, andPrevention of Biological Threats. PLoS Biol 7(10): e1000217. doi:10.1371/journal.pbio.1000217
Published October 26, 2009
Copyright: � 2009 Fricke et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/).
PLoS Biology | www.plosbiology.org 1 October 2009 | Volume 7 | Issue 10 | e1000217
Now that genomics technologies are
broadly available, there is the potential
for commercial interests to hamper the
release of genomic data in the public
domain. Thus it is important that federally
funded large-scale genome sequencing
efforts have enforceable rapid release
policies. This accessibility could afford
further opportunities to capitalize on
investments in genome sequencing by
providing the necessary resources to bio-
preparedness.
Whereas genome projects aimed at
sequencing one, two, or three isolates of
a pathogen seemed adequate a few years
ago, it is now possible to sequence rapidly
hundreds of individual genomes for each
species. Access to relevant, well-curated
culture collections [10] and DNA prepa-
rations suitable for sequencing may be-
come a bottleneck in the future when
sequencing resources are no longer limit-
ing. More importantly, the impact of large
genomic sequence datasets from clinical
isolates will be limited without key clinical
metadata that characterize these isolates,
such as patients’ medical information,
date of isolation, and the number of
culture passages in the laboratory. Open
access to large numbers of sequences and
associated metadata allows for powerful
comparative genomic analyses and thus
provides major insights into the charac-
teristics of a pathogen. Standardized
vocabulary should be developed to de-
scribe these isolates and the genes they
contain. Such efforts have already started,
for example through the open-access
journal Standards in Genome Sciences
(SIGS) (http://standardsingenomics.org/
index.php/sigen), but the dedicated re-
sources are not adequate and highlight the
lack of understanding of the importance of
metadata in genomics. Initiatives such as
those of the Genomics Standards Consor-
tium have made great strides [11,12], but
still need widespread implementation
from the ever-expanding genomic com-
munity. Open access to the genomic DNA
that has been sequenced or the culture
from which the DNA was extracted and to
the associated metadata is key to success-
ful genome sequencing projects, whether
on single or several hundred genomes or
metagenomes. Well-documented genome
sequence data will form a key growing
resource for biodefense and other re-
search fields.
Emerging New BioinformaticsResources
As we enter a new era of modern
genomics, the ever-expanding sequence
datasets are becoming more challenging to
analyze. Future analysts will require powerful
new bioinformatics tools in conjunction with
new computer systems engineered with
genomic analysis in mind. Open-source
new bioinformatics software tools are being
developed that exploit Web-based services
and the increasing computing power provid-
ed by academic and commercial ‘‘cloud
computing networks’’ (large computing re-
sources provided as a service over the
Internet). For example, ‘‘Science Clouds’’
(http://workspace.globus.org/clouds/) allow
members of the scientific community to lease
cloud computing resources free of charge.
To leverage these capabilities, novel cloud-
optimized bioinformatics tools are being
developed, such as the genome sequence
read mapper CloudBurst [13]. In addition,
novel resources are currently under devel-
opment to increase the availability of open-
source bioinformatics tools for cloud com-
puting (http://www.nsf.gov/awardsearch/
showAward.do?AwardNumber=0949201;
http://www.nsf.gov/awardsearch/showAward.
do?AwardNumber=0844494). These emerging
tools make access to the Worldwide Web the
only requirement to join the genomic revolution
and achieve large scale bioinformatics analyses
that could not be possible on local servers. As a
consequence, it is conceivable that in the future
genomic research will increasingly move away
from the large sequencing centers toward a
more decentralized organization. Decentralized
Author Summary
In all likelihood, it is only a matter of time before our public health system willface a major biological threat, whether intentionally dispersed or originating froma known or newly emerging infectious disease. It is necessary not only to increaseour reactive ‘‘biodefense,’’ but also to be proactive and increase ourpreparedness. To achieve this goal, it is essential that the scientific and publichealth communities fully embrace the genomic revolution, and that novelbioinformatic and computing tools necessary to make great strides in ourunderstanding of these novel and emerging threats be developed. Genomics hasgraduated from a specialized field of science to a research tool that soon will beroutine in research laboratories and clinical settings. Because the technology isbecoming more affordable, genomics can and should be used proactively tobuild our preparedness and responsiveness to biological threats. All pieces,including major continued funding, advances in next-generation sequencingtechnologies, bioinformatics infrastructures, and open access to data andmetadata, are being set in place for genomics to play a central role in our publichealth system.
Box 1. Hot Spots for the Emergence of Infectious Disease
Can we define ‘‘hot spots’’ of microbial populations where new infectiousdiseases are more likely to evolve? Human contact with new types of infectiousagents precedes the emergence of infectious diseases. Infectious agents can benew in the sense of not having previously infected humans or new in the sensethat a combination of preexisting genetic factors (for example, mobile elementsor regulatory elements) have reassembled to give rise to an infectious agent witha substantially altered genome. The Ebola virus, which first emerged by infectinghumans 1976 in Zaire [21], is an example of the former, whereas the acquisition ofantimicrobial resistance by Acinetobacter baumannii [22] is an example of thelatter. In both cases, a change in the selective pressure on an infectious agentallows its emergence from a specific setting. This selective pressure may be, forexample, the new niche that the human host provides to the pathogen or theantimicrobial selection on a pathogen. Since both events rely on preexistinggenetic resources and not on the de novo evolution of virulence factors, thepotential of a setting to serve as a hot spot or reservoir for an emerging infectiousdisease is theoretically predictable from the examination of the total metagen-ome. In this scenario, traditional microbiological approaches that focus on singleisolates of bacteria or viruses are limited in their predictive power since they lack aview of the complete genetic landscape. The potential infectious disease agentcould, however, arise from an environment that only contains pieces of a‘‘virulence puzzle,’’ i.e., individual virulence factors encoded within the genomesof different organisms (the metagenomic ‘‘gene soup’’). These pieces would haveto be assembled in one species for the new pathogen to emerge as an infectiousagent.
PLoS Biology | www.plosbiology.org 2 October 2009 | Volume 7 | Issue 10 | e1000217
rapid genome sequencing and bioinformatic
analysis of infectious agents will enable near-real-
time global surveillance, detection of new
pathogens, new virulence factors, antimi-
crobial resistance determinants, or engineered
organisms.
Population Genomics Appliedto Single Cultures
Because the resources for affordable
high-throughput sequencing, data pro-
cessing, and analysis are available, the
time is right to think about microbial
population genomics and large-scale mi-
crobial metagenomics in the context of
biodefense research (Box 3). Traditional-
ly, the concept of population genomics
has applied to variation within a species.
However, a bacterial culture, even if
derived from a single clone, is composed
of millions of cells that are not necessarily
identical at the genome sequence level,
hence forming a population of genomes.
Therefore we propose to apply the
concept of population genomics to mi-
crobial cultures. The assemblage of
genotypes defines what is called a ‘‘cul-
ture,’’ ‘‘culture stock,’’ or ‘‘reference
strain.’’ Population genomics addresses
the genomic diversity within these assem-
blages and has significant implications for
many fields of research but, most impor-
tantly, for pathogen evolution, diagnos-
tics, epidemiology, and microbial foren-
sics. For example, following the anthrax
mail attacks of 2001, microbiologists and
genomicists joined forces to characterize
the unique genetic traits of the Bacillus
anthracis spores recovered from the enve-
lopes, which were quickly identified as
the B. anthracis Ames strain (DAAR et al.,
unpublished data). Sequencing the ge-
nome of several single colonies obtained
from the spores revealed that the entire
chromosome and its associated plasmids
were 100% identical to the genome
sequence of the ancestral B. anthracis
Ames strain that was stored for over 20
years in a military laboratory in Freder-
ick, Maryland. The only genotypic dif-
ferences were found in a small, pheno-
typically and genetically distinct portion
of cells grown from the spores used in the
attacks. Genomic characterization of
these phenotypic variants revealed a
number of unique genetic alterations that
together provided a characteristic DNA
fingerprint of the spore population that
could be unequivocally matched to the
spore sample used in the attacks. Using
this fingerprint, a genetic assay was
developed to screen a B. anthracis spore
repository, which identified the origin of
the spores as a single spore stock of B.
anthracis Ames. This stock was stored at
the US Army Medical Research Institute
for Infectious Diseases in Fort Detrick,
Maryland, narrowing the pool of suspects
to a manageable number (those who had
access to the spore stock) for the investi-
gative team. The police investigation that
followed identified a potential suspect as
the custodian of the spore stock. This was
the first use of microbial genomics as an
essential tool in a forensic investigation.
In the course of the investigation, scien-
tists had to establish culture repositories
from strains used in research in the US
and build databases of genome sequences
of all B. anthracis isolates. This work took
several years and delayed the investiga-
tion significantly. A lesson to be learned
from this investigation should therefore
be that there is a need for comprehensive
databases of unique DNA fingerprints of
stocks of potentially threatening patho-
gens. In the event that another bioterror
attack were to take place such genomic
databases would be key in quickly
establishing the source of the biological
material.
The concept of population genomics also
applies to epidemiological studies of out-
breaks of infectious diseases such as those
caused by food-borne or zoonotic patho-
gens, such as Salmonella spp. Traditionally,
epidemiologists and pathologists have used
low-resolution methods such as pulsed-field
gel electrophoresis (PFGE), multi-locus
sequence typing (MLST), or multi-locus
variable number tandem repeats analysis
(MLVA) to trace an individual isolate from
a patient back to a potentially infected food
source or to isolates from other patients
[14–17]. In 2006, for example, during an
outbreak of pathogenic Escherichia coli
O157:H7 infections in 26 states of the
US, which was caused by contaminated
spinach, isolates of the pathogen were
recovered from cows and wild pigs (the
zoonotic reservoirs), bags of spinach (the
vehicle of transmission), and ill patients
(http://www.cdc.gov/mmwr/preview/
mmwrhtml/mm55d926a1.htm). One
of these isolates was designated as the
reference for the outbreak based on
conserved PFGE patterns. Genome
sequencing of several isolates from the
same outbreak performed in our labo-
ratory, however, revealed genomic
variations that questioned a direct
evolutionary link between all out-
break-associated isolates (Eppinger
et al., unpublished data). Comparative
genomics followed by whole-genome
phylogenetic analyses based on single
nucleotide polymorphisms demonstrat-
ed that these isolates were indeed
closely related to one another and only
distantly related to other E. coli
O157:H7 isolates, hence linking all
isolates to the same outbreak, some-
thing that was not possible using PFGE
patterns. In this case, phylogenetic
analyses suggest that several highly
related genotypes were at the source
of the outbreak, thus challenging the
Box 2. Pandemic H1N1 2009 Influenza: A Recent Example of theImpact of Genomics on Biopreparedness
Genomics can be readily applied to follow outbreaks of infectious diseases. This isclearly illustrated during the severe acute respiratory syndrome (SARS) outbreakin 2002–2003 and the emergence and worldwide spread of the pandemic H1N12009 influenza virus this year. In both cases, genomics played a key role in theimmediate response to the outbreak. Initially, very little was known about thevirus responsible for the SARS outbreak. Pangenomic virus microarrays identifiedit as a coronavirus [23]; however, it was only through detailed sequencing that thespecific genotype of this virus could be determined [24]. Comparative sequenceanalysis identified the SARS virus as distinct from other coronaviruses in terms ofits encoded proteins responsible for antigen presentation. This finding ultimatelylead to development of diagnostics [25] and potential therapeutics [26]. Thisexample of a sequencing approach as a rapid response to a virus outbreakdemonstrates that genomics can be a useful and important, if not essential,epidemiological tool. In the ongoing H1N1 influenza outbreak, the NationalCenter for Biotechnology Information (NCBI) established the Influenza VirusResource (a database and tool for flu sequence analysis, annotation, andsubmission to GenBank; http://www.ncbi.nlm.nih.gov/genomes/FLU/SwineFlu.html), containing 462 complete viral genome sequences from worldwide viralsamples (as of September, 2009). Some of the genomic data was completed,compared, and released to the public within two weeks of isolation of the DNA.The rapid generation of genome sequence data is providing a paradigm shift inthe analysis of infectious disease outbreaks, from more classical methods ofisolation to the rapid molecular examination of the pathogen in question.
PLoS Biology | www.plosbiology.org 3 October 2009 | Volume 7 | Issue 10 | e1000217
utility of assigning a single reference
strain to a specific outbreak. Instead,
collecting and sequencing tens or
hundreds of isolates from each source
or patient linked to an outbreak would
provide a better basis for understand-
ing the genomic diversity within the
outbreak population and would aid in
defining the population dynamics of an
outbreak.
A New Concept: Contrabiotics
Insufficient attention has been paid to
the human microbiome (i.e., the consor-
tium of microbes that inhabit the human
body) as it relates to our efforts to
increase biopreparedness. New analyses
of the diversity and composition of the
human microbiome are making it in-
creasingly clear that human health
depends on a delicate equilibrium be-
tween the microbial inhabitants and the
human host [18,19]. Severe effects on
health could be caused not only by the
introduction of true pathogens in the
traditional sense into these human-asso-
ciated microbial communities (e.g., Vib-
rio cholerae, the etiologic agent of cholera)
but potentially also by slight shifts in the
proportions of different populations wi-
thin the community that give an other-
wise harmless species or strain an un-
desirable advantage over others, a sim-
ilar situation to what is observed in
bacterial vaginosis [20]. Probiotic die-
tary supplements of live microorganisms
deliver beneficial bacteria that promote
an healthy state of the targeted micro-
biota. In a completely hypothetical
possibility, the opposite would also be
plausible, where the healthy microbiota
(skin, gut, or upper respiratory tract,
among others) may be disturbed by
introducing large amounts of ‘‘contra-
biotics,’’ i.e., living nonpathogenic bac-
teria that would shift the microbiota
away from a healthy state. A better
understanding of the ecological princi-
ples that shape the composition of our
microbiome might contribute to our
biopreparedness for such a threat to
public health.
Challenges for the Future
The field of biodefense has thoroughly
embraced genomics and made it a
keystone for developing better identifica-
tion technologies, diagnostic tools, and
vaccines and improving our understand-
ing of pathogen virulence and evolution.
Enabling technologies and bioinfor-
matics tools have shifted genomics from
a separate research discipline to a tool so
powerful that it can provide novel
insights that were not imaginable a few
years ago, including for example redefin-
ing the notion of strains or cultures in the
context of biopreparedness or microbial
forensics. Challenges remain, though,
mostly in the form of large amounts of
data that are being generated, and will
continue to be generated in the future,
and are becoming difficult to manage.
The need for better bioinformatic algo-
rithms, access to faster computing capa-
bilities, larger or novel and more efficient
data storage devices, and better training
in genomics are all in critical demand,
and will be required to fully embrace the
genomic revolution. Our nation’s pre-
paredness for biological threats, whether
they are deliberate or not, and our public
health system would benefit greatly by
leveraging these capabilities into better
real-time diagnostics (in the environment
as well as at the bedside), vaccines, a
greater understanding of the evolution-
ary process that makes a friendly microbe
become a pathogen (Box 3) (hence to
better predict what microbial foes will be
facing us in the near future), and better
forensics and epidemiological tools. The
time is right to be bold and capitalize on
these enabling technological advances to
sequence microbial species or complex
microbial communities to the greatest
level possible—that is, hundreds of ge-
nomes per species or samples—but let us
not forget that informatics and comput-
ing resources are now becoming the
bottleneck to actually making major
progress in this field.
References
1. Fleischmann RD, Adams MD, White O,Clayton RA, Kirkness EF, et al. (1995) Whole-
genome random sequencing and assembly of
Haemophi lus inf luenzae Rd. Science 269:496–512.
2. Guzman E, Romeu A, Garcia-Vallve S (2008)Completely sequenced genomes of pathogenic
bacteria: A review. Enferm Infecc Microbiol Clin
26: 88–98.
3. Binnewies TT, Motro Y, Hallin PF, Lund O,
Dunn D, et al. (2006) Ten years of bacterialgenome sequencing: Comparative-genomics-
based discoveries. Funct Integr Genomics 6:
165–185.
4. Franco C (2008) Billions for biodefense: Federalagency biodefense funding, FY2008-FY2009.
Biosecur Bioterror 6: 131–146.
5. Rowe T, Abernathy RA, Hu-Primmer J,Thompson WW, Lu X, et al. (1999) Detection
of antibody to avian influenza A (H5N1)virus in human serum by using a combina-
tion of serologic assays. J Clin Microbiol 37:
937–943.
6. Maurer-Stroh S, Ma J, Lee RT, Sirota FL,
Eisenhaber F (2009) Mapping the sequencemutations of the 2009 H1N1 influenza A virus
neuraminidase relative to drug and antibody
binding sites. Biol Direct 4: 18.
7. Aldhous P (2001) Biologists urged to address risk
of data aiding bioweapon design. Nature 414:
237–238.
8. Read TD, Parkhill J (2002) Restricting genome
data won’t stop bioterrorism. Nature 417: 379.
9. Bambini S, Rappuoli R (2009) The use of
genomics in microbial vaccine development.Drug Discov Today 14: 252–260.
10. Tindall BJ, Garrity GM (2008) Proposals to clarifyhow type strains are deposited and made available to
the scientific community for the purpose of systematicresearch. Int J Syst Evol Microbiol 58: 1987–1990.
11. Garrity GM, Field D, Kyrpides N, Hirschman L,
Sansone SA, et al. (2008) Toward a standards-
Box 3. Simple Genomics, Population Genomics, andMetagenomics
It is now technically possible and scientifically desirable to combine sequencingprojects on single genomes, genome populations, and metagenomes to studygenome evolution. Single-genome projects provide the greatest resolution foridentifying genetic factors responsible for specific virulence phenotypes andprovide answers to many important questions, such as: What is the minimal geneset in a pathogen required to cause a specific disease phenotype? What does thegenetic context of virulence or antibiotic resistance factors tell us about theirevolutionary origin or the mobility between different microbial species or evengenera? Population-level genome sequencing projects provide us with informa-tion about the pangenomic gene pool and the potential of a species to evolveinto a novel pathogen. Are certain bacterial species or strains more likely thanothers to evolve pathogenic traits? What distinguishes a commensal from apathogenic isolate? What provides the trigger or ability to convert a commensalor opportunistic strain into a pathogen? What role does horizontal gene transferplay in species evolution? Is an infection always caused by an individual isolate ormight infection be caused by a combination of individuals in a population that allhave different attenuated infectious potentials? Metagenomics projects samplethe genetic reservoir (the set of genes carried by all members of a community)within a specific environment or sample. This ‘‘gene soup’’ reflects the maximumgenetic potential accessible to individual isolates by horizontal gene transfer.
PLoS Biology | www.plosbiology.org 4 October 2009 | Volume 7 | Issue 10 | e1000217
compliant genomic and metagenomic publication
record. OMICS 12: 157–160.12. Field D, Garrity GM, Sansone SA, Sterk P,
Gray T, et al. (2008) Meeting report: The fifth
Genomic Standards Consortium (GSC) work-shop. OMICS 12: 109–113.
13. Schatz MC (2009) CloudBurst: Highly sensitiveread mapping with MapReduce. Bioinformatics
25: 1363–1369.
14. Gerner-Smidt P, Hise K, Kincaid J, Hunter S,Rolando S, et al. (2006) PulseNet USA: A five-
year update. Foodborne Pathog Dis 3: 9–19.15. Urwin R, Maiden MC (2003) Multi-locus se-
quence typing: A tool for global epidemiology.Trends Microbiol 11: 479–487.
16. Keim P, Price LB, Klevytska AM, Smith KL,
Schupp JM, et al. (2000) Multiple-locus variable-number tandem repeat analysis reveals genetic
relationships within Bacillus anthracis. J Bacteriol182: 2928–2936.
17. Boxrud D, Pederson-Gulrud K, Wotton J,
Medus C, Lyszkowicz E, et al. (2007) Compar-ison of multiple-locus variable-number tandem
repeat analysis, pulsed-field gel electrophoresis,
and phage typing for subtype analysis of Salmonella
enterica serotype Enteritidis. J Clin Microbiol 45:
536–543.18. Gao Z, Tseng CH, Strober BE, Pei Z, Blaser MJ
(2008) Substantial alterations of the cutaneous
bacterial biota in psoriatic lesions. PLoS One 3:e2719.
19. Turnbaugh PJ, Ley RE, Mahowald MA,Magrini V, Mardis ER, et al. (2006) An obesity-
associated gut microbiome with increased capac-ity for energy harvest. Nature 444: 1027–1031.
20. Srinivasan S, Fredricks DN (2008) The human
vaginal bacterial biota and bacterial vaginosis.Interdiscip Perspect Infect Dis 2008: 750479.
21. Pourrut X, Kumulungui B, Wittmann T,Moussavou G, Delicat A, et al. (2005) The
natural history of Ebola virus in Africa. Microbes
Infect 7: 1005–1014.
22. Peleg AY, Seifert H, Paterson DL (2008)
Acinetobacter baumannii: Emergence of a successful
pathogen. Clin Microbiol Rev 21: 538–582.
23. Wang D, Urisman A, Liu YT, Springer M,
Ksiazek TG, et al. (2003) Viral discovery and
sequence recovery using DNA microarrays. PLoS
Biol 1: e2. doi:10.1371/journal.pbio.0000002.
24. Marra MA, Jones SJ, Astell CR, Holt RA,
Brooks-Wilson A, et al. (2003) The genome
sequence of the SARS-associated coronavirus.
Science 300: 1399–1404.
25. Zhu M (2004) SARS immunity and vaccination.
Cell Mol Immunol 1: 193–198.
26. Haagmans BL, Osterhaus AD (2006) Coronavi-
ruses and their therapy. Antiviral Res 71:
397–403.
PLoS Biology | www.plosbiology.org 5 October 2009 | Volume 7 | Issue 10 | e1000217
Perspective
Discovering the Phylodynamics of RNA VirusesEdward C. Holmes1,2*, Bryan T. Grenfell2,3
1 Center for Infectious Disease Dynamics, Department of Biology, The Pennsylvania State University, Mueller Laboratory, University Park, Pennsylvania, United States of
America, 2 Fogarty International Center, National Institutes of Health, Bethesda, Maryland, United States of America, 3 Department of Ecology and Evolutionary Biology
and Woodrow Wilson School, Princeton University, Princeton, New Jersey, United States of America
Phylodynamics: The DiscoveryPhase
The advent of extremely high through-
put DNA sequencing ensures that genomic
data from microbial organisms can be
acquired in unprecedented quantities and
with remarkable rapidity. Although this
genomic revolution will affect all microbes
alike, our focus here is on RNA viruses, as
the rapidity of their evolution, which is
observable over the time scale of human
observation, allows phylodynamic infer-
ences to be made with great precision. In
the foreseeable future it is likely that
complete genome sequencing will become
the standard method of viral characteriza-
tion, providing the highest possible reso-
lution for phylogenetic studies. The rapid-
ity with which genome sequence data were
generated from the ongoing epidemic of
swine-origin H1N1 influenza A virus [1] is
testament to the power of this technology.
Understandably, pathogen discovery is
a major focus of this new-scale genome
sequencing [2]. It is now possible to
sequence the entire assemblage of viruses
in a particular tissue type or host species
[3–5], as well as all those viruses that are
associated with specific disease syndromes
[6,7]. In essence, this new era of metage-
nomics constitutes a crucial taxonomic
discovery phase in virology and epidemi-
ology that allows the genetic characteriza-
tion of new viruses within hours of their
isolation.
Assembling an inventory of viruses that
may emerge in human populations is of
major importance to public health and to
students of biodiversity. However, it is only
the first step in developing a full quanti-
tative understanding of the processes that
shape the epidemiology and evolution—
the phylodynamics—of RNA virus infec-
tions [8]. To achieve this goal, we argue
here that the field of viral phylodynamics
requires its own discovery phase; that is, a
comprehensive and quantitative analysis
of the interaction between the ecological
and evolutionary dynamics of all circulat-
ing RNA viruses from the molecular to the
global scale. Such a marriage of phyloge-
netic and epidemiological dynamics is
currently only potentially possible for the
select few human viruses for which large
genome sequence datasets have been
acquired, such as HIV and influenza A
virus, and even here fundamental gaps in
our knowledge remain (see below). Indeed,
it is striking that so few complete genome
sequences are currently available for
viruses whose epidemiological dynamics
are known in exquisite detail, such as
measles [9,10]; these sequences have been
so sparsely sampled in both time and space
that a full phylodynamic perspective has
not yet been achieved. We contend that a
better understanding of RNA virus phylo-
dynamics will allow more directed at-
tempts at pathogen surveillance, facilitate
more accurate predictions of the epidemi-
ological impact of newly emerged viruses,
and assist in the control of those viruses
that exhibit complex patterns of antigenic
variation such as dengue and influenza.
Just as PCR and first-generation DNA
sequencing ushered in the science of
molecular epidemiology, so next-genera-
tion sequencing may herald the age of
phylodynamics. Box 1 lists a number of
key questions that can be addressed within
this phylodynamics research program.
A number of important advances are
needed to meet our goal of a comprehen-
sive catalog of the diversity of phylody-
namic patterns in RNA viruses. Because
answers to many of the most interesting
research questions depend on sufficiently
large sample sizes, we require large
numbers of sequences that have been
rigorously sampled according to strict
temporal, spatial, and clinical criteria,
and that as much of these data are publicly
accessible as possible. A phylodynamic
analysis has little value unless viral ge-
nomes are sampled on the same scale as
the epidemiological processes under inves-
tigation.
The only acute virus for which a suitably
expansive genome dataset currently exists is
influenza. In this case, the .4,000 com-
plete genomes generated under the Influ-
enza Genome Sequencing Project [11]
have provided important new insights into
the evolution and epidemiology of this
major human pathogen [12]. To highlight
one key insight here, these genome se-
quence data have revealed that multiple
lineages of influenza virus are imported and
circulate within specific geographic locali-
ties (even within relatively isolated popula-
tions), generating both frequent mixed
infections [13] and reassortment events
[14]. Even so, the sampling of these
genome sequences (and associated epide-
miological covariates) may not be dense
enough to fully capture spatial dynamics
[15]. There is also a marked absence of
samples from asymptomatically infected
patients (or those with mild disease), so it
is impossible to link genetic variation to
clinical syndrome. Such a bias against
viruses sampled from individuals with
asymptomatic infections is a common
problem in molecular epidemiology.
Epidemiological Factors
It is also clear that for many RNA
viruses we need to better understand a
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/).
Citation: Holmes EC, Grenfell BT (2009) Discovering the Phylodynamics of RNA Viruses. PLoS ComputBiol 5(10): e1000505. doi:10.1371/journal.pcbi.1000505
Editor: Ernest Fraenkel, Massachusetts Institute of Technology, United States of America
Published October 26, 2009
Copyright: � 2009 Holmes, Grenfell. This is an open-access article distributed under the terms of theCreative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in anymedium, provided the original author and source are credited.
Funding: BTG was supported by the RAPIDD program of the Science & Technology Directorate of theDepartment of Homeland Security and the National Institutes of Health (NIH), and National Science Foundationgrant 0742373. ECH was supported by the NIH (grant GM080533). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
PLoS Computational Biology | www.ploscompbiol.org 1 October 2009 | Volume 5 | Issue 10 | e1000505
number of key epidemiological factors,
such as the interaction between local
persistence, epidemic dynamics in both
time and space, the impact of measures to
control the spread of infection, and the
consequences of adaptive evolution in
those viral genes that interact most
intimately with the host immune response.
It is instructive to imagine the ideal
database for addressing these issues. In
the case of acute infections, the goal would
be to collect four parallel datasets on the
appropriate scale of interest during out-
breaks (Figure 1). This database would
comprise, first, epidemic dynamics in time and
space, ideally at a comparable or higher
frequency than the generation time of
individual infections. Second, and in
parallel, our ideal study would collect viral
genome sequence data at these time points,
sampling both within and among infected
hosts. Both disease incidence data (bol-
stered by contact tracing) and viral
sequence data furnish information on the
transmission network traced by an out-
break. Third, we would need to know the
underlying contact network of susceptible
individuals, which serves as fuel for the
epidemic. This is a difficult structure to
measure directly, although novel measure-
ments of human interactions are increas-
ingly shedding light on the problem [16].
Finally, measurements of the immunity
structure of our contact network [17]—
reflecting the past history of the virus in
the population—are key for understanding
both the dynamics of epidemic spread and
the evolutionary pressures that shape virus
diversity.
The outbreak of foot-and-mouth disease
(FMD, an RNA virus infection of cattle) in
the UK in 2001 resulted in a database that
is arguably closest to our ideal on the
epidemiological scale [18,19]. Notwith-
standing a variety of gaps in data from
the epidemic [20], it is one of the most
well-documented large outbreaks in terms
of the availability of spatiotemporal inci-
dence data in parallel with contact tracing
and the underlying spatial pattern of the
susceptible farms as a measure of the
contact network. In addition, analyses of
viral sequences from relatively small sam-
ples of farms have drawn important
conclusions about epidemic spread and
allowed the testing of new methods to
recover the spatiotemporal patterns writ-
ten into sequence data [18,20]. Impor-
tantly, samples exist from over half the
,2,000 confirmed infected premises in
2001: sequencing whole FMD virus ge-
nomes from these samples would provide a
vast resource for basic and applied devel-
Box 1. Key Research Questions in RNA Virus Phylodynamics
(1) What is the range of phylodynamic patterns observed in RNA viruses? Can they
be categorized into specific groups? How do these patterns relate to other ‘‘life
history’’ variables exhibited by RNA viruses?
(2) What epidemiological and evolutionary processes give rise to these phylodynamic
patterns? What generalities can be drawn?
(3) How commonly does natural selection (compared to neutral evolutionary
processes) determine the population dynamics of pathogens? On what scale does
natural selection act? How does viral immune escape reduce herd immunity at the
population level and allow the persistence of viral lineages in epidemic troughs?
(4) What is the range of spatial patterns exhibited by RNA viruses? What
epidemiological factors are responsible for these patterns?
(5) How do different viral species (various respiratory viruses, for example) interact
in host immunity?
Figure 1. Sampling scales for acute RNA viruses and the associated phylodynamic processes that viral genome sequence data andhost sampling can elucidate.doi:10.1371/journal.pcbi.1000505.g001
PLoS Computational Biology | www.ploscompbiol.org 2 October 2009 | Volume 5 | Issue 10 | e1000505
opments in integrating epidemiological
and phylogenetic information to dissect
spatiotemporal spread. We suggest that
achieving this task would be a huge
contribution to understanding the phylo-
dynamics of acute viruses. Another virtue
of animal infections like FMD is that the
relationship between the determinants of
viral variability within and between hosts
can also be dissected by experimental
infections (see [21] for another example).
A parallel limitation of many phyloge-
netic approaches to viral epidemiology is
that they have often proceeded in the
absence of the necessary metadata, such as
the precise time and place of sampling or
those that relate to clinical syndrome [22].
A perhaps more challenging goal for
phylodynamics is therefore to integrate
phylogenetic patterns with other biological
variables, such as the nature of antigenic
variation, the capacity for drug resistance,
or the clinical syndrome of the host, as well
as the spatial host network data outlined
above. Cohort studies may be the most
productive way to link genomics with
epidemiological variables.
The lack of a synthesis of phylogenetic
and phenotypic/epidemiological data is
reflected in the current debate over the
mode of antigenic evolution in human
influenza A virus. Although it has long
been known that the hemagglutinin (HA)
and neuraminidase (NA) proteins of hu-
man influenza A virus evolve by strong
natural selection to evade the host immune
response—a process commonly called
antigenic drift [23,24]—the precise mech-
anisms by which such drift occurs are
uncertain. From a phylodynamics perspec-
tive, the key observation is that over long
time periods a single lineage of HA
sequences from subtype A/H3N2 influen-
za viruses links epidemic to epidemic [23],
although intensive sampling has revealed
that single populations may harbor far
higher levels of genetic diversity [25].
Rather different phylodynamic patterns
are seen in other influenza viruses, includ-
ing those sampled from birds (Figure 2).
Three models have been proposed to
explain the distinctive phylodynamic pat-
tern observed in human A/H3N2 viruses:
(i) that there is short-lived cross-immunity
among viral strains [26], (ii) that the HA
evolves in a punctuated manner among
antigenic types that are linked by a
network of neutrally evolving sites [27],
and (iii) that the virus continually reuses a
limited number of antigenic combinations
[28].
To determine which combination of
these models best explains influenza phy-
lodynamics will require more expansive
genome sequence data, as well as focused
sampling and epidemiological surveillance
in Southeast Asia, which is likely the global
source population for the virus [29]. More
importantly, it is also crucial that these
phylogenetic data are combined with
detailed, spatiotemporally disaggregated
antigenic information. Indeed, it is re-
markable that despite the abundance of
information on the antigenic characteris-
tics of individual influenza viruses, most
notably through the use of the hemagglu-
tinin inhibition (HI) assay [17], these data
have not been routinely linked to phylo-
genetic information. It is clear that both
antigenic and phylogenetic analyses would
greatly benefit from each other.
New-Generation ComputationalTools
Another important challenge for phylo-
dynamics is to match the remarkable
ongoing developments in genome se-
quencing technology to the increase in
the power of the computational tools
available to analyze these sequence data.
Crucially, in phylogenetics, the size of the
space of possible trees increases faster than
exponentially with the number of sequenc-
es, such that the availability of datasets
comprising thousands of complete ge-
nomes [30] presents a major combinato-
rial problem. This problem creates a
growing discrepancy between our ability
to generate genome sequence data and our
capacity to analyze them using the most
sophisticated methods. Redressing this
Figure 2. Phylodynamic patterns of human and avian influenza viruses. The left diagram shows the phylogeny of the hemagglutinin (HA)gene of human H3N2 influenza A viruses sampled between 1985 and 2005, revealing the ‘‘ladder-like’’ branching structure indicative of antigenicdrift. By comparison, the phylogeny of the HA gene of human influenza B virus sampled over the same interval (center diagram) shows the co-circulation of the antigenically distinct ‘‘Victoria 1987’’ and ‘‘Yamagata 1988’’ lineages, as well a shorter length from root to tip, reflecting a lower rateof evolutionary change. Finally, the phylogeny for the HA gene of H4 avian influenza virus (right diagram) reveals the deep geographic divisionbetween the Eurasian and Australian versus North American lineages of this virus.doi:10.1371/journal.pcbi.1000505.g002
PLoS Computational Biology | www.ploscompbiol.org 3 October 2009 | Volume 5 | Issue 10 | e1000505
balance should be the major goal of
bioinformatics in the future; and in fact
some progress has been made recently
[31].
It is also clear that improvements need
to be made to the methods that are
available to analyze genome sequence
data. A powerful set of research tools in
this area comprises those based on coales-
cent theory, as this provides a natural link
between the analysis of epidemiological
and phylogenetic patterns [8,32]. In par-
ticular, the coalescent allows the demo-
graphic characteristics of viral populations
(particularly population size and growth
rate) to be inferred directly from gene
sequence data. Coalescent analyses are
especially powerful in the case of RNA
viruses, because their rapid evolution
means that temporal and spatial dynamics
are discernable over the period of human
observation [33] and can in theory be
combined with time series epidemiological
data. However, currently available coales-
cent methods are restricted by the limited
scope of demographic models and their
inability to fully incorporate spatial infor-
mation. In particular, most acute RNA
viruses have complex population dynamics
that combine distinct periods of growth
and decline. The most commonly used
phylodynamic tool available in such cases
is the Bayesian skyline plot (and the related
Bayesian ‘‘skyride’’ [34]), which represents
a piecewise graphical depiction of changes
in genetic diversity through time [32]. In
the case of neutral evolution, such changes
in genetic diversity also reflect underlying
changes in the number of infected hosts.
Although the Bayesian skyline plot can
reveal unique features of epidemic dynam-
ics (Figure 3) [30], precise estimates of
parameters such as population growth rate
are not yet possible.
The coalescent methods commonly
used to study RNA virus evolution focus
largely on temporal dynamics (a natural
function of the rapidity of viral evolution),
with little consideration of patterns of
spatial diffusion. Although these phylogeo-
graphic patterns are becoming increasing-
ly well described for RNA viruses [35], few
methods effectively recover the spatial
component in genome sequence data.
For example, commonly used parsimony-
based approaches consider a single phylo-
genetic tree without an explicit spatial
model (see, for example, [36]). In addition,
these methods usually describe the place of
origin and direction of spread of viral
lineages without formal tests of competing
spatial hypotheses. As a specific case in
point, although gravity models (in which
patterns of viral transmission reflect the
size of and distance between population
centers) have been applied successfully to
morbidity and mortality data from human
influenza A virus to describe its spread
across the United States [37], they have
yet to be interpreted within a phylogenetic
setting. A clear push for the future should
therefore be the development of coalescent
tools that integrate the analysis of spatial
and temporal dynamics within a single
framework, with a focus on those that
combine phylogenetic data and informa-
tion on the dynamics of the host contact
network of susceptible, infected, and
immune individuals.
Looking beyond the ConsensusSequence
The vast majority of studies of RNA
virus evolution undertaken to date, partic-
ularly of those viruses that cause acute
infections, rely on the analysis of consensus
sequences in which the nucleotide shown
for any given site is the most common
among all the genomes within a patient.
Although the use of consensus sequences is
adequate for many aspects of molecular
epidemiology, in which complete genomes
may suffice to determine even tight
transmission chains [20], there is growing
evidence that key evolutionary processes
occur beyond the consensus. In particular,
extensive intra-host gene sequencing has
revealed the existence of minor viral
subpopulations within individual hosts that
are not detected by consensus sequencing
and that are sometimes of great pheno-
typic importance [38,39]. Given the in-
trinsically high mutation rates of RNA
viruses, as well as the immense size of
intra-host populations, such extensive ge-
netic and phenotypic diversity is only to be
expected.
Figure 3. Fluctuating genetic diversity of influenza A virus. The figure shows a Bayesian skyline plot of changing levels of genetic diversitythrough time for the HA gene (165 sequences) of A/H3N2 virus sampled from the state of New York, US, during the period 2001–2003. The y-axesdepict relative genetic diversity (Net, where Ne is the effective population size, and t the generation time from infected host to infected host), whichcan be considered a measure of effective population size under strictly neutral evolution. Peaks of genetic diversity, reflecting the seasonaloccurrence of influenza, are clearly visible. See [30] for a more detailed analysis.doi:10.1371/journal.pcbi.1000505.g003
PLoS Computational Biology | www.ploscompbiol.org 4 October 2009 | Volume 5 | Issue 10 | e1000505
A full description of the extent and
structure of intra-host viral genetic varia-
tion is critical for understanding evolu-
tionary dynamics, informing on such issues
as the frequency of mixed infection, and
hence the degree and extent of cross-
immunity; the frequency with which
antigenic variants are produced and
whether antigenic evolution can occur on
the time scale of individual infections; and
the size of the population bottleneck that
might accompany inter-host transmission.
As a case in point, it is commonly assumed
that viruses experience a severe population
bottleneck as they are transmitted to new
hosts, a phenomenon that greatly restricts
the power of natural selection to fix
advantageous mutations. Although this
assumption appears to be true in some
cases [40], whether this is a general
property of RNA viruses is unclear; the
evidence that multiple viral lineages can
be transmitted among hosts argues against
a narrow bottleneck in all cases [41]. To
more accurately determine the size of the
transmission bottleneck, analyses of intra-
host genetic diversity along known trans-
mission chains will be essential. On a
larger scale, it is unclear whether phylo-
dynamic patterns differ within and among
hosts, and whether any differences among
these scales of analysis are qualitative or
quantitative.
Intra-host sequence data are also essen-
tial for understanding the process of cross-
species virus transmission and emergence.
Key parameters in determining whether a
virus will adapt successfully to a new host
species include the extent of intra-host
genetic diversity, the fitness distribution of
the mutations produced, and how many of
these mutations will assist adaptation to
new host species [41–43]. No such data
are available for any acute RNA virus, so
testing models for viral emergence is
difficult. We believe, however, that under-
standing the mechanics of this adaptive
process is at least as important as surveying
for new emerging viruses.
Challenges for the Future
Our discussion has highlighted a num-
ber of key challenges for a successful
phylodynamic research agenda. These
challenges comprise data, theory, and
methodological issues, and are briefly
summarized as follows. First, with respect
to data, it is clear that more genome
sequences must be acquired and with
increased temporal and spatial precision.
For example, wherever possible, GenBank
records should contain the exact day and
precise latitude and longitude of sampling.
In addition, it is essential that these
sequence data be linked with the relevant
metadata, such as the associated clinical
syndrome and (if applicable) measure of
antigenicity. Similarly, it is essential that
equivalent genome sequence data be
acquired from multiple time points within
individual hosts. Second, in terms of
theory, it is crucial that we fully integrate
patterns of viral evolution across multiple
epidemiological scales, from within hosts,
to local outbreaks, and on to global
pandemics. Although the coalescent is
hugely useful in this respect, it is essential
that its theoretical framework be extended
to incorporate models of population
growth and decline that most accurately
reflect the population dynamics of acute
RNA viruses, in particular the dynamics of
the susceptible ‘‘denominator’’ that fuels
epidemics. Sequencing of all available
samples from the UK 2001 FMD epidem-
ic would yield great scientific dividends
here. Third and finally, with respect to
methodology, new computational tools are
needed to rapidly make phylodynamic
inferences from genomic datasets that
may contain thousands of sequences and
that efficiently integrate genomic with
other forms of biological data. We hope
this review will stimulate research in all
these areas.
References
1. Novel Swine-Origin Influenza A (H1N1) Virus
Investigation Team, Dawood FS, Jain S, Finelli L,
Shaw MW, et al. (2009) Emergence of a novel
swine-origin influenza A (H1N1) virus in humans.
N Engl J Med 360: 2605–2615.
2. Lipkin WI (2009) Microbe hunting in the 21st
century. Proc Natl Acad Sci U S A 106: 6–7.
3. Cox-Foster DL, Conlan S, Holmes EC,
Palacios G, Evans JD, et al. (2007) A metage-
nomic survey of microbes in honey bee colony
collapse disorder. Science 318: 283–287.
4. Finkbeiner SR, Allred AF, Tarr PI, Klein EJ,
Kirkwood CD, et al. (2008) Metagenomic analysis
of human diarrhea: viral detection and discovery.
PLoS Pathog 4(2): e1000011. doi:10.1371/journal.
ppat.1000011.
5. Zhang T, Breitbart M, Lee WH, Run JQ, Wei CL,
et al. (2005) RNA viral community in human feces:
Prevalence of plant pathogenic viruses. PLoS Biol
4(1): e3. doi:10.1371/journal.pbio.0040003.
6. Palacios G, Druce J, Du L, Tran T, Birch C, et al.
(2008) A new arenavirus in a cluster of fatal
transplant-associated diseases. N Engl J Med 358:
991–998.
7. Palmenberg AC, Spiro D, Kuzmickas R,
Wang S, Djikeng A, et al. (2009) Sequencing
and analyses of all known human rhinovirus
genomes reveals structure and evolution. Sci-
ence 324: 55–59.
8. Grenfell BT, Pybus OG, Gog JR, Wood JLN,
Daly JM, et al. (2004) Unifying the epidemiolog-
ical and evolutionary dynamics of pathogens.
Science 303: 327–332.
9. Bjørnstad ON, Finkenstadt B, Grenfell BT (2002)
Dynamics of measles epidemics. I. estimating
scaling of transmission rates using a time series
SIR model. Ecol Monogr 72: 169–184.
10. Grenfell BT, Bjornstad ON, Finkenstadt BF
(2002) Dynamics of measles epidemics. II. Scaling
noise, determinism and predictability with the
time series SIR model. Ecol Monogr 72:
185–202.
11. Ghedin E, Sengamalay NA, Shumway M,
Zaborsky J, Feldblyum T, et al. (2005) Large-
scale sequencing of human influenza reveals the
dynamic nature of viral genome evolution.
Nature 437: 1162–1166.
12. Nelson MI, Holmes EC (2007) The evolution of
epidemic influenza. Nat Rev Genet 8: 196–205.
13. Ghedin E, Fitch A, Boyne A, DePasse J, Bera J,
et al. (2009) Mixed infection and the genesis of
influenza diversity. J Virol 83: 8832–8841.
14. Nelson MI, Simonsen L, Viboud C, Miller MA,
Taylor J, et al. (2006) Stochastic processes are key
determinants of the short-term evolution of
influenza A virus. PLoS Pathog 2: e125.
doi:10.1371/journal.ppat.0020125.
15. Nelson MI, Edelman L, Spiro DJ, Boyne AR,
Bera J, et al. (2008) Molecular epidemiology of
A/H3N2 and A/H1N1 influenza virus during a
single epidemic season in the United States. PLoS
Pathog 4(8): e1000133. doi:10.1371/journal.
ppat.1000133.
16. Gonzalez MC, Hidalgo CA, Barabasi AL (2008)
Understanding individual human mobility pat-
terns. Nature 453: 779–782.
17. Smith DJ, Lapedes AS, de Jong JC,
Bestebroer TM, Rimmelzwaan GF, et al.
(2004) Mapping the antigenic and genetic
evolution of influenza virus. Science 305:
371–376.
18. Cottam EM, Haydon DT, Paton DJ, Gloster J,
Wilesmith JW, et al. (2006) Molecular epidemi-
ology of the foot-and-mouth disease virus out-
break in the United Kingdom in 2001. J Virol 80:
11274–11282.
19. Keeling MJ, Woolhouse MEJ, Shaw DJ,
Matthews L, Chase-Topping M, et al. (2001)
Dynamics of the 2001 UK foot and mouth
epidemic: stochastic dispersal in a heterogeneous
landscape. Science 294: 813–817.
20. Cottam EM, Wadsworth J, Shaw AE,
Rowlands RJ, Goatley L, et al. (2008) Transmis-
sion pathways of foot-and-mouth disease virus in
the United Kingdom in 2007. PLoS Pathog 4(4):
e1000050. doi:10.1371/journal.ppat.1000050.
21. Hoelzer K, Shackelton LA, Holmes EC,
Parrish CR (2008) Within-host genetic diversity
of endemic and emerging parvoviruses of cats
and dogs. J Virol 82: 11096–11105.
22. Holmes EC (2007) Viral evolution in the genomic
age. PLoS Biol 5(10): e278. doi:10.1371/journal.
pbio.0050278.
23. Fitch WM, Leiter JME, Li X, Palese P (1991)
Positive Darwinian evolution in human influenza
A viruses. Proc Natl Acad Sci U S A 88:
4270–4274.
24. Webster RG, Laver WG, Air GM, Schild GC
(1982) Molecular mechanisms of variation in
influenza viruses. Nature 296: 115–121.
25. Holmes EC, Ghedin E, Miller N, Taylor J, Bao Y,
et al. (2005) Whole genome analysis of human
influenza A virus reveals multiple persistent
lineages and reassortment among recent H3N2
viruses. PLoS Biol 3(9): e300. doi:10.1371/
journal.pbio.0030300.
26. Ferguson NM, Galvani AP, Bush RM (2003)
Ecological and immunological determinants of
influenza evolution. Nature 422: 428–433.
27. Koelle K, Cobey S, Grenfell B, Pascual M (2006)
Epochal evolution shapes the phylodynamics of
PLoS Computational Biology | www.ploscompbiol.org 5 October 2009 | Volume 5 | Issue 10 | e1000505
interpandemic influenza A (H3N2) in humans.
Science 314: 1898–1903.
28. Recker M, Pybus OG, Nee S, Gupta S (2007)
The generation of influenza outbreaks by a
network of host immune responses against a
limited set of antigenic types. Proc Natl Acad
Sci U S A 104: 7711–7716.
29. Russell CA, Jones TC, Barr IG, Cox NJ,
Garten RJ, et al. (2008) The global circulation
of seasonal influenza A (H3N2) viruses. Science
320: 340–346.
30. Rambaut A, Pybus OG, Nelson MI, Viboud C,
Taubenberger JK, et al. (2008) The genomic and
epidemiological dynamics of human influenza A
virus. Nature 453: 615–619.
31. Suchard MA, Rambaut A (2009) Many-core
algorithms for statistical phylogenetics. Bioinfor-
matics 25: 1370–1376.
32. Drummond AJ, Rambaut A, Shapiro B,
Pybus OG (2005) Bayesian coalescent inference
of past population dynamics from molecular
sequences. Mol Biol Evol 22: 1185–1192.
33. Drummond AJ, Pybus OG, Rambaut A,
Forsberg R, Rodrigo AG (2003) Measurablyevolving populations. Trends Ecol Evol 18:
481–488.
34. Minin VN, Bloomquist EW, Suchard MA (2008)Smooth skyride through a rough skyline: Bayesian
coalescent-based inference of population dynam-ics. Mol Biol Evol 25: 1459–1471.
35. Holmes EC (2008) The evolutionary history and
phylogeography of human viruses. Annu RevMicrobiol 62: 307–328.
36. Wallace RG, Hodac H, Lathrop RH, Fitch WM(2007) A statistical phylogeography of influenza A
H5N1. Proc Natl Acad Sci U S A 104:4473–4478.
37. Viboud C, Bjornstad ON, Smith DL, Simonsen L,
Miller MA, et al. (2006) Synchrony, waves, andspatial hierarchies in the spread of influenza.
Science 312: 447–451.38. Aaskov J, Buzacott K, Thu HM, Lowry K,
Holmes EC (2006) Long-term transmission of
defective RNA viruses in humans and Aedes
mosquitoes. Science 311: 236–238.
39. Jerzak G, Bernard KA, Kramer LD, Ebel GD
(2005) Genetic variation in West Nile virus fromnaturally infected mosquitoes and birds suggests
quasispecies structure and strong purifying selec-
tion. J Gen Virol 86: 2175–2183.40. Keele BF, Giorgi EE, Salazar-Gonzalez JF,
Decker JM, Pham KT, et al. (2008) Identificationand characterization of transmitted and early
founder virus envelopes in primary HIV-1
infection. Proc Natl Acad Sci U S A 105:7552–7557.
41. Holmes EC (2009) The evolution and emergenceof RNA viruses. Oxford Series in Ecology and
Evolution. Harvey PH, May RM, eds. Oxford:Oxford University Press.
42. Kuiken T, Holmes EC, McCauley J ,
Rimmelzwaan GF, Williams CS, et al. (2006)Host species barriers to influenza virus infections.
Science 312: 394–397.43. Parrish CR, Holmes EC, Morens DM, Park EC,
Burke DS, et al. (2008) Cross-species viral
transmission and the emergence of new epidemicdiseases. Microbiol Mol Biol Rev 72: 457–470.
PLoS Computational Biology | www.ploscompbiol.org 6 October 2009 | Volume 5 | Issue 10 | e1000505
Perspective
Computational Resources in Infectious Disease:Limitations and ChallengesEva C. Berglund, Bjorn Nystedt, Siv G. E. Andersson*
Department of Molecular Evolution, Uppsala University, Uppsala, Sweden
Infectious diseases continue to be a
major cause of death in the human
population, with tuberculosis and malaria
affecting 500 million people and causing
1–2 million deaths annually [1]. The
situation is aggravated by the increasing
prevalence of antibiotic-resistant bacteria
and the risk that terrorists might use
infectious organisms to aggress target
populations. During the past decade, we
have also witnessed the emergence of
many new pathogens not previously de-
tected in humans, such as the avian
influenza virus, severe acute respiratory
syndrome (SARS), and Ebola. The ap-
pearance of these novel agents and the
reemergence of previously eradicated
pathogens may be associated with the
growing human population, flooding, and
other environmental perturbations; global
travel and migration; and animal trade
and domestic animal husbandry practices.
Simultaneously, we have seen an explosion
of genome sequence data. Sequencing is
now the method of choice for character-
ization of new disease agents, as exempli-
fied by the rapid sequencing of the
genome of the SARS virus, which was
made available within a month of identi-
fication of the virus [2,3]. Like SARS,
most newly emerging disease agents orig-
inate in animals and have been transmit-
ted to humans recently at food markets, by
insect bites, or through hunting [1].
The new sequencing technologies enable
small academic research groups to create
huge genome datasets at low cost. As a
result, scientists with expertise in other
fields of research, such as clinical microbi-
ology and ecology, are just beginning to
face the challenge of handling, comparing,
and extracting useful information from
millions of sequences. Here, we discuss
the limitations of publicly available resourc-
es in the field of genomics of emerging
bacterial pathogens, emphasizing areas
where increased efforts in computational
biology are urgently needed.
Genome Evolution in EmergingBacterial Pathogens
A natural ecosystem of a bacterial
population that incidentally infects hu-
mans provides a high-risk microenviron-
ment for the establishment of this patho-
gen in the human population (Box 1;
Figure 1). Comparative studies of the
genomes of well-recognized human path-
ogens, incidental pathogens, and their
closely related nonpathogenic species [4–
11] are valuable for efforts to predict the
propensity for host shifts and their conse-
quences for human health.
A successful infectious bacterium,
whether it causes disease or not, must
possess mechanisms for interacting with
the host and evading the host immune
system. The key players in these processes
are often proteins on the surface of the
bacterium, including secretion systems
that release effector proteins into the
surrounding medium or directly into the
host cells. These host-interaction factors
are often members of large protein families
with many paralogs and often encoded by
long genes with internal repeats. Fluctua-
tions in gene length and copy number
occur through homologous recombination
over these repeats [12–15].
Adding to the variability of the host-
interaction genes is that they are often
located on mobile elements such as
plasmids or bacteriophages, which are
easily gained and lost. Rapid sequence
evolution of these genes may be driven by
selection, because it often increases bacte-
rial fitness by escaping the host immune
system, creating a diverse set of binding
structures or tuning effector proteins to a
new host. As a consequence, host-interac-
tion genes typically show extreme plastic-
ity in both sequence and copy number,
partly because they are under strong
evolutionary pressure and partly because
they are mechanistically prone to drastic
mutational changes. Understanding these
complex dynamics poses major challenges
in many areas of computational biology,
ranging from sequence assembly to epi-
demic risk assessment.
Complete Genome AssemblyRemains Difficult
Despite the ease with which shotgun
sequence data can be generated, assem-
bling these data into a single genomic
contig remains labor-intensive and time-
consuming. This obstacle is primarily due
to the difficulty of assembling repeated
sequences. Hence, resequencing ap-
proaches—where short sequence reads
are directly mapped to an already com-
pleted reference genome—have become
increasingly popular. Resequencing read-
ily detects SNPs (single nucleotide poly-
morphisms) in single-copy genes, but
performs very poorly in repeated and
highly divergent regions of the genome.
Genes involved in infection processes, with
their complex repeat structures, high
duplication frequency, and rapid evolu-
tion, are thus often left unresolved.
The perhaps most imminent need is not
for improved assembly algorithms but for
Citation: Berglund EC, Nystedt B, Andersson SGE (2009) Computational Resources in Infectious Disease:Limitations and Challenges. PLoS Comput Biol 5(10): e1000481. doi:10.1371/journal.pcbi.1000481
Editor: Ernest Fraenkel, Massachusetts Institute of Technology, United States of America
Published October 26, 2009
Copyright: � 2009 Berglund et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.
Funding: The authors are supported by grants to SGEA from the European Union (QLK3-CT2000-01079,EUWOL and EuroPathogenomics), the Swedish Research Council (http://www.vr.se/), the Goran GustafssonFoundation (http://www.gustafssonsstiftelse.se/), the Swedish Foundation for Strategic Research (http://www.stratresearch.se/) and the Knut and Alice Wallenberg Foundation (http://www.wallenberg.com/kaw/). Thefunders had no role in study design, data collection and analysis, decision to publish, or preparation of themanuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/).
PLoS Computational Biology | www.ploscompbiol.org 1 October 2009 | Volume 5 | Issue 10 | e1000481
better ways to integrate data from diverse
sources, including shotgun sequencing,
paired-end sequencing, PCR experiments,
fosmid and BAC (bacterial artificial chro-
mosome) clone sequencing, physical map-
ping, and restriction fragment data. A
program integrating these different data
should not only accurately assemble as
much of the genome as possible, but also
assist the researcher in designing addition-
al experiments to resolve the remaining
regions. Given the rapidly increasing
number of incomplete genome sequences
available, it would also be valuable with a
quality-scoring standard that not only
provides quality scores at individual sites
under the assumption that the assembly is
correct, but also reflects the uncertainty of
the actual assembly over specific regions.
While assembly software development is
struggling to keep up, the sequencing
revolution shows no signs of slowing down.
Perhaps the most important new develop-
ment is real-time single molecule detection
platforms with ultra-long sequencing reads
[16]. Within the next few years, we can
expect to see read lengths of 20 kb, which
will help resolve many of the complex
genomic features underlying host adapta-
tion and pathogenicity.
Functional Annotation ofVirulence and Host-InteractionGenes
Annotation is the process of assigning
meaningful information, such as the loca-
tion or function of genes, to raw sequence
data. Reliable and consistent annotations
are thus fundamental for analysis and
interpretation of genome data. Since
annotation of new genomes is usually
based on homology searches (e.g., BLAST
hits), errors and inconsistencies tend to
propagate. One way to reduce error
propagation is to functionally annotate a
set of reference genomes based on exper-
imentally determined information. Anno-
tation of new genomes could then start
with searches in this database, which
would allow high-quality annotation of
all well-conserved genes. The Gene On-
tology’s Reference Genome Project [17]
and BioCyc [18] represent developments
in this direction. However, the number of
species included is still limited, and a
broader taxonomic breadth of bacteria,
with one reference species per genus,
would be desirable.
Functional annotation of pathogen ge-
nomes is particularly important, because
genes involved in host-interaction process-
es are among the most difficult to
annotate. One problem is that different
Box 1. Genomic Changes Associated with Host Shifts
The movement of a bacterial species from abundant animal hosts such asrodents, which are a major reservoir of infectious disease agents, to the relativelysmall human population is typically associated with decreased genome size andloss/alteration of the mobile gene pool [4,32–34]. One illustrative example can befound in the genus Mycobacterium, which contains several severe humanpathogens, including the agents of tuberculosis (M. tuberculosis) and leprosy (M.leprae) and also the recently emerged M. ulcerans. M. ulcerans causes severe skinlesions; this disease, known as Buruli ulcer, is becoming a serious public healthproblem in West and Central Africa as well as in other parts of the tropics.
Like many other recently emerged human pathogens [4,34–36], M. ulceransappears to have switched from a generalist to a specialist lifestyle: starting with aprogenitor very similar to the aquatic M. marinum. While M. marinum has beenfound both free-living and as an intracellular pathogen of fish and other species,M. ulcerans is thought to have a restricted host range and to be transmitted byinsects (Figure 1A). The host switch was likely initiated by the uptake of avirulence plasmid, and preceded through a series of ‘‘bottleneck events’’ or(severe reductions in population size due to environmental circumstances). Thisprocess resulted in loss of about 1 Mb of the genome, major genomicrearrangements, extensive proliferation of insertion sequences, and a massiveincrease in number of pseudogenes [37–39]. In particular, there was a massivereduction in the size of the two major surface protein gene families (a decrease ofmore than 250 genes compared to M. marinum). This gene loss is thought to havebeen crucial for the organism to evade the human immune system, by limitingthe number of antigens on the bacterial surface [40].
The uptake of a new virulence plasmid producing an immunosuppressivesubstance called mycolactone is also thought to have played a key role in theevolution and host switch of M. ulcerans. This plasmid consists mainly of threeunusually large and internally repeated genes (over 100 kb in total), and thusillustrates the concept of long and repeated virulence genes (Figure 1B) [41].These genes appear to evolve rapidly by recombination and gene conversion,and new variants can be directly connected to variations in the chemical structureof mycolactone [42], which might be important for host specificity, immunosup-pressive potency, and drug design.
Figure 1. Evolution of a new infectious disease agent. (A) Recent evolution of the specialisthuman pathogen M. ulcerans from the aquatic generalist pathogen M. marinum. (B) Arrangementof the three M. ulcerans plasmid–encoded repeated virulence genes (arrows from left to right:mlsA1 [51 kb], mlsA2 [7.6 kb], mlsB [43 kb]) coding for three polyketide synthases. The loadingmodules (labeled LM) and the 16 repeated modules depicted in purple (labeled 1–9 for mlsA1 andmlsA2, and 1–7 for mlsB) enable the serial buildup of the backbone carbon chain of the compleximmunosuppressive substance mycolactone.doi:10.1371/journal.pcbi.1000481.g001
PLoS Computational Biology | www.ploscompbiol.org 2 October 2009 | Volume 5 | Issue 10 | e1000481
research groups often have studied homol-
ogous genes in various species, and given
them different names that are not always
logical or reflective of similarities in
sequence and function. A manually curat-
ed database of protein families involved in
host interactions that incorporates cur-
rently used gene names, sequence motifs,
gene functions, and experimental results
would substantially improve the situation.
Much improved guidelines for how to
annotate genes in large families with
different combinations of sequence motifs
would also be valuable.
Comparative studies of very closely
related genomes can help to distinguish
functional genes from spurious ORFs
(open reading frames) and pseudogenes,
and thereby improve gene prediction. To
this end, a tool to visualize all the fine
details in comparisons of multiple closely
related genomes is crucial. Such a tool was
developed recently for genomes with a
conserved order of genes, and it has been
applied to analyze sequence deterioration
in the typhus pathogen Rickettsia prowazekii
and its closest relatives [10]. Future
studies, however, will require software that
can also handle multiple genome compar-
isons from highly rearranged genomes.
Another limitation of currently available
visualization tools is that, although multi-
ple genomes can be included, only serial
pairwise comparisons can be made. This
limitation can be overcome by visualiza-
tion of genome comparisons in ‘‘three
dimensions’’ (3D visualization), enabling
all-against-all comparisons to be viewed
simultaneously (Figure 2). Just as 3D
visualizations revolutionized the field of
structural biology over the past decades,
such developments might well revolution-
ize the field of comparative genomics in
the years to come.
Molecular Diagnostics andVaccine Development
Classification of infectious disease
agents is typically based on multilocus
sequence-typing (MLST) systems, by
which new bacterial isolates are analyzed
by sequencing five to seven predefined
core genes [19]. With the increasing
number of complete genome sequences
of pathogenic and nonpathogenic strains,
it will be possible to concatenate a much
larger number of conserved genes and use
this dataset to infer a tree to represent the
underlying population structure [20].
However, while genotyping systems based
on conserved genes can be useful for
monitoring the spread of strains, they do
not necessarily correlate with genomotypes
defined by virulence properties [21]. This
is because genes contributing to virulence
are prone to horizontal gene transfer, gene
duplications, and gene loss. Further com-
plicating the development of molecular
diagnostic methods is that homologs of
virulence genes are often present also in
nonpathogenic species, making it difficult
to recognize pathogens solely from the
gene content. Hence, classification and
risk assessments for the emergence of
novel infectious strains ultimately should
be based on a combination of strain
typing, gene content, and identification
of virulence genes.
Understanding the evolutionary dynam-
ics of host-interaction genes in terms of
both mechanisms and selective forces is also
important in order to design drugs that will
be effective in the long term. What good
would be the development of a new
antibiotic or vaccine if the intended target
protein evolves beyond recognition before
the drug reaches the market? One solution
to this problem is to characterize the
selective pressures on candidate vaccine
targets, and then exclude genes or parts of
genes based on their evolutionary dynamics
[22]. However, current tools for measuring
positive or diversifying selection are severe-
ly limited in that they assume that single-
base mutations are the only underlying
mechanism of sequence change. For reli-
able analyses of genes with a complex
evolution, a new generation of evolutionary
tests needs to be developed that acknowl-
edge the importance of mutation by
recombination (Figure 3) and multiple-base
insertion/deletion events as well as point
mutations. With the expected huge increase
of complete and draft genomes for many
strains of a species, there is a need for
programs capable of screening a large set of
alignments for recombination signals, with
novel statistical and visualization tools to
analyze the full set of results.
Predicting Risk for DiseaseOutbreaks
The next challenge is to place the
genomic data within its ecological context,
which has led to a new research field
called molecular ecosystems biology [23].
This field focuses on dissecting the many
complex molecular interactions between
the bacterial population and its environ-
ment. This environment can be highly
specialized, as in the case of bacteria
adapted to a single host species, or very
complex as for soil-, water-, or airborne
bacteria. The behavior of a pathogen thus
depends on many ecological factors, such
as seasonal fluctuations in temperature
and nutritional availability, species rich-
ness and host population density.
To be able to integrate and evaluate
these data, new software is needed. Imagine
a program that can read sequence data
from hundreds of bacterial isolates, infer
the underlying population structure, and
combine it with gene expression data,
Figure 2. New visualization tools forgenome comparisons. Comparison of thegenes in multiple genomes can be represent-ed visually by using a 3D program. Each arrowrepresents one gene, and the grey shadingbetween genes indicates homology. Redindicates genes that are unique to onegenome. The difference between this ap-proach and existing programs is that allgenomes can be compared to each othersimultaneously, rather than by pairwise com-parisons. With multiple genomes, and withzooming, flipping, and selecting options, eventhis rudimentary 3D program would be ofgreat help in genome analysis.doi:10.1371/journal.pcbi.1000481.g002
Figure 3. New methods for analyzingevolution by recombination. Improvedmodels and visualization tools are needed toanalyze recombination. Virulence genes, hereexemplified by the acfD gene in the Vibriocholerae pathogenicity island [43], oftendisplay complex recombination patterns. Thealigned acfD genes (arrows) from three V.cholerae strains (M2140, M1567, and M1118)are plotted separately; a line connects eachsite where the nucleotides in two strains differfrom the third strain. Noninformative siteswere removed before plotting.doi:10.1371/journal.pcbi.1000481.g003
PLoS Computational Biology | www.ploscompbiol.org 3 October 2009 | Volume 5 | Issue 10 | e1000481
ecological factors, and clinical data such as
the number of disease cases reported in
various geographic areas. It should be
possible to visualize global patterns in the
data, such as abundance of particular
strains and sequence variants and migra-
tion of infected hosts and vectors over
geographic areas and seasons. Changes in
taxonomic profiles, virulence genes, and
metabolic pathways should be visualized in
real time. This program could also be
linked to a Web site where researchers can
post daily updates of clinical cases, spread
of virulence genes, appearance of new
strains and new mutations, migration
patterns, and news about genome and
functional data. This site would be useful
for estimating the risk for new epidemics to
emerge in the human population.
Analyzing MicrobialCommunities
Analyzing the behavior of complete
pathogen ecosystems is an immediate
priority. Random shotgun sequencing
projects of bacterial DNA from diverse
environments count in the hundreds, and
the amount of metagenomic sequence
data already exceeds the available geno-
mic sequences in public databases [24,25]
(http://www.genomesonline.org). Several
multinational projects on the human
microbiome have been launched, which,
together with studies of 16S rRNA ampli-
cons, have provided new insights into the
human intestinal [26–28], oral [29], and
vaginal flora [30]. Comparison of the
microbial flora in healthy and diseased
people can be a powerful diagnostic tool
and enable the discovery of both emerging
pathogens and novel virulence factors,
such as antibiotic resistance plasmids. An
important technical development that
holds great promise for associating the
functional adaptation of the community as
a whole with the metabolic pathways
present in the individual strains is single-
cell isolation followed by whole-genome
amplification. Community sequencing also
provides an excellent tool for epidemic
surveillance of pathogenic strains and
virulence genes in environments from
which they may further spread to humans.
The massive amount of data created by
microbial community sequencing poses
new challenges and will require extensive
bioinformatics development [24]. Al-
though the advent of longer sequence
reads will have a large impact on the
assembly of community data, the presence
of many closely related species or strains in
the same sample, along with horizontal
gene transfer, will remain a daunting
challenge. A whole new field of compar-
ative algorithms needs to be developed, for
example to provide meaningful compari-
sons between taxonomic profiles. New
sequence databases will be essential for
rapid access to both raw and processed
data. Also, for fair comparisons between
datasets, a certain level of standardization
of sampling, experimental work, and
statistics will be crucial [31]. Bioinfor-
matics skills combined with a deep biolog-
ical understanding of the system under
study are needed to use these complex
sequence datasets to answer such questions
as: Who is there? What are they doing?
How are they communicating? And what
is the risk for disease?
Challenges for the Future
The priority goals for the next decade
within the area of emerging infectious
diseases should be the study of complete
pathogen ecosystems and the dissection of
host–pathogen interaction communication
pathways directly in the natural environ-
ment. To achieve these goals, investments
in user-friendly software and improved
visualization tools, along with excellent
expertise in computational biology, will be
of utmost importance. Unfortunately, too
few undergraduate students in clinical
microbiology and microbial ecology are
trained in computational skills, and nation-
al governments and universities need to
take action to address this deficiency to
meet the demands of the near future. Often
neglected by public and private funding is
the monumental need for stable and
standardized infrastructure at all levels,
from the individual research group to the
intergovernmental organization. Only with
proper investments in everything from
hardware and personnel for data handling,
to the development of sensible and stan-
dardized file formats, can we ensure that
the current developments can be fully
exploited to more efficiently battle emerg-
ing infectious diseases.
Currently, the slow transition from a
scientific in-house program to the distribu-
tion of a stable and efficient software
package is a major bottleneck in scientific
knowledge sharing, preventing efficient
progress in all areas of computational
biology. Efforts to design, share, and
improve software must receive increased
funding, practical support, and, not the
least, scientific impact. Since microorgan-
isms do not follow national borders, such
initiatives are probably best started from
intergovernmental organizations with close
links to national centers with established
communication networks to distribute
know-how and advances further within
the country, and vice versa, to facilitate
the spread of new concepts and software to
all members of the organization. Eventual-
ly, many of these initiatives may become
community-driven. The example of Wiki-
pedia, with more than 10 million entries
written since the launch in 2001 and a
current growth rate of thousands of articles
daily (http://www.wikipedia.org), demon-
strates the power of user-contributed ini-
tiatives.
Acknowledgments
We thank Eddie Persson for graphical work.
References
1. Rappuoli R (2004) From Pasteur to genomics:
Progress and challenges in infectious diseases. Nat
Med 10: 1177–1185.
2. Marra MA, Jones SJ, Astell CR, Holt RA,
Brooks-Wilson A, et al. (2003) The genome
sequence of the SARS-associated coronavirus.
Science 300: 1399–1404.
3. Rota PA, Oberste MS, Monroe SS, Nix WA,
Campagnoli R, et al. (2003) Characterization of a
novel coronavirus associated with severe acute
respiratory syndrome. Science 300: 1394–1399.
4. Parkhill J, Wren BW, Thomson NR, Titball RW,Holden MT, et al. (2001) Genome sequence of
Yersinia pestis, the causative agent of plague.
Nature 413: 523–527.
5. Welch RA, Burland V, Plunkett G 3rd,
Redford P, Roesch P, et al. (2002) Extensive
mosaic structure revealed by the complete
genome sequence of uropathogenic Escherichia
coli. Proc Natl Acad Sci U S A 99: 17020–17024.
6. Dziejman M, Balon E, Boyd D, Fraser CM,
Heidelberg JF, et al. (2002) Comparative genomic
analysis of Vibrio cholerae: genes that correlate with
cholera endemic and pandemic disease. Proc Natl
Acad Sci U S A 99: 1556–1561.
7. Wolfgang MC, Kulasekara BR, Liang X, Boyd D,
Wu K, et al. (2003) Conservation of genome
content and virulence determinants among clin-
ical and environmental isolates of Pseudomonas
aeruginosa. Proc Natl Acad Sci U S A 100:
8484–8489.
8. Seshadri R, Myers GS, Tettelin H, Eisen JA,
Heidelberg JF, et al. (2004) Comparison of the
genome of the oral pathogen Treponema denticola
with other spirochete genomes. Proc Natl Acad
Sci U S A 101: 5646–5651.
9. Gill SR, Fouts DE, Archer GL, Mongodin EF,
Deboy RT, et al. (2005) Insights on evolution of
virulence and resistance from the complete
genome analysis of an early methicillin-resistant
Staphylococcus aureus strain and a biofilm-producing
methicillin-resistant Staphylococcus epidermidis strain.
J Bacteriol 187: 2426–2438.
10. Fuxelius HH, Darby AC, Cho NH, Andersson SG
(2008) Visualization of pseudogenes in intracellu-
lar bacteria reveals the different tracks to gene
destruction. Genome Biol 9: R42.
11. Berglund EC, Frank AC, Calteau A, Vinnere
Pettersson O, Granberg F, et al. (2009) Run-
off replication of host-adaptability genes is
associated with gene transfer agents in the
genome of mouse-infecting Bartonella grahamii.
PLoS Genet 5: e1000546. doi:10.1371/journal.
pgen.1000546.
PLoS Computational Biology | www.ploscompbiol.org 4 October 2009 | Volume 5 | Issue 10 | e1000481
12. Deitsch KW, Moxon ER, Wellems TE (1997)
Shared themes of antigenic variation and viru-
lence in bacterial, protozoal, and fungal infec-
tions. Microbiol Mol Biol Rev 61: 281–293.
13. Brayton KA, Knowles DP, McGuire TC,
Palmer GH (2001) Efficient use of a small
genome to generate antigenic diversity in tick-
borne ehrlichial pathogens. Proc Natl Acad
Sci U S A 98: 4130–4135.
14. Nystedt B, Frank AC, Thollesson M,
Andersson SG (2008) Diversifying selection and
concerted evolution of a type IV secretion system
in Bartonella. Mol Biol Evol 25: 287–300.
15. Bilek N, Ison CA, Spratt BG (2009) Relative
contributions of recombination and mutation to
the diversification of the opa gene repertoire of
Neisseria gonorrhoeae. J Bacteriol 191: 1878–1890.
16. Gupta PK (2008) Single-molecule DNA sequenc-
ing technologies for future genomics research.
Trends Biotechnol 26: 602–611.
17. The Gene Ontology’s Reference Genome Pro-
ject: A unified framework for functional annota-
tion across species. PLoS Comput Biol 5:
e1000431.
18. Karp PD, Ouzounis CA, Moore-Kochlacs C,
Goldovsky L, Kaipa P, et al. (2005) Expansion of
the BioCyc collection of pathway/genome data-
bases to 160 genomes. Nucleic Acids Res 33:
6083–6089.
19. Maiden MC, Bygraves JA, Feil E, Morelli G,
Russell JE, et al. (1998) Multilocus sequence
typing: A portable approach to the identification
of clones within populations of pathogenic
microorganisms. Proc Natl Acad Sci U S A 95:
3140–3145.
20. Ciccarelli FD, Doerks T, von Mering C,
Creevey CJ, Snel B, et al. (2006) Toward
automatic reconstruction of a highly resolved
tree of life. Science 311: 1283–1287.
21. Turner KM, Feil EJ (2007) The secret life of the
multilocus sequence type. Int J Antimicrob
Agents 29: 129–135.
22. Bambini S, Rappuoli R (2009) The use of
genomics in microbial vaccine development.
Drug Discov Today 14: 252–260.
23. Raes J, Bork P (2008) Molecular eco-systems
biology: Towards an understanding of communi-ty function. Nat Rev Microbiol 6: 693–699.
24. Kunin V, Copeland A, Lapidus A, Mavromatis K,
Hugenholtz P (2008) A bioinformatician’s guideto metagenomics. Microbiol Mol Biol Rev 72:
557–578.25. Liolios K, Mavromatis K, Tavernarakis N,
Kyrpides NC (2008) The Genomes On Line
Database (GOLD) in 2007: Status of genomicand metagenomic projects and their associated
metadata. Nucleic Acids Res 36: D475–479.26. Dethlefsen L, Huse S, Sogin ML, Relman DA
(2008) The pervasive effects of an antibiotic onthe human gut microbiota, as revealed by deep
16S rRNA sequencing. PLoS Biol 6: e280.
doi:10.1371/journal.pbio.0060280.27. Turnbaugh PJ, Hamady M, Yatsunenko T,
Cantarel BL, Duncan A, et al. (2009) A coregut microbiome in obese and lean twins. Nature
457: 480–484.
28. Mahowald MA, Rey FE, Seedorf H,Turnbaugh PJ, Fulton RS, et al. (2009) Charac-
terizing a model human gut microbiota com-posed of members of its two dominant bacterial
phyla. Proc Natl Acad Sci U S A 106:5859–5864.
29. Keijser BJ, Zaura E, Huse SM, van der
Vossen JM, Schuren FH, et al. (2008) Pyrose-quencing analysis of the oral microflora of
healthy adults. J Dent Res 87: 1016–1020.30. Spear GT, Sikaroodi M, Zariffard MR,
Landay AL, French AL, et al. (2008) Comparison
of the diversity of the vaginal microbiota in HIV-infected and HIV-uninfected women with or
without bacterial vaginosis. J Infect Dis 198:1131–1140.
31. Raes J, Foerstner KU, Bork P (2007) Get the mostout of your metagenome: Computational analysis
of environmental sequence data. Curr Opin
Microbiol 10: 490–498.32. Andersson SG, Kurland CG (1998) Reductive
evolution of resident genomes. Trends Microbiol6: 263–268.
33. Cole ST, Eiglmeier K, Parkhill J, James KD,
Thomson NR, et al. (2001) Massive gene decay inthe leprosy bacillus. Nature 409: 1007–1011.
34. Alsmark CM, Frank AC, Karlberg EO,
Legault BA, Ardell DH, et al. (2004) The louse-borne human pathogen Bartonella quintana is a
genomic derivative of the zoonotic agent Barton-
ella henselae. Proc Natl Acad Sci U S A 101:9716–9721.
35. Cole ST, Brosch R, Parkhill J, Garnier T,Churcher C, et al. (1998) Deciphering the biology
of Mycobacterium tuberculosis from the complete
genome sequence. Nature 393: 537–544.36. Parkhill J, Sebaihia M, Preston A, Murphy LD,
Thomson N, et al. (2003) Comparative analysis ofthe genome sequences of Bordetella pertussis,
Bordetella parapertussis and Bordetella bronchiseptica.Nat Genet 35: 32–40.
37. Yip MJ, Porter JL, Fyfe JA, Lavender CJ,
Portaels F, et al. (2007) Evolution of Mycobacterium
ulcerans and other mycolactone-producing myco-
bacteria from a common Mycobacterium marinum
progenitor. J Bacteriol 189: 2021–2029.
38. Rondini S, Kaser M, Stinear T, Tessier M,
Mangold C, et al. (2007) Ongoing genomereduction in Mycobacterium ulcerans. Emerg Infect
Dis 13: 1008–1015.39. Stinear TP, Seemann T, Pidot S, Frigui W,
Reysset G, et al. (2007) Reductive evolution andniche adaptation inferred from the genome of
Mycobacterium ulcerans, the causative agent of Buruli
ulcer. Genome Res 17: 192–200.40. Huber CA, Ruf MT, Pluschke G, Kaser M (2008)
Independent loss of immunogenic proteins inMycobacterium ulcerans suggests immune evasion.
Clin Vaccine Immunol 15: 598–606.
41. Stinear TP, Mve-Obiang A, Small PL, Frigui W,Pryor MJ, et al. (2004) Giant plasmid-encoded
polyketide synthases produce the macrolide toxinof Mycobacterium ulcerans. Proc Natl Acad Sci U S A
101: 1345–1349.42. Pidot SJ, Hong H, Seemann T, Porter JL, Yip MJ,
et al. (2008) Deciphering the genetic basis for
polyketide variation among mycobacteria pro-ducing mycolactones. BMC Genomics 9: 462.
43. Tay CY, Reeves PR, Lan R (2008) Importation ofthe major pilin TcpA gene and frequent recom-
bination drive the divergence of the Vibrio
pathogenicity island in Vibrio cholerae. FEMSMicrobiol Lett 289: 210–218.
PLoS Computational Biology | www.ploscompbiol.org 5 October 2009 | Volume 5 | Issue 10 | e1000481
Perspective
The Role of Medical Structural Genomics in DiscoveringNew Drugs for Infectious DiseasesWesley C. Van Voorhis1, Wim G. J. Hol2, Peter J. Myler3,4,5*, Lance J. Stewart6*
1 Department of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Biochemistry, University of Washington, Seattle,
Washington, United States of America, 3 Seattle Biomedical Research Institute, Seattle, Washington, United States of America, 4 Department of Global Health, University of
Washington, Seattle, Washington, United States of America, 5 Department of Medical Education and Biomedical Informatics, University of Washington, Seattle,
Washington, United States of America, 6 deCODE biostructures, Bainbridge Island, Washington, United States of America
Introduction
Whether we think of Alzheimer’s dis-
ease, microbial infection, or any other
modern-day disease, new medicines are
urgently needed. The number of new
drugs registered since the advent of
genomics, however, has not lived up to
expectations. One recent review revealed
that over 70 high-throughput biochemical
screens against genetically validated drug
targets in bacteria failed to yield a single
candidate that could be tested in the clinic
[1]. The reasons for the failure of high-
throughput biochemical screens are not
completely clear, but it could reflect the
limited diversity of chemical libraries used
and/or the absence of structural informa-
tion for many of the targets. Indeed,
structure-based drug design is playing a
growing role in modern drug discovery,
with numerous approved drugs tracing
their origins, at least in part, to the use of
structural information from X-ray crystal-
lography or nuclear magnetic resonance
(NMR) analysis of protein targets and their
ligand-bound complexes. Although it is
beyond the scope of this brief overview to
present a comprehensive list of structures
that have led to useful drugs, Table 1 lists
some examples in which protein structure
information has provided insights to the
design and development of new therapeu-
tic entities. These cases include both novel
drug design based on native and ligand-
bound structures and optimization of
inhibitors based on the binding mode
revealed by the structures of inhibitor–
target complexes. These approaches have
allowed increased affinity for the target
and/or improvement of pharmacological
properties while maintaining target
affinity.
With the increasing availability of
complete human and pathogen genome
sequences and the substantial progress in
structure determination methods, it is no
surprise that the field of ‘‘structural
genomics’’ has emerged recently. Its aim
is to solve as many useful protein struc-
tures as possible from the entire genome of
a single organism or group of related
organisms. Over the past ten years, over
20 structural genomics initiatives have
begun around the world (Table 2). The
impact of these efforts on structural
biology has been substantial, both in the
sheer number of new structures and,
perhaps even more importantly, in the
development of new methodologies, espe-
cially the use of robotics and informatics to
generate and capture data in a systematic
way [2]. Over the next five years,
thousands of new protein structures, many
bound to their ligands, will be elucidated;
laying the groundwork for structure-
based design and development of new
and improved chemotherapeutic agents
against pathogen proteins. Here, we will
focus on the intersection of structural
biology with chemistry and biology—a
field called ‘‘medical structural geno-
mics’’—particularly on how the structures
of medically relevant drug targets in
pathogens can serve as a starting point
for inhibitor design and drug develop-
ment. We argue that the pharmaceutical
industry should be persuaded to comple-
ment the publicly funded structural geno-
mics initiatives by making public the
structural coordinates of their drug targets
for important infectious disease organisms
in a timely fashion and by developing
public–private partnerships to provide the
maximal synergy between target valida-
tion, structure determination, and hit-to-
lead development.
Target Selection
A prerequisite of medical structural
genomics is that the proteins whose
structures are determined must be well-
validated as good drug targets. The term
‘‘drugability’’ is often used to loosely
describe how tractable any given target is
for the development of a drug candidate.
For infectious organisms, one key factor in
defining drugability is that the target
protein be essential for survival of the
microbe. While essentiality has tradition-
ally been defined using techniques such as
‘‘gene knockout’’ and RNA interference,
these are not always feasible and should be
complemented by chemical biology ap-
proaches (see below). Furthermore, the
meaningfulness of these experiments can
often be difficult to assess, since the
interplay of host and pathogen is complex
and full of surprises. For example, tre-
mendous effort has been devoted recently
to the development of antagonists for
targets in the fatty acid biosynthesis
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/).
Citation: Van Voorhis WC, Hol WGJ, Myler PJ, Stewart LJ (2009) The Role of Medical Structural Genomics inDiscovering New Drugs for Infectious Diseases. PLoS Comput Biol 5(10): e1000530. doi:10.1371/journal.pcbi.1000530
Editor: Ernest Fraenkel, Massachusetts Institute of Technology, United States of America
Published October 26, 2009
Copyright: � 2009 Van Voorhis et al. This is an open-access article distributed under the terms of theCreative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in anymedium, provided the original author and source are credited.
Funding: This work was supported by the NIAID funding to the Seattle Structural Genomics Center forInfectious Disease (SSGCID) contract HHSN266200700057C, the Medical Structural Genomics of ProtozoanPathogens (MSGPP) contract P01 AI067921 and to WCVV, grant 1R01AI080625. The funders had no role inpreparation of the article.
Competing Interests: Co-author Lance Stewart is an employee of deCODE biostructures, which developedthe Fragments-of-Life library presented in Figure 1 and discussed in sections titled ‘Fragment-based drugdiscovery’ and ‘Targeting oligomeric enzymes’. Fragments-of-Life TM is a technology trademarked by deCODEbiostructures and chemistry (http://www.decodechembio.com/Capabilities/StructuralBiology/FragmentsofLife.aspx).
* E-mail: [email protected] (PJM); [email protected] (LJS)
PLoS Computational Biology | www.ploscompbiol.org 1 October 2009 | Volume 5 | Issue 10 | e1000530
pathway of bacteria [3]. Potent drug-like
molecules with high bioavailability have
been developed that can effectively shut
down bacterial replication in vitro. These
compounds were found to be ineffective in
subsequent animal testing, however, be-
cause fatty acids are quite abundant in
vertebrates, so bacteria can secure these
host molecules for their survival and
growth even if their own fatty acid
biosynthesis pathways are blocked [4].
Thus, to improve target selection for
medical structural genomics, it will be
important to collaborate with chemical
biology groups to undertake screening
campaigns to identify compounds that
cause the death of a pathogen under the
appropriate assay conditions [5].
If the target protein of a drug is known,
medical structural genomics offers a rapid
and efficient way to obtain ligand-bound
structures by using high-throughput X-ray
crystallography and/or NMR. Converse-
ly, when the target of a cell-active
compound is unknown, medical structural
genomics efforts provide purified protein
for many potential drug targets that can be
screened for interaction with the active
compound by a number of biophysical
methods (such as thermal stability [6]).
The Medicinal Structural Genomics of
Protozoan Pathogens (MSGPP, http://
www.msgpp.org/) initiative has already
begun such an effort by screening thou-
sands of anti-malaria compounds against
67 potential Plasmodium falciparum targets
expressed in bacteria (WC Van Voorhis,
unpublished data). These approaches aim
to generate knowledge about the biological
effect of a small molecule on a target
protein. Follow-up experiments are then
needed to test the activity of this com-
pound in live organisms in order to
validate the target; this valuable ‘‘chemical
validation’’ makes the target much more
likely to be drugable, and thus worthy of
more intensive effort. The future will likely
see more medical structural genomics
centers working with chemical biology
groups that have collections of ‘‘pheno-
type-defined’’ compounds (i.e., those with
known anti-pathogen activity). The result
will be synergistic target validation and
hit-to-lead development using structure-
based drug design.
Fragment-Based DrugDiscovery
Fragment-based drug discovery has rapid-
ly gained interest within the pharmaceutical
industry (reviewed in [7] with roots of 128-
compound cocktails in [8]), as an alternative
to expensive and sometimes inefficient high-
throughput screening methods for hit identi-
fication and optimization [9]. The general
concept of fragment-based drug discovery
involves screening libraries of ‘‘rule-of-three’’
compounds [10] against target macromole-
cules by using a variety of methods including
X-ray crystallography, NMR, surface plas-
mon resonance, differential thermal denatur-
ation, fluorescence polarization, and other
techniques [7,11–14]. The rule of three
consists of molecular weight ,300 daltons,
#3 rotatable bonds, #3 hydrogen bond
donors/acceptors, and Clog P (calculated log
of octanol/water partition coefficient) ,3.
These compounds generally include frag-
ments or ‘‘building blocks’’ of available drugs,
on the assumption that these fragments are
more likely to be ‘‘drug-like.’’ Fragment-
based drug discovery has been used by
commercial and academic groups, including
our own, and has led to a number of leads for
further drug development [15]. At deCODE
biostructures, a partner in the Seattle Struc-
tural Genomics Center for Infectious Disease
(SSGCID, http://www.ssgcid.org/) consor-
tium, the approach to assembling a fragment
library has been somewhat different. The
Fragments of Life (FOL) library (Figure 1) is a
collection of approximately 1,400 structurally
diverse small molecules found in the cellular
environment, metabolites, natural products,
and their derivatives or isosteres (molecules of
Table 1. Examples of how target protein structure can assist drug discovery and development.
Source Target Protein Approach Reference(s)
HIV gp41 Structure led to strategies that target viral entry. [43–45]
HIV Protease Protease–inhibitor complexes allowed lead optimization. [46–52]
HIV Reverse transcriptase Non-nucleoside inhibitor complexes led to drug design that targetspockets outside the enzyme’s active site.
[53–55]
Influenza virus Neuraminidase Complex with a transition state analog led to inhalable and orally activeneuraminidase inhibitors.
[56–59]
Rhinovirus Coat protein Small fatty acid molecules bound in hydrophobic pocket led to newstrategies of antiviral drug design.
[60]
Vibrio Cholera toxin Five receptor-binding sites provided inspiration for design of novelmultivalent inhibitors.
[61]
Bacteria Peptide deformylase Protein–inhibitor complexes led to macrocyclic compounds withimproved potency, selectivity and metabolic stability.
[62]
Trypanosoma GAPDH Novel adenosine analogs showed enhanced selectivity towards theparasite target versus human protein.
[63,64]
Human Cyclophilin and calcineurin A ternary complex with cyclosporine A led to insights into itsimmunosuppressive activity.
[65]
Human Renin The ligand-bound structure allowed design and improvement of orallyactive non-peptide inhibitors to regulate blood pressure.
[66]
Human Coagulation factor Xa Structure-based design led to improved pharmacological anticoagulantproperties in a primate model.
[67]
Human Adenosine deaminase Optimization of a non-nucleoside inhibitor led to an orally activeanti-inflammatory compound in a rat model.
[68]
Human Kinases Structures of kinases provided a basis to improve and design newtherapeutics for various human diseases including cancer.
[69]
doi:10.1371/journal.pcbi.1000530.t001
PLoS Computational Biology | www.ploscompbiol.org 2 October 2009 | Volume 5 | Issue 10 | e1000530
similar size containing the same number and
types of atoms). Also included in the FOL
library are a series of biaryl small molecules
(which contain two tethered five- or six-
membered ring structures) that mimic protein
secondary structure elements (e.g., a-helical
turns). Thus, this fragment set is useful for
targeting both the active sites of enzymes and
more complex protein surfaces including
allosteric small molecule binding sites and
protein–protein interfaces [16].
Targeting Oligomeric Enzymes
Protein–protein interaction and assem-
blies, ranging from simple dimers to
extremely complex arrangements as seen
in the ribosome or the nuclear pore
complex, form the basis of most biological
processes, and there are usually numerous
points of contact between the macromol-
ecules involved. Yet the protein–protein
interfaces formed by oligomerization are
not necessarily accompanied by a large
gain in free energy, and small molecules
have been shown to prevent critical
protein–protein interactions [17]. These
Table 2. Structural genomics projects worldwide submitting to the Protein Data Bank.
Name URL Target Focus
Berkeley Structural Genomics Center (BSGC) http://www.strgen.org/ Near complete coverage of Mycoplasma genome
Center for Eukaryotic Structural Genomics (CESG) http://www.uwstructuralgenomics.org/ PSI Center—Eukaryotic bottlenecks, specifically solubility
Center for Structural Genomics of Infectious Disease(CSGID)
http://csgid.org/csgid/ Medically relevant infectious disease targets
Center for Structure of Membrane Proteins (CSMP) http://csmp.ucsf.edu/index.htm PSI Center—Bacterial and human membrane proteins
Integrated Center for Structure and FunctionInnovation (ISFI)
htp://techcenter.mbi.ucla.edu/ PSI Center—Protein solubility and crystallizationimprovement
Israel Structural Proteomics Center http://www.weizmann.ac.il/ISPC/ Member of Structural Proteomics in Europe (seebelow)
Joint Center for Structural Genomics (JCSG) http://www.jcsg.org/ PSI Center—High-throughput pipeline developmentand operation
Marseilles Structural Genomics Program http://www.afmb.univ-mrs.fr/rubrique93.html Human health
Medical Structural Genomics of PathogenicProtozoa (MSGPP)
http://www.msgpp.org/ Structural and functional genomics of ten species ofpathogenic protozoa
Montreal-Kingston Bacterial Structural GenomicsInitiative (BSGI)
http://euler.bri.nrc.ca/brimsg/bsgi.html ORFs from pathogenic and nonpathogenic bacterialstrains
Mycobacterium Tuberculosis Structural GenomicsConsortium (TBsgc)
http://www.doe-mbi.ucla.edu/TB/ Mycobacterium tuberculosis—To understandpathogenesis and for structure-based drug design
Mycobacterium Tuberculosis Structural ProteomicsProject (X-MTB)
http://webclu.bio.wzw.tum.de/binfo/proj/mtb/ 35 Mycobacterium tuberculosis targets to identify fivefor drug development
New York SGX Research Center for StructuralGenomics (NYSGXRC)
http://www.nysgrc.org/nysgrc/ PSI Center—High-throughput pipeline developmentand operation
Ontario Center for Structural Proteomics (OCSP) http://www.uhnres.utoronto.ca/centres/proteomics/ Enzymatic activity characterization
Oxford Protein Production Facility http://www.oppf.ox.ac.uk/OPPF/ Human and pathogen targets of biomedicalrelevance
RIKEN Structural Genomics/Proteomics Initiative http://www.rsgi.riken.jp/rsgi_e/ Protein functional networks
Seattle Structural Genomics Center for InfectiousDisease (SSGCID)
http://www.ssgcid.org/ Medically relevant infectious disease targets
Southeast Collaboratory for Structural Genomics http://www.secsg.org/ High-throughput eukaryotic genome-scan methodsdevelopment
Structural Genomics of Pathogenic Protozoa http://www.sgpp.org/ PSI Center - Three-dimensional structures of proteinsfrom four major pathogenic protozoa
Structural Proteomics in Europe (SPINE) http://www.spineurope.org/ Structures of medically relevant proteins and proteincomplexes
Structural Proteomics in Europe 2-Complexes(SPINE2 - Complexes)
http://www.spine2.eu/SPINE2/ Structures of protein complexes from medicallyrelevant signaling pathways
Structural Genomics Consortium http://www.thesgc.org/ Medically relevant human and pathogen proteins
Structure 2 Function Project http://s2f.umbi.umd.edu/ Poorly characterized and hypothetical protein targets
The Accelerated Technologies Center for Geneto 3D Structure
http://atcg3d.org/default.aspx PSI Center—Technologies development of X-raysource, synthetic gene design, and microfluidiccrystallization
The Midwest Center for Structural Genomics(MCSG)
http://www.mcsg.anl.gov/ PSI Center—High-throughput methods developmentand operation
The Northeast Structural Genomics Consortium(NESG)
http://www.nesg.org/ PSI Center—Protein domains, network families,biomedical relevance
Note: Some centers with fewer than ten released structures in the PDB (www.rcsb.org/pdb/) are not shown.PSI, Protein Structure Initiative.doi:10.1371/journal.pcbi.1000530.t002
PLoS Computational Biology | www.ploscompbiol.org 3 October 2009 | Volume 5 | Issue 10 | e1000530
findings have prompted recent discussion
of a structure-based approach aimed at
developing novel small-molecule antibiot-
ics that modulate protein activity by
binding to an interface between subunits
within multi-protein complexes [18]. The
bacterial enzyme inorganic pyrophospha-
tase may serve as an example for this
approach, since it exists in a hexameric
state that requires conformational flexibil-
ity for its essential role in converting
inorganic pyrophosphate into phosphate
[19–21]. Moreover, whereas all bacterial
inorganic pyrophosphatases function as a
homohexamer, the eukaryotic cytosolic
and mitochondrial inorganic pyrophos-
phatases function as homodimers [21].
Hence eukaryotic inorganic pyrophospha-
tases have different oligomeric interfaces
than those of bacterial enzymes. This
suggests that it may be possible to inhibit
the bacterial inorganic pyrophosphatase
safely by targeting its oligomeric state
rather than its highly conserved active
site. A similar approach has recently been
used to identify species-specific modulators
of porphobilinogen synthase (PBGS) ac-
tivity [22]. SSGCID has solved the high-
resolution X-ray crystal structure of inor-
ganic pyrophosphatase from the patho-
genic bacterium Burkholderia pseudomallei,
and a subsequent FOL screen of this target
identified several fragments that specifical-
ly bind at multiple oligomerization pockets
in a molecular interface between the two
trimers of the homohexamer (Figure 2).
While these fragments remain to be
validated in terms of their species-specific
inhibition of inorganic pyrophosphatase
activity, they represent potential starting
points for the development of novel
antibiotics.
Industry-Generated Structuresand the Protein Data Bank
As we have seen above, protein struc-
ture information is the bread and butter of
structure-based drug discovery. Structural
genomics projects (Table 2) have substan-
tially increased the number of protein
structures solved and have made this
information freely and openly available
(i.e., at no cost and without restriction by
copyright or other constraints) by depos-
iting it in the Protein Data Bank (PDB)
[23]. Most publishers have policies that
require authors to deposit structural data
in the PDB at the time of publication, so
structures determined by academic re-
searchers worldwide are, for the most
part, well disseminated. By contrast, the
pharmaceutical industry is sitting on a
mountain of structural data for protein–
ligand complexes from globally important
pathogens, which is not available to the
wider scientific community. The secrecy
engendered by the current economic
incentives driving drug discovery in the
commercial sector has led to a substantial
waste of precious resources through dupli-
cation of effort and inability to learn from
others’ successes and failures. The situa-
tion is unlikely to change without a
concerted effort to find ways to overcome
the financial and intellectual property
barriers that prevent dissemination of this
information. A recent publication suggest-
ed that open access industry–academia
partnerships may provide one possible
model [24]. We propose that the United
States National Institutes of Health, along
with other national and international
research-funding agencies, issue calls for
proposals that will fund the transfer of the
highly valuable structural information
from corporate databases into the PDB.
Such an effort would obviously require
discussion with industrial parties to nego-
tiate mutually acceptable policies and
mechanisms for the deposition of these
structures in the public databases. These
might include relaxation of release stan-
dards for industrial entities, such that
structural information could be safely
deposited in PDB at the time of structure
Figure 1. Conceptual organization of the deCODE biostructures Fragments of Life library. The current ,1,400-compound library containschemically tractable natural small molecule metabolites (FOL-Nat), metabolite-like compounds and their bioisosteres (FOL-NatD), and biaryl mimeticsof protein architecture (FOL-Biaryl). The FOL-Nat members include any natural molecule of molecular weight ,350 daltons that exists as a substrate,natural product, or allosteric regulator of any metabolic pathway in any cell type, such as the biosynthetic pathways for the neurotransmitterserotonin (1) and the plant hormone auxin (2). The FOL-Nat members also include secondary metabolites such as bestatin (3), a secondarymetabolite of Streptomyces olivoreticuli [38]. FOL-NatD fragments are defined as heteroatom-containing derivatives, isosteres, or analogs of any FOL-Nat molecule. For example, fragments 4–7 contain the indole scaffold, which is known to be a privileged building block for drug molecules [39]. Toemulate protein architecture, the FOL-Biaryl fragments were selected from a variety of biaryl compounds that are potential mimics of protein a, b, orc turns [40–42]. These include a compound (8) whose structure in an energy-minimized state can be seen to mimic the architecture on an a-turn of aprotein structure (here, residues Ser65-Ile66-Leu67-Lys68 of PDB ID:1RTP) and, similarly, a compound (9) whose structure mimics the b-turn of aprotein structure (residues Ala20-Ala21-Asp22-Ser23).doi:10.1371/journal.pcbi.1000530.g001
PLoS Computational Biology | www.ploscompbiol.org 4 October 2009 | Volume 5 | Issue 10 | e1000530
determination and released only at a later
date more appropriate for protection of
intellectual property.
Challenges for the Future
We are currently witnessing an explo-
sion in technological and computational
advances in structural genomics, with
protein structures of hundreds or thou-
sands of medically relevant targets from
infectious disease organisms likely to be
available over the next few years. This new
information provides both academic and
for-profit scientists with an unprecedented
opportunity to accelerate the development
of new and improved chemotherapeutic
agents against these pathogens. One major
challenge will be the adaptation of existing
fragment-based drug design methods to
match the scale of the structural genomics
era. New high-throughput methods need
to be developed for fragment-screening to
enhance the success rate for protein–
ligand structure determination.
Major attention is also needed to the
development of fully automated, very high
throughput crystal growth screening meth-
ods to elucidate the binding of well-
selected compounds to medically relevant
targets. These screens need to cover many
(up to 100) protein variants [25,26],
1,000–10,000 different small molecule
compounds, and approximately 1,000
different crystal growth conditions [27],
resulting in 108 to 109 conditions to be
tested for a single drug target. Obviously,
this will require development of even
smaller volume assays than those currently
in use [28–31]—down to the low pico-
liters—and automated detection of crystals
in the millions of crystallization chambers
[32–34]. Further development of automat-
ed capillary crystallization methods [35]
might provide another way to achieve the
very high throughput crystal screening
required for reaching the full power of
medical structural genomics in the future.
Cryoprotection of the crystals is a specific
hurdle, although it might be possible to
routinely collect and merge partial datasets
from multiple crystals under non-cryo
conditions. Alternatively, the use of micro-
meshes [36,37] and further miniaturiza-
tion of trays and other crystal screening
tools may allow cryoprotection of many
crystals simultaneously.
In addition, existing databases will need
to be modified to allow easy dissemination
of the results from these fragment screens,
and a serious effort should be made to
persuade small and big pharma to release
coordinates of drug targets from globally
important infectious disease organisms. It
will also be critical (but challenging) for
structural biologists to collaborate with
medicinal chemists and molecular biolo-
gists to turn these fragment from promis-
ing leads to effective drugs. Together,
these steps should begin to release a flood
of structures that provide a tremendous
resource for improving health in rich and
poor countries alike.
Acknowledgments
The authors wish to thank all the individuals
who have dedicated themselves to the SSGCID
and MSGPP projects. In particular, we thank
Robin Stacy, Bart Staker, Alberto Napuli,
Frank E. Zucker, Erkang Fan, Christophe
Verlinde, Ethan Merritt, and Frederick Buck-
ner, to name but a few.
Figure 2. B. pseudomallei inorganic pyrophosphatase with bound ligand at an oligomeric interface. Homo-hexameric bacterial inorganicpyrophosphatase is a dimer of trimers (blue and green). The illustration shows the hexamer structure in a complex with three ligand fragmentmolecules (red spheres and stick structures represent fragment FOL 110), each of which is located at one of three ‘‘dimer of trimer’’ interfaces (1.5ligands per monomer) (PDBID:3EJ0). The location of one pyrophosphate substrate (cyan spheres) at the active site of one of the monomers isindicated here based on the superimposed structure of the hexamer with pyrophosphate bound in the active site (PDBID:3EIY). The binding sites ofthe ligands (red) are clearly seen in a pocket formed by the homo-oligomeric assemblage, which is distant from the active site where pyrophosphate(cyan) binds.doi:10.1371/journal.pcbi.1000530.g002
PLoS Computational Biology | www.ploscompbiol.org 5 October 2009 | Volume 5 | Issue 10 | e1000530
References
1. Payne DJ , Gwynn MN, Holmes DJ ,Pompliano DL (2007) Drugs for bad bugs:
Confronting the challenges of antibacterial dis-
covery. Nat Rev Drug Discov 6: 29–40.
2. Haquin S, Oeuillet E, Pajon A, Harris M,Jones AT, et al. (2008) Data management in
structural genomics: An overview. Methods Mol
Biol 426: 49–79.
3. Wright HT, Reynolds KA (2007) Antibacterialtargets in fatty acid biosynthesis. Curr Opin
Microbiol 10: 447–453.
4. Brinster S, Lamberet G, Staels B, Trieu-Cuot P,
Gruss A, et al. (2009) Type II fatty acid synthesisis not a suitable antibiotic target for gram-positive
pathogens. Nature 458: 83–86.
5. Hoon S, Smith AM, Wallace IM, Suresh S,Miranda M, et al. (2008) An integrated platform
of genomic assays reveals small-molecule bioac-
tivities. Nat Chem Biol 4: 498–506.
6. Ericsson UB, Hallberg BM, Detitta GT,Dekker N, Nordlund P (2006) Thermofluor-based
high-throughput stability optimization of proteins
for structural studies. Anal Biochem 357:
289–298.
7. Congreve M, Chessari G, Tisi D, Woodhead AJ
(2008) Recent developments in fragment-based
drug discovery. J Med Chem 51: 3661–3689.
8. Verlinde CLMJ, Kim H, Bernstein BE,Mande SC, Hol WG (1997) Antitrypanosomiasis
drug development based on structures of glyco-
lytic enzymes. In: Veerapandian P, ed. Structure-
based drug design. New York: Marcel Dekker. pp
365–394.
9. Rees DC, Congreve M, Murray CW, Carr R
(2004) Fragment-based lead discovery. Nat Rev
Drug Discov 3: 660–672.
10. Congreve M, Carr R, Murray C, Jhoti H (2003)
A ‘‘rule of three’’ for fragment-based lead
discovery? Drug Discov Today 8: 876–877.
11. Nienaber VL, Greer J (2000) Discovering novelligands for macromolecules using X-ray crystal-
lographic screening. Nature Biotechnol 18:
1105–1108.
12. Neumann T, Junker HD, Schmidt K, Sekul R
(2007) SPR-based fragment screening: Advantag-
es and applications. Curr Top Med Chem 7:
1630–1642.
13. Jhoti H, Cleasby A, Verdonk M, Williams G
(2007) Fragment-based screening using X-ray
crystallography and NMR spectroscopy. Curr
Opin Chem Biol 11: 485–493.
14. Erlanson DA (2006) Fragment-based lead discov-
ery: A chemical update. Curr Opin Biotechnol
17: 643–652.
15. Bosch J, Robien MA, Mehlin C, Boni E,Riechers A, et al. (2006) Using fragment cocktail
crystallography to assist inhibitor design of
Trypanosoma brucei nucleoside 2-deoxyribosyltrans-
ferase. J Med Chem 49: 5939–5946.
16. Davies DR, Mamat B, Magnusson OT,
Christensen J, Haraldsson MH, et al. (2009)
Discovery of leukotriene A4 hydrolase inhibitors
using metabolomics biased fragment crystallog-raphy. J Med Chem 52: 4694–4715.
17. Liuzzi M, Deziel R, Moss N, Beaulieu P,
Bonneau AM, et al. (1994) A potent peptidomi-
metic inhibitor of HSV ribonucleotide reductasewith antiviral activity in vivo. Nature 372:
695–698.
18. Wells JA, McClendon CL (2007) Reaching for
high-hanging fruit in drug discovery at protein-protein interfaces. Nature 450: 1001–1009.
19. Kankare J, Salminen T, Lahti R, Cooperman BS,
Baykov AA, et al. (1996) Structure of Escherichia
coli inorganic pyrophosphatase at 2.2 A resolu-tion. Acta Crystallogr D Biol Crystallogr 52:
551–563.
20. Oksanen E, Ahonen AK, Tuominen H,
Tuominen V, Lahti R, et al. (2007) A completestructural description of the catalytic cycle of
yeast pyrophosphatase. Biochemistry 46:
1228–1239.
21. Sivula T, Salminen A, Parfenyev AN,
Pohjanjoki P, Goldman A, et al. (1999) Evolu-tionary aspects of inorganic pyrophosphatase.
FEBS Lett 454: 75–80.
22. Lawrence SH, Ramirez UD, Tang L, Fazliyez F,Kundrat L, et al. (2008) Shape shifting leads to
small-molecule allosteric drug discovery. Chem
Biol 15: 586–596.
23. Berman H, Henrick K, Nakamura H, Markley JL(2007) The worldwide Protein Data Bank
(wwPDB): Ensuring a single, uniform archive ofPDB data. Nucleic Acids Res 35: D301–303.
24. Edwards AM, Bountra C, Kerr DJ, Willson TM
(2009) Open access chemical and clinical probes
to support drug discovery. Nat Chem Biol 5:436–440.
25 . Cho i KH, Groarke JM, Young DC,
Rossmann MG, Pevear DC, et al. (2004) Design,expression, and purification of a Flaviviridae
polymerase using a high-throughput approach to
facilitate crystal structure determination. ProteinSci 13: 2685–2692.
26. Graslund S, Sagemark J, Berglund H,
Dahlgren LG, Flores A, et al. (2008) The use ofsystematic N- and C-terminal deletions to
promote production and structural studies of
recombinant proteins. Protein Expr Purif 58:210–221.
27. Luft JR, Collins RJ, Fehrman NA, Lauricella AM,
Veatch CK, et al. (2003) A deliberate approach toscreening for initial crystallization conditions of
biological macromolecules. J Struct Biol 142:170–179.
28. Santarsiero BDYD, Lee CC, Spraggon G, Gu J,Scheibe D, Uber EC, Cornell EW, Nordmeyer RA,
Kolbe WF, Jin J, Jones AL, Jaklevic JM,Schultz PG, Stevens RC (2002) An approach to
rapid protein crystallization using nanodroplets.J Appl Crystallogr 35: 278–281.
29. Hansen CL, Skordalakes E, Berger JM, Quake SR(2002) A robust and scalable microfluidic meter-
ing method that allows protein crystal growth byfree interface diffusion. Proc Natl Acad Sci U S A
99: 16531–16536.
30. Zheng B, Roach LS, Ismagilov RF (2003)
Screening of protein crystallization conditionson a microfluidic chip using nanoliter-size
droplets. J Am Chem Soc 125: 11170–11171.
31. Gerdts CJ, Elliott M, Lovell S, Mixon MB,Napuli AJ, et al. (2008) The plug-based nanovo-
lume Microcapillary Protein Crystallization Sys-tem (MPCS). Acta Crystallogr D Biol Crystallogr
64: 1116–1122.
32. Wilson J (2002) Towards the automated evalua-
tion of crystallization trials. Acta Crystallogr D BiolCrystallogr 58: 1907–1914.
33. Pan S, Shavit G, Penas-Centeno M, Xu DH,
Shapiro L, et al. (2006) Automated classification
of protein crystallization images using supportvector machines with scale-invariant texture and
Gabor features. Acta Crystallogr D Biol Crystal-logr 62: 271–279.
34. Liu R, Freund Y, Spraggon G (2008) Image-
based crystal detection: A machine-learning
approach. Acta Crystallogr D Biol Crystallogr64: 1187–1195.
35. Fan E, Baker D, Fields S, Gelb MH, Buckner FS,
et al. (2008) Structural genomics of pathogenicprotozoa: An overview. Methods Mol Biol 426:
497–513.
36. Wagner A, Diez J, Schulze-Briese C, Schluckebier G
(2009) Crystal structure of ultralente—A microcrys-talline insulin suspension. Proteins 74: 1018–1027.
37. Thorne RESZ, Kmetko J, O’Niell J, Gillilan R
(2003) Microfabricated mounts for high-through-put macromolecular cryocrystallography.
J Applied Crystallography 36: 1455–1460.
38. Schorlemmer HU, Bosslet K, Dickneite G,
Luben G, Sedlacek HH (1984) Studies on the
mechanisms of action of the immunomodulatorBestatin in various screening test systems. Behring
Inst Mitt: 157–173.
39. Costantino L, Barlocco D (2006) Privileged
structures as leads in medicinal chemistry. Curr
Med Chem 13: 65–85.
40. Biros SM, Moisan L, Mann E, Carella A, Zhai D,
et al. (2007) Heterocyclic alpha-helix mimetics for
targeting protein-protein interactions. Bioorg
Med Chem Lett 17: 4641–4645.
41. Robinson JA (2008) Beta-hairpin peptidomi-
metics: design, structures and biological activities.
Acc Chem Res 41: 1278–1288.
42. Saraogi I, Hamilton AD (2008) alpha-Helix
mimetics as inhibitors of protein-protein interac-
tions. Biochem Soc Trans 36: 1414–1417.
43. Root MJ, Steger HK (2004) HIV-1 gp41 as atarget for viral entry inhibition. Curr Pharm Des
10: 1805–1825.
44. Weissenhorn W, Dessen A, Harrison SC,
Skehel JJ, Wiley DC (1997) Atomic structure of
the ectodomain from HIV-1 gp41. Nature 387:
426–430.
45. Ferrer M, Kapoor TM, Strassmaier T,
Weissenhorn W, Skehel JJ, et al. (1999) Selection
of gp41-mediated HIV-1 cell entry inhibitors
from biased combinatorial libraries of non-
natural binding elements. Nat Struct Biol 6:
953–960.
46. Lapatto R, Blundell T, Hemmings A,
Overington J, Wilderspin A, et al. (1989) X-ray
analysis of HIV-1 proteinase at 2.7 A resolution
confirms structural homology among retroviral
enzymes. Nature 342: 299–302.
47. Miller M, Schneider J, Sathyanarayana BK,
Toth MV, Marshall GR, et al. (1989) Structureof complex of synthetic HIV-1 protease with a
substrate-based inhibitor at 2.3 A resolution.
Science 246: 1149–1152.
48. Navia MA, Fitzgerald PM, McKeever BM,
Leu CT, Heimbach JC, et al. (1989) Three-
dimensional structure of aspartyl protease from
human immunodeficiency virus HIV-1. Nature337: 615–620.
49. Wlodawer A, Mil ler M, Jaskolski M,
Sathyanarayana BK, Baldwin E, et al. (1989)
Conserved folding in retroviral proteases: Crystal
structure of a synthetic HIV-1 protease. Science
245: 616–621.
50. Wlodawer A, Vondrasek J (1998) Inhibitors of
HIV-1 protease: A major success of structure-
assisted drug design. Annu Rev Biophys Biomol
Struct 27: 249–284.
51. Abdel-Rahman HM, Al-karamany GS, El-Koussi NA, Youssef AF, Kiso Y (2002) HIV
protease inhibitors: Peptidomimetic drugs and
future perspectives. Curr Med Chem 9:
1905–1922.
52. Chrusciel RA, Strohbach JW (2004) Non-peptidic
HIV protease inhibitors. Curr Top Med Chem 4:
1097–1114.
53. Das K, Lewi PJ, Hughes SH, Arnold E (2005)
Crystallography and the design of anti-AIDS
drugs: Conformational flexibility and positional
adaptability are important in the design of non-
nucleoside HIV-1 reverse transcriptase inhibitors.
Prog Biophys Mol Biol 88: 209–231.
54. Kohlstaedt LA, Wang J, Friedman JM, Rice PA,
Steitz TA (1992) Crystal structure at 3.5 A
resolution of HIV-1 reverse transcriptase com-
plexed with an inhibitor. Science 256: 1783–1790.
55. Smerdon SJ, Jager J, Wang J, Kohlstaedt LA,
Chirino AJ, et al. (1994) Structure of the bindingsite for nonnucleoside inhibitors of the reverse
transcriptase of human immunodeficiency virus
type 1. Proc Natl Acad Sci U S A 91: 3911–3915.
56. Babu YS, Chand P, Bantia S, Kotian P,
Dehghani A, et al. (2000) BCX-1812 (RWJ-
270201): Discovery of a novel, highly potent,
orally active, and selective influenza neuramini-
PLoS Computational Biology | www.ploscompbiol.org 6 October 2009 | Volume 5 | Issue 10 | e1000530
dase inhibitor through structure-based drug
design. J Med Chem 43: 3482–3486.57. Bossart-Whitaker P, Carson M, Babu YS,
Smith CD, Laver WG, et al. (1993) Three-
dimensional structure of influenza A N9 neur-aminidase and its complex with the inhibitor 2-
deoxy 2,3-dehydro-N-acetyl neuraminic acid.J Mol Biol 232: 1069–1083.
58. Kim CU, Lew W, Williams MA, Liu H, Zhang L,
et al. (1997) Influenza neuraminidase inhibitorspossessing a novel hydrophobic interaction in the
enzyme active site: Design, synthesis, and struc-tural analysis of carbocyclic sialic acid analogues
with potent anti-influenza activity. J Am ChemSoc 119: 681–690.
59. von Itzstein M, Wu WY, Kok GB, Pegg MS,
Dyason JC, et al. (1993) Rational design of potentsialidase-based inhibitors of influenza virus repli-
cation. Nature 363: 418–423.60. Hadfield AT, Lee W, Zhao R, Oliveira MA,
Minor I, et al. (1997) The refined structure of
human rhinovirus 16 at 2.15 A resolution:Implications for the viral life cycle. Structure 5:
427–441.
61. Merritt EA, Zhang Z, Pickens JC, Ahn M,
Hol WG, et al. (2002) Characterization andcrystal structure of a high-affinity pentavalent
receptor-binding inhibitor for cholera toxin and
E. coli heat-labile enterotoxin. J Am Chem Soc124: 8818–8824.
62. Hu X, Nguyen KT, Jiang VC, Lofland D,Moser HE, et al. (2004) Macrocyclic inhibitors
for peptide deformylase: A structure-activity
relationship study of the ring size. J Med Chem47: 4941–4949.
63. Aronov AM, Verlinde CL, Hol WG, Gelb MH(1998) Selective tight binding inhibitors of try-
panosomal glyceraldehyde-3-phosphate dehydro-genase via structure-based drug design. J Med
Chem 41: 4790–4799.
64. Bressi JC, Choe J, Hough MT, Buckner FS, VanVoorhis WC, et al. (2000) Adenosine analogues as
inhibitors of Trypanosoma brucei phosphoglyceratekinase: Elucidation of a novel binding mode for a
2-amino-N(6)-substituted adenosine. J Med Chem
43: 4135–4150.65. Jin L, Harrison SC (2002) Crystal structure of
human calcineurin complexed with cyclosporin A
and human cyclophilin. Proc Natl Acad Sci U S A
99: 13522–13526.
66. Rahuel J, Rasetti V, Maibaum J, Rueger H,
Goschke R, et al. (2000) Structure-based drug
design: The discovery of novel nonpeptide orally
active inhibitors of human renin. Chem Biol 7:
493–504.
67. Lam PY, Clark CG, Li R, Pinto DJ, Orwat MJ,
et al. (2003) Structure-based design of novel
guanidine/benzamidine mimics: Potent and oral-
ly bioavailable factor Xa inhibitors as novel
anticoagulants. J Med Chem 46: 4405–4418.
68. Terasaka T, Kinoshita T, Kuno M, Seki N,
Tanaka K, et al. (2004) Structure-based design,
synthesis, and structure-activity relationship stud-
ies of novel non-nucleoside adenosine deaminase
inhibitors. J Med Chem 47: 3730–3743.
69. Noble ME, Endicott JA, Johnson LN (2004)
Protein kinase inhibitors: Insights into drug design
from structure. Science 303: 1800–1805.
PLoS Computational Biology | www.ploscompbiol.org 7 October 2009 | Volume 5 | Issue 10 | e1000530
Review
The Key Role of Genomics in Modern Vaccine and DrugDesign for Emerging Infectious DiseasesKate L. Seib1, Gordon Dougan2, Rino Rappuoli1*
1 Novartis Vaccines and Diagnostics, Siena, Italy, 2 The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
Abstract: It can be argued that the arrival of the‘‘genomics era’’ has significantly shifted the paradigm ofvaccine and therapeutics development from microbiolog-ical to sequence-based approaches. Genome sequencesprovide a previously unattainable route to investigate themechanisms that underpin pathogenesis. Genomics,transcriptomics, metabolomics, structural genomics, pro-teomics, and immunomics are being exploited to perfectthe identification of targets, to design new vaccines anddrugs, and to predict their effects in patients. Further-more, human genomics and related studies are providinginsights into aspects of host biology that are important ininfectious disease. This ever-growing body of genomicdata and new genome-based approaches will play acritical role in the future to enable timely development ofvaccines and therapeutics to control emerging infectiousdiseases.
By controlling debilitating and often-lethal infectious diseases,
vaccines and antibiotics have had an enormous impact on world
health. Now, with the arrival of the ‘‘genomics era,’’ a paradigm
shift is occurring in the development of vaccines—and potentially
also in the development of antibiotics—that is providing fresh
impetus to this field. The world is still faced with a huge burden of
infection, however, by classic pathogens (e.g., typhoid, measles),
recently discovered causes of disease (e.g., Helicobacter pylori and
hepatitis C virus [HCV]), and emerging infectious diseases (EIDs,
e.g., H1N1 swine flu and severe acute respiratory syndrome
coronavirus [SARS-CoV]). In addition, variant forms of previ-
ously identified infectious diseases are reemerging (e.g., Streptococcus
pyogenes, also known as group A streptococcus [GAS], and dengue
fever), along with antibiotic-resistant forms of microbes (e.g.,
methicillin-resistant Staphylococcus aureus [MRSA] and Mycobacterium
tuberculosis) [1,2] (for a list of EIDs see http://www3.niaid.nih.gov/
topics/emerging/list.htm). The World Health Organization
(WHO) estimates that we can expect at least one such new
pathogen to appear every year.
The fact that an infectious disease has emerged or reemerged
indicates immune naıvety in the infected population, or altered
virulence potential or an increase in antibiotic/antiviral resistance
in the pathogen population. The rapid development of vaccines
and therapeutics that target these pathogens is therefore essential
to limit their spread. Traditional empirical approaches that screen
for vaccines or drugs a few candidates at a time are time-
consuming and have often proven insufficient to control many
EIDs, particularly when the causative pathogens are antigenically
diverse (e.g., HIV), cannot be cultivated in the laboratory (e.g.,
HCV), lack suitable animal models of infection (e.g., Neisseria spp.),
have complex mechanisms of pathogenesis (e.g., retroviruses),
and/or are controlled by mucosal or T cell–dependent immune
responses rather than humoral immune responses (e.g., Shigella
spp., M. tuberculosis) [3]. For many EIDs, the wealth of information
emerging in the genome era has already had a significant impact
on the way we approach vaccine and therapeutic development.
For EIDs that appear in the near future, genomics will be in the
first line of defense in terms of antigen identification, diagnostic
development, and functional characterization.
Since the completion of the genome sequence of Haemophilus
influenzae—the first finished bacterial genome sequence—in 1995 [4],
advances in sequencing technology and bioinformatics have
produced an exponential growth of genome sequence information.
At least one genome sequence is now available for each major
human pathogen. As of October 2009, over 1,000 bacterial genomes
were ‘‘completed’’ (i.e., closed genomes and whole genome shotgun
sequences) and more than 1,000 were ongoing; over 3,000 viral
genomes were completed (http://www.genomesonline.org/gold.cgi,
http://www.ncbi.nlm.nih.gov/genomes/MICROBES/microbial_
taxtree.html, http://cmr.jcvi.org/tigr-scripts/CMR/shared/
Genomes.cgi). For a bacterial pathogen, which may have more
than 4,000 genes, the genome sequence provides the complete
genetic repertoire of antigens or drug targets from which novel
candidates can be identified. For viral pathogens that may possess
fewer than 10 genes, genomics can be used to define the variability
that may exist between isolates. Host genetic factors also play a role
in infectious disease [5,6], however, and the availability of
‘‘complete’’ human genome sequences, as well as large-scale human
genome projects (see http://www.1000genomes.org/), are valuable
resources. Hence, the sequences of both pathogen and host genomes
can facilitate identification of a growing number of potential vaccine
and drug targets (Figure 1). It is estimated that 10–100 times more
candidates can be identified in one to two years using genomics-
based approaches than can be identified by conventional methods
in the same time frame. Furthermore, genomics-based vaccine
projects have substantially increased our understanding of microbial
physiology, epidemiology, pathogenesis, and protein functions (see
Box 1).
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journalcollection (http://ploscollections.org/emerginginfectiousdisease/).
Citation: Seib KL, Dougan G, Rappuoli R (2009) The Key Role of Genomics inModern Vaccine and Drug Design for Emerging Infectious Diseases. PLoSGenet 5(10): e1000612. doi:10.1371/journal.pgen.1000612
Editor: Nicholas J. Schork, University of California San Diego and The ScrippsResearch Institute, United States of America
Published October 26, 2009
Copyright: � 2009 Seib et al. This is an open-access article distributed underthe terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.
Funding: KLS is the recipient of an Australian NHMRC CJ Martin Fellowship. GD issupported by The Wellcome Trust. KLS and RR are employed by Novartis Vaccines.The funders had no role in the preparation of the article.
Competing Interests: KLS and RR are employed by Novartis Vaccines.
* E-mail: [email protected]
PLoS Genetics | www.plosgenetics.org 1 October 2009 | Volume 5 | Issue 10 | e1000612
Figure 1. Genomics-based approaches used in the control of EIDs from the outbreak of a disease to the development of a vaccineor drug. (A) The causative agent of a disease may first be identified from patient samples by using metagenomics. (B) Vaccine and therapeutictargets can be identified from the pathogen genome using a variety of screening approaches that focus on the genome, transcriptome, proteome,immunome or structural genome. (C) The human genome can be screened to avoid homologies or similarities with pathogen vaccine andtherapeutic targets, or to identify new targets. (D) Once candidate vaccine and therapeutic targets have been identified they must be shown toprovide protection against disease and to be safe for use in patients. (E) The clinically tested vaccine or therapeutic can then be licensed for use. Theclinical responses of a vaccine and/or therapeutic can be analyzed using human genome based studies (dotted arrows). The pathogen genome canalso be used to analyze mutants that are able to evade the immune system in vaccinated subjects or organisms that develop antibiotic resistance.Examples of the approaches indicated are given in Table 1.doi:10.1371/journal.pgen.1000612.g001
PLoS Genetics | www.plosgenetics.org 2 October 2009 | Volume 5 | Issue 10 | e1000612
From the outbreak of a disease, metagenomics (the study of all the
genetic material recovered directly from a sample) can be applied to
diseased human samples to aid the rapid identification of the
causative agent [7,8]. Once the complete genome sequence of the
organism is available, high-throughput approaches can be used to
screen for target molecules, as outlined below and in Table 1 [9,10].
Screening approaches vary depending on the nature of the pathogen
but are based on several accepted principles and key requirements of
vaccines and therapeutics, including the need for targets to be (i)
expressed and accessible to the host immune system, or to a
therapeutic agent, during human disease; (ii) genetically conserved;
(iii) important for survival or pathogenesis; and (iv) free of measurable
homology or similarity to host factors. Although many of the
approaches described here focus on vaccine development, which
involves screening of candidates for immunogenicity, they are largely
applicable to drug development by altering the selection criteria used
and screening candidates against compound libraries [11–13].
Reverse Vaccinology, Pan-genomics, andComparative Genomics
The idea behind reverse vaccinology is to screen an entire
pathogen genome to find genes that encode proteins with the
attributes of good vaccine targets, such as, for example, bacterial
surface associated proteins [14]. These proteins can then undergo
normal laboratory evaluation for immunogenicity. The Neisseria
meningitidis serogroup B (MenB) reverse vaccinology project provides
the ‘‘proof of concept’’ for this type of approach. This project
identified more novel vaccine candidates in 18 months than had
been discovered in 40 years of conventional vaccinology [15].
Analysis of the genome sequence of the virulent MenB strain MC58
found 2,158 predicted open reading frames (ORFs); these were
screened using bioinformatics tools to identify 570 ORFs that were
predicted to encode surface-exposed or secreted proteins that might
be accessible to the immune system [15]. Antigen screening
Box 1: Reverse Vaccinology Drives the Discovery of New Protein Functions
Reverse vaccinology involves the in silico screening of theentire genome of a pathogen to find genes that encodeproteins with the attributes of good vaccine targets, usingeither the genome of a single pathogenic isolate or the pan-genome (the genomic information from several isolates) of apathogenic species.
Pili in pathogenic streptococci play a key role invirulence and are promising vaccine candidates Theidentification of pili (long filamentous structures that extendfrom the bacterial surface) in the main pathogenic strains ofstreptococci is a good example of how genomics can lead tothe discovery of protein functions and increasedunderstanding of host–pathogen interactions. The pili ofgram-negative bacteria are well-described virulence factors.Little was known, however, about pili in gram-positivebacteria before the sequencing and analysis of the genomesof S. pyogenes, S. agalactiae, and S. pneumoniae (reviewed in[72]).During analysis of eight S. agalactiae genome sequences,three protective antigens identified by pan-genomic reversevaccinology [20] were found to contain LPXTG motifs typicalof cell wall-anchored proteins and seen to assemble into pili[73]. Further bioinformatics analysis revealed three indepen-dent loci that encode structurally distinct pilus types, each ofwhich contains two surface-exposed antigens capable ofeliciting protective immunity in mice [75]. Because of thelimited variability of S. agalactiae pili, it has been suggestedthat a combination of only three pilin subunits could lead tobroad protective immunity [74].Following the identification of S. agalactiae pili, typical pilusregions were identified in the available S. pyogenes genomesbased on the presence of genes encoding LPXTG-containingproteins. In addition, a combination of recombinant pilusproteins was shown to confer protection in mice againstmucosal challenge with virulent S. pyogenes isolates [75].Falugi and colleagues have since found that S. pyogenes piliare encoded by nine different gene clusters, and theyestimate that a vaccine comprising a combination of 12backbone variants could provide protection against over90% of circulating S. pyogenes strains [76].The availability of multiple complete genome sequences forS. pneumoniae, and the increased understanding of pilusproteins in other pathogenic streptococci, led to thediscovery of two pilus ‘‘islands’’ that encode proteins that
play a role in adherence to lung epithelial cells andcolonization in a murine model of infection, where theyelicit host inflammatory responses [77,78]. In addition, thepilus subunits confer protection in passive and activeimmunization models [79]. The presence of pili that containprotective antigens in all three principal streptococcalpathogens indicates that these structures play an importantrole in virulence.
Reverse vaccinology leads to identification of thefHBP and its role in meningococcal species specificitySerogroup B N. meningitidis (MenB) strains are responsiblefor the majority of meningococcal disease in the developedworld, yet there is no comprehensive MenB vaccineavailable. Screening of the MenB genome for vaccinecandidates by using reverse vaccinology led to thediscovery of the meningococcal factor H-binding protein(fHBP) [15], which was recently suggested to play animportant role in the species specificity of N. meningitidis[80]. fHBP is a component of the Novartis multivalent MenBvaccine that entered Phase III clinical testing in 2008 [16,17]and is also under investigation by Wyeth Vaccines(designated LP2086) [81] and other groups [82]. Initiallyidentified as the genome-derived Neisseria antigen 1870(GNA1870), a Neisseria-specific putative surface lipoproteinof unknown function, fHBP was renamed because of itsability to bind complement factor H (fH), a molecule thatdown-regulates activation of the complement alternativepathway. Hence, binding of fH to the surface of Neisseriaallows the pathogen to evade complement-mediated killingby the innate immune system [83]. fHBP is expressed by all N.meningitidis strains studied [84]. It induces high levels ofbactericidal antibodies in mice [16] and is important forsurvival of bacteria in human serum and blood [83,85,86].The discovery that binding of fH to N. meningitidis is specificfor human fH, and that human fH alone is able to down-regulate complement activation and bactericidal activityleading to increased bacterial survival has significantimplications for the study of this organism [80]. Theadministration of human fH to infant rats challenged withMenB led to a greater than 10-fold increase in survival ofbacteria [80], providing an important insight into host–pathogen interactions that may lead to the development ofnew animal models of infection.
PLoS Genetics | www.plosgenetics.org 3 October 2009 | Volume 5 | Issue 10 | e1000612
continued on the basis of several criteria: the ability of antigens to be
expressed in Escherichia coli as recombinant proteins (350 candidates);
confirmation by ELISA and flow cytometry that the antigen is
exposed on the cell surface (91 candidates); the ability of induced
antibodies to elicit killing, as measured by serum bactericidal assay
and/or passive protection in infant rat assays (28 candidates); and
screening of a panel of diverse meningococcal isolates to determine
whether the antigens are conserved. This approach resulted in the
development of a multi-component recombinant MenB vaccine
that entered Phase III clinical trials in 2008 [16,17].
As multiple genome sequences become available for a single
species, the concept of pan-genomic reverse vaccinology is
Table 1. Approaches to identify vaccine and/or drug targets against EIDs in the genomic era.
Approach Methods Used Limitations of Method Example
Organism Disease
Genomics/reverse vaccinology:Analysis of the genetic material ofan organism in order to identify therepertoire of protein antigens/drugtargets the organism has the potentialto express.
Bioinformatics screening of the genomesequence to identify ORFs predicted tobe exposed on the surface of thepathogen or secreted, expression ofrecombinant proteins, generation ofantibodies in mice to confirm surfaceexposure, and bactericidal activity [14].
Prediction algorithms need to bevalidated.Non-protein antigens includingpolysaccharides or glycolipids, andpost-translational modificationscannot be identified.High-throughput cloning and proteinexpression is required.
Serogroup B N.meningitidis [15,16]
Major cause ofsepticemia andmeningitis in thedeveloped world.
Pan-genomics: Analysis of the geneticmaterial of several organisms of a singlespecies to identify conserved antigens/targets and ensure the chosen targetcovers the diversity of the organism.
Similar to above, but ORFs are chosenby screening of multiple genomes witheither direct sequencing or comparativegenome hybridization [18].
Sequences of multiple isolatesof a species are required.Similar limitations as describedabove.
S. agalactiae [20] Leading cause ofneonatal bacterialsepsis, pneumonia,and meningitis inthe US and Europe.
Comparative genomics: Analysis ofthe genetic material of several individualsof a single species, to identify antigens/targets that are present in pathogenicstrains but absent in commensal strains,and thus important for disease.
Similar to pangenomics, but ORFs arechosen by screening of genomes frommultiple strains of pathogenic andcommensal strains of a species [18,21].
Similar limitations as for the abovetwo approaches.
E. coli [22] Major cause of mildto severe diarrhea,hemolytic-uremicsyndrome, andurinary tract infections.
Transcriptomics: Analysis of the setof RNA transcripts expressed by anorganism under a specified condition.
Gene expression is evaluated in vitro orin vivo using DNA microarrays or cDNAsequencing [24].
There is no direct correlationbetween the levels of mRNAand protein.In vivo studies require relativelylarge amounts of mRNA.
V. cholerae [26] Causes diseasesranging from self-limiting to severe,life-threateningdiarrhea, woundinfections, and sepsis.
Functional genomics: Analysis of therole of genes and proteins in order toidentify genes required for survivalunder specific conditions.
Genes that are functionally essential inspecific conditions in vitro or in vivo aredetermined by gene inhibition followedby screening of mutants in animal modelsor cell culture to identify attenuatedclones [87].
Genetic tools, acceptance oftransposons, and naturalcompetence of the pathogenare required.
H. pylori [32] Major cause ofduodenal and gastriculcers and stomachcancer as a resultof chronic low-levelinflammation of thestomach lining.
Proteomics: Analysis of the set ofproteins expressed by an organismunder a specified condition and/or inspecific cellular locations (e.g., on thecell surface).
2D-PAGE, MS, and chromatographictechniques to identify proteins fromwhole cells, fractionated samples, orthe cell surface [34].
Proteins with low abundanceand/or solubility and proteinsthat are only expressed in vivomay not be identified.
S. pyogenes [36] Cause of a range ofdiseases from mildpharyngitis to severetoxic shock syndrome,necrotizing fasciitis,and rheumatic fever.
Immunomics: Analysis of the subsetof proteins/epitopes that interact withthe host immune system.
Analysis of seroreactive proteins, using2D-PAGE, phage display libraries, orprotein microarrays, probed with hostsera [38].Bioinformatics prediction of B cell andT cell epitopes [37].
Potential bias against sequencesthat cannot be displayed.Large conformational epitopesmade up of noncontiguous aminoacids may not be detected.Prediction of B cell epitopes isdifficult due to the need toidentify conformational epitopes.
S. aureus [39] Cause of woundinfections. Hasemerged as asignificantopportunisticpathogen due toantibiotic resistance.
Structural genomics: Analysis of thethree-dimensional structure of anorganism’s proteins and how theyinteract with antibodies or therapeutics.
NMR or crystallography to determinethe structure of proteins in thepresence/absence of antibodies ortherapeutics [51].
Poor understanding ofdeterminants of immunogenicity,immunodominance, and structure-function relationships.
HIV [53] Causative agent ofAIDS.
Vaccinomics/immunogeneticspharmacogenetics: Analysis of howthe human immune system respondsto a vaccine or drug.
Investigation of genetic heterogeneity/polymorphisms in the host, at theindividual or population level, that mayalter immune responses to vaccines [68]or metabolism of therapeutics [71].
Ethical issues of ‘‘personalized’’medicine.Immense diversity of the humangenome and, in particular, in thehuman immune response.
Mumps virus [69] Cause of diseaseranging from self-limiting parotidinflammation toepididymo-orchitis,meningitis, andencephalitis.
doi:10.1371/journal.pgen.1000612.t001
PLoS Genetics | www.plosgenetics.org 4 October 2009 | Volume 5 | Issue 10 | e1000612
emerging as a powerful tool to identify vaccine candidates in
antigenically diverse species [18]. Pan-genomics aims to identify
the full complement of genes in a species, based on the superset of
genes in several strains of the same species. Analysis of the genome
sequences of eight Streptococcus agalactiae (also known as group B
streptococcus) strains revealed substantial genetic heterogeneity
and the extended gene repertoire of the species [19]. Screening
found a total of 589 genes predicted to encode surface-exposed or
secreted proteins in the S. agalactiae pan-genome (396 from the
‘‘core genome’’—genes conserved in all strains—and 193 from the
‘‘dispensable genome’’—genes that are present in two or more
strains and are hence considered dispensable for survival). Based
on further screening of this pool of candidates, including the ability
of recombinant proteins to provide protection when used to
immunize animals, a combination of four antigens—only one of
which is in the core genome—was selected and shown to confer
protection against a panel of S. agalactiae strains [20].
Whereas genome sequencing projects have typically focused on
pathogenic organisms, comparison of the genomes of pathogenic and
nonpathogenic strains allows vaccine and drug targets to be identified
on the basis of proteins that are specifically involved in pathogenesis
[21]. Comparative studies of up to 17 commensal and pathogenic E.
coli genomes identified genes unique to certain pathogenic strains that
are largely absent in commensal strains. This filter decreases the pool of
targets to be screened and potentially limits any detrimental effects of
therapeutics on the composition of the commensal flora [22].
New sequencing technologies will also open up opportunities for
monitoring pathogen vaccine escape by screening for evidence of
immune selection in the genomes of pathogen populations before
and after vaccine selection. By deep-sequencing of bacterial and
viral populations it will be possible to identify antigens under
immune selection by monitoring the clustering of single nucleotide
polymorphisms (SNPs) and other mutations that affect protein
sequence. This approach has already been used to search for
evidence of antigenic variation/selection in populations of
Salmonella enterica serovar Typhi [23], where variation is extremely
limited. Similar sequencing strategies could be applied to
populations of bacteria taken before or after a vaccine trial in a
particular geographical region.
Beyond Genomics: Other -Omics Approaches toStudy Pathogens
Pathogen genes that are up-regulated during infection and/or
essential for microorganism survival or pathogenesis can be
identified by using transcriptomics, i.e., the analysis of a near
complete set of RNA transcripts expressed by the pathogen under
a specified condition. Comprehensive DNA-based microarray
chips (probed with cDNA generated from RNA by reverse
transcription) [24] and ultra-high-throughput sequencing technol-
ogies that allow rapid sequencing and direct quantification of
cDNA [25] enable the transcriptome of a pathogen to be
characterized and particular types of gene product to be identified.
For example, genes involved in the hyperinfectious state of Vibrio
cholerae, which appears after passage through the human
gastrointestinal tract, were identified through a comparison of
the transcriptome of bacteria isolated directly from stool samples
of cholera patients with that of V. cholerae grown in vitro [26].
Similarly, analysis of the transcription profile of M. tuberculosis
during early infection in immune-competent (BALB/c) and severe
combined immunodeficient (SCID) mice revealed a set of 67 genes
activated exclusively in response to the host immune system [27].
Functional genomics—linking genotype, through transcrip-
tomics and proteomics, to phenotype—has been applied to many
pathogens to identify genes essential to survival or virulence that
may be valid vaccine candidates. DNA microarrays can be used to
screen comprehensive libraries of pathogen mutants, by compar-
ing bacterial isolates from before and after passage through animal
models or exposure to compound libraries to identify attenuated
clones [28–30]. For example, these methods have been used to
identify 65 novel MenB genes that are required for the pathogen to
cause septicemia in infant rats [31], 47 genes essential for H. pylori
gastric colonization of the gerbil [32], and genes contributing to
M. tuberculosis persistence in the host [33].
Analysis of a pathogen’s proteome (the near complete set of
proteins expressed under a specified condition) to reveal potential
vaccine and drug candidates can add significant value to in silico
approaches [34]. High-throughput proteomic analyses can be
performed by using mass spectrometry (MS), chromatographic
techniques, and protein microarrays [35]. A novel proteome-based
approach has been applied to identify the surface proteins of GAS
by making use of proteolytic enzymes to ‘‘shave’’ the bacterial
surface, releasing exposed proteins and partially exposed peptides.
Seventeen surface proteins of a virulent GAS strain were identified
in this way by using MS and genome sequence analysis. Their
location on the pathogen surface was confirmed by flow
cytometry, and one of them provided protective immunity in a
mouse model of the disease [36].
The proteome of a pathogen can also be screened to identify the
immunome (the near complete set of pathogen proteins or
epitopes that interact with the host immune system) using in vitro
or in silico techniques [37,38]. In vitro identification and screening
of the immunome are based on the idea that antibodies present in
serum from a host, which has been exposed to a pathogen,
represent a molecular ‘‘imprint’’ of the pathogen’s immunogenic
proteins and can be used to identify vaccine candidates. As such,
several techniques have been developed to allow the high-
throughput display of pathogen proteins, and the subsequent
screening for proteins that interact with antibodies in sera.
Immunogenic surface proteins of several organisms have been
identified, including S. aureus using 2D-PAGE, membrane blotting,
and MS [39]; S. agalactiae, S. pyogenes, and Streptococcus pneumoniae
using phage- or E. coli-based comprehensive genomic peptide
expression libraries [38,40]; and Francisella tularensis (the causative
agent of tularemia or rabbit fever) [41] and V. cholerae using protein
microarray chips [42]. Protein microarrays, in which proteins
from the pathogen are spotted onto a microarray chip, can also
be used to characterize protein–drug interactions, as well as
other protein–protein, protein–nucleic acid, ligand–receptor, and
enzyme–substrate interactions [43].
The ability to predict in silico which pathogen epitopes will be
recognized by B cells or T cells has greatly improved in recent
years [44]. Large-scale screening of pathogens including HIV,
Bacillus anthracis, M. tuberculosis, F. tularensis, Yersinia pestis (the
causative agent of bubonic plague), flaviviruses, and influenza for
B cell and T cell epitopes is currently underway [45,46]. Although
epitope prediction is not foolproof, it can serve as a guide for
further biological evaluation. T cell epitopes are presented by
MHC/HLA proteins on the surface of antigen-presenting cells,
which vary considerably between hosts, complicating the task of
functional epitope prediction. Additionally, B cell epitopes can be
both linear and conformational. The ultimate aim of researchers
in this field of study would be to engineer a single peptide that
represents defined epitope combinations from a protein or
organism, enabling the genetic variability of both pathogen and
host to be overcome [44].
Structural genomics—the study of the three-dimensional
structures of the proteins produced by a species—is increasingly
PLoS Genetics | www.plosgenetics.org 5 October 2009 | Volume 5 | Issue 10 | e1000612
being applied to vaccine and drug development as a result of the
explosion of genome and proteome data, and continuing
improvements in the fields of protein expression, purification,
and structural determination [47]. The structure-based design of
antiviral therapeutics has led to the development of drugs directed
at the active sites of the HIV-1 protease [48] and influenza
neuraminidase [49]. More than 45,000 high-resolution protein
structures are available in public databases (see http://www.
wwpdb.org/stats.html), and several initiatives have been estab-
lished to pursue high-throughput characterization of protein
structures on a genome-wide scale [50], focusing on determining
and understanding the structural basis of immune-dominant and
immune-recessive antigens as well as protein active sites and
potential drug-binding sites [51,52]. For example, structural
characterization of the HIV envelope proteins gp120 and gp41
has revealed mechanisms used by the virus to evade host antibody
responses, many of which involve hypervariability in immunodo-
minant epitopes [53,54]. Based on this information, immune
refocusing (e.g., by retargeted glycosylation, deletion, and/or
substitution of amino acids) has been used to dampen the response
to variable immunodominant epitopes of the envelope glycopro-
tein gp160, enabling the host to respond to previously subdom-
inant epitopes [55]. High-throughput modification of proteins and
their screening for immunogenicity and interaction with antimi-
crobials is predicted to become more common as techniques
evolve [51].
The Contribution of Human Genomics
When designing new vaccines, one important consideration is
the risk that the vaccine might generate ‘‘self’’ immune reactions
against host epitopes; immune responses against a pathogen
antigen can cross-react with host antigens if homologies exist in the
primary amino acid sequence or structure, potentially leading to
damage to the host tissue [56]. Drugs aimed at pathogen targets
could also theoretically target similar host molecules. The
availability of the human genome sequence combined with
methods for predicting B cell and T cell epitopes will facilitate
screening for the presence of homologies between candidate
microbial vaccine antigens and proteins in humans, enabling issues
of autoimmunity and cross-reactivity to be tackled [57]. As such,
vaccine or drug targets identified using methods based on
pathogen genomics should be screened for homology or similarity
to human proteins in silico, using programs such as BLAST (Basic
Local Alignment Search Tool; http://blast.ncbi.nlm.nih.gov/
Blast.cgi) to query human genome databases. Interestingly,
analysis of 30 viral genomes revealed that around 90% of viral
pentapeptides, which could be components of epitopes, are
identical to human peptides [58]. There is little homology,
however, between validated immunogenic disease-associated
peptides/epitopes and host peptides [57,59], suggesting that
screening approaches that include prediction of immunogenicity
could improve the pool of target candidates.
It is important to keep in mind that we do not fully understand
how self-tolerance is broken, so we currently have no perfect way
of predicting all potential autoimmune triggers that could be
associated with vaccination. While many links have been made
between autoimmune disease and vaccination, they have been
confirmed in only a small number of cases (reviewed in [60]). For
example, treatment-resistant Lyme arthritis is associated in certain
patients with immune reactivity to the outer surface protein A
(OspA) of the causative agent of Lyme disease, Borrelia burgdorferi,
and an OspA epitope (OspA165–173) has homology to the human
lymphocyte function-associated antigen (hLFA)-1aL [61]. As a
result, the OspA-based Lyme disease vaccine (LYMErix) was taken
off the market in 2002, but a recombinant OspA lacking the
potentially autoreactive T cell epitope has been proposed as a
replacement vaccine [62].
Rather than targeting drugs to pathogen enzymes, an
alternative approach has focused on targeting the host-cell proteins
that are exploited by pathogens for replication and survival. The
use of techniques including microarray-based analysis of virus-
induced host gene expression has revealed several possible targets
[63,64]. The cholesterol-lowering drugs statins, for example, have
an anti-HIV effect that is believed to be mediated by preventing
activation of the host protein Rho, which is activated by the HIV
envelope protein and required for virus entry to the cell [65].
Furthermore, such studies can improve our understanding of the
host immune responses that protect against a pathogen (i.e.,
innate, antibody, Th1, or Th2 responses), which will aid the
selection of appropriate vaccine adjuvants. For example, induction
of interferon signaling early in infection may be critical to confer
protection against SARS-CoV, as determined from functional
genomic studies of early host responses to SARS-CoV infection in
the lungs of macaques [66].
Many of the genes of the human immune system are highly
polymorphic, which enables the population as a whole to generate
sufficient immunological diversity to combat EIDs. This variation
also impacts on the outcome of vaccination and treatment. The
International HapMap Project has identified over 3.1 million
SNPs in 270 individuals [67] and the 1000 Genomes Project aims
to identify even more genetic variants. The field of vaccinomics
(also called immunogenetics) investigates heterogeneity in host
genetic markers that results in variations in vaccine-induced
immune responses, with the aim of predicting and minimizing
vaccine failures or adverse events [68]. For example, polymor-
phisms of HLA and immunoregulatory cytokine receptor genes
are associated with variable outcomes of vaccination against
mumps [69]. Similarly, pharmacogenetics, which investigates
genetic differences in the way individuals metabolize therapeutics,
has found that human variability in the speed of metabolism of the
common first-line tuberculosis drug isoniazid is associated with
genetic variants, including SNPs, in the gene encoding arylamine
N-acetyltransferase (NAT2) [70,71]. The ability to predict an
individual’s response to a vaccine or drug, may eventually allow
physicians to determine whether a patient is genetically susceptible
to a disease, the possible adverse effects of a vaccine or drug, and
the appropriate schedule or dose to use.
Challenges for the Future
We predict that genomics will greatly aid the control of EIDs
because of the increased efficiency with which vaccine and
therapeutic targets can be identified using the genome-based
approaches described above. Furthermore, we anticipate the
continual refinement and development of novel genome-based
approaches as sequencing becomes faster and more affordable.
Several challenges remain, however, in the identification of these
targets and in the processes needed to bring a new vaccine or drug
to the market. Understanding the molecular nature of epitopes,
the mechanisms of action of adjuvants, and T cell and mucosal
immunity are key priorities to be tackled in the coming years [3].
These issues can be addressed by improved structural studies of
antigen epitopes and the compilation of databases containing
information on structure, immunogenicity, and in silico B cell and
T cell epitope predictions. Genome-based development of effective
vaccines and therapeutics is still largely dependent on the
availability of valid models to measure efficacy and protection
PLoS Genetics | www.plosgenetics.org 6 October 2009 | Volume 5 | Issue 10 | e1000612
against disease; however, the increased understanding of microbial
pathogenesis that is emerging from genomics should greatly aid in
this respect. Likewise, the continued development of animal
models with knockout and allele-specific mutations in key
components of the immune response will greatly increase
understanding of the type of immune response needed to control
disease and the ways in which the immune system can be
programmed to protect the host against disease. Unfortunately,
the stepwise series of prelicensure clinical trials (Phase I, II, and III)
that are required to document the safety, immunogenicity, and
efficacy of a vaccine are still highly time-consuming and costly. We
can only hope that the increasingly ‘‘smart’’ identification and
design of targets, and the fresh impetuous given to the fields of
vaccine and drug development by the arrival of genomics, will
enable increased success of those vaccines and drugs that do make
it into clinical development.
References
1. Dong J, Olano JP, McBride JW, Walker DH (2008) Emerging pathogens:Challenges and successes of molecular diagnostics. J Mol Diagn 10: 185–197.
2. Yang X, Yang H, Zhou G, Zhao GP (2008) Infectious disease in the genomicera. Annu Rev Genomics Hum Genet 9: 21–48.
3. Rappuoli R (2007) Bridging the knowledge gaps in vaccine design. Nat
Biotechnol 25: 1361–1366.
4. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, et al. (1995)
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.Science 269: 496–512.
5. Casanova JL, Abel L (2007) Human genetics of infectious diseases: A unifiedtheory. EMBO J 26: 915–922.
6. Burgner D, Jamieson SE, Blackwell JM (2006) Genetic susceptibility to infectious
diseases: Big is beautiful, but will bigger be even better? Lancet Infect Dis 6:653–663.
7. Nakamura S, Yang CS, Sakon N, Ueda M, Tougan T, et al. (2009) Directmetagenomic detection of viral pathogens in nasal and fecal specimens using an
unbiased high-throughput sequencing approach. PLoS ONE 4: e4219.
doi:10.1371/journal.pone.0004219.
8. Bittar F, Richet H, Dubus JC, Reynaud-Gaubert M, Stremler N, et al. (2008)
Molecular detection of multiple emerging pathogens in sputa from cystic fibrosispatients. PLoS ONE 3: e2908. doi:10.1371/journal.pone.0002908.
9. Rinaudo CD, Telford JL, Rappuoli R, Seib KL (2009) Vaccinology in thegenome era. J Clin Invest 119: 2515–2525.
10. Kaushik DK, Sehgal D (2008) Developing antibacterial vaccines in genomics
and proteomics era. Scand J Immunol 67: 544–552.
11. Pucci MJ (2007) Novel genetic techniques and approaches in the microbial
genomics era: identification and/or validation of targets for the discovery of newantibacterial agents. Drugs R D 8: 201–212.
12. Mills SD (2006) When will the genomics investment pay off for antibacterialdiscovery? Biochem Pharmacol 71: 1096–1102.
13. Van Voorhis WC, Hol WGJ, Myler PJ, Stewart LJ (2009) The role of medical
structural genomics in discovering new drugs for infectious diseases. PLoSComput Biol 5(10): e530. 10.1371/journal.pcbi.1000530.
14. Masignani V, Rappuoli R, Pizza M (2002) Reverse vaccinology: A genome-based approach for vaccine development. Expert Opin Biol Ther 2: 895–905.
15. Pizza M, Scarlato V, Masignani V, Giuliani MM, Arico B, et al. (2000)Identification of vaccine candidates against serogroup B meningococcus by
whole-genome sequencing. Science 287: 1816–1820.
16. Giuliani MM, Adu-Bobie J, Comanducci M, Arico B, Savino S, et al. (2006) Auniversal vaccine for serogroup B meningococcus. Proc Natl Acad Sci U S A
103: 10834–10839.
17. Rappuoli R (2008) The application of reverse vaccinology, Novartis MenB
vaccine developed by design. 16th International Pathogenic Neisseria Confer-
ence, Rotterdam, The Netherlands: http://www.IPNC2008.org. Abstr. 81 p.
18. Muzzi A, Masignani V, Rappuoli R (2007) The pan-genome: Towards a
knowledge-based discovery of novel targets for vaccines and antibacterials. DrugDiscov Today 12: 429–439.
19. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, et al. (2005) Genomeanalysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the
microbial ‘‘pan-genome.’’ Proc Natl Acad Sci U S A 102: 13950–13955.
20. Maione D, Margarit I, Rinaudo CD, Masignani V, Mora M, et al. (2005)Identification of a universal Group B streptococcus vaccine by multiple genome
screen. Science 309: 148–150.
21. Bhagwat AA, Bhagwat M (2008) Methods and tools for comparative genomics of
foodborne pathogens. Foodborne Pathog Dis 5: 487–497.
22. Rasko DA, Rosovitz MJ, Myers GS, Mongodin EF, Fricke WF, et al. (2008) The
pangenome structure of Escherichia coli: Comparative genomic analysis of E. coli
commensal and pathogenic isolates. J Bacteriol 190: 6881–6893.
23. Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill FX, et al. (2008) High-
throughput sequencing provides insights into genome variation and evolution inSalmonella typhi. Nat Genet 40: 987–993.
24. Dhiman N, Bonilla R, O’Kane DJ, Poland GA (2001) Gene expressionmicroarrays: A 21st century tool for directed vaccine design. Vaccine 20: 22–30.
25. Morozova O, Marra MA (2008) Applications of next-generation sequencing
technologies in functional genomics. Genomics 92: 255–264.
26. Merrell DS, Butler SM, Qadri F, Dolganov NA, Alam A, et al. (2002) Host-
induced epidemic spread of the cholera bacterium. Nature 417: 642–645.
27. Talaat AM, Lyons R, Howard ST, Johnston SA (2004) The temporal expression
profile of Mycobacterium tuberculosis infection in mice. Proc Natl Acad Sci U S A
101: 4602–4607.
28. Scarselli M, Giuliani MM, Adu-Bobie J, Pizza M, Rappuoli R (2005) The
impact of genomics on vaccine design. Trends Biotechnol 23: 84–91.
29. Saenz HL, Dehio C (2005) Signature-tagged mutagenesis: technical advances in
a negative selection method for virulence gene identification. Curr Opin
Microbiol 8: 612–619.
30. Sakata T, Winzeler EA (2007) Genomics, systems biology and drug development
for infectious diseases. Mol Biosyst 3: 841–848.
31. Sun YH, Bakshi S, Chalmers R, Tang CM (2000) Functional genomics of
Neisseria meningitidis pathogenesis. Nat Med 6: 1269–1273.
32. Kavermann H, Burns BP, Angermuller K, Odenbreit S, Fischer W, et al. (2003)
Identification and characterization of Helicobacter pylori genes essential for gastric
colonization. J Exp Med 197: 813–822.
33. Sassetti CM, Boyd DH, Rubin EJ (2003) Genes required for mycobacterial
growth defined by high density mutagenesis. Mol Microbiol 48: 77–84.
34. Zhu H, Bilgin M, Snyder M (2003) Proteomics. Annu Rev Biochem 72:
783–812.
35. Grandi G (2006) Genomics and proteomics in reverse vaccines. Methods
Biochem Anal 49: 379–393.
36. Rodriguez-Ortega MJ, Norais N, Bensi G, Liberatori S, Capo S, et al. (2006)
Characterization and identification of vaccine candidate proteins through
analysis of the group A Streptococcus surface proteome. Nat Biotechnol 24:
191–197.
37. De Groot AS, McMurry J, Moise L (2008) Prediction of immunogenicity: in
silico paradigms, ex vivo and in vivo correlates. Curr Opin Pharmacol 8:
620–626.
38. Meinke A, Henics T, Hanner M, Minh DB, Nagy E (2005) Antigenome
technology: A novel approach for the selection of bacterial vaccine candidate
antigens. Vaccine 23: 2035–2041.
39. Vytvytska O, Nagy E, Bluggel M, Meyer HE, Kurzbauer R, et al. (2002)
Identification of vaccine candidate antigens of Staphylococcus aureus by serological
proteome analysis. Proteomics 2: 580–590.
40. Giefing C, Meinke AL, Hanner M, Henics T, Bui MD, et al. (2008) Discovery of
a novel class of highly conserved vaccine antigens using genomic scale antigenic
fingerprinting of pneumococcus with human antibodies. J Exp Med 205:
117–131.
41. Eyles JE, Unal B, Hartley MG, Newstead SL, Flick-Smith H, et al. (2007)
Immunodominant Francisella tularensis antigens identified using proteome
microarray. Proteomics 7: 2172–2183.
42. Rolfs A, Montor WR, Yoon SS, Hu Y, Bhullar B, et al. (2008) Production and
sequence validation of a complete full length ORF collection for the pathogenic
bacterium Vibrio cholerae. Proc Natl Acad Sci U S A 105: 4364–4369.
43. Stoevesandt O, Taussig MJ, He M (2009) Protein microarrays: high-throughput
tools for proteomics. Expert Rev Proteomics 6: 145–157.
44. De Groot AS, Moise L, McMurry JA, Martin W (2008) Epitope-based immunone-
derived vaccines: a strategy for improved design and safety. In: Falus A, ed. Clinical
Applications of Immunomics. New York: Springer. pp 39–69.
45. Sette A, Fleri W, Peters B, Sathiamurthy M, Bui HH, et al. (2005) A roadmap
for the immunomics of category A-C pathogens. Immunity 22: 155–161.
46. De Groot AS, Rivera DS, McMurry JA, Buus S, Martin W (2008) Identification
of immunogenic HLA-B7 ‘‘Achilles’ heel’’ epitopes within highly conserved
regions of HIV. Vaccine 26: 3059–3071.
47. Lundstrom K (2007) Structural genomics and drug discovery. J Cell Mol Med
11: 224–238.
48. Kaldor SW, Kalish VJ, Davies JF, 2nd, Shetty BV, Fritz JE, et al. (1997)
Viracept (nelfinavir mesylate, AG1343): A potent, orally bioavailable inhibitor of
HIV-1 protease. J Med Chem 40: 3979–3985.
49. Kim CU, Lew W, Williams MA, Liu H, Zhang L, et al. (1997) Influenza
neuraminidase inhibitors possessing a novel hydrophobic interaction in the
enzyme active site: Design, synthesis, and structural analysis of carbocyclic sialic
acid analogues with potent anti-influenza activity. J Am Chem Soc 119: 681–690.
50. Todd AE, Marsden RL, Thornton JM, Orengo CA (2005) Progress of structural
genomics initiatives: An analysis of solved target structures. J Mol Biol 348:
1235–1260.
51. Dormitzer PR, Ulmer JB, Rappuoli R (2008) Structure-based antigen design: A
strategy for next generation vaccines. Trends Biotechnol 26: 659–667.
52. Nicola G, Abagyan R (2009) Structure-based approaches to antibiotic drug
discovery. Curr Protoc Microbiol Chapter 17: Unit 17.2.
53. Zhou T, Xu L, Dey B, Hessell AJ, Van Ryk D, et al. (2007) Structural definition
of a conserved neutralization epitope on HIV-1 gp120. Nature 445: 732–737.
PLoS Genetics | www.plosgenetics.org 7 October 2009 | Volume 5 | Issue 10 | e1000612
54. Prabakaran P, Dimitrov AS, Fouts TR, Dimitrov DS, KuanTeh J (2007)
Structure and function of the HIV envelope glycoprotein as entry mediator,vaccine immunogen, and target for inhibitors. In: Advances in Pharmacology.
Academic Press. pp 33–97.
55. Tobin GJ, Trujillo JD, Bushnell RV, Lin G, Chaudhuri AR, et al. (2008)Deceptive imprinting and immune refocusing in vaccine design. Vaccine 26:
6189–6199.56. Ercolini AM, Miller SD (2009) The role of infections in autoimmune disease.
Clin Exp Immunol 155: 1–15.
57. Amela I, Cedano J, Querol E (2007) Pathogen proteins eliciting antibodies donot share epitopes with host proteins: A bioinformatics approach. PLoS ONE 2:
e512. doi:10.1371/journal.pone.0000512.58. Kanduc D, Stufano A, Lucchese G, Kusalik A (2008) Massive peptide sharing
between viral and human proteomes. Peptides 29: 1755–1766.59. Kanduc D, Lucchese A, Mittelman A (2007) Non-redundant peptidomes from
DAPs: Towards ‘‘the vaccine’’? Autoimmun Rev 6: 290–294.
60. Wraith DC, Goldman M, Lambert PH (2003) Vaccination and autoimmunedisease: What is the evidence? Lancet 362: 1659–1666.
61. Gross DM, Forsthuber T, Tary-Lehmann M, Etling C, Ito K, et al. (1998)Identification of LFA-1 as a candidate autoantigen in treatment-resistant Lyme
arthritis. Science 281: 703–706.
62. Willett TA, Meyer AL, Brown EL, Huber BT (2004) An effective second-generation outer surface protein A-derived Lyme vaccine that eliminates a
potentially autoreactive T cell epitope. Proc Natl Acad Sci U S A 101: 1303–1308.63. Kellam P (2006) Attacking pathogens through their hosts. Genome Biol 7: 201.
64. Andeweg AC, Haagmans BL, Osterhaus AD (2008) Virogenomics: the virus-host interaction revisited. Curr Opin Microbiol 11: 461–466.
65. del Real G, Jimenez-Baranda S, Mira E, Lacalle RA, Lucas P, et al. (2004) Statins
inhibit HIV-1 infection by down-regulating Rho activity. J Exp Med 200: 541–547.66. de Lang A, Baas T, Teal T, Leijten LM, Rain B, et al. (2007) Functional
genomics highlights differential induction of antiviral pathways in the lungs ofSARS-CoV-infected macaques. PLoS Pathog 3: e112. doi:10.1371/journal.
ppat.0030112.
67. International HapMap Consortium (2007) A second generation humanhaplotype map of over 3.1 million SNPs. Nature 449: 851–861.
68. Poland GA, Ovsyannikova IG, Jacobson RM (2009) Application of pharmaco-genomics to vaccines. Pharmacogenomics 10: 837–852.
69. Ovsyannikova IG, Jacobson RM, Dhiman N, Vierkant RA, Pankratz VS, et al.(2008) Human leukocyte antigen and cytokine receptor gene polymorphisms
associated with heterogeneous immune responses to mumps viral vaccine.
Pediatrics 121: e1091–1099.70. Sim E, Lack N, Wang CJ, Long H, Westwood I, et al. (2008) Arylamine N-
acetyltransferases: Structural and functional implications of polymorphisms.Toxicology 254: 170–183.
71. Baudhuin LM, Langman LJ, O’Kane DJ (2007) Translation of pharmacoge-
netics into clinically relevant testing modalities. Clin Pharmacol Ther 82:373–376.
72. Telford JL, Barocchi MA, Margarit I, Rappuoli R, Grandi G (2006) Pili ingram-positive pathogens. Nat Rev Microbiol 4: 509–519.
73. Lauer P, Rinaudo CD, Soriani M, Margarit I, Maione D, et al. (2005) Genome
analysis reveals pili in Group B Streptococcus. Science 309: 105.
74. Margarit I, Rinaudo CD, Galeotti CL, Maione D, Ghezzo C, et al. (2009)
Preventing bacterial infections with pilus-based vaccines: The group B
streptococcus paradigm. J Infect Dis 199: 108–115.
75. Mora M, Bensi G, Capo S, Falugi F, Zingaretti C, et al. (2005) Group A
Streptococcus produce pilus-like structures containing protective antigens and
Lancefield T antigens. Proc Natl Acad Sci U S A 102: 15641–15646.
76. Falugi F, Zingaretti C, Pinto V, Mariani M, Amodeo L, et al. (2008) Sequence
variation in Group A Streptococcus pili and association of pilus backbone types
with Lancefield T serotypes. J Infect Dis 198: 1834–1841.
77. Barocchi MA, Ries J, Zogaj X, Hemsley C, Albiger B, et al. (2006) A
pneumococcal pilus influences virulence and host inflammatory responses. Proc
Natl Acad Sci U S A 103: 2857–2862.
78. Bagnoli F, Moschioni M, Donati C, Dimitrovska V, Ferlenghi I, et al. (2008) A
second pilus type in Streptococcus pneumoniae is prevalent in emerging serotypes and
mediates adhesion to host cells. J Bacteriol 190: 5480–5492.
79. Gianfaldoni C, Censini S, Hilleringmann M, Moschioni M, Facciotti C, et al.
(2007) Streptococcus pneumoniae pilus subunits protect mice against lethal challenge.
Infect Immun 75: 1059–1062.
80. Granoff DM, Welsch JA, Ram S (2009) Binding of complement factor H (fH) to
Neisseria meningitidis is specific for human fH and inhibits complement activation
by rat and rabbit sera. Infect Immun 77: 764–769.
81. McNeil LK, Murphy E, Zhao XJ, Guttmann S, Harris S, et al. (2009) Detection
of LP2086 on the cell surface of Neisseria meningitidis and its accessibility in the
presence of serogroup B capsular polysaccharide. Vaccine 27: 3417–3421.
82. Koeberling O, Seubert A, Granoff DM (2008) Bactericidal antibody responses
elicited by a meningococcal outer membrane vesicle vaccine with overexpressed
factor H-binding protein and genetically attenuated endotoxin. J Infect Dis 198:
262–270.
83. Madico G, Welsch JA, Lewis LA, McNaughton A, Perlman DH, et al. (2006)
The meningococcal vaccine candidate GNA1870 binds the complement
regulatory protein factor H and enhances serum resistance. J Immunol 177:
501–510.
84. Masignani V, Comanducci M, Giuliani MM, Bambini S, Adu-Bobie J, et al.
(2003) Vaccination against Neisseria meningitidis using three variants of the
lipoprotein GNA1870. J Exp Med 197: 789–799.
85. Welsch JA, Ram S, Koeberling O, Granoff DM (2008) Complement-dependent
synergistic bactericidal activity of antibodies against factor H-binding protein, a
sparsely distributed meningococcal vaccine antigen. J Infect Dis 197:
1053–1061.
86. Seib KL, Serruto D, Oriente F, Delany I, Adu-Bobie J, et al. (2009) Factor H-
binding protein is important for meningococcal survival in human whole blood
and serum and in the presence of the antimicrobial peptide LL-37. Infect
Immun 77: 292–299.
87. Mazurkiewicz P, Tang CM, Boone C, Holden DW (2006) Signature-tagged
mutagenesis: Barcoding mutants for genome-wide screens. Nat Rev Genet 7:
929–939.
PLoS Genetics | www.plosgenetics.org 8 October 2009 | Volume 5 | Issue 10 | e1000612
Review
Toward the Use of Genomics to Study MicroevolutionaryChange in BacteriaDaniel Falush*
Department of Microbiology, University College Cork, Environmental Research Institute, Lee Road, Cork, Ireland
Abstract: Bacteria evolve rapidly in response to theenvironment they encounter. Some environmental chang-es are experienced numerous times by bacteria from thesame population, providing an opportunity to dissect thegenetic basis of adaptive evolution. Here I discuss twoexamples in which the patterns of rapid change provideinsight into medically important bacterial phenotypes,namely immune escape by Neisseria meningitidis and hostspecificity of Campylobacter jejuni. Genomic analysis ofpopulations of bacteria from these species holds greatpromise but requires appropriate concepts and statisticaltools.
Bacteria lack a natural reproductive system, comparable to
meiosis in eukaryotes, that segregates genes randomly. Instead,
they evolve progressively through mostly small genetic changes, a
proportion of which have noteworthy phenotypic effects. Some
phenotypes are intrinsically difficult to study in the laboratory:
virulence in humans or adaptation to particular ecological niches,
for example. For these traits in particular, a promising avenue for
scientific investigation is to identify the genetic changes that have
provided the basis for their evolution in natural populations.
Most human phenotypes are hard to study in vitro and,
consequently, methods for relating differences amongst humans to
natural genetic variation are well developed. Association studies
were proposed as an effective way of identifying genes with small
phenotypic effects more than a decade ago [1] and, although
initially controversial [2], the recent development of arrays for
genotyping hundreds of thousands of single nucleotide polymor-
phisms (SNPs) scattered across the whole genome has allowed the
approach to be successfully applied to many different human
diseases and other phenotypes [3]. This success should inspire the
development of equivalent protocols within bacteriology.
One challenge in developing generally applicable protocols for
mapping phenotypic traits in bacteria is that processes by which
microevolution occurs vary tremendously between species. For
example, the human pathogen Mycobacterium tuberculosis, the causal
agent of tuberculosis (TB), diverged recently from an obscure
organism occasionally isolated from humans in Africa called
Mycobacterium canetti [4]. M. tuberculosis shows very little variation
and there is no evidence of strains acquiring DNA by import from
other M. tuberculosis strains or indeed from any other organism, so
that individuals are clones of each other, distinguished only by rare
mutations or other small changes. By contrast, individual
Helicobacter pylori, a cause of gastric cancer, acquire DNA from
other members of the species at an extremely high rate.
Consequently, as well as varying in gene content [5], strains
isolated from different host individuals in the same ethnic group
typically differ from each other at approximately 3% of
nucleotides in core genes, and this diversity segregates nearly
randomly [6]. The majority of bacterial species fall between these
extremes, with their genomes showing signs of both clonal descent
and DNA import from other strains.
In this essay, I will argue that the clonal mode of reproduction
shared by all bacteria and Archaea, in which replication occurs by
binary fission, in fact provides an extremely powerful context for
association studies. These studies will require both appropriate
technologies for genotyping and evolutionary analysis and
judiciously chosen strain collections. I will here concentrate on
two examples in which placing evolutionary changes in their
clonal context provides the power to relate phenotype to genotype.
Population-scale genome sequencing promises to allow a full and
unbiased catalogue of variation within the same clonal context.
This reconstruction will facilitate identification of loci that show
correlations with phenotype or anomalous patterns that indicate
natural selection, with minimal assumptions about the mecha-
nisms by which phenotypes change.
Example 1: Immune Escape during Clonal Spreadof Neisseria meningitidis
Neisseria meningitidis lives in the human nasopharynx and is best
known for its role in meningitis and other forms of meningococcal
disease. N. meningitidis is a major cause of morbidity and mortality
in childhood in industrialised countries and is responsible for
epidemics, principally in Africa and Asia. Many lineages persist
stably within human populations, causing little disease. There are
a handful of ‘‘hyperinvasive’’ lineages, however, that have a
distinct epidemiology, spreading rapidly from location to location
and causing clusters of disease cases but not persisting in any one
place.
Mark Achtman and colleagues examined variation within a
single hyperinvasive lineage of N. meningitidis, designated subgroup
III, over a period of three decades [7]. The strains within subgroup
III showed little diversity in most of their housekeeping and other
genes surveyed. A few loci were identified that did show variation,
however, allowing clonal relationships to be partially reconstruct-
Citation: Falush D (2009) Toward the Use of Genomics to Study Microevolu-tionary Change in Bacteria. PLoS Genet 5(10): e1000627. doi:10.1371/journal.pgen.1000627
Editor: David S. Guttman, University of Toronto, Canada
Published October 26, 2009
Copyright: � 2009 Daniel Falush. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.
Funding: The author is funded by Science Foundation of Ireland grant number05/FE1/B882. The funders had no role in the preparation of the article.
Competing Interests: The author has declared that no competing interestsexist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journalcollection (http://ploscollections.org/emerginginfectiousdisease/).
PLoS Genetics | www.plosgenetics.org 1 October 2009 | Volume 5 | Issue 10 | e1000627
ed. This reconstruction demonstrated that there were strong
bottlenecks during geographical spread, with a single ancestor for
each major wave of infection. It also showed that, notwithstanding
the low overall level of variation, certain genes encoding specific
antigens changed repeatedly in different countries and pandemic
waves.
The most remarkable variation was found in the transferrin-
binding protein B gene (tbpB), which encodes a protein responsible
for iron uptake that is expressed on the surface of the bacterium.
This gene had evolved on three occasions by nonsynonymous
point mutations that altered the structure of the protein and on 21
occasions by import of different versions of the protein from a
variety of sources, including from N. lactamica, a closely related and
entirely noninvasive species that also colonizes humans (Figure 1).
The import events vary: analysis of similar tbpB changes in a
closely related lineage showed that between 2 kb and 10 kb of
sequence was transferred, which often altered the sequence of the
flanking genes as well as tbpB [8]. In each case, however, an effect
of the imported DNA was to change the externally exposed part of
the protein from the usual version (called the family 4 version) to
one of two antigenically highly distinct versions (family 1 and
family 3).
The fact that functionally equivalent changes to tbpB are
achieved by heterogeneous genetic events shows that the large
number of imports is not caused by a recombination mechanism
that is specific to the locus. Instead it reflects the amplifying effect
of natural selection within the large number of bacteria that
circulate during epidemics. Imports happen at a low rate
throughout the genome, but those that cause an antigenic change
at the tbpB locus have a selective advantage, meaning that they are
observed at a much higher rate than imports elsewhere in the
genome.
High diversity at a particular antigen locus is usually explained
by invoking a mechanism called negative frequency-dependent
selection [9]. Hosts who have been exposed to a particular variant
develop immune responses against this variant. Bacteria with
antigenically distinct variants escape this response, giving them an
advantage in colonizing that host. At the population level, this
selection should lead to the persistence of multiple variants. Yet,
despite this selection for rare variants within individual epidemics,
the antigenic diversity of subgroup III did not increase
progressively over time but was instead reset at the beginning of
each new epidemic, which was started by a strain with a family 4
allele.
The continuous generation of subgroup III strains with family 1
and 3 tbpB alleles is better explained by a mechanism called
source–sink dynamics [10]. The source consists of an environment
within which transmission of the bacterium is self-sustaining. Sinks
consist of environments that bacteria can colonize effectively
(perhaps by undergoing genetic modification) but from which
onward transmission is ineffective. Here, the sink environment
consists of individuals with acquired immunity to subgroup III
strains that carry family 4 alleles, while the source is the remainder
of the human population. The fact that the variant genotypes
capable of colonizing the sink do not spread geographically but
instead are repeatedly regenerated locally suggests that that these
strains have reduced overall transmission fitness in naıve hosts,
which comprise the majority of individuals in populations where
an epidemic has not occurred recently.
Two other examples of sink environments are the lungs of
immunocompromised patients for Pseudomonas aeruginosa, and the
human urinary tract for Escherichia coli [10]; as for the N. meningitidis
example, specific genetic changes have been identified that adapt
strains of these bacteria to those environments but at the expense
of overall transmission fitness, with the result that infections occur
generally sporadically.
Example 2: Host Specificity in Campylobacter jejuni
Campylobacter jejuni is a gram-negative bacterium commonly
found in animal feces. It is often associated with poultry and
naturally colonises the GI tract of many bird species. C. jejuni is one
of the most common causes of human gastroenteritis in the world.
Infection caused by Campylobacter species can be severely
debilitating but is rarely life-threatening. Human infection is
sporadic and, although poorly prepared food is often thought to be
implicated, it is generally difficult to track the source. There has
therefore been a substantial effort to isolate bacteria from a wide
variety of reservoirs and to genotype them using multilocus
sequence typing (MLST), which involves obtaining the DNA
Figure 1. Acquisition of new tbpB genes by subgroup III Neisseria meningitidis during epidemic spread. Colours indicate the family ofeach tbpB allele, with red corresponding to family 4, green corresponding to family 1, and blue corresponding to family 3. The bars highlight the timeframe, most common tbpB type, and geographical extent of each epidemic (in 1987, pilgrims from the Hajj pilgrimage briefly distributed the lineageworldwide). The circles correspond to variant genotypes. Small circles indicating that the variant allele was found in only one strain; large circlesindicate it was found in between two and four strains.doi:10.1371/journal.pgen.1000627.g001
PLoS Genetics | www.plosgenetics.org 2 October 2009 | Volume 5 | Issue 10 | e1000627
sequence for each isolate at a standardized panel of genes (seven
for Campylobacter) that are chosen because they have an essential
function and are present in the vast majority of isolates in the
species [11].
The C. jejuni strains acquired by chickens are distinct from those
of the wild birds around them, even when the poultry are kept
outdoors [12]. Within farm animals, certain lineages are found
with very different frequencies in chickens and cattle, whereas
several genotypes are found at high frequency in both (strains with
the MLST type ST-21, for example) [13]. Strains from different
farm animals are more similar to each other than they are to
strains found, for example, in starlings (a native European bird
that is also common in may other countries, including the US)
[14].
The digestive system of chickens differs from that of cattle in
multiple aspects, and their body temperature is several degrees
higher than that of cattle. This raises the question of how some
lineages are able to compete successfully in both hosts.
Mechanisms facilitating rapid phenotypic adaptation include: (1)
inbuilt regulatory mechanisms that allow individual bacteria to
alter gene expression in response to new environments [15], (2)
‘‘contingency loci’’ that mutate rapidly, creating phenotypic
variation amongst bacteria that are otherwise genetically identical
[16], and (3) import of DNA from other strains that are already
adapted to the current environment.
A first step toward understanding the evolution of host
specificity is to establish whether it is possible to predict the host
origin of strains based on their genome sequence. One approach
to doing this uses phylogenetic relationships. For example, the
program AdaptML (http://almlab.mit.edu/ALME/Software/
Software.html) attempts to assign branches of the phylogenetic
tree to preferred habitats based on where the strains on that
branch were isolated [17]. For C. jejuni, habitat can, for example,
be equated to host species. The observation of a group of
phylogenetically related strains in a single host species might reflect
the common ancestor of those strains acquiring the traits required
to survive in that species.
Since C. jejuni recombines frequently, the genome composition
of each strain is determined by the sources from which it has
imported DNA, as well by which strains it is phylogenetically
related to. For example, ST-21, together with its variants, is a
lineage analogous to subgroup III of N. meningitidis. Like subgroup
III, the lineage has imported DNA from other strains on numerous
occasions during its spread, with the result that many isolates have
variant genotypes that differ from ST-21 at one or two of the seven
MLST fragments. By convention, these strains are grouped with
ST-21 into the ST-21 clonal complex.
ST-21 itself has been found at high frequency in several
agricultural species and elsewhere. Therefore, if a new strain is
found to be ST-21, this provides little information on where it
might have originated. However, for the variants of ST-21, Noel
McCarthy and colleagues obtained significantly better than
random assignment by predicting hosts based on the frequency
with which the variant allele was found in chicken or cattle [13]. A
useful signal of host-of-origin is thus provided by the DNA that
each isolate has acquired (Figure 2). Furthermore, the high rate of
recombination within particular hosts represents a mechanism by
which complex adaptations to a particular host species can be
acquired quickly subsequent to a host switch.
The Power of Bacterial Genomics
Studies in bacteria have two major advantages over those in
humans or other mammals when it comes to relating phenotype to
genotype based on natural variation. The first is the magnifying
effect of natural selection in enormous bacterial populations. This
selection acts to rapidly increase the frequency of genotypes that
give small fitness advantages in a particular environment, even if
these genotypes are generated only rarely. Adaptation in bacteria
is likely to be more frequent and to leave more distinctive genetic
signatures than in species such as humans where signals of
adaptation to local environments have proved to be remarkably
subtle [18]. The second is the fact that evolution occurs in the
context of progressively changing clonal backgrounds. This
property can make it possible to identify strains that have
extremely similar genomes but nevertheless differ phenotypically
[19]. These strains represent the natural equivalent of an isogenic
line and can allow precise inferences about the effects of natural
variation and how different changes interact with each other.
In order to fully exploit the advantages of bacteria for detecting
phenotypic associations, it is necessary to develop a conceptual
and analytical framework within which rapid evolutionary change
can be interpreted. One such framework is source–sink dynamics
[10]. The Neisseria example illustrates the power of microevolu-
tionary analysis in a source–sink ecological context to identify first
the sink (hosts with immune responses to tbpB family 4 alleles) and
second the loci under an immediate selective pressure to change
within that sink (the tbpB gene).
Source–sink dynamics cannot be applied to investigate host
specificity within Campylobacter, because individual host species,
e.g., chicken, cattle, and individual species of wild birds, each
harbour large, viable populations of bacteria with high rates of
within-species transmission and do not represent sinks. Neverthe-
less, there is a key similarity between the Neisseria and Campylobacter
Figure 2. A schematic illustration of the evolution of the C.jejuni ST-21 clonal complex in cattle and chickens. The commonancestor of the complex occurred in chickens (red). During evolution,the lineage occasionally switched to a cattle host (indicated by a bluebranch) and sometimes back to chicken. The bacteria acquired DNA byhomologous recombination from other C. jejuni in the same host. Sincerecombination is assumed to occur from donors within the same host,the gene pool is determined by the genomic composition of the strainsthat colonize each host. The gene pools are illustrated for two separateloci (right and left facing arrows) in chickens and cattle. The gene poolscontain alleles whose frequencies occur at much higher frequency inone host than another (shown in colour) and others that did not (shownin black). The former are informative about the host in which therecombination event occurred, while the latter are not. The recombi-nation event labelled a introduces the left facing black arrow gene fromthe cattle gene pool and is phylogenetically informative because itdefines a lineage that is largely restricted to cattle. The fiverecombination events labelled b are not phylogenetically informative,since they only affect a single strain in the sample. These events arenevertheless informative because they introduce alleles that arecharacteristic of the host species. The event labelled c is bothphylogenetically informative and characteristic of host. The eventlabelled d is noninformative.doi:10.1371/journal.pgen.1000627.g002
PLoS Genetics | www.plosgenetics.org 3 October 2009 | Volume 5 | Issue 10 | e1000627
examples, namely that the strains are repeatedly challenged by an
environment that is novel in the recent history of the strain. In the
Neisseria example, this challenge is repeatedly met by genetic
changes at particular antigenic loci, which consequently have
extremely atypical patterns of variation. In Campylobacter this
challenge is met in the context of a high rate of import of DNA
across the genome from other Campylobacter strains that already
colonize the new host.
The availability of full genome sequences promises to enhance
our understanding of the bacterial responses to new environments
in a number of ways. First, phylogenetic relationships will be better
resolved. In the Neisseria example, a well-resolved tree will
elucidate patterns of transmission within epidemics and, for
example, whether tbpB imports take place at the later stages of
each wave and if strains with such imports ever reacquire family 4
alleles and seed later epidemics. In the Campylobacter example this
will allow estimates of the number of occasions that the ST-21
lineage has jumped between host-species and establish whether
there are sublineages that are becoming progressively more
adapted to single-host transmission.
Second, genomics will provide a complete catalogue of loci
whose pattern of descent is atypical of the genome as a whole and
therefore either associated with a particular phenotype or
putatively affected by selection. In the Neisseria example, an
elevated rate of change at particular loci and consistency in the
nature of those changes would provide signs of selection. In the
Campylobacter example, loci that are imported at very high
frequency and/or that are highly differentiated between host
species may be involved in adaptation to a new host. An isolate-by-
isolate analysis of the patterns of import should establish whether
the multi-host lifestyle of ST-21 and, by extension, of C. jejuni as a
whole is facilitated by import of DNA from locally adapted strains.
Third, genomics will allow detection of epistasis between loci.
Epistasis occurs when the fitness effects of alleles at one gene are
modified by the genotype at one or more additional genes. In
outbreeding diploids, such as mammals, each allele has its fitness
tested on a new genetic background in every generation, with the
result that epistasis does not leave a distinctive signature in the
frequency of particular combinations of alleles unless the loci are
closely linked on the same chromosome or selection is very strong.
In bacteria, combinations of alleles remain together for many
generations wherever they occur in the genome, providing ample
opportunity for epistasis to bring particular combinations of alleles
to high frequency. For example, subgroup III strains that have
imported variant tbpB alleles can potentially enhance their fitness
by importing other parts of the genome that adapt other strains in
the Neisseria population to having high fitness when carrying family
1 or family 3 alleles. These parts of the genome could be detected
by identifying parallel changes that have occurred on the 21
occasions that a variant tbpB allele was imported during the spread
of subgroup III strains. Fitness interactions establish functional
relationships between loci and represent a central part of the
evolutionary landscape, for example triggering the origin of species
[20]. Genome sequencing of bacteria should provide key insights
on the nature of these interactions in natural populations.
In C. jejuni and other zoonoses, genomic analyses will facilitate a
qualitative advance in our understanding of the epidemiology,
ecology, and molecular biology of host switches. These develop-
ments will allow accurate delineation of the sources of human
infection and an understanding of the factors promoting successful
and pathogenic colonization of humans. In N. meningitidis and
similar bacteria, we will gain a much better understanding of the
genetic differences between invasive and noninvasive strains and
the particular adaptive strategies that cause lineages to become
invasive. These advances will together allow the design of targeted
interventions that reduce the burden of human disease.
Challenges for the Future
Advances in sequencing technology mean that it is becoming
economically feasible to obtain complete or nearly complete
genome sequences for large samples of bacteria. To better exploit
this technology to understand bacterial phenotypes, the field
should emulate the research program of human genetics and (1)
develop statistical tools that use sequence variation to infer
mechanisms of evolution [21] and patterns of genetic relationship
[22]; (2) collect and sequence samples of isolates in which bacteria
that differ in phenotypes of interest are matched as far as possible
in time and space [23]; and (3) design statistical tools for detecting
phenotypic associations [24] and natural selection [25] by
identifying patterns of relationship at particular loci that are
atypical of the genome as a whole.
Acknowledgments
Mark Achtman, Jim Bull, Jana Haase, Riikka Haukkanen, and Daniel
Stoebel provided insightful discussions and comments on the manuscript.
References
1. Risch N, Merikangas K (1996) The future of genetic studies of complex humandiseases. Science 273: 1516–1517.
2. Weiss KM, Terwilliger JD (2000) How many diseases does it take to map a genewith SNPs? Nat Genet 26: 151–157.
3. Hardy J, Singleton A (2009) Genomewide association studies and humandisease. N Engl J Med 360: 1759–1768.
4. Fabre M, Koeck JL, Le Fleche P, Simon F, Herve V, V, et al. (2004) Highgenetic diversity revealed by variable-number tandem repeat genotyping and
analysis of hsp65 gene polymorphism in a large collection of ‘‘Mycobacterium
canettii’’ strains indicates that the M. tuberculosis complex is a recently emergedclone of ‘‘M. canettii’’. J Clin Microbiol 42: 3248–3255.
5. Gressmann H, Linz B, Ghai R, Pleissner KP, Schlapbach R, et al. (2005) Gainand loss of multiple genes during the evolution of Helicobacter pylori. PLoS Genet
1: e43. doi:10.1371/journal.pgen.0010043.
6. Suerbaum S, Maynard Smith J, Bapumia K, Morelli G, Smith NH, et al. (1998)
Free recombination within Helicobacter pylori. Proc Natl Acad Sci U S A 95:12619–12624.
7. Zhu P, van Der EA, Falush D, Brieske N, Morelli G, et al. (2001) Fit genotypesand escape variants of subgroup III Neisseria meningitidis during three pandemics
of epidemic meningitis. Proc Natl Acad Sci U S A 98: 5234–5239.
8. Linz B, Schenker M, Zhu P, Achtman M (2000) Frequent interspecific genetic
exchange between commensal Neisseriae and Neisseria meningitidis. Mol Microbiol
36: 1049–1058.
9. Brisson D, Dykhuizen DE (2004) ospC diversity in Borrelia burgdorferi: Different
hosts are different niches. Genetics 168: 713–722.
10. Sokurenko EV, Gomulkiewicz R, Dykhuizen DE (2006) Source-sink dynamics ofvirulence evolution. Nat Rev Microbiol 4: 548–555.
11. Maiden MCJ, Bygraves JA, Feil E, Morelli G, Russell JE, et al. (1998) Multilocus
sequence typing: A portable approach to the identification of clones withinpopulations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95:
3140–3145.
12. Colles FM, Jones TA, McCarthy ND, Sheppard SK, Cody AJ, et al. (2008)Campylobacter infection of broiler chickens in a free-range environment.
Environ Microbiol 10: 2042–2050.
13. McCarthy ND, Colles FM, Dingle KE, Bagnall MC, Manning G, et al. (2007) Host-
associated genetic import in Campylobacter jejuni. Emerg Infect Dis 13: 267–272.
14. Colles FM, McCarthy ND, Howe JC, Devereux CL, Gosler AG, et al. (2009)Dynamics of Campylobacter colonization of a natural host, Sturnus vulgaris
(European starling). Environ Microbiol 11: 258–267.
15. Coulson RM, Ouzounis CA (2003) The phylogenetic diversity of eukaryotictranscription. Nucleic Acids Res 31: 653–660.
16. Moxon R, Bayliss C, Hood D (2006) Bacterial contingency loci: The role of simple
sequence DNA repeats in bacterial adaptation. Annu Rev Genet 40: 307–333.
17. Hunt DE, David LA, Gevers D, Preheim SP, Alm EJ, et al. (2008) Resource
partitioning and sympatric differentiation among closely related bacterioplank-
ton. Science 320: 1081–1085.
PLoS Genetics | www.plosgenetics.org 4 October 2009 | Volume 5 | Issue 10 | e1000627
18. Coop G, Pickrell JK, Novembre J, Kudaravalli S, Li J, et al. (2009) The role of
geography in human adaptation. PLoS Genet 5: e1000500. doi:10.1371/journal.pgen.1000500.
19. Beres SB, Richter EW, Nagiec MJ, Sumby P, Porcella SF, et al. (2006)
Molecular genetic anatomy of inter- and intraserotype variation in the humanbacterial pathogen group A Streptococcus. Proc Natl Acad Sci U S A 103:
7059–7064.20. Coyne JA, Orr HA (2004) Speciation. Sunderland (MA): Sinauer Associates.
21. McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, et al. (2004) The
fine-scale structure of recombination rate variation in the human genome.Science 304: 581–584.
22. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, et al. (2002)
Genetic structure of human populations. Science 298: 2381–2385.23. The Wellcome Trust Case Control Consortium (2007) Genome-wide association
study of 14,000 cases of seven common diseases and 3,000 shared controls.
Nature 447: 661–678.24. Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint
method for genome-wide association studies by imputation of genotypes. NatGenet 39: 906–913.
25. Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, et al. (2007) Genome-
wide detection and characterization of positive selection in human populations.Nature 449: 913–918.
PLoS Genetics | www.plosgenetics.org 5 October 2009 | Volume 5 | Issue 10 | e1000627
Review
The Application of Genomics to Emerging Zoonotic ViralDiseasesBart L. Haagmans, Arno C. Andeweg, Albert D. M. E. Osterhaus*
Department of Virology, Erasmus Medical Center, Rotterdam, The Netherlands
Abstract: Interspecies transmission of pathogens mayresult in the emergence of new infectious diseases inhumans as well as in domestic and wild animals.Genomics tools such as high-throughput sequencing,mRNA expression profiling, and microarray-based analysisof single nucleotide polymorphisms are providing un-precedented ways to analyze the diversity of the genomesof emerging pathogens as well as the molecular basis ofthe host response to them. By comparing and contrastingthe outcomes of an emerging infection with those ofclosely related pathogens in different but related hostspecies, we can further delineate the various hostpathways determining the outcome of zoonotic trans-mission and adaptation to the newly invaded species. Theultimate challenge is to link pathogen and host genomicsdata with biological outcomes of zoonotic transmissionand to translate the integrated data into novel interven-tion strategies that eventually will allow the effectivecontrol of newly emerging infectious diseases.
Emerging Zoonotic Viruses
Most of the well-known human viruses persist in the population
for a relatively long time, and coevolution of the virus and its
human host has resulted in an equilibrium characterized by
coexistence, often in the absence of a measurable disease burden.
When pathogens cross a species barrier, however, the infection
can be devastating, causing a high disease burden and mortality.
In recent years, several outbreaks of infectious diseases in humans
linked to such an initial zoonotic transmission (from animal to
human host) have highlighted this problem. Factors related to our
increasingly globalized society have contributed to the apparently
increased transmission of pathogens from animals to humans over
the past decades; these include changes in human factors such as
increased mobility, demographic changes, and exploitation of the
environment (for a review see Osterhaus [1] and Kuiken et al. [2]).
Environmental factors also play a direct role, and many examples
exist. The recently increased distribution of the arthropod
(mosquito) vector Aedes aegypti, for example, has led to massive
outbreaks of dengue fever in South America and Southeast Asia.
Intense pig farming in areas where frugivorous bats are common is
probably the direct cause of the introduction of Nipah virus into
pig populations in Malaysia, with subsequent transmission to
humans. Bats are an important reservoir for a plethora of zoonotic
pathogens: two closely related paramyxoviruses—Hendra virus
and Nipah virus—cause persistent infections in frugivorous bats
and have spread to horses and pigs, respectively [3].
The similarity between human and nonhuman primates permits
many viruses to cross the species barrier between different primate
species. The introduction into humans of HIV-1 and HIV-2 (the
lentiviruses that cause AIDS), as well as other primate viruses, such
as monkeypox virus and Herpesvirus simiae, provide dramatic
examples of this type of transmission. Other viruses, such as
influenza A viruses and severe acute respiratory syndrome
coronavirus (SARS-CoV), may need multiple genetic changes to
adapt successfully to humans as a new host species; these changes
might include differential receptor usage, enhanced replication,
evasion of innate and adaptive host immune defenses, and/or
increased efficiency of transmission. Understanding the complex
interactions between the invading pathogen on the one hand and
the new host on the other as they progress toward a new host–
pathogen equilibrium is a major challenge that differs substantially
for each successful interspecies transmission and subsequent
spread of the virus.
Genomics of Zoonotic Viruses and Their Hosts
New molecular techniques such as high-throughput sequencing,
mRNA expression profiling, and array-based single nucleotide
polymorphism (SNP) analysis provide ways to rapidly identify
emerging pathogens (Nipah virus and SARS-CoV, for example)
and to analyze the diversity of their genomes as well as the host
responses against them. Essential to the process of identification
and characterization of genome sequences is the exploitation of
extensive databases that allow the alignment of viral genome
sequences and the linkage of these genomics data to those obtained
by classical viral culture and serological techniques, and
epidemiological, clinical, and pathological studies [4]. Extensive
genetic analysis of HIV-1, for example, has provided clues to the
geography and time scale of the early diversification of HIV-1
strains when the virus emerged in humans. HIV-1 strains are
divided into multiple clades, each of which has independently
evolved from a simian immunodeficiency virus (SIV) that naturally
infects chimpanzees in West and Central Africa. Current estimates
date the common ancestor of HIV-1 to the beginning of the
twentieth century [5].
Citation: Haagmans BL, Andeweg AC, Osterhaus ADME (2009) The Application ofGenomics to Emerging Zoonotic Viral Diseases. PLoS Pathog 5(10): e1000557.doi:10.1371/journal.ppat.1000557
Editor: Marianne Manchester, The Scripps Research Institute, United States ofAmerica
Published October 26, 2009
Copyright: � 2009 Haagmans et al. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.
Funding: Supported by the VIRGO consortium, an Innovative Cluster approvedby the Netherlands Genomics Initiative and partially funded by the DutchGovernment (BSIK 03012), The Netherlands and the US National Institutes ofHealth, RO1 grant HL080621-O1A1. The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interestsexist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journalcollection (http://ploscollections.org/emerginginfectiousdisease/).
PLoS Pathogens | www.plospathogens.org 1 October 2009 | Volume 5 | Issue 10 | e1000557
Because zoonotic pathogens typically may cause variable
clinical outcomes in human hosts that differ in age, nutritional
status, genetic background, and immunological condition, deci-
phering the complex interactions between evolving pathogens and
their hosts is a great challenge. The genome sequences of many
host species have become available the last decade, and with them
a range of novel tools are available to study virus–host interactions
at the molecular level. This progress, together with advances in
high-throughput sequencing technology and, not least, in
(bio)informatics and statistics, allows us to analyze the ‘‘genome-
wide’’ networks of gene interactions that control the host response
to pathogens. By comparing and contrasting the outcomes of
infection with closely related pathogens in different but related
host species, we can further delineate the various host pathways
involved in the different outcomes. The power of this approach
was nicely demonstrated for SIV infection of various primate host
species. Natural reservoir hosts of SIV do not develop AIDS upon
infection, whereas non-natural hosts, such as rhesus macaques and
pig-tailed macaques, when infected experimentally with SIV,
develop AIDS in a similar manner to HIV-infected humans.
Transcriptional profiling indicates that SIV infection of these
species produces a distinctive host response [6]. SIV-infected
primates with symptoms of AIDS have a high viral load, immune
activation, and loss of certain types of T cells, whereas SIV-
infected sooty mangabeys (the species from which HIV-2 is
thought to have originated) have substantially lower levels of
innate immune activation than the symptomatic primates, partly
due to the production of less interferon-a by plasmacytoid
dendritic cells in response to SIV and other Toll-like receptor
ligands [7]. Identification of host factors that restrict HIV infection
may aid the development of effective intervention strategies.
Below, we elaborate on two other examples of recent important
zoonotic events that led to sustained virus transmission in the
human host, and the role that genomics has played in the
elucidation of their pathogenesis thus far.
Influenza Virus
Influenza is caused by RNA viruses of the Orthomyxoviridae
family. Whereas fever and coughs are the most frequent
symptoms, in more serious cases a fatal pneumonia can develop,
particularly in the young and the elderly. Typically, influenza is
transmitted through the air by coughs or sneezes, creating aerosols
containing the virus; but influenza can also be transmitted by bird
droppings, saliva, feces, and blood. Birds and pigs play an
important role in the emergence of new influenza viruses in
humans. Fecal sampling of migratory birds has revealed that they
harbor a large range of different subtypes of influenza A viruses
[8]. Some wild duck species, particularly mallards, are potential
long-distance vectors of highly pathogenic avian influenza virus
(H5N1), whereas others, including diving ducks, are more likely to
act as ‘‘sentinel’’ species that die upon infection [9]. Following the
introduction of a new pandemic influenza A virus subtype from an
avian reservoir, either directly or via another mammalian species
such as the pig, the virus may continue to circulate in humans in
subsequent years as a seasonal influenza virus. In the past century,
three major influenza epidemics resulted in the loss of many
millions of lives. Spanish flu alone caused the deaths of more than
50 million people by the end of World War I in 1918. The 2009
outbreak of a new H1N1 virus (causing ‘‘swine flu’’) that started in
Mexico further illustrates the pandemic potential of influenza A
viruses.
After introduction of a new influenza A virus from an avian or
porcine reservoir into the human species, viral genomics studies
are essential to identify critical mutations that enable the
circulating virus to spread efficiently, interact with different
receptors, and cause disease in the new host. For example, the
importance of residue 627 of the PB2 protein of the viral
polymerase in determining species restriction has been demon-
strated through these kinds of approaches [10]. Furthermore,
changes in the hemagglutinin molecules may allow influenza A
viruses to switch receptor specificity. The hemagglutinin of avian
H5N1 influenza viruses preferentially binds to oligosaccharides
that terminate with a sialic acid–a-2,3-Gal disaccharide, whereas
the hemagglutinins of mammalian influenza A viruses prefer
oligosaccharides that terminate with sialic acid–a-2,6-Gal
(Figure 1). Fatal viral pneumonia in humans infected with avian
H5N1 viruses is partly due to the ability of these viruses to attach
to and replicate in the cells of the lower respiratory tract, which
have oligosaccharides that terminate in sialic acid–a-2,3-Gal
disaccharide [11,12]. The sequence of the hemagglutinin protein
may also affect its binding affinity for neutralizing antibodies.
Understanding the relationship between genetic diversity and
antigenic properties of these viruses [13] may help to predict the
emergence of influenza viruses and to develop effective vaccines.
Microarray-assisted mRNA expression profiling of emerging
zoonotic viral infections, including influenza A virus, is used to
phenotype the host response in great detail. By comparing mRNA
expression in individuals infected with an emerging virus to
expression in individuals infected with a related established virus,
researchers can generate a ‘‘molecular fingerprint’’ of the host
response genes or pathways specifically involved in the often-
exuberant host responses to the emerging virus. By using
genetically engineered influenza A viruses, a role for the
nonstructural NS1 viral protein in evasion of the innate host
response has been demonstrated [14]. Interestingly, the NS1
protein derived from the 1918 Spanish H1N1 pandemic influenza
virus blocked expression of interferon-regulated genes more
efficiently than did the NS1 protein from established seasonal
influenza viruses [14]. Other genomics studies of genetically
engineered influenza A viruses containing some or all of the gene
segments from either the 1918 H1N1 virus or the highly
pathogenic avian influenza A virus (H5N1), suggest that these
highly pathogenic influenza viruses induce severe disease in mice
and macaques through aberrant and persistent activation of
proinflammatory cytokine and chemokine responses [15–18].
Application of genomics tools not only supports the elucidation
of mechanisms underlying pathogenesis but may also help to
identify leads for therapeutic intervention. In ferrets, H5N1
infection induced severe disease that was associated with strong
expression of interferon response genes including the interferon-c-
induced cytokine CXCL10. Treatment of H5N1-infected ferrets
with an antagonist of the CXCL10 receptor (CXCR3) reduced the
severity of the flu symptoms and the viral titers compared to the
controls [19], clearly demonstrating the potential of biological
response modifiers for the clinical management of viral infections.
The host evasion and evolution of influenza virus is further
discussed in [20].
SARS-CoV
Coronaviruses (CoVs) primarily infect the upper respiratory and
gastrointestinal tract of mammals and birds. Five different
currently known CoVs infect humans and are believed to cause
a significant percentage of all common colds in human adults.
Surprisingly, recent studies revealed that approximately 6% of bats
sampled in China were positive for CoVs [21]. Subsequent
phylogenetic studies revealed that bat CoVs that resembled
PLoS Pathogens | www.plospathogens.org 2 October 2009 | Volume 5 | Issue 10 | e1000557
human SARS-CoV clustered in a putative group comprising one
subgroup of bat CoVs and another of SARS-CoVs from humans
and other mammalian hosts. According to the current hypothesis
SARS-CoV has arisen by recombination between two bat viruses.
Phylogenetic analysis of SARS-CoV isolates from animals indicate
that the resulting bat virus was transmitted first to palm civets
(Paguma larvata), a wild cat-like animal hunted for its meat, and
subsequently to humans at live animal markets in southern China
[22].
Genome analyses have provided evidence that genetic variation
in the spike gene of these viruses from civets is associated with
increased transmission of the virus [21]. In addition, species-to-
species variation in the sequence of the gene angiotensin-converting
enzyme 2 (ACE2), which encodes the SARS-CoV receptor, also
affects the efficiency by which the virus can enter cells [23]. By a
combination of phylogenetic and bioinformatics analyses, chimeric
gene design, and reverse genetics–aided generation of viruses that
encode spike proteins of diverse isolates, researchers have
reconstructed the events that led to the emergence of a virus able
to spread efficiently in humans [24]. Structural modeling predicted
that the SARS-CoV that caused the epidemic had an increased
affinity for both civet and human ACE2 receptors due to
adaptation (Figure 2). Subsequent functional genomics studies of
these viruses in diverse species provided further insight into the
role of specific host genes involved in the pathogenic response
[25,26]. The pathological changes observed in the lungs are
initiated by a disproportionate innate immune response, illustrated
by elevated levels of inflammatory cytokines and chemokines, such
Figure 1. Zoonotic transmission of influenza A virus. The hemagglutinin of avian influenza A viruses (blue) preferentially bind tooligosaccharides that terminate in sialic acid–a-2,3-Gal (red), whereas the hemagglutinin on human influenza A viruses (green) preferoligosaccharides that terminate in sialic acid–a-2,6-Gal (orange). Fatal viral pneumonia in humans infected with the H5N1 subtype of avianinfluenza A viruses is likely due to the ability of these viruses to attach to and replicate in the lower respiratory tract cells, which have sialic acid-a-2,3-Gal terminated saccharides. The horizontal arrows indicate interspecies transmission, including the transmission from an avian or porcine reservoirinto the human species. Image credit: Bart Haagmans, Erasmus MC. Original images (left to right, from top to bottom) by Roman Kohler, Alvesgaspar,Anton Holmquist, Joshua Lutz, and CDC.doi:10.1371/journal.ppat.1000557.g001
PLoS Pathogens | www.plospathogens.org 3 October 2009 | Volume 5 | Issue 10 | e1000557
as CXCL10 (IP-10), CCL2 (MCP-1), interleukin (IL)-6, IL-8, IL-
12, IL-1b, and interferon-c [27]. These clinical data were
confirmed experimentally by demonstrating that SARS-CoV
infection of diverse cell types induces a range of cytokines and
chemokines, thus providing a conceptual framework for SARS-
CoV pathogenesis. Host genome expression analyses of various
animal hosts and humans with different outcomes of infection
indicated differential activation of innate immune genes in, for
example, aged subjects compared to young subjects. Importantly,
treatment of aged macaques with pegylated interferon-a (i.e.
interferon-a covalently modified with polyethylene glycol polymer
chains, to enhance its bioavailability) reduced SARS-CoV
replication and pathogenic responses [28]. Thus, host genomics
analysis may provide markers of pathogenesis and leads for
therapeutic intervention, as in this example of SARS-CoV
infection.
Challenges for the Future
Rapid identification of newly emerging viruses through the use
of genomics tools is one of the major challenges for the near future.
In addition, the identification of critical mutations that enable
viruses to spread efficiently, interact with different receptors, and
cause disease in diverse hosts through, for instance, enhanced viral
replication or circumvention of the innate and adaptive immune
responses, needs to be further expanded. Although microarray-
assisted transcriptional profiling can provide us with a wealth of
information regarding host genes and gene-interacting networks in
Figure 2. Zoonotic transmission of SARS-CoV. Genomic analyses provided evidence that genetic changes in the spike gene of SARS-CoV frombats (left) and civet cats (center) are essential for the animal-to-human transmission (horizontal arrows). Species-to-species genetic variation in the(thus far unidentified) viral receptor in bats and in the angiotensin converting enzyme 2 (ACE2) gene, encoding the SARS-CoV receptor in civet cats andhumans also affects the efficiency with which the virus can enter cells (vertical arrows). The SARS-CoV that caused the epidemic evolved a highaffinity for both civet (center) and human (right) ACE2 receptors (indicated by the single diagonal and the right side vertical arrow). Image credit: BartHaagmans, Erasmus MC. Original images (left to right) by Dodoni, Paul Hilton, and Hoang Dinh Nam.doi:10.1371/journal.ppat.1000557.g002
PLoS Pathogens | www.plospathogens.org 4 October 2009 | Volume 5 | Issue 10 | e1000557
virus–host interactions, future research should focus on combining
data obtained in different experimental settings. Therefore, the
careful design of complementary sets of experiments using
different formats of virus–host interactions is absolutely needed
for successful genomics studies [29]. Special attention should be
addressed to the comparative analysis of the host response in
diverse animal species. Thus far a limited number of laboratory
animal species has been studied, but the recent elucidation of the
genome of several other animal species will provide tools to
decipher the virus–host interactions in the more relevant natural
host. Recent developments in the sequencing of the RNA
transcriptome may aid this development. Ultimately, microarray
technology may also extend to genotyping of the human host by
SNP analysis, to identify markers of host susceptibility and severity
of disease, that can be used in tailor-made clinical management of
disease caused by emerging infections. Comparative analysis of
host responses to emerging viruses may also point toward a similar
dysregulated host response to a range of emerging virus infections,
enabling the rational design of multipotent biological response
modifiers to combat a variety of emerging viral infections. By
focusing on broad-acting intervention strategies rather than on the
discovery of a newly emerging pathogen that is not characterized
yet, we may be able to protect ourselves from several unexpectedly
emerging infections with the same clinical manifestations. This
approach may readily reduce the burden of disease and time will
be gained to design preventive pathogen specific intervention
strategies such as antiviral therapy or vaccination. Clearly, for all
stages of combating emerging infections, from the early identifi-
cation of the pathogen to the development and design of vaccines,
application of sophisticated genomics tools is fundamental to
success.
References
1. Osterhaus A (2001) Catastrophes after crossing species barriers. Philos Trans Soc
Lond B Biol Sci 356: 791–793.2. Kuiken T, Leighton FA, Fouchier RA, LeDuc JW, Peiris JS, et al. (2005) Public
health. Pathogen surveillance in animals. Science 309: 1680–1681.
3. Field HE, Mackenzie JS, Daszak P (2007) Henipaviruses: Emerging paramyxo-viruses associated with fruit bats. Curr Top Microbiol Immunol 315: 133–159.
4. Rivers TM (1937) Viruses and Koch’s postulates. J Bacteriol 33: 1–12.5. Worobey M, Gemmel M, Teuwen DE, Haselkorn T, Kunstman K, et al. (2008)
Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960. Nature
455: 661–664.6. Lederer S, Favre D, Walters KA, Proll S, Kanwar B, et al. (2009)
Transcriptional profiling in pathogenic and non-pathogenic SIV infectionsreveals significant distinctions in kinetics and tissue compartmentalization. PLoS
Pathog 5: e1000296. doi:10.1371/journal.ppat.1000296.
7. Mandl JN, Barry AP, Vanderford TH, Kozyr N, Chavan R, et al. (2008)Divergent TLR7 and TLR9 signaling and type I interferon production
distinguish pathogenic and nonpathogenic AIDS virus infections. Nat Med 14:1077–1087.
8. Munster VJ, Baas C, Lexmond P, Waldenstrom J, Wallensten A, et al. (2007)Spatial, temporal, and species variation in prevalence of influenza A viruses in
wild migratory birds. PLoS Pathog 3: e61. doi:10.1371/journal.ppat.0030061.
9. Keawcharoen J, van Riel D, van Amerongen G, Bestebroer T, Beyer WE, et al.(2008) Wild ducks as long-distance vectors of highly pathogenic avian influenza
virus (H5N1). Emerg Infect Dis 4: 600–607.10. Hatta M, Gao P, Halfmann P, Kawaoka Y (2001) Molecular basis for high
virulence of Hong Kong H5N1 influenza A viruses. Science 293: 1840–1842.
11. van Riel D, Munster VJ, de Wit E, Rimmelzwaan GF, Fouchier RA, et al. (2006)H5N1 virus attachment to lower respiratory tract. Science 312: 399.
12. Yamada S, Suzuki Y, Suzuki T, Le MQ, Nidom CA, et al. (2006)Haemagglutinin mutations responsible for the binding of H5N1 influenza A
viruses to human-type receptors. Nature 444: 378–382.13. Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, et al.
(2004) Mapping the antigenic and genetic evolution of influenza virus. Science
305: 371–376.14. Geiss GK, Salvatore M, Tumpey TM, Carter VS, Wang X, et al. (2002) Cellular
transcriptional profiling in influenza A virus-infected lung epithelial cells: Therole of the nonstructural NS1 protein in the evasion of the host innate defense
and its potential contribution to pandemic influenza. Proc Natl Acad Sci U S A
99: 10736–10741.15. Kobasa D, Jones SM, Shinya K, Kash JC, Copps J, et al. (2007) Aberrant innate
immune response in lethal infection of macaques with the 1918 influenza virus.Nature 445: 319–323.
16. Baskin CR, Bielefeldt-Ohmann H, Tumpey TM, Sabourin PJ, Long JP, et al.(2009) Early and sustained innate immune response defines pathology and death
in nonhuman primates infected by highly pathogenic influenza virus. Proc Natl
Acad Sci U S A 106: 3455–3460.
17. Kash JC, Tumpey TM, Proll SC, Carter V, Perwitasari O, et al. (2006) Genomic
analysis of increased host immune and cell death responses induced by 1918
influenza virus. Nature 443: 578–581.
18. Kash JC, Basler CF, Garcıa-Sastre A, Carter V, Billharz R, et al. (2004) Global
host immune response: Pathogenesis and transcriptional profiling of type A
influenza viruses expressing the hemagglutinin and neuraminidase genes from
the 1918 pandemic virus. J Virol 78: 9499–9511.
19. Cameron CM, Cameron MJ, Bermejo-Martin JF, Ran L, Xu L, et al. (2008)
Gene expression analysis of host innate immune responses during lethal H5N1
infection in ferrets. J Virol 82: 11308–11317.
20. McHardy AC, Adams, B (2009) The role of genomics in tracking the evolution
of influenza A virus. PLoS Pathog e1000566: doi:10.1371/journal.
ppat.1000566.
21. Tang XC, Zhang JX, Zhang SY, Wang P, Fan XH, et al. (2006) Prevalence and
genetic diversity of coronaviruses in bats from China. J Virol 80: 7481–7490.
22. Song HD, Tu CC, Zhang GW, Wang SY, Zheng K, et al. (2005) Cross-host
evolution of severe acute respiratory syndrome coronavirus in palm civet and
human. Proc Natl Acad Sci U S A 102: 2430–2435.
23. Li W, Zhang C, Sui J, Kuhn JH, Moore MJ, et al. (2005) Receptor and viral
determinants of SARS-coronavirus adaptation to human ACE2. EMBO J 24:
1634–1643.
24. Sheahan T, Rockx B, Donaldson E, Sims A, Pickles R, et al. (2008) Mechanisms
of zoonotic severe acute respiratory syndrome coronavirus host range expansion
in human airway epithelium. J Virol 82: 2274–2285.
25. Rockx B, Baas T, Zornetzer GA, Haagmans B, Sheahan T, et al. (2009) Early
upregulation of acute respiratory distress syndrome-associated cytokines
promotes lethal disease in an aged-mouse model of severe acute respiratory
syndrome coronavirus infection. J Virol 83: 7062–7074.
26. de Lang A, Baas T, Teal T, Leijten LM, Rain B, et al. (2007) Functional
genomics highlights differential induction of antiviral pathways in the lungs of
SARS-CoV-infected macaques. PLoS Pathog 3: e112. doi:10.1371/journal.
ppat.0030112.
27. Baas T, Roberts A, Teal TH, Vogel L, Chen J, et al. (2008) Genomic analysis
reveals age-dependent innate immune responses to severe acute respiratory
syndrome coronavirus. J Virol 82: 9465–9476.
28. Haagmans BL, Kuiken T, Martina BE, Fouchier RA, Rimmelzwaan GF, et al.
(2004) Pegylated interferon-alpha protects type 1 pneumocytes against SARS
coronavirus infection in macaques. Nat Med 10: 290–293.
29. Andeweg AC, Haagmans BL, Osterhaus ADME (2008) Virogenomics: The
virus –host interaction revisited. Curr Opin Microbiol 11: 1–6.
PLoS Pathogens | www.plospathogens.org 5 October 2009 | Volume 5 | Issue 10 | e1000557
Review
The Role of Genomics in Tracking the Evolution ofInfluenza A VirusAlice Carolyn McHardy1*, Ben Adams2
1 Computational Genomics and Epidemiology, Max Planck Institute for Informatics, Saarbruecken, Germany, 2 Department of Mathematical Sciences, University of Bath,
United Kingdom
Abstract: Influenza A virus causes annual epidemics andoccasional pandemics of short-term respiratory infectionsassociated with considerable morbidity and mortality. Thepandemics occur when new human-transmissible virusesthat have the major surface protein of influenza A virusesfrom other host species are introduced into the humanpopulation. Between such rare events, the evolution ofinfluenza is shaped by antigenic drift: the accumulation ofmutations that result in changes in exposed regions of theviral surface proteins. Antigenic drift makes the virus lesssusceptible to immediate neutralization by the immunesystem in individuals who have had a previous influenzainfection or vaccination. A biannual reevaluation of thevaccine composition is essential to maintain its effective-ness due to this immune escape. The study of influenzagenomes is key to this endeavor, increasing our under-standing of antigenic drift and enhancing the accuracy ofvaccine strain selection. Recent large-scale genomesequencing and antigenic typing has considerably im-proved our understanding of influenza evolution: epi-demics around the globe are seeded from a reservoir inEast-Southeast Asia with year-round prevalence of influ-enza viruses; antigenically similar strains predominate inepidemics worldwide for several years before beingreplaced by a new antigenic cluster of strains. Future in-depth studies of the influenza reservoir, along with large-scale data mining of genomic resources and theintegration of epidemiological, genomic, and antigenicdata, should enhance our understanding of antigenic driftand improve the detection and control of antigenicallynovel emerging strains.
Influenza is a single-stranded, negative-sense RNA virus that
causes acute respiratory illness in humans. In temperate regions,
winter influenza epidemics result in 250,000–500,000 deaths per
year; in tropical regions, the burden is similar [1,2]. Influenza
viruses of three genera or types (A, B, and C) circulate in the
human population. Influenza viruses of the types B and C evolve
slowly and circulate at low levels. Type A evolves rapidly and can
evade neutralization by antibodies in individuals who have been
previously infected with, or vaccinated against, the virus. As a
result it regularly causes large epidemics. Furthermore, distinct
reservoirs of influenza A exist in other mammals and in birds. Four
times in the last hundred years these reservoirs have provided
genetic material for novel viruses that have caused global
pandemics [3–8].
The genome of influenza A viruses comprises eight RNA
segments of 0.9–2.3 kb that together span approximately 13.5 kb
and encode 11 proteins [9]. Segment 4 encodes the major surface
glycoprotein called hemagglutinin (H), which is responsible for
attaching the virus to sialic acid residues on the host cell surface and
fusing the virus membrane envelope with the host cell membrane,
thus delivering the viral genome into the cell (Figure 1). Segment 6
encodes another surface glycoprotein called neuraminidase (N),
which cleaves terminal sialic acid residues from glycoproteins and
glycolipids on the host cell surface, thus releasing budding viral
particles from an infected cell [10]. Influenza A viruses are further
classified into distinct subtypes based on the genetic and antigenic
characteristics of these two surface glycoproteins. Sixteen hemag-
glutinin (H1–16) and nine neuraminidase subtypes (N1–9) are
known to exist, and they occur in various combinations in influenza
viruses endemic in aquatic birds [10,11]. Viruses with the subtype
composition H1N1 and H3N2 have been circulating in the human
population for several decades. Of these two subtypes, H3N2
evolves more rapidly, and has until recently caused the majority of
infections [1,12,13]. In the spring of 2009, however, a new H1N1
virus originating from swine influenza A viruses, and only distantly
related to the H1N1 already circulating, gained hold in the human
population. The emergence of this virus has initiated the first
influenza pandemic of the twenty-first century [7,14,15].
Hemagglutinin is about five times more abundant than
neuraminidase in the viral membrane and is the major target of
the host immune response [16–18]. Following exposure to the
virus, whether by infection or vaccination, the host immune system
acquires the capacity to produce neutralizing antibodies against
the viral surface glycoproteins. These antibodies participate in
clearing an infection and may protect an individual from future
infections for many decades [19]. Five exposed regions on the
surface of hemagglutinin, called epitope sites, are predominantly
recognized by such antibodies [16,17]. However, the human
subtypes of influenza A continuously evolve and acquire genetic
mutations that result in amino acid changes in the epitopes. These
changes reduce the protective effect of antibodies raised against
previously circulating viral variants. This ‘‘antigenic drift’’
necessitates frequent modification and readministration of the
influenza vaccine to ensure efficient protection (Box 1).
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journalcollection (http://ploscollections.org/emerginginfectiousdisease/).
Citation: McHardy AC, Adams B (2009) The Role of Genomics in Tracking theEvolution of Influenza A Virus. PLoS Pathog 5(10): e1000566. doi:10.1371/journal.ppat.1000566
Editor: Marianne Manchester, The Scripps Research Institute, United States ofAmerica
Published October 26, 2009
Copyright: � 2009 McHardy et al. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.
Funding: The authors received no specific funding for this work.
Competing Interests: The authors have declared that no competing interestsexist.
* E-mail: [email protected]
PLoS Pathogens | www.plospathogens.org 1 October 2009 | Volume 5 | Issue 10 | e1000566
To monitor for novel emerging strains, the World Health
Organization (WHO) maintains a global surveillance program. A
panel of experts meets twice a year to review antigenic, genetic, and
epidemiological data and decides on the vaccine composition for the
next winter season in the northern or southern hemisphere [20]. If
an emerging antigenic variant is detected and judged likely to
become predominant, an update of the vaccine strain is recom-
mended. This ‘‘predict and produce’’ approach mostly results in
efficient vaccines that substantially limit the morbidity and mortality
of seasonal epidemics [21]. The recommendation has to be made
almost a year before the season in which the vaccine is used,
however, because of the time required to produce and distribute a
new vaccine. Problems arise when an emerging variant is not
identified early enough for an update of the vaccine composition
[22–24]. Thus, gaining a detailed understanding of the evolution
and epidemiology of the virus is of the utmost importance, as it may
lead to earlier identification of novel emerging variants [20].
The development of high-throughput sequencing has recently
provided large datasets of high-quality, complete genome
sequences for viral isolates collected in a relatively unbiased
manner, regardless of virulence or other unusual characteristics
[9,25]. Analyses of the genome sequence data combined with
large-scale antigenic typing [26,27] have given insights into the
pattern of global spread, the genetic diversity during seasonal
epidemics, and the dynamics of subtype evolution. Influenza data
repositories such as the NCBI Influenza Virus Resource (http://
www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html) [28] and the
Global Initiative on Sharing All Influenza Data (GISAID; http://
platform.gisaid.org/) database [29] make the genomic information
publicly available, together with epidemiological data for the
sequenced isolates. The GISAID model for data sharing requires
users to agree to collaborate with, and appropriately credit, all
data contributors. A notable success of this initiative has been the
contribution of countries, such as Indonesia and China, which
Figure 1. Schematic representation of an influenza A virion. Three proteins, hemagglutinin (HA, a trimer of three identical subunits),neuraminidase (NA, a tetramer of four identical subunits), and the M2 transmembrane proton channel (a tetramer of four identical subunits), areanchored in the viral membrane, which is composed of a lipid bilayer. The large, external domains of hemagglutinin and neuraminidase are the majortargets for neutralizing antibodies of the host immune response. The M1 matrix protein is located below the membrane. The genome of the influenzaA virus is composed of eight individual RNA segments (conventionally ordered by decreasing length, bottom row), which each encode one or twoproteins. Inside the virion, the eight RNA segments are packaged in a complex with nucleoprotein (NP) and the viral polymerase complex, consistingof the PA, PB1, and PB2 proteins.doi:10.1371/journal.ppat.1000566.g001
PLoS Pathogens | www.plospathogens.org 2 October 2009 | Volume 5 | Issue 10 | e1000566
have previously been reticent about placing data in the public
domain. The WHO also supports the endeavor of rapid
publication of all available sequences for influenza viruses and
there is hope that comprehensive submission to public databases
will soon become a reality [24,30]. In the future, mining these
resources and establishing a statistical framework based on
epidemiological, antigenic, and genetic information could provide
further insights into the rules that govern the emergence and
establishment of antigenically novel variants and improve the
potential for influenza prevention and control.
Host Immune Evasion by Antigenic Drift and Shift
Influenza viruses can rapidly acquire genetic diversity because
of high replication rates in infected hosts, an error-prone RNA
polymerase (which introduces mutations during genome replica-
tion), and segment reassortment (Figure 2). Mutations that change
amino acid residues appear significantly more often than silent
mutations in the evolution of the hemagglutinin gene of human
influenza A, particularly in the protein epitopes [31–34]. This
observation indicates that selection for antigenic change of the
virus is the driving force in the evolutionary ‘‘arms race’’ between
the virus and the immunity of the human population [35].
Reassortment of the eight genome segments between two distinct
viruses present simultaneously in a host cell can result in hybrid
viruses with genome segments from two different progenitors.
Antigenic mapping allows researchers to generate a quantita-
tive, two-dimensional representation of antigenic distances be-
tween genetically divergent strains [26]. This technique has
revealed that the relationship between antigenic change and
genetic change is nonlinear for the hemagglutinin of influenza A/
H3N2. The rate of genetic change of the virus is almost constant
over time, but some mutations exert a disproportionately large
effect on the antigenic type, whereas others are ‘‘hitchhikers’’ with
no phenotypic effect. Elucidating the effects of different mutations
at individual sites on the antigenic type will improve our
understanding of the overall genotype-to-phenotype mapping for
antigenic drift. Furthermore, the antigenic drift of H3N2 is not
continuous but punctuated: antigenically homogenous clusters of
strains predominate for an average of 3 years before being
replaced by a new cluster. In accordance with the punctuated
nature of antigenic drift, periods of predominantly neutral
evolution alternate with periods of strong selection for antigenic
change [13,36]. Phylogenetic trees illustrating the evolution of the
hemagglutinin gene of H3N2 have a cactus-like shape with a
strong temporal structure in which the trunk represents the
succession of surviving viral lineages over time. Short side
branches indicate that most strains are driven to extinction and
that viral diversity at any given time is limited [31,34]. The
underlying causes of this punctuated antigenic drift and limited
viral diversity at a given point in time have been investigated in
phylodynamic modeling studies (Box 2).
Major changes in antigenicity (antigenic shift) are associated
with the introduction of novel viruses into the human population
that have a hemagglutinin segment of an influenza A virus from
another host species and can be transmitted efficiently among
humans [5]. Such viruses may arise by segment reassortment
between a human influenza A virus and an influenza A virus from
another host species. Alternatively, an entire virus from another
host species may cross into the human population. The
appearance of such viruses is rare, as it requires the viral genes
encoded by the different segments to be compatible with each
other and the virus to be capable of replication and transmission in
the human population, which is also thought to be a polygenic trait
[6,7,10,37,38]. Antigenic shift can have grave consequences
because neutralizing antibodies against the viral surface proteins
offer limited or no cross-protection across subtypes. Cross-
protection can also be very limited between viruses of the same
subtype that have evolved independently in different hosts for long
periods of time [14]. Thus, a larger part of the population is
susceptible to infection with such viruses than to infection with
endemic viruses [10,14]. Antigenic shift caused three global
pandemics in the twentieth century, the 1918 H1N1 pandemic,
the 1957 H2N2 pandemic, and 1968 H3N2 pandemic (reviewed
in [3–5,8]): The 1918 pandemic had the most devastating impact,
with an estimated 20–50 million deaths worldwide [39]. There is
some uncertainty concerning the origin of the 1918 virus due to
the lack of data from this time [6,40–43]. A recent phylogenetic
study suggests that this virus may have been generated by
reassortment of avian viruses with already circulating viruses in a
mammalian host such as human or swine [44]. The H2N2 virus
that caused the 1957 pandemic was a reassortant of five human
H1N1 segments and avian segments encoding the viral surface
proteins and the PB1 protein. Similarly, the reassortant H3N2
virus of the 1968 pandemic featured avian segments encoding
hemagglutinin and PB1. H3N2 still circulates today, together with
an H1N1 lineage introduced in 1977, which is similar to the H1N1
viruses circulating in the 1950s [4].
The first pandemic virus of the twenty-first century probably
entered the human population in January or February of 2009
[15]. Phylogenetic analyses of the viral genome determined that
the virus has a complex reassortment history with segments of
‘‘avian-like’’ Eurasian swine influenza A viruses (NA and M) that
were first observed in Eurasian swine in 1979, and of a triple
reassortant virus identified in North American swine after 1998.
The segments derived from the triple reassortant stem themselves
from human H3N2 (PB1), an avian influenza A virus (PA, PB2),
and classical North American swine influenza A viruses (HA, NP,
NS), which have a common ancestry with the 1918 H1N1 virus
[14,45]. Experiments have shown that the new H1N1 virus
replicates efficiently in mammalian model organisms such as
Box 1. Broadly Protective Vaccines
Current influenza vaccines are based on detergent-inactivated viruses. They elicit antibodies with a narrowrange of protection that target predominantly the variableregions of the hemagglutinin protein. Accordingly, theseasonal influenza vaccine includes one strain withsegments of the surface proteins for each of the A/H1N1,A/H3N2 and B viruses, and it is updated every 1–3 years tomatch the predominant variants of influenza. Researchinto vaccines that offer broader protection across diversesubtypes and antigenic drift variants is ongoing [21,59–61].This research is particularly important with respect to theemergence of novel viruses with pandemic potential, suchas the 2009 H1N1 virus. In such an event, the time periodbetween the detection of the virus and the onset of apandemic is too short to produce a specific vaccine forimmediate vaccination of the population. Work in this areais focused on developing vaccines that elicit antibodiesagainst conserved viral components, such as certainregions of hemagglutinin, neuraminidase, and the M2proton channel in the viral membrane [60]. Other types ofvaccines based on live attenuated viruses or plasmid DNAexpression vectors, or supplemented with adjuvants, showpromise in inducing a more broadly protective immuneresponse [61].
PLoS Pathogens | www.plospathogens.org 3 October 2009 | Volume 5 | Issue 10 | e1000566
ferrets, mice, and cynomolgus macaques and is likely to be capable
of long-term circulation in the human population, particularly in
the event of further adaptive changes through mutation or
reassortment [46–48]. The novel H1N1 appears, so far, to cause
relatively mild human infections in comparison to other viruses
such as the highly pathogenic H5N1 avian influenza A viruses
that, since 1997, have repeatedly been transmitted to humans and
caused severe disease but so far have not been capable of sustained
transmission between humans. The emergence of a novel
pandemic virus, which may have been circulating undetected in
swine for a decade [14,45], has highlighted the need for increased
genomic surveillance of the viral populations in mammalian hosts
such as swine. These hosts could be a vessel for mammalian
adaptation of avian viruses, either by reassortment with human or
swine viruses or through adaptive changes [8], but have been
monitored less intensely than avian populations. The latest
emergence of a pandemic H1N1 virus has also underscored the
vital importance of further research into the molecular factors that
determine the host range and capacity for sustained human-to-
human transmission of influenza A viruses.
Reassortment in Subtype Evolution
Whole-genome studies have revealed that segment reassortment
between different viruses of the same subtype is an important
mechanism in the evolution of human-adapted subtypes and
generates extensive genome-wide diversity [34,36,49–51]. Periodic
selective sweeps caused by a novel antigenic drift variant rising to
predominance reduce the genomic diversity of the circulating viral
population, either genome-wide or for the hemagglutinin segment
only [12]. Reassortment results in substantial differences in the
evolutionary histories of individual segments. However, similarities
in the histories of some segments indicate that besides the antigenic
characteristics of hemagglutinin, the genomic context and compat-
ibility of certain segment combinations might be an important
contributor to viral fitness [12,51]. A case in point is the
antigenically novel ‘‘Fujian’’ strain which became predominant in
the 2003–2004 season, following a reassortment event that placed a
hemagglutinin segment from a lineage that had been circulating at
low levels for several years into a new genomic context [49]. The
importance of other segments in the adaptive evolution of the virus
is further supported by the observation that a number of other
segments, including the one encoding neuraminidase, evolve at
similar rates to the segment encoding hemagglutinin [12].
Geographic Spread
Genomic analysis has led to profound insights into the global
patterns of circulation and evolution of influenza A. Over the
course of seasonal epidemics in temperate regions, little evidence
has been found for selection for amino acid change and adaptive
evolution in the antigenic regions of the surface proteins [36].
There is, however, substantial genetic diversity due to multiple
introductions of distinct strains, wide spatial spread, and frequent
Figure 2. Generation of genetic diversity and antigenic drift in the evolution of human influenza A viruses. Blue and yellow virusesdepict two antigenically similar strains of the same subtype circulating in the human population. The genetic diversity of the circulating viralpopulation increases through mutation and reasssortment. Single white arrows indicate relationships between ancestral and descendant viruses.White marks on the segments indicate neutral mutations and red marks indicate mutations that affect the antigenic regions of the surface proteins.Incoming pairs of orange arrows indicate the generation of reassortants with segments from two different ancestral viruses. As these viruses continueto circulate, immunity against them builds up in the host population, represented here by the narrowing of the bottleneck. In parallel, viruses withmutations affecting the antigenic regions of the surface proteins accumulate in the viral population. At some point a novel antigenic drift variant,indicated by a red colored virus, which is less affected by immunity in the human population, is generated. This variant is able to cause widespreadinfection and founds a new cluster of antigenically similar strains.doi:10.1371/journal.ppat.1000566.g002
PLoS Pathogens | www.plospathogens.org 4 October 2009 | Volume 5 | Issue 10 | e1000566
segment reassortment in seasonal epidemics [9,12,36,49,50]. The
viral population circulating in one season does not directly seed the
epidemic in the following one. Instead, gene flow and viral spread
are global, with similar strains appearing in northern and southern
hemisphere epidemics across several seasons. There is a global
reservoir of viral diversity from which seasonal epidemics in
temperate regions are seeded [12,27,52]. This reservoir is located
in East-Southeast Asia, where a region-wide network of temporally
overlapping epidemics maintains infection incidence throughout
the year [27]. Novel strains appear in this region on average 6–9
months before they emerge in Oceania, Europe, and North
America and 12–18 months before they reach South America.
Challenges for the Future
A key objective for research into the antigenic drift of influenza A is
to improve the accuracy of vaccine strain choice, in particular for
seasons preceding the establishment of novel antigenic drift variants.
More intensive surveillance and sampling, particularly in East-
Southeast Asia, could facilitate the early detection of novel emerging
drift variants and alleviate problems related to the time required for
vaccine production. A better understanding of the evolutionary and
epidemiological rules governing antigenic drift, viral fitness, the role
of the source region, and establishment of predominance would be
particularly helpful for the selection of vaccine strains when
considerable variation among antigenically novel strains is observed
and it is unclear which, if any, will become predominant. Such
insights are likely to come both from phylodynamic modeling studies
and by mining genomic resources for genome-wide properties
associated with viral fitness and predominance. Some molecular
properties of hemagglutinin with predictive value for this task have
already been identified [53–56], such as the number of changes at
sites under positive selection or in the most extensively altered
epitope, although the sites under selection might change over time
[26]. It is notable that the lack of antigenic information for sequenced
viral isolates in public repositories currently restricts the direct analysis
of genetic determinants in antigenic drift [24]. If the World Health
Organization were to establish similar policies for the deposition of
antigenic information into public databases as exist for sequence data,
this could create a valuable resource for research in this area. As
existing databases grow, new statistical and computational techniques
are being developed for interpretation of these large-scale, popula-
tion-level genomic datasets in combination with epidemiological and
phenotypic information [57]. Ultimately, the expert analysis of the
WHO in the detection and control of antigenically novel emerging
strains could be extensively supported by the development of a
suitable predictive framework based on statistical learning that takes
into consideration the population-level phylodynamics of antigenic
change [57,58]. Such a framework could utilize epidemiological,
genomic, and antigenic information and detailed knowledge of the
genetic and epidemiological characteristics of antigenic drift to assess
the likelihood of strains rising to predominance.
Acknowledgments
We thank Linus Roune for his help creating the figures.
References
1. WHO (2003) Fact sheet number 211. Available: http://www.who.int/
mediacentre/factsheets/fs211/en/. Accessed 13 August 2009.
2. Viboud C, Alonso WJ, Simonsen L (2006) Influenza in tropical regions. PLoS
Med 3: e89. doi:10.1371/journal.pmed.0030089.
3. Palese P (2004) Influenza: old and new threats. Nat Med 10: S82–87.
4. Kilbourne ED (2006) Influenza pandemics of the 20th century. Emerg Infect Dis
12: 9–14.
5. Cox NJ, Subbarao K (2000) Global epidemiology of influenza: past and present.
Annu Rev Med 51: 407–421.
6. Morens DM, Taubenberger JK, Fauci AS (2009) The persistent legacy of the
1918 influenza virus. N Engl J Med 361: 225–229.
7. Neumann G, Noda T, Kawaoka Y (2009) Emergence and pandemic potential of
swine-origin H1N1 influenza virus. Nature 459: 931–939.
8. Horimoto T, Kawaoka Y (2005) Influenza: Lessons from past pandemics,
warnings from current incidents. Nat Rev Microbiol 3: 591–600.
9. Ghedin E, Sengamalay NA, Shumway M, Zaborsky J, Feldblyum T, et al. (2005)
Large-scale sequencing of human influenza reveals the dynamic nature of viral
genome evolution. Nature 437: 1162–1166.
10. Webster RG, Bean WJ, Gorman OT, Chambers TM, Kawaoka Y (1992)
Evolution and ecology of influenza A viruses. Microbiol Rev 56: 152–179.
11. Fouchier RA, Munster V, Wallensten A, Bestebroer TM, Herfst S, et al. (2005)
Characterization of a novel influenza A virus hemagglutinin subtype (H16)
obtained from black-headed gulls. J Virol 79: 2814–2822.
12. Rambaut A, Pybus OG, Nelson MI, Viboud C, Taubenberger JK, et al. (2008)
The genomic and epidemiological dynamics of human influenza A virus. Nature
453: 615–619.
13. Wolf YI, Viboud C, Holmes EC, Koonin EV, Lipman DJ (2006) Long intervals
of stasis punctuated by bursts of positive selection in the seasonal evolution of
influenza A virus. Biol Direct 1: 34.
14. Garten RJ, Davis CT, Russell CA, Shu B, Lindstrom S, et al. (2009) Antigenic
and genetic characteristics of swine-origin 2009 A(H1N1) influenza viruses
circulating in humans. Science 325: 197–201.
15. Fraser C, Donnelly CA, Cauchemez S, Hanage WP, Van Kerkhove MD, et al.
(2009) Pandemic potential of a strain of influenza A (H1N1): Early findings.
Science 324: 1557–1561.
16. Wiley DC, Wilson IA, Skehel JJ (1981) Structural identification of the antibody-
binding sites of Hong Kong influenza haemagglutinin and their involvement in
antigenic variation. Nature 289: 373–378.
17. Wilson IA, Cox NJ (1990) Structural basis of immune recognition of influenza
virus hemagglutinin. Annu Rev Immunol 8: 737–771.
Box 2. Modeling Antigenic Evolution
There is a long history of the use of mathematical modelsto study epidemiological and evolutionary ystems [63].For rapidly evolving RNA viruses such as influenza thedynamics of these systems are densely interwoven, andrecent work has sought to develop unified ‘‘phylody-namic’’ models to examine the processes underlying theobserved epidemiological and evolutionary patterns (re-viewed in [35]). A better understanding of the mechanismsdriving viral evolution will enhance our capacity toaccurately identify novel emerging strains. For influenza,phylodynamic models have been developed to probe thecomplex processes relating to viral persistence in thehuman population, antigenic turnover, and the limitedgenetic diversity at any given point in time. The firstmodels predicted that diversity increases exponentiallyunless long-term, partial cross-immunity between strains issupplemented by temporary broad immunity that lasts forseveral months and protects against all infections,regardless of the genetic or antigenic similarity of strains[64,65]. Subsequently, it has been proposed that agenotype-to-phenotype mapping defined by neutralnetworks underlies influenza evolution [66]. A neutralnetwork is a set of genotypes linked by single mutationsthat all map to the same phenotype, in this case theantigenic characteristics of a virus. Hence, genetic diver-gence is not accompanied by antigenic divergence as longas the genotype remains in the same network. In certaingenetic contexts, however, mutations can move agenotype onto an adjacent network, resulting in asignificant change in the antigenic phenotype. Incorpo-rating this evolutionary framework into an epidemiologicalmodel leads to both epidemiological and evolutionarypatterns characteristic of human influenza A/H3N2.
PLoS Pathogens | www.plospathogens.org 5 October 2009 | Volume 5 | Issue 10 | e1000566
s
18. Wilson IA, Skehel JJ, Wiley DC (1981) Structure of the haemagglutinin
membrane glycoprotein of influenza virus at 3 A resolution. Nature 289:366–373.
19. Yu X, Tsibane T, McGraw PA, House FS, Keefer CJ, et al. (2008) Neutralizing
antibodies derived from the B cells of 1918 influenza pandemic survivors. Nature455: 532–536.
20. Russell CA, Jones TC, Barr IG, Cox NJ, Garten RJ, et al. (2008) Influenzavaccine strain selection and recent studies on the global migration of seasonal
influenza viruses. Vaccine 26(Suppl 4): D31–34.
21. Karlsson Hedestam GB, Fouchier RA, Phogat S, Burton DR, Sodroski J, et al.(2008) The challenges of eliciting neutralizing antibodies to HIV-1 and to
influenza virus. Nat Rev Microbiol 6: 143–155.22. de Jong JC, Beyer WE, Palache AM, Rimmelzwaan GF, Osterhaus AD (2000)
Mismatch between the 1997/1998 influenza vaccine and the major epidemicA(H3N2) virus strain as the cause of an inadequate vaccine-induced antibody
response to this strain in the elderly. J Med Virol 61: 94–99.
23. CDC (2004) Preliminary assessment of the effectiveness of the 2003–04inactivated influenza vaccine—Colorado, December 2003. MMWR Morb
Mortal Wkly Rep 53: 8–11.24. Salzberg S (2008) The contents of the syringe. Nature 454: 160–161.
25. Obenauer JC, Denson J, Mehta PK, Su X, Mukatira S, et al. (2006) Large-scale
sequence analysis of avian influenza isolates. Science 311: 1576–1580.26. Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, et al.
(2004) Mapping the antigenic and genetic evolution of influenza virus. Science305: 371–376.
27. Russell CA, Jones TC, Barr IG, Cox NJ, Garten RJ, et al. (2008) The globalcirculation of seasonal influenza A (H3N2) viruses. Science 320: 340–346.
28. Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, et al. (2008) The
influenza virus resource at the National Center for Biotechnology Information.J Virol 82: 596–601.
29. Enserink M (2007) Data sharing. New Swiss influenza database to test promisesof access. Science 315: 923.
30. Bogner P, Capua I, Lipman DJ, Cox NJ, et al. (2006) A global initiative on
sharing avian flu data. Nature 442: 981.31. Fitch WM, Leiter JM, Li XQ, Palese P (1991) Positive Darwinian evolution in
human influenza A viruses. Proc Natl Acad Sci U S A 88: 4270–4274.32. Fitch WM, Bush RM, Bender CA, Cox NJ (1997) Long term trends in the
evolution of H(3) HA1 human influenza type A. Proc Natl Acad Sci U S A 94:7712–7718.
33. Bush RM, Fitch WM, Bender CA, Cox NJ (1999) Positive selection on the H3
hemagglutinin gene of human influenza virus A. Mol Biol Evol 16: 1457–1465.34. Nelson MI, Holmes EC (2007) The evolution of epidemic influenza. Nat Rev
Genet 8: 196–205.35. Grenfell BT, Pybus OG, Gog JR, Wood JL, Daly JM, et al. (2004) Unifying the
epidemiological and evolutionary dynamics of pathogens. Science 303: 327–332.
36. Nelson MI, Simonsen L, Viboud C, Miller MA, Taylor J, et al. (2006) Stochasticprocesses are key determinants of short-term evolution in influenza A virus.
PLoS Pathog 2: e125. doi:10.1371/journal.ppat.0020125.37. Lowen AC, Palese P (2007) Influenza virus transmission: Basic science and
implications for the use of antiviral drugs during a pandemic. Infect Disord DrugTargets 7: 318–328.
38. Kuiken T, Holmes EC, McCauley J, Rimmelzwaan GF, Williams CS, et al.
(2006) Host species barriers to influenza virus infections. Science 312: 394–397.39. Johnson NP, Mueller J (2002) Updating the accounts: Global mortality of the
1918–1920 ‘‘Spanish’’ influenza pandemic. Bull Hist Med 76: 105–115.40. Taubenberger JK, Reid AH, Lourens RM, Wang R, Jin G, et al. (2005)
Characterization of the 1918 influenza virus polymerase genes. Nature 437:
889–893.41. Reid AH, Taubenberger JK, Fanning TG (2004) Evidence of an absence: The
genetic origins of the 1918 pandemic influenza virus. Nat Rev Microbiol 2:909–914.
42. Antonovics J, Hood ME, Baker CH (2006) Molecular virology: Was the 1918 flu
avian in origin? Nature 440: E9; discussion E9–10.
43. Taubenberger JK (2006) The origin and virulence of the 1918 ‘‘Spanish’’
influenza virus. Proc Am Philos Soc 150: 86–112.
44. Smith GJ, Bahl J, Vijaykrishna D, Zhang J, Poon LL, et al. (2009) Dating the
emergence of pandemic influenza viruses. Proc Natl Acad Sci U S A 106:
11709–11712.
45. Smith GJ, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, et al. (2009) Origins
and evolutionary genomics of the 2009 swine-origin H1N1 influenza Aepidemic. Nature 459: 1122–1125.
46. Maines TR, Jayaraman A, Belser JA, Wadford DA, Pappas C, et al. (2009)Transmission and pathogenesis of swine-origin 2009 A(H1N1) influenza viruses
in ferrets and mice. Science 325: 484–487.
47. Munster VJ, de Wit E, van den Brand JM, Herfst S, Schrauwen EJ, et al. (2009)Pathogenesis and transmission of swine-origin 2009 A(H1N1) influenza virus in
ferrets. Science 325: 481–483.
48. Itoh Y, Shinya K, Kiso M, Watanabe T, Sakoda Y, et al. (2009) In vitro and in
vivo characterization of new swine-origin H1N1 influenza viruses. Nature;E-pubahead of print. doi:10.1038/nature08260.
49. Holmes EC, Ghedin E, Miller N, Taylor J, Bao Y, et al. (2005) Whole-genome
analysis of human influenza A virus reveals multiple persistent lineages andreassortment among recent H3N2 viruses. PLoS Biol 3: e300. doi:10.1371/
journal.pbio.0030300.
50. Nelson MI, Edelman L, Spiro DJ, Boyne AR, Bera J, et al. (2008) Molecular
epidemiology of A/H3N2 and A/H1N1 influenza virus during a single epidemic
season in the United States. PLoS Pathog 4: e1000133. doi:10.1371/journal.-ppat.1000133.
51. Nelson MI, Viboud C, Simonsen L, Bennett RT, Griesemer SB, et al. (2008)Multiple reassortment events in the evolutionary history of H1N1 influenza A
virus since 1918. PLoS Pathog 4: e1000012. doi:10.1371/journal.ppat.1000012.
52. Nelson MI, Simonsen L, Viboud C, Miller MA, Holmes EC (2007) Phylogenetic
analysis reveals the global migration of seasonal influenza A viruses. PLoS
Pathog 3: e131. doi:10.1371/journal.ppat.0030131.
53. Fitch WM, Bush RM, Bender CA, Subbarao K, Cox NJ (2000) The Wilhelmine
E. Key 1999 Invitational lecture. Predicting the evolution of human influenza A.J Hered 91: 183–185.
54. Gupta V, Earl DJ, Deem MW (2006) Quantifying influenza vaccine efficacy andantigenic distance. Vaccine 24: 3881–3888.
55. Blackburne BP, Hay AJ, Goldstein RA (2008) Changing selective pressure
during antigenic changes in human influenza H3. PLoS Pathog 4: e1000058.doi:10.1371/journal.ppat.1000058.
56. Kryazhimskiy S, Bazykin GA, Plotkin J, Dushoff J (2008) Directionality in theevolution of influenza A haemagglutinin. Proc Biol Sci 275: 2455–2464.
57. Pybus OG, Rambaut A (2009) Modelling: Evolutionary analysis of the dynamics
of viral infectious disease. Nat Rev Genet 10: 540–550.
58. Bishop CM (2006) Pattern recognition and machine learning. In: Jordan M,
Kleinberg J, Schoellkopf B, eds. , Singapore: Springer.
59. Sui J, Hwang WC, Perez S, Wei G, Aird D, et al. (2009) Structural and
functional bases for broad-spectrum neutralization of avian and humaninfluenza A viruses. Nat Struct Mol Biol 16: 265–273.
60. Gerhard W, Mozdzanowska K, Zharikova D (2006) Prospects for universal
influenza virus vaccine. Emerg Infect Dis 12: 569–574.
61. Carrat F, Flahault A (2007) Influenza vaccine: The challenge of antigenic drift.
Vaccine 25: 6852–6862.
62. Fisher RA (1999) The genetical theory of natural selection. Oxford (UK):
Oxford University Press. pp 318.
63. Ross R (1910) The prevention of malaria. New York: E.P. Dutton. pp 669.
64. Ferguson NM, Galvani AP, Bush RM (2003) Ecological and immunological
determinants of influenza evolution. Nature 422: 428–433.
65. Tria F, Lassig M, Peliti L, Franz S (2005) A minimal stochastic model for
influenza evolution. J Stat Mech;doi:10.1088/1742-5468/2005/07/P07008.
66. Koelle K, Cobey S, Grenfell B, Pascual M (2006) Epochal evolution shapes the
phylodynamics of interpandemic influenza A (H3N2) in humans. Science 314:1898–1903.
PLoS Pathogens | www.plospathogens.org 6 October 2009 | Volume 5 | Issue 10 | e1000566
Review
The Past and Future of Tuberculosis ResearchInaki Comas, Sebastien Gagneux*
Division of Mycobacterial Research, MRC National Institute for Medical Research, London, United Kingdom
Abstract: Renewed efforts in tuberculosis (TB) researchhave led to important new insights into the biology andepidemiology of this devastating disease. Yet, in the faceof the modern epidemics of HIV/AIDS, diabetes, andmultidrug resistance—all of which contribute to suscep-tibility to TB—global control of the disease will remain aformidable challenge for years to come. New high-throughput genomics technologies are already contribut-ing to studies of TB’s epidemiology, comparative geno-mics, evolution, and host–pathogen interaction. We arguehere, however, that new multidisciplinary approaches—especially the integration of epidemiology with systemsbiology in what we call ‘‘systems epidemiology’’—will berequired to eliminate TB.
Introduction
Tuberculosis (TB) remains an important public health problem
[1]. With close to 10 million new cases per year, and a pool of two
billion latently infected individuals, control efforts are struggling in
many parts of the world (Figure 1). Nevertheless, the renewed
interest in research and improved funding for TB give reasons for
optimism. Recently, the Stop TB Partnership, a network of
concerned governments, organizations, and donors lead by the
WHO (http://www.stoptb.org/stop_tb_initiative/), outlined a
global plan to halve TB prevalence and mortality by 2015 and
eliminate the disease as a public health problem by 2050 [2].
Attaining these goals will depend on both strong government
commitment and increased interdisciplinary research and devel-
opment. As existing diagnostics, drugs, and vaccines will be
insufficient to achieve these objectives, a substantial effort in both
basic science and epidemiology will be necessary to develop better
tools and strategies to control TB [3]. Here we review the recent
history of TB research and some of the latest insights into the
evolutionary history of the disease. We then discuss ways in which
we could benefit from a more comprehensive systems approach to
control TB in the future.
Recent History of the Field
TB is caused by several species of gram-positive bacteria known
as tubercle bacilli or Mycobacterium tuberculosis complex (MTBC).
MTBC includes obligate human pathogens such as Mycobacterium
tuberculosis and Mycobacterium africanum, as well as organisms adapted
to various other species of mammal. In the developed world, TB
incidence declined steadily during the second half of the 20th
century and so funds available for research and control of TB
decreased substantially during that time [4]. When TB started to
reemerge in the early 1990s, fuelled by the growing pandemic of
HIV/AIDS (Box 1), scientists and public health officials were
caught off-guard; billions of dollars of emergency funds were
necessary to control TB outbreaks [5]. Moreover, long-term
neglect of basic TB research and product development meant that
global TB control relied on a 100-year-old diagnostic method (i.e.
sputum smear microscopy) of poor sensitivity, an 80-year-old and
largely ineffective vaccine (Bacille Calmette-Guerin [BCG]), and
just a few drugs that were decades old (streptomycin, rifampicin,
isoniazid, ethambutol, pyrozinamide) [3]. Tragically, these are the
tools still in use today in most parts of the world where TB remains
one of the most important public health problems (Figure 1).
In addition to the lack of appropriate tools to control TB
globally, much about the disease was unknown in the early 1990s
and many dogmas were guiding the field at the time. These
included the view that differences in the clinical manifestation of
TB were primarily driven by host variables and the environment
as opposed to bacterial factors, a notion reinforced by early DNA
sequencing studies that reported very limited genetic diversity in
MTBC compared with other bacterial pathogens [6]. According
to other dogmas, TB was mainly a consequence of reactivation of
latent infections rather than ongoing disease transmission, and that
mixed infections and exogenous reinfections with different strains
were very unlikely.
The development of molecular techniques to differentiate
between strains of MTBC made it possible to readdress some of
these points. One of these methods, a DNA fingerprinting protocol
based on the Mycobacterium insertion sequence IS6110, quickly
evolved into the first international gold standard for genotyping of
MTBC [7]. It also became a key component of pragmatic public
health efforts, such as detecting disease outbreaks and ongoing TB
transmission [8], and allowed differentiation between patients who
relapsed due to treatment failure and those reinfected with a
different strain [9]. This latter finding demonstrated for the first
time that previous exposure to MTBC does not protect against
subsequent exogenous reinfection and TB disease, which is a
phenomenon with implications for vaccine design. Many other
new insights were gained through these molecular epidemiological
studies [10], which, for the most part, were performed in wealthy
countries; corresponding data from most high-burden areas
remained limited because of poor infrastructure and lack of
funding.
Routine genotyping of MTBC for public health purposes also
revived discussions about the role of pathogen variation in
Citation: Comas I, Gagneux S (2009) The Past and Future of TuberculosisResearch. PLoS Pathog 5(10): e1000600. doi:10.1371/journal.ppat.1000600
Editor: Marianne Manchester, The Scripps Research Institute, United States ofAmerica
Published October 26, 2009
Copyright: � 2009 Comas, Gagneux. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.
Funding: Work in our laboratory is supported by the Medical Research Council,UK, and the US National Institutes of Health grants HHSN266200700022C andAI034238. The funders had no role in study design, data collection and analysis,decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interestsexist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journalcollection (http://ploscollections.org/emerginginfectiousdisease/).
PLoS Pathogens | www.plospathogens.org 1 October 2009 | Volume 5 | Issue 10 | e1000600
outcome of infection and disease. Some strains of MTBC
appeared over-represented in particular patient populations,
which suggested that strain diversity may have epidemiological
implications. The completion of the first whole genome sequence
of M. tuberculosis in 1998 [11] and the development of DNA
microarrays offered a new opportunity to address this question by
interrogating the entire genome of multiple clinical strains of
MTBC. These comparative genomics studies revealed that
genomic deletions, also known as large sequence polymorphisms
(LSPs), are an important source of genome plasticity in MTBC
[12]. Furthermore, statistical analyses of patient data suggested
possible associations between strain genomic content and disease
severity in humans [13]. Clinical phenotypes in TB are difficult to
standardize, however, and whether MTBC genotype plays a
meaningful role in TB severity remains controversial [14].
Comparative genomics of MTBC also yielded interesting insights
into the evolution and geographic distribution of the organism.
Because MTBC has essentially no detectible horizontal gene transfer
[15,16], LSPs can be used as phylogenetic markers to trace the
evolutionary relationships of different strain families. Following such
an approach, studies have shown that humans did not, as previously
believed, acquire MTBC from animals during the initiation of animal
domestication, rather the human- and animal-adapted members of
MTBC share a common ancestor, which might have infected humans
even before the Neolithic transition [17,18]. LSPs also allowed
researchers to define several discrete strain lineages within the human-
adapted members of MTBC, which are associated with different
human populations and geographical regions (Figures 2 and 3)
[15,19,20]. Because of the lack of horizontal gene exchange in
MTBC, phylogenetic trees derived using various molecular markers
define the same phylogenetic groupings [21], and several studies based
on single nucleotide polymorphisms (SNPs) and other molecular
makers have gathered additional support for the highly phylogeo-
graphical population structure of MTBC [22–25].
Ancient History of the Pathogen
Although LSPs have proven very useful for defining different
lineages within MTBC, these markers do not reflect actual genetic
distances, and the mode of molecular evolution in MTBC cannot
be easily inferred from them [21]. By contrast, DNA sequence-
based methods can provide important clues about the evolutionary
forces shaping bacterial populations. Multilocus sequence typing
(MLST), in which fragments of seven structural genes are
Figure 1. The global incidence of TB. The number of new TB cases per 100,000 population for the year 2007 according to WHO estimates(adapted from [1]).doi:10.1371/journal.ppat.1000600.g001
Box 1. The Influence of Modern Epidemics onTB Incidence
HIV/AIDS and diabetes are important comorbidities thatdramatically increase the susceptibility to TB. The synergybetween TB and HIV/AIDS is a particular problem in sub-Saharan Africa, while the impact of diabetes on TB isincreasing in many rapidly growing world economies; itmay already be a more important risk factor for TB thanHIV/AIDS in places like India and Mexico. The emergenceof multidrug-resistant strains represents an additionalthreat to global TB control. The strong associationbetween HIV/AIDS and drug-resistant TB has been wellestablished, but whether similar interactions exist betweendrug-resistant TB and diabetes needs to be exploredfurther.
PLoS Pathogens | www.plospathogens.org 2 October 2009 | Volume 5 | Issue 10 | e1000600
sequenced for each strain [26], has been used very successfully to
define the genetic population structure of many bacterial species
[27]. Because of the low degree of sequence polymorphisms in
MTBC, however, standard MLST is uninformative [28]. A recent
study of MTBC extended the traditional MLST scheme by
sequencing 89 complete genes in 108 strains, covering 1.5% of the
genome of each strain [29]. Phylogenetic analysis of this extended
multilocus sequence dataset resulted in a tree that was highly
congruent with that generated previously using LSPs (Figure 3).
The new sequence-based data also revealed that the MTBC
strains that are adapted to various animal species represent just a
subset of the global genetic diversity of MTBC that affects different
human populations [29]. Furthermore, by comparing the
geographical distribution of various human MTBC strains with
their position on the phylogenetic tree, it became evident that
MTBC most likely originated in Africa and that human MTBC
originally spread out of Africa together with ancient human
migrations along land routes. This view is further supported by the
fact that the so-called ‘‘smooth tubercle bacilli,’’ which are the
closest relatives of the human MTBC, are highly restricted to East
Africa [30]. The multilocus sequence data reported by Hershberg
et al. [29] further suggested a scenario in which the three
‘‘modern’’ lineages of MTBC (purple, blue, and red in Figure 3)
seeded Eurasia, which experienced dramatic human population
expansion in more recent times. These three lineages then spread
globally out of Europe, India, and China, respectively, accompa-
nying waves of colonization, trade and conquest. In contrast to the
ancient human migrations, however, this more recent dispersal of
human MTBC occurred primarily along water routes [29].
The availability of comprehensive DNA sequence data has also
allowed researchers to address questions about the molecular
evolution of MTBC. In-depth population genetic analyses by
Hershberg et al. highlight the fact that purifying selection against
slightly deleterious mutations in this organism is strongly reduced
compared to other bacteria [29]. As a consequence, nonsynon-
ymous SNPs tend to accumulate in MTBC, leading to a high ratio
of nonsynonymous to synonymous mutations (also known as dN/
dS). The authors hypothesized that the high dN/dS in MTBC
compared to most other bacteria might indicate increased random
genetic drift associated with serial population bottlenecks during
past human migrations and patient-to-patient transmission. If
confirmed, this would indicate that ‘‘chance,’’ not just natural
selection, has been driving the evolution of MTBC. Although these
kinds of fundamental evolutionary questions are often underap-
preciated by clinicians and biomedical researchers, studying the
evolution of a pathogen ultimately allows for better epidemiolog-
ical predictions by contributing to our understanding of basic
biology, particularly with respect to antibiotic resistance.
A Vision for the Future
Thanks to recent increases in research funding for TB [4],
substantial progress has been made in our understanding of the
basic biology and epidemiology of the disease. Unfortunately, this
increased knowledge has not yet had any noticeable impact on the
current global trends of TB (Figure 1). While TB incidence
appears to have stabilized in many countries, the total number of
cases is still increasing as a function of global human population
growth [1]. Of particular concern are the ongoing epidemics of
multidrug-resistant TB [31], as well as the synergies between TB
and the ongoing epidemics of HIV/AIDS and other comorbidities
such as diabetes (Box 1).
Figure 2. Global distribution of the six main lineages of human MTBC. Each dot represents the most frequent lineage(s) circulating in acountry. Colours correspond to the lineages defined in Figure 3 (adapted from [20]).doi:10.1371/journal.ppat.1000600.g002
PLoS Pathogens | www.plospathogens.org 3 October 2009 | Volume 5 | Issue 10 | e1000600
As our understanding of TB improves, we would like to be able to
make better predictions about the future trajectory of the disease
and to develop new tools to control the disease better and ultimately
reverse global trends. For this to be feasible, TB epidemiology needs
to evolve into a more predictive, interdisciplinary endeavour; a
discipline we might refer to as ‘‘systems epidemiology’’ (Figure 4).
Systems biology is already a rapidly emerging field, in which cycles
of mathematical modelling and experiments using various large-
scale ‘‘-omics’’ datasets are integrated in an iterative manner [32].
Novel biological processes are being discovered through these
systems approaches, which might not have been possible using more
traditional methods [33–35].
Last year, Young et al. argued that systems biology approaches
will be necessary to elucidate some of the key aspects of host–
pathogen interactions in TB [36] and to develop new drugs,
vaccines, and biomarkers to evaluate new interventions [3]. For
example, according to another dogma in the TB field, latent TB
infections are caused by physiologically dormant bacilli and can
thus be differentiated from active disease where MTBC is actively
growing and dividing [37]. In reality, however, the phenomenon
of TB latency most likely reflects a whole spectrum of responses to
TB infection, involving phenotypically distinct bacterial subpop-
ulations and spanning various degrees of bacterial burden and
associated host immune responses [38]. We agree with Young
et al. [36] that TB latency and similar biological complexities will
only be adequately addressed using systems approaches, and we
argue further that to comprehend the current TB epidemic as a
whole, and to better predict its future trajectory, a complementary
systems epidemiology approach will be necessary (Figure 4).
Mathematical models are already being used extensively to study
the epidemiology of TB and to guide control policies [39]. Recent
applications have shown that socioeconomic factors are key drivers
Figure 3. The global phylogeny of Mycobacterium tuberculosis complex (MTBC). The phylogenic relationships between various human- andanimal-adapted strains and species are largely consistent when defined by using either (A) large sequence polymorphisms (LSPs) or (B) singlenucleotide polymorphisms (SNPs) identified by sequencing 89 genes in 108 MTBC strains. Numbers inside the squares in (A) refer to specific lineage-defining LSPs. Colors indicate congruent lineages (adapted from [20] and [29]).doi:10.1371/journal.ppat.1000600.g003
PLoS Pathogens | www.plospathogens.org 4 October 2009 | Volume 5 | Issue 10 | e1000600
of today’s TB epidemic [40]. In addition, much theoretical emphasis
has been placed on trying to define the impact that drug resistance
will have on the global TB epidemic [41]. Some of this theoretical
work has become more complex by incorporating new biological
insights obtained empirically and through targeted experimental
studies. Early theoretical studies on the spread of drug-resistant
MTBC were based on the assumption that all drug-resistant
bacteria had an inherent fitness disadvantage compared to drug-
susceptible strains [42]; however, as is becoming clear from
experimental and molecular epidemiological investigation, substan-
tial heterogeneity exists with respect to the reproductive success of
drug-resistant strains [43–46]. Newer mathematical models account
for some of this heterogeneity [47–49].
One could imagine an expansion of such mathematical
approaches—much as systems biology operates—in which epide-
miological modelling is combined with more comprehensive
biological data related to the host, the pathogen, and their
interactions (Figure 4). Of course, environmental and sociological
data would also need to be considered [40]. As mathematical
models become more finely tuned, they could in turn inform
future experimental work to test some of the specific predictions.
The genomics revolution now offers the opportunity to study host–
pathogen interactions at an unprecedented depth. To be able to
make sense out of the current and upcoming deluge of -omics data,
however, scientists will have to rely on a mathematically and
statistically robust analytical framework. Ideally, some of these
theoretical approaches will be able to accommodate increasingly
diverse sets of data in order to capture the various biological,
environmental, and social aspects of TB.
Among the newly emerging technologies, we believe that next-
generation DNA sequencing will play an important role in
improving our understanding of TB [50]. Whole-genome
sequencing could potentially become the new gold standard for
strain typing in routine molecular epidemiology [51]. For host
genetics and TB susceptibility, too, de novo DNA sequencing
based approaches could have advantages over traditional SNP
typing [52]. For example, many of the human populations
carrying the largest proportion of the global TB burden have
not been sufficiently characterised genetically (Figure 1) [53,54],
and screening for currently limited human SNP collections might
have little relevance for these populations [55]. Furthermore,
comprehensive DNA sequencing of TB patients and controls in
various human populations could help unveil rare but biologically
relevant mutations [56]. Another approach increasingly being
Figure 4. A systems epidemiology approach to TB research. The spread of TB is influenced by social and biological factors. On the one hand,the new discipline of systems biology integrates approaches that address the host, the pathogen, and interactions between the two. On the otherhand, epidemiology addresses the burden of the disease and the social, economic, and ecological causes of its frequency and distribution. There islittle crosstalk between these two disciplines at the moment. ‘‘Systems epidemiology’’ is an attempt to take into account the interactions betweenthese various fields of research.doi:10.1371/journal.ppat.1000600.g004
PLoS Pathogens | www.plospathogens.org 5 October 2009 | Volume 5 | Issue 10 | e1000600
used to study both the host and the pathogen is sequence-based
transcriptomics, in which gene expression is measured by whole
genome sequencing of RNA transcripts; a method referred to as
RNA-seq [57]. One of the advantages of this approach over
existing microarray-based methods is that changes in the
expression of noncoding RNAs and other novel transcripts can
be easily detected. RNA-seq is particularly useful for genome-wide
studies of small regulatory RNAs, as such studies are more difficult
to perform using standard DNA microarrays. Recent studies, for
example, have reported a role for small regulatory RNAs in M.
tuberculosis [58], and there is little doubt more regulatory RNAs will
soon be identified by RNA-seq [57].
Challenges for the Future
Advances in TB research are hampered by the fact that MTBC
is a Biosafety Level 3 pathogen with a long generation time,
making it slow and complex to culture. Moreover, TB is a chronic
disease that can develop over many years, and is characterised by
extended periods of latency during which MTBC cannot be
isolated from infected individuals. All of these factors complicate
and prolong the development of new interventions and their
assessment in clinical trials. As we have already mentioned, the
field has been marked by a number of dogmas that, in some cases,
might have contributed to the slow progress in TB research. New
insights are now questioning some of these views, but at the same
time, new opinions could well evolve into new dogmas. For
example, we and others have spent much of our scientific careers
seeking convincing evidence for the role of MTBC strain diversity
in human disease. Although some pieces of evidence have recently
started to emerge [59–61], the subject needs more work. One of
the problems has been that the macrophage and mouse infection
models used in these studies relied on poorly characterised strains,
and finding relevant links to human disease has been all but
impossible [14,21].
In TB control, too, potential new dogmas might emerge to limit
future progress. A strong T cell–derived interferon gamma (INFc)
response appears to be crucial for the immunological control of
TB, and many MTBC antigens have been identified based on
their capacity to elicit INFc responses in TB patients or their
infected contacts [62]. Some of these antigens are being developed
into new TB diagnostics and vaccines, but the potential impact of
MTBC diversity on immune responses is not generally being
considered [21]. A recent study in The Gambia showed that INFcresponses to one of the key MTBC antigens differed in an MTBC
lineage–specific manner [63]. Developing a universally effective
vaccine might be the only way to eliminate TB in the future [3].
This is particularly true given the large reservoir of latently
infected individuals in the world, which would be impossible to
eliminate through prophylactic drug treatment. Considering that
natural TB infection does not protect against exogenous
reinfection and disease, however, mimicking natural infection
using attenuated strains or a cocktail of traditional INFc-inducing
antigens might not necessarily be the most promising vaccine
strategy. Indeed, the largely unsuccessful implementation of BCG
vaccination might serve as a warning [64].
Acknowledgments
We thank Peter Small and Douglas Young for comments on the
manuscript.
References
1. World Health Organization (2009) Global tuberculosis control - surveillance,
planning, financing. Geneva, Switzerland: WHO.
2. Stop TB Partnership (2006) The global plan to stop TB 2006–2015. Geneva:WHO.
3. Young DB, Perkins MD, Duncan K, Barry CE (2008) Confronting the scientific
obstacles to global control of tuberculosis. J Clin Invest 118: 1255–1265.
4. Kaufmann SH, Parida SK (2007) Changing funding patterns in tuberculosis.Nat Med 13: 299–303.
5. Frieden TR, Fujiwara PI, Washko RM, Hamburg MA (1995) Tuberculosis in
New York City–turning the tide. N Engl J Med 333: 229–233.
6. Sreevatsan S, Pan X, Stockbauer KE, Connell ND, Kreiswirth BN, et al. (1997)
Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex
indicates evolutionarily recent global dissemination. Proc Natl Acad Sci U S A94: 9869–9874.
7. van Embden JD, Cave MD, Crawford JT, Dale JW, Eisenach KD, et al. (1993)
Strain identification of Mycobacterium tuberculosis by DNA fingerprinting:recommendations for a standardized methodology. J Clin Microbiol 31:
406–409.
8. Small PM, Hopewell PC, Singh SP, Paz A, Parsonnet J, et al. (1994) Theepidemiology of tuberculosis in San Francisco. A population-based study using
conventional and molecular methods. N Engl J Med 330: 1703–1709.
9. Small PM, Shafer RW, Hopewell PC, Singh SP, Murphy MJ, et al. (1993)Exogenous reinfection with multidrug-resistant Mycobacterium tuberculosis in
patients with advanced HIV infection. N Engl J Med 328: 1137–1144.
10. Mathema B, Kurepina NE, Bifani PJ, Kreiswirth BN (2006) Molecularepidemiology of tuberculosis: current insights. Clin Microbiol Rev 19: 658–685.
11. Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, et al. (1998) Deciphering
the biology of Mycobacterium tuberculosis from the complete genome sequence.Nature 393: 537–544.
12. Tsolaki AG, Hirsh AE, DeRiemer K, Enciso JA, Wong MZ, et al. (2004)
Functional and evolutionary genomics of Mycobacterium tuberculosis: insights fromgenomic deletions in 100 strains. Proc Natl Acad Sci U S A 101: 4865–4870.
13. Kato-Maeda M, Rhee JT, Gingeras TR, Salamon H, Drenkow J, et al. (2001)
Comparing genomes within the species Mycobacterium tuberculosis. Genome Res11: 547–554.
14. Nicol MP, Wilkinson RJ (2008) The clinical consequences of strain diversity in
Mycobacterium tuberculosis. Trans R Soc Trop Med Hyg 102: 955–65.
15. Hirsh AE, Tsolaki AG, DeRiemer K, Feldman MW, Small PM (2004) Stable
association between strains of Mycobacterium tuberculosis and their human host
populations. Proc Natl Acad Sci U S A 101: 4871–4876.
16. Supply P, Warren RM, Banuls AL, Lesjean S, Van Der Spuy GD, et al. (2003)
Linkage disequilibrium between minisatellite loci supports clonal evolution ofMycobacterium tuberculosis in a high tuberculosis incidence area. Mol Microbiol 47:
529–538.
17. Brosch R, Gordon SV, Marmiesse M, Brodin P, Buchrieser C, et al. (2002) Anew evolutionary scenario for the Mycobacterium tuberculosis complex. Proc Natl
Acad Sci U S A 99: 3684–3689.
18. Mostowy S, Cousins D, Brinkman J, Aranaz A, Behr MA (2002) Genomicdeletions suggest a phylogeny for the Mycobacterium tuberculosis complex. J Infect
Dis 186: 74–80.
19. Reed MB, Pichler VK, McIntosh F, Mattia A, Fallow A, et al. (2009) MajorMycobacterium tuberculosis lineages associate with patient country of origin. J Clin
Microbiol 47: 1119–28.
20. Gagneux S, Deriemer K, Van T, Kato-Maeda M, de Jong BC, et al. (2006)Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc Natl Acad
Sci U S A 103: 2869–2873.
21. Gagneux S, Small PM (2007) Global phylogeography of Mycobacterium tuberculosis
and implications for tuberculosis product development. Lancet Infect Dis 7:328–337.
22. Baker L, Brown T, Maiden MC, Drobniewski F (2004) Silent nucleotide
polymorphisms and a phylogeny for Mycobacterium tuberculosis. Emerg Infect Dis10: 1568–1577.
23. Gutacker MM, Mathema B, Soini H, Shashkina E, Kreiswirth BN, et al. (2006)
Single-nucleotide polymorphism-based population genetic analysis of Mycobac-
terium tuberculosis strains from 4 geographic sites. J Infect Dis 193: 121–128.
24. Filliol I, Motiwala AS, Cavatore M, Qi W, Hernando Hazbon M, et al. (2006)
Global phylogeny of Mycobacterium tuberculosis based on single nucleotidepolymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic
accuracy of other DNA fingerprinting systems, and recommendations for a
minimal standard SNP set. J Bacteriol 188: 759–772.
25. Brudey K, Driscoll JR, Rigouts L, Prodinger WM, Gori A, et al. (2006)Mycobacterium tuberculosis complex genetic diversity: mining the fourth interna-
tional spoligotyping database (SpolDB4) for classification, population geneticsand epidemiology. BMC Microbiol 6: 23.
26. Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, et al. (1998) Multilocus
sequence typing: a portable approach to the identification of clones withinpopulations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95:
3140–3145.
27. Maiden MC (2006) Multilocus sequence typing of bacteria. Annu Rev Microbiol60: 561–588.
PLoS Pathogens | www.plospathogens.org 6 October 2009 | Volume 5 | Issue 10 | e1000600
28. Achtman M (2008) Evolution, population structure, and phylogeography of
genetically monomorphic bacterial pathogens. Annu Rev Microbiol 62: 53–70.
29. Hershberg R, Lipatov M, Small PM, Sheffer H, Niemann S, et al. (2008) High
functional diversity in Mycobacterium tuberculosis driven by genetic drift and human
demography. PLoS Biol 6: e311.
30. Gutierrez C, Brisse S, Brosch R, Fabre M, Omais B, et al. (2005) Ancient origin
and gene mosaicism of the progenitor of Mycobacterium tuberculosis. PLoS
Pathogens 1: 1–7.
31. World Health Organization (2008) Anti-tuberculosis drug resistance in the world
report no. 4. Geneva, Switzerland: WHO.
32. Zak DE, Aderem A (2009) Systems biology of innate immunity. Immunol Rev
227: 264–282.
33. Gilchrist M, Thorsson V, Li B, Rust AG, Korb M, et al. (2006) Systems biology
approaches identify ATF3 as a negative regulator of Toll-like receptor 4. Nature
441: 173–178.
34. Querec TD, Akondy RS, Lee EK, Cao W, Nakaya HI, et al. (2009) Systems
biology approach predicts immunogenicity of the yellow fever vaccine in
humans. Nat Immunol 10: 116–125.
35. Stuart LM, Boulais J, Charriere GM, Hennessy EJ, Brunet S, et al. (2007) A
systems biology analysis of the Drosophila phagosome. Nature 445: 95–101.
36. Young D, Stark J, Kirschner D (2008) Systems biology of persistent infection:
tuberculosis as a case study. Nat Rev Microbiol 6: 520–8.
37. Gill WP, Harik NS, Whiddon MR, Liao RP, Mittler JE, et al. (2009) A
replication clock for Mycobacterium tuberculosis. Nat Med 15: 211–4.
38. Young DB, Gideon HP, Wilkinson RJ (2009) Eliminating latent tuberculosis.
Trends Microbiol 17: 183–188.
39. Cohen T, Dye C, Colijn C, Murray M (2009) Mathematical models of the
epidemiology and control of drug-resistant TB. Expert Rev Resp Med in press.
40. Lonnroth K, Jaramillo E, Williams BG, Dye C, Raviglione M (2009) Drivers of
tuberculosis epidemics: The role of risk factors and social determinants. Soc Sci
Med 68: 2240–6.
41. Borrell S, Gagneux S (2009) Infectiousness, reproductive fitness, and evolution of
drug-resistant Mycobactyerium tuberculosis. Int J Tuberc Lung Dis in press.
42. Dye C, Williams BG, Espinal MA, Raviglione MC (2002) Erasing the world’s
slow stain: strategies to beat multidrug-resistant tuberculosis. Science 295:
2042–2046.
43. Bottger EC, Springer B, Pletschette M, Sander P (1998) Fitness of antibiotic-
resistant microorganisms and compensatory mutations. Nat Med 4: 1343–1344.
44. Gagneux S, Burgos MV, DeRiemer K, Encisco A, Munoz S, et al. (2006) Impact
of bacterial genetics on the transmission of isoniazid-resistant Mycobacterium
tuberculosis. PLoS Pathog 2: e61.
45. Gagneux S, Long CD, Small PM, Van T, Schoolnik GK, et al. (2006) The
competitive cost of antibiotic resistance in Mycobacterium tuberculosis. Science 312:
1944–1946.
46. van Soolingen D, de Haas PE, van Doorn HR, Kuijper E, Rinder H, et al.
(2000) Mutations at amino acid position 315 of the katG gene are associated with
high-level resistance to isoniazid, other drug resistance, and successful
transmission of Mycobacterium tuberculosis in the Netherlands. J Infect Dis 182:
1788–1790.47. Cohen T, Murray M (2004) Modeling epidemics of multidrug-resistant M.
tuberculosis of heterogeneous fitness. Nat Med 10: 1117–1121.
48. Blower SM, Chou T (2004) Modeling the emergence of the ‘hot zones’:tuberculosis and the amplification dynamics of drug resistance. Nat Med 10:
1111–1116.49. Dye C (2009) Doomsday postponed? Preventing and reversing epidemics of
drug-resistant tuberculosis. Nat Rev Microbiol 7: 81–87.
50. Mardis ER (2008) Next-generation DNA sequencing methods. Annu RevGenomics Hum Genet 9: 387–402.
51. MacLean D, Jones JD, Studholme DJ (2009) Application of ‘next-generation’sequencing technologies to microbial genetics. Nat Rev Microbiol 7: 287–296.
52. Hardy J, Singleton A (2009) Genomewide association studies and humandisease. N Engl J Med 360: 1759–1768.
53. Tishkoff SA, Reed FA, Friedlaender FR, Ehret C, Ranciaro A, et al. (2009) The
genetic structure and history of Africans and African Americans. Science 324:1035–44.
54. Basu A, Mukherjee N, Roy S, Sengupta S, Banerjee S, et al. (2003) Ethnic India:a genomic view, with special reference to peopling and structure. Genome Res
13: 2277–2290.
55. Campbell MC, Tishkoff SA (2008) African Genetic Diversity: Implications forhuman demographic history, modern human origins, and complex disease
mapping. Annu Rev Genomics Hum Genet 9: 403–33.56. Goldstein DB (2009) Common genetic variation and human traits. N Engl J Med
360: 1696–1698.57. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for
transcriptomics. Nat Rev Genet 10: 57–63.
58. Arnvig KB, Young DB (2009) Identification of small RNAs in Mycobacterium
tuberculosis. Mol Microbiol 73: 397–408.
59. de Jong BC, Hill PC, Aiken A, Awine T, Antonio M, et al. (2008) Progression toactive tuberculosis, but not transmission, varies by Mycobacterium tuberculosis
lineage in the Gambia. J Infect Dis 198: 1037–43.
60. Caws M, Thwaites G, Dunstan S, Hawn TR, Thi Ngoc Lan N, et al. (2008) Theinfluence of host and bacterial genotype on the development of disseminated
disease with Mycobacterium tuberculosis. PLoS Pathog 4: e1000034.61. Thwaites G, Caws M, Chau TT, D’Sa A, Lan NT, et al. (2008) The relationship
between Mycobacterium tuberculosis genotype and the clinical phenotype ofpulmonary and meningeal tuberculosis. J Clin Microbiol 46: 1363–8.
62. Ernst JD, Lewinsohn DM, Behar S, Blythe M, Schlesinger LS, et al. (2007)
Meeting report: NIH workshop on the Tuberculosis Immune Epitope Database.Tuberculosis (Edinb) 88: 366–70.
63. de Jong BC, Hill PC, Brookes RH, Gagneux S, Jeffries DJ, et al. (2006)Mycobacterium africanum elicits an attenuated T Cell response to Early Secreted
Antigenic Target, 6 kDa, in patients with tuberculosis and their household
contacts. J Infect Dis 193: 1279–1286.64. Andersen P, Doherty TM (2005) Opinion: The success and failure of BCG -
implications for a novel tuberculosis vaccine. Nat Rev Microbiol 3: 656–62.
PLoS Pathogens | www.plospathogens.org 7 October 2009 | Volume 5 | Issue 10 | e1000600
Review
Helicobacter pylori ’s Unconventional Role in Health andDiseaseMarion S. Dorer, Sarah Talarico, Nina R. Salama*
Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
Abstract: The discovery of a bacterium, Helicobacterpylori, that is resident in the human stomach and causeschronic disease (peptic ulcer and gastric cancer) wasradical on many levels. Whereas the mouth and the colonwere both known to host a large number of microorgan-isms, collectively referred to as the microbiome, thestomach was thought to be a virtual Sahara desert formicrobes because of its high acidity. We now know that H.pylori is one of many species of bacteria that live in thestomach, although H. pylori seems to dominate thiscommunity. H. pylori does not behave as a classicalbacterial pathogen: disease is not solely mediated byproduction of toxins, although certain H. pylori genes,including those that encode exotoxins, increase the risk ofdisease development. Instead, disease seems to resultfrom a complex interaction between the bacterium, thehost, and the environment. Furthermore, H. pylori was thefirst bacterium observed to behave as a carcinogen. Theinnate and adaptive immune defenses of the host,combined with factors in the environment of thestomach, apparently drive a continuously high rate ofgenomic variation in H. pylori. Studies of this geneticdiversity in strains isolated from various locations acrossthe globe show that H. pylori has coevolved with humansthroughout our history. This long association has givenrise not only to disease, but also to possible protectiveeffects, particularly with respect to diseases of theesophagus. Given this complex relationship with humanhealth, eradication of H. pylori in nonsymptomaticindividuals may not be the best course of action. Thestory of H. pylori teaches us to look more deeply at ourresident microbiome and the complexity of its interac-tions, both in this complex population and within ourown tissues, to gain a better understanding of health anddisease.
Common wisdom circa 1980 suggested that the stomach, with
its low pH, was a sterile environment. Then, endoscopy of the
stomach became common and, in 1984, pathologist Robin
Warren and gastroenterologist Barry Marshall saw an extracellu-
lar, curved bacillus, often in dense sheets, lining the stomach
epithelium of patients with gastritis (inflammation of the stomach)
and ulcer disease [1]. Soon, the medical community understood
that the gram-negative bacterium Helicobacter pylori, not stress, is
the major cause of stomach inflammation, which, in some infected
individuals, precedes peptic ulcer disease (10%–20%), distal gastric
adenocarcinoma (1%–2%), and gastric mucosal-associated lym-
phoid tissue (MALT) lymphoma (,1%) [2–5]. Thus, H. pylori
gained distinction as the only known bacterial carcinogen [6]. It is
believed that half of the world’s population is infected with H.
pylori; however, the burden of disease falls disproportionately on
less-developed countries. The incidence of infection in developed
countries has fallen dramatically, for unknown reasons, with a
corresponding decrease in gastric cancer [7]. This public health
success is tempered by the recent demonstration of an inverse
relationship between H. pylori infection and esophageal adenocar-
cinoma, Barrett’s esophagus, and reflux esophagitis [8]. H. pylori
has been with humans since our earliest days, thus it is not
surprising that its relationship is that of both a commensal
bacterium and a pathogen, causing some diseases and possibly
protecting against others. In addition, it is genetically diverse,
likely as a result of constant exposure to both environmental and
immunological selection, suggesting that genetic diversification is a
strategy for long-term colonization.
The Role of Infection in Disease Risk
H. pylori infection is generally acquired during childhood and,
without specific antibiotic treatment, can persist for the lifetime of
the host. Disease often does not develop until adulthood, after
decades of infection, and H. pylori induces variable pathologies in
the stomach. Duodenal ulcer disease is characterized by gastritis
that is largely confined to the antrum (the distal compartment of
the stomach), relatively low inflammation of the corpus (the
middle, acid-secreting compartment), and high levels of stomach
acid secretion (Figure 1A). Those with gastric ulcer or stomach
cancer have high levels of inflammation of the corpus, multifocal
gastric atrophy, and low levels of stomach acid secretion, due to
the destruction of stomach acid–secreting parietal cells (Figure 1B)
[9,10]. Some of this inflammatory response is controlled by the
cytokine IL-1b, which is induced by H. pylori infection [11] and
both elicits a proinflammatory response and inhibits secretion of
gastric acid [12]. Polymorphisms in the interleukin gene cluster,
including IL-1b, are risk factors for H. pylori–associated gastric
cancer [13,14], and studies of the transcriptional response of both
human and model hosts to H. pylori confirm induction of
transcriptional regulators of proinflammatory programs. In
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journalcollection (http://ploscollections.org/emerginginfectiousdisease/).
Citation: Dorer MS, Talarico S, Salama NR (2009) Helicobacter pylori’sUnconventional Role in Health and Disease. PLoS Pathog 5(10): e1000544.doi:10.1371/journal.ppat.1000544
Editor: Marianne Manchester, The Scripps Research Institute, United States ofAmerica
Published October 26, 2009
Copyright: � 2009 Dorer et al. This is an open-access article distributed underthe terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.
Funding: Work in the Salama lab is supported by National Institutes of Healthgrant AI054423. The funder had no role in study design, data collection andanalysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interestsexist.
* E-mail: [email protected]
PLoS Pathogens | www.plospathogens.org 1 October 2009 | Volume 5 | Issue 10 | e1000544
addition, transcription profiles reveal induction of several
chemokines and cytokines including those produced by nonlym-
phoid cells, and robust induction of innate immune defenses
including iron sequestration proteins and antimicrobial peptides
[15]. These studies suggest it would be wise to explore diverse
functional classes of genes for host genetic variant associations with
H. pylori disease progression. To this end, H. pylori researchers are
eagerly awaiting an unbiased genome-wide association study of
risk factors associated with progression to intestinal-type gastric
cancer or peptic ulcer disease in patients infected with H. pylori.
Such a study has been completed for sporadic diffuse-type gastric
cancer, which can be associated with H. pylori infection, revealing
two candidate loci, one that encodes a likely tumor suppressor
(prostate stem cell antigen [PSCA]) [16]. Genomic studies of this
sort will help elucidate host factors that synergize with H. pylori
infection to cause disease.
The association of H. pylori infection with gastric cancer raises
the interesting question of whether H. pylori encodes one or more
oncogenes. Oncogenic viruses initiate and promote cellular
transformation by integrating virally encoded oncogenes into the
host genome [17,18]. By contrast, H. pylori remains primarily
extracellular and does not integrate its genome into the host DNA.
The bacterium can still affect the function of host cells, however,
by translocating a bacterial protein, CagA, into host cells via a
specialized secretion system called the cag Type IV secretion
system (T4SS) [19,20]. In host cells, CagA interacts with a number
of cellular complexes implicated in oncogenesis [21,22]. Despite
elucidation of potentially transforming activities, transgenic
expression of CagA in the mouse stomach is only weakly
oncogenic [23]. As the cag T4SS also induces proinflammatory
cytokines via the intracellular bacterial peptidoglycan recognition
molecule Nod1, cancer progression may occur through synergy
with the host inflammatory response [24]. While CagA may not
promote cancer itself, exposure to CagA and inflammatory insults
may select for heritable host cell changes (genetic or epigenetic)
that together contribute to cancer progression.
H. pylori expands our view of how microbes survive at high levels
while activating inflammatory responses and shows us that microbes
may be underappreciated as an important factor in chronic disease
pathogenesis. In the case of pathogens that cause acute infections,
there is a massive inflammatory response, which often supports
bacterial replication and transmission. Alternatively, some patho-
gens, such as Mycobacterium tuberculosis, persist in the host by
manipulating the immune response to create a protected compart-
ment. H. pylori introduces a third strategy; it actively replicates and
maintains a continuous balance with the inflammatory response
over years of infection with little evidence for increased H. pylori–
related disease upon immune suppression [25]. As the role of
chronic inflammation in many diseases including cardiovascular
disease, diabetes mellitus, Alzheimer’s disease, and others is
increasingly recognized, researchers are focusing on infectious
agents as one possible source of this chronic inflammation.
Genomic Insights into the Biology of H. pylori
The study of H. pylori is strongly influenced by the genomic age.
The sequencing of its genome was completed in 1997 [26], just 13
years after Marshall and Warren reported their discovery.
However, almost a quarter (24%) of H. pylori genes have no
sequence similarity with genes available in public databases [27],
suggesting that lessons learned from well-studied bacteria like
Escherichia coli would not necessarily apply to this evolutionarily
distinct Epsilonproteobacteria. By using more advanced bioinfor-
matic approaches, researchers are now identifying some pathways
first thought absent in H. pylori. For example, H. pylori appeared to
lack the E. coli recBCD pathway, which is involved in homologous
recombination and DNA double-strand break repair. More careful
examination of conserved domains and motifs, however, identified
the H. pylori addA and addB genes, which are present in most gram-
positive and many gram-negative bacteria and whose protein
products have enzymatic functions similar to those of the recBCD
pathway [28].
By 1999, H. pylori was the first species to have complete genomes
sequenced from two different strains—an important milestone,
given its genetic diversity. Comparison of the two genomes
revealed that 6%–7% of the genes were present in one strain but
not in the other. There was also a high level of nucleotide diversity
between the two strains, with only eight genes sharing at least 98%
nucleotide identity; however, most nucleotide differences were
synonymous changes [27]. Microarrays designed upon these
sequences were then used for comparative genomic hybridization
of H. pylori strains isolated from different ethnic groups and
geographic areas [29,30]. These studies found that 25% of H. pylori
genes are variably present among strains. Such genome-wide
analyses have played an important role in dividing H. pylori genes
into two classes: variable genes that are absent in some strains and
core genes that are present in all strains analyzed. The variable
genes are likely adaptive for different environmental niches, which
for the human stomach–restricted H. pylori comprise genetically
distinct hosts. The largest annotated class of variable genes encode
proteins expressed on or that modify the bacterial cell surface
(outer membrane proteins and proteins involved in lipopolysac-
charide synthesis) [30], consistent with a function at the interface
of the bacteria and host. The core genes have diverse functions.
Some core genes are required for viability in culture. A genomic
study that utilized microarray-based mapping of a genome-
saturating transposon library (a collection of H. pylori strains that
includes transposon mutants randomly distributed throughout the
genome) revealed that 23% of the genome is required for viability
in culture because these genes could not tolerate transposon
Figure 1. Distinct pathologies of H. pylori–induced disease. (A)Duodenal ulcer disease correlates with high inflammation in the antrum(red bursts), lower levels of inflammation in the corpus, and high acidsecretion (+). (B) Gastric ulcer or adenocarcinoma correlates withincreased inflammation in the corpus, low acid secretion, and multifocalatrophy (wavy lines).doi:10.1371/journal.ppat.1000544.g001
Learning about Disease from H. pylori
PLoS Pathogens | www.plospathogens.org 2 October 2009 | Volume 5 | Issue 10 | e1000544
insertion [31]. Additional core genes are essential only in the
context of host infection and several groups have completed
screens for transposon mutants that fail to colonize animal models
of infection [32,33]. An example of such a colonization core gene
is addA, which is required for recombinational repair of DNA
double-strand breaks, presumably caused by the host inflamma-
tory response [28].
The nucleotide sequence diversity in H. pylori’s core genes can
distinguish between different ethnic and geographic human
populations, demonstrating that passage of H. pylori between
closely related humans has continued uninterrupted over tens of
thousands of years (see Box 1). Different geographic and ethnic
groups that have similar infection rates have quite varied relative
risks of H. pylori–associated diseases such as gastric cancer [34].
Thus, in addition to host genetic and environmental exposures,
differences among strains likely contribute to variation in disease
risk. Consequently, studies of pathogenesis need to be reproduced
in representative strain backgrounds to ensure that discoveries in
one strain apply in strain populations with a diverse evolutionary
history.
H. pylori Diversification during Persistent Infection
Genetic diversification can aid in the persistence of organisms
that continue to replicate during chronic infection, allowing them
to sample adaptive variants. HIV, for example, has a flexible
reverse transcriptase that makes point mutations, insertions,
deletions, transversions, and duplications that produce variants
that may have a selective advantage [35]. Genetic variation in a
microbe indicates constant selection by a dynamic environment,
and H. pylori is a very genetically diverse species of bacteria [36–
38]. Genetic diversification may help H. pylori to adapt to a new
host after transmission, to different micro-niches within a single
host, and to changing conditions in the host over time—for
example, by avoiding clearance by host defenses.
Genetic diversity arises from within-genome diversification as
well as from reassortment by recombination with DNA from other
infecting H. pylori, generating novel clones within the stomach
(Figure 2). Within-genome diversification can include point
mutations, intragenomic recombination, and slipped-strand mis-
pairing during DNA replication within repetitive sequences.
Reassortment can occur by recombination with either DNA from
a superinfecting H. pylori strain or a variant clone of the same
strain. Central to this reassortment is H. pylori’s natural
competence—the ability to take up exogenous DNA and
incorporate it into its genome. Evidence from our lab shows that
natural competence is induced by DNA damage, suggesting that
H. pylori responds to stress by diversifying its genome (MSD and
NRS, unpublished data). However, there are controls on this
rampant genetic exchange: restriction-modification systems, which
include a restriction endonuclease that cleaves a specific DNA
sequence and a DNA methyltransferase that protects the
bacterium’s own DNA from being cleaved by methylating the
target DNA sequence. Genes that encode restriction-modification
systems compose the second largest class of variably present genes
with known function, so the complement of available restriction-
modification systems varies between strains, giving a methylation
code to the DNA from each strain. This mechanism serves to limit
or prevent recombination between H. pylori strains as well as
between H. pylori and other bacteria or eukaryotic cells [39].
The H. pylori genome encodes relatively few proteins that
regulate transcription. Instead, some of the same processes that
govern the generation of genetic diversity (i.e., slipped-strand
mispairing, methyltransferase activity, and recombination) also
play an important role in varying gene expression in response to
environmental cues. There are 46 H. pylori genes that have long
repeats of one or two nucleotides that are prone to slipped-strand
mispairing during replication [26,27,40]. These genes are phase-
variable because changes in the number of repeats can shift the
reading frame of the gene, switching gene expression on or off
(Figure 2). In addition, many H. pylori promoters have mononu-
cleotide repeats that regulate gene expression by changing the
spacing between important regulatory sites in these promoters.
Orphan methyltransferases, which have lost their corresponding
restriction enzyme, may also regulate gene expression by
methylating sequences in the promoter region of genes, and some
of the methyltransferase genes are themselves subject to phase-
variable expression. Recombination regulates gene expression
through deletions and duplications that occur during gene
conversion and locus switching. These mechanisms suggest that
H. pylori survives by constantly generating variants that adapt its
physiology to new environments.
One example of how H. pylori’s genetic variability helps it adapt
to new environments involves its adhesin genes, which encode
proteins that bind to the Lewis human blood group antigens,
which are carbohydrate-based epitopes [41]. The protein encoded
by one of these adhesin genes, BabA, binds the Lewis-b antigen on
the gastric mucosa, helping the bacterium adhere to the mucosa.
The babA gene is silent in some H. pylori strains but can be
Box 1. Tracking Human Genealogy with H.pylori Genomics
Currently, a number of companies propose to predict your‘‘genetic genealogy’’ from the DNA in a cheek swab. Theydo this by analyzing informatively variable parts of ourgenomes (such as the Y chromosome or mitochondrialDNA) that show characteristic differences between ethnicand geographic populations; thus, they can tell if you maybe distantly related to Ghengis Khan, for example.Unfortunately, population bottlenecks [51], small popula-tion sizes, and long generation times have limited theamount of genetic diversity in the human population thatcan be used for these analyses. It turns out, however, thatgenomic sequencing of the H. pylori strain harbored by anindividual does a better job in resolving ancestry than theusual human genomic markers [52]. This is because of highgenetic diversity among H. pylori strains [53], a restrictedmode of transmission (primarily within families or house-holds [54]), and the association of H. pylori with humansthroughout our evolution [55]. A major source of H. pylori’sgenetic diversity is recombination between strains [38],which blurs signatures of descent. Despite this confoundingfactor, Achtman and colleagues [53] identified evolutionarysignatures in strain sequences from diverse geographicsources. These signatures, combined with new statisticaltools that take into account admixture and recombination[55], have tracked ancient human migrations, such as ouremergence from Africa [55], and more recent events such ascolonization of the Pacific islands [56]. H. pylori genesequences can even distinguish between the Buddhist andMuslim ethnic groups that have coexisted for at least 1,000years in Ladakh [52]. The fact that H. pylori has maintainedevolutionarily distinct strain signatures during many gener-ations of contact suggests either that interracial interactionsthat promote transmission are very limited or thatadditional mechanisms prevent strains from one ethnicpopulation from establishing a foothold in hosts of anotherethnic population.
Learning about Disease from H. pylori
PLoS Pathogens | www.plospathogens.org 3 October 2009 | Volume 5 | Issue 10 | e1000544
expressed if it recombines with the babB gene, an event mediated
by homologous sequences at the 59 and 39 ends of the two genes
[42]. Thus, recombination can help H. pylori alter its adherence
properties to adapt to selective pressures in the host. These
selective pressures may include variation in the host receptors
present or in conditions that favor a shift in the ratio of bacteria
adherent to the gastric cell epithelium over those swimming freely
in the mucus.
Genetic variation may also be important for the ability of H.
pylori to evade the host immune system. H. pylori further exploits
the Lewis antigen system by ‘‘camouflaging’’ its surface lipopoly-
saccharide with its own Lewis-type antigen, which mimics that of
the individual host. The bacterium adapts the spectrum of Lewis
antigens it expresses by phase variation of the genes involved in
their biosynthesis [43]. Furthermore, recombination among the
many members of the large outer membrane protein (omp) gene
family has the potential to create mosaic omp genes, generating
antigenic variation that may keep H. pylori ahead of the ability of
the host’s immune system to recognize these cell surface exposed
epitopes.
H. pylori’s Interaction with the Microbiome
H. pylori share their niche with the stomach microbiome, the
collection of microorganisms living on and in us. Study of
microorganisms was once limited to only those microbes that could
be cultured in the laboratory. Advances in sequencing technology
now allow us to study the collection of genes encoded by any
group of organisms—so-called metagenomics—making it possible
to characterize also the microbes that cannot be cultured but
nevertheless affect our health. Given that H. pylori engages in DNA
exchange, the metagenome may serve as a repository for novel
traits. When present, H. pylori dominates the microbiome in the
stomach [44,45], although the effect of this dominance is not
known. Perhaps H. pylori infection changes the composition of the
stomach microbiome, with unknown consequences.
Challenges for the Future
H. pylori is considered pathogenic, even carcinogenic. With this
simple view, eradication seems an obvious choice. In reality,
however, the relationship between H. pylori and disease is more
Figure 2. Mechanisms that create genetic diversity in H. pylori. Colored arrows represent different genes, and the correspondingly coloredtriangles, rectangles, and circles represent the proteins encoded by these genes. Diversification mechanisms (right side of figure) includespontaneous point mutations, slipped-strand mispairing, and intragenomic recombination. Allelic changes involving nonsynonymous pointmutations and mosaic genes resulting from intragenomic recombination can alter the function and/or the antigenic epitopes of the encoded protein.Gene expression can also be regulated by gene conversion resulting from intragenomic recombination, and phase variation mediated by slipped-strand mispairing. Reassortment of genes (left side of figure) by natural transformation with exogenous DNA also contributes to genetic diversity.Natural transformation with DNA from a superinfecting strain, for example, can introduce new genes and new alleles of already present genes(horizontal gene transfer). Similarly, natural transformation with DNA from a variant clone of the same strain can further propagate an advantageousallele acquired by within-genome diversification.doi:10.1371/journal.ppat.1000544.g002
Learning about Disease from H. pylori
PLoS Pathogens | www.plospathogens.org 4 October 2009 | Volume 5 | Issue 10 | e1000544
nuanced. Like the cancer risk associated with smoking, a recent
trial showed that the cancer risk from H. pylori diminished
measurably only 12 years after eradication of the infection [46].
Some studies suggest that infection may prevent diseases of the
esophagus, and there is a debate in the literature concerning a
relationship between H. pylori and childhood asthma [8,47,48].
There is clear consensus that H. pylori should be eliminated in cases
of peptic ulcer disease, gastric MALT lymphoma, early gastric
cancer, first-degree relatives of gastric cancer patients, and
uninvestigated dyspepsia in high-prevalence populations. Despite
its potential to prevent ulcer and cancer, universal eradication of
H. pylori infection has not gained wide support, because of the
mixture of positive and negative disease associations with infection,
the lack of a definitive bacterial or host molecule accounting for
disease causation, and poor success rates of treating non-ulcer
dyspepsia by clearing H. pylori infection [49,50]. Thus a more
detailed picture of this host–pathogen interaction is needed and
likely will depend upon further advances in both endoscopy and
genomics.
We have a poor understanding of the immune responses to H.
pylori and the reasons that most hosts fail to clear infection. The
host restriction of H. pylori to humans and some nonhuman
primates has hampered development of robust animal models to
study the disease process. Thus progress will require improvements
in animal models and improved access to patient samples.
Endoscopy of the upper gastrointestinal tract is an invasive
procedure, so a major limitation to research is collection of
bacterial and human tissue samples from infected people.
Available samples are biased toward patients with severe
dyspepsia, ulcer symptoms, and gastric cancer, and only a small
fraction of the stomach can be sampled. Advances in less-invasive
methods, such as capsule endoscopy, may allow increased
sampling to monitor bacterial and tissue changes during chronic
colonization, including isolation and phenotypic analysis of
immune effector cells in infected tissue. Less-invasive methods
would also provide an opportunity to study infection in
asymptomatic individuals and transmission of H. pylori infection,
conditions in which the selective pressures that drive the observed
H. pylori genetic diversification likely operate.
A major opportunity to increase our understanding of how H.
pylori causes or prevents disease arises from recent advances in
high-throughput sequencing technologies. Currently, several
platforms allow researchers to accomplish in a single experiment
sequencing or resequencing of tens of H. pylori genomes,
characterization of host immune and epithelial cell types that
change during infection with highly sensitive digital expression tag
analysis, or analysis of the microbiome present in the stomach and
esophagus through metagenomic sequencing or targeted bacterial
or fungal small ribosomal subunit DNA sequencing. The sequence
data generated by such experiments will address several important
mysteries of H. pylori biology, including the timing and extent of H.
pylori genetic diversification. While strains from unrelated
individuals show dramatic variation in gene content and gene
sequence, the extent of sequence variation among clones during
persistent infection of a single host or upon transmission has not
been adequately sampled. Whole-genome sequencing of multiple
isolates of individual patients with dense spatial and temporal
sampling would definitively establish when, where, and by what
mechanisms genetic diversity is generated. This information will
inform efforts to combat resistance to current antibiotics, to
develop vaccines, and to understand H. pylori’s coevolution with
humans. Exploration of the influence of H. pylori on the
microbiome will identify organisms that collaborate with or can
be antagonized by H. pylori. Such organisms may mediate some of
the disease risks that have been associated with H. pylori presence
and absence. Finally, the rapid pace of resequencing of H. pylori’s
human host will provide a deeper understanding of genetic
variation in the human population that may influence risk for H.
pylori–associated pathologies and which, by association, could
provide clues to the cellular pathways disrupted in disease. Thus,
genomic approaches to study host response, the human micro-
biome, bacterial genetic variation, and, perhaps most importantly,
the intersections among these components, will help researchers
determine whether eradication is appropriate for all individuals in
all populations.
Acknowledgments
We thank Olivier Humbert and Laura Sycuro for their critical comments
on the manuscript and Laura Sycuro for providing H. pylori images.
References
1. Marshall BJ, Warren JR (1984) Unidentified curved bacilli in the stomach ofpatients with gastritis and peptic ulceration. Lancet 1: 1311–1315.
2. Nomura A, Stemmermann GN, Chyou P, Kato I, Perez-Perez G, et al. (1991)
Helicobacter pylori infection and gastric carcinoma among japanese americans in
Hawaii. N Engl J Med 325: 1132–1136.
3. Parsonnet J, Friedman GD, Vandersteen DP, Chang Y, Vogelman JH, et al.
(1991) Helicobacter pylori infection and the risk of gastric carcinoma. N Engl J Med
325: 1127–1131.
4. Parsonnet J, Hansen S, Rodriguez L, Gelb AB, Warnke RA, et al. (1994)Helicobacter pylori infection and gastric lymphoma. N Engl J Med 330: 1267–1271.
5. Kusters JG, van Vliet AH, Kuipers EJ (2006) Pathogenesis of Helicobacter pylori
infection. Clin Microbiol Rev 19: 449–490.
6. WHO (2006) Fact sheet No. 297, Cancer. World Health Organization.
7. Peek RM Jr, Blaser MJ (2002) Helicobacter pylori and gastrointestinal tractadenocarcinomas. Nat Rev Cancer 2: 28–37.
8. Anderson LA, Murphy SJ, Johnston BT, Watson RG, Ferguson HR, et al.
(2008) Relationship between Helicobacter pylori infection and gastric atrophyand the stages of the oesophageal inflammation, metaplasia, adenocarcino-
ma sequence: Results from the FINBAR case-control study. Gut 57:734–739.
9. Amieva MR, El-Omar EM (2008) Host-bacterial interactions in Helicobacter pylori
infection. Gastroenterology 134: 306–323.
10. Rubin CE (1997) Are there three types of Helicobacter pylori gastritis?Gastroenterology 112: 2108–2110.
11. Basso D, Scrigner M, Toma A, Navaglia F, Di Mario F, et al. (1996) Helicobacter
pylori infection enhances mucosal interleukin-1 beta, interleukin-6, and the
soluble receptor of interleukin-2. Int J Clin Lab Res 26: 207–210.
12. El-Omar EM (2001) The importance of interleukin 1beta in Helicobacter pylori
associated disease. Gut 48: 743–747.
13. El-Omar EM, Carrington M, Chow WH, McColl KE, Bream JH, et al. (2000)Interleukin-1 polymorphisms associated with increased risk of gastric cancer.
Nature 404: 398–402.
14. Figueiredo C, Machado JC, Pharoah P, Seruca R, Sousa S, et al. (2002)
Helicobacter pylori and interleukin 1 genotyping: an opportunity to identify high-
risk individuals for gastric carcinoma. J Natl Cancer Inst 94: 1680–1687.
15. Humbert O, Pinto-Santini DM, Salama NR (2008) Genomotyping of Helicobacter
pylori and its host: microarray-based insights on gene variation, expression and
function. In: Yamaoka Y, ed. Helicobacter pylori Molecular Genetics and Cellular
Biology. Norfolk, UK: Caister Academic Press. pp 205–244.
16. Sakamoto H, Yoshimura K, Saeki N, Katai H, Shimoda T, et al. (2008) Geneticvariation in PSCA is associated with susceptibility to diffuse-type gastric cancer.
Nat Genet 40: 730–740.
17. Maeda N, Fan H, Yoshikai Y (2008) Oncogenesis by retroviruses: Old and new
paradigms. Rev Med Virol 18: 387–405.
18. Howley PM, Livingston DM (2009) Small DNA tumor viruses: Large
contributors to biomedical sciences. Virology 384: 256–259.
19. Segal ED, Cha J, Lo J, Falkow S, Tompkins LS (1999) Altered states:
Involvement of phosphorylated CagA in the induction of host cellular growthchanges by Helicobacter pylori. Proc Natl Acad Sci U S A 96: 14559–14564.
20. Stein M, Rappuoli R, Covacci A (2000) Tyrosine phosphorylation of theHelicobacter pylori CagA antigen after cag-driven host cell translocation. Proc Natl
Acad Sci U S A 97: 1263–1268.
21. Bourzac KM, Guillemin K (2005) Helicobacter pylori-host cell interactions
mediated by type IV secretion. Cell Microbiol 7: 911–919.
Learning about Disease from H. pylori
PLoS Pathogens | www.plospathogens.org 5 October 2009 | Volume 5 | Issue 10 | e1000544
22. Hatakeyama M (2006) Helicobacter pylori CagA — A bacterial intruder conspiring
gastric carcinogenesis. Int J Cancer 119: 1217–1223.
23. Ohnishi N, Yuasa H, Tanaka S, Sawa H, Miura M, et al. (2008) Transgenic
expression of Helicobacter pylori CagA induces gastrointestinal and hematopoietic
neoplasms in mouse. Proc Natl Acad Sci U S A 105: 1003–1008.
24. Viala J, Chaput C, Boneca IG, Cardona A, Girardin SE, et al. (2004) Nod1
responds to peptidoglycan delivered by the Helicobacter pylori cag pathogenicity
island. Nat Immunol 5: 1166–1174.
25. Romanelli F, Smith KM, Murphy BS (2007) Does HIV infection alter the
incidence or pathology of Helicobacter pylori infection? AIDS Patient Care STDS
21: 908–919.
26. Tomb JF, White O, Kerlavage AR, Clayton RA, Sutton GG, et al. (1997) The
complete genome sequence of the gastric pathogen Helicobacter pylori [published
erratum appears in Nature 1997 Sep 25;389(6649):412]. Nature 388: 539–547.
27. Alm RA, Ling LS, Moir DT, King BL, Brown ED, et al. (1999) Genomic-
sequence comparison of two unrelated isolates of the human gastric pathogen
Helicobacter pylori. Nature 397: 176–180.
28. Amundsen SK, Fero J, Hansen LM, Cromie GA, Solnick JV, et al. (2008)
Helicobacter pylori AddAB helicase-nuclease and RecA promote recombination-
related DNA repair and survival during stomach colonization. Mol Microbiol
69: 994–1007.
29. Gressmann H, Linz B, Ghai R, Pleissner KP, Schlapbach R, et al. (2005) Gain
and loss of multiple genes during the evolution of Helicobacter pylori. PLoS Genet
1: e43. doi:10.1371/journal.pgen.0010043.
30. Salama N, Guillemin K, McDaniel TK, Sherlock G, Tompkins L, et al. (2000) A
whole-genome microarray reveals genetic diversity among Helicobacter pylori
strains. Proc Natl Acad Sci U S A 97: 14668–14673.
31. Salama NR, Shepherd B, Falkow S (2004) Global transposon mutagenesis and
essential gene analysis of Helicobacter pylori. J Bacteriol 186: 7926–7935.
32. Baldwin DN, Shepherd B, Kraemer P, Hall MK, Sycuro LK, et al. (2007)
Identification of Helicobacter pylori genes that contribute to stomach colonization.
Infect Immun 75: 1005–1016.
33. Kavermann H, Burns BP, Angermuller K, Odenbreit S, Fischer W, et al. (2003)
Identification and characterization of Helicobacter pylori genes essential for gastric
colonization. J Exp Med 197: 813–822.
34. Yamaguchi N, Kakizoe T (2001) Synergistic interaction between Helicobacter
pylori gastritis and diet in gastric cancer. Lancet Oncol 2: 88–94.
35. Johnson WE, Desrosiers RC (2002) Viral persistance: HIV’s strategies of
immune system evasion. Annu Rev Med 53: 499–518.
36. Israel DA, Salama N, Krishna U, Rieger UM, Atherton JC, et al. (2001)
Helicobacter pylori genetic diversity within the gastric niche of a single human host.
Proc Natl Acad Sci U S A 98: 14625–14630.
37. Salama NR, Gonzalez-Valencia G, Deatherage B, Aviles-Jimenez F,
Atherton JC, et al. (2007) Genetic analysis of Helicobacter pylori strain populations
colonizing the stomach at different times postinfection. J Bacteriol 189:
3834–3845.
38. Suerbaum S, Smith JM, Bapumia K, Morelli G, Smith NH, et al. (1998) Free
recombination within Helicobacter pylori. Proc Natl Acad Sci U S A 95:
12619–12624.
39. Humbert O, Salama NR (2008) The Helicobacter pylori HpyAXII restriction-
modification system limits exogenous DNA uptake by targeting GTAC sites butshows asymmetric conservation of the DNA methyltransferase and restriction
endonuclease components. Nucleic Acids Res 36: 6893–6906.
40. Salaun L, Linz B, Suerbaum S, Saunders NJ (2004) The diversity within anexpanded and redefined repertoire of phase-variable genes in Helicobacter pylori.
Microbiology 150: 817–830.41. Lloyd KO (2000) The chemistry and immunochemistry of blood group A, B, H,
and Lewis antigens: Past, present and future. Glycoconj J 17: 531–541.
42. Backstrom A, Lundberg C, Kersulyte D, Berg DE, Boren T, et al. (2004)Metastability of Helicobacter pylori bab adhesin genes and dynamics in Lewis b
antigen binding. Proc Natl Acad Sci U S A 101: 16923–16928.43. Wirth HP, Yang M, Peek RM Jr, Tham KT, Blaser MJ (1997) Helicobacter pylori
Lewis expression is related to the host Lewis phenotype. Gastroenterology 113:1091–1098.
44. Bik EM, Eckburg PB, Gill SR, Nelson KE, Purdom EA, et al. (2006) Molecular
analysis of the bacterial microbiota in the human stomach. Proc Natl AcadSci U S A 103: 732–737.
45. Andersson AF, Lindberg M, Jakobsson H, Backhed F, Nyren P, et al. (2008)Comparative analysis of human gut microbiota by barcoded pyrosequencing.
PLoS ONE 3: e2836. doi:10.1371/journal.pone.0002836.
46. Mera R, Fontham ET, Bravo LE, Bravo JC, Piazuelo MB, et al. (2005) Longterm follow up of patients treated for Helicobacter pylori infection. Gut 54:
1536–1540.47. Raj SM, Choo KE, Noorizan AM, Lee YY, Graham DY (2009) Evidence
against Helicobacter pylori being related to childhood asthma. J Infect Dis 199:914–915; author reply 915–916.
48. Chen Y, Blaser MJ (2008) Helicobacter pylori colonization is inversely associated
with childhood asthma. J Infect Dis 198: 553–560.49. Chey WD, Wong BC (2007) American College of Gastroenterology guideline on
the management of Helicobacter pylori infection. Am J Gastroenterol 102:1808–1825.
50. Malfertheiner P, Megraud F, O’Morain C, Bazzoli F, El-Omar E, et al. (2007)
Current concepts in the management of Helicobacter pylori infection: TheMaastricht III Consensus Report. Gut 56: 772–781.
51. Cann RL, Stoneking M, Wilson AC (1987) Mitochondrial DNA and humanevolution. Nature 325: 31–36.
52. Wirth T, Wang X, Linz B, Novick RP, Lum JK, et al. (2004) Distinguishinghuman ethnic groups by means of sequences from Helicobacter pylori: Lessons from
Ladakh. Proc Natl Acad Sci U S A 101: 4746–4751.
53. Achtman M, Azuma T, Berg DE, Ito Y, Morelli G, et al. (1999) Recombinationand clonal groupings within Helicobacter pylori from different geographical regions.
Mol Microbiol 32: 459–470.54. Schwarz S, Morelli G, Kusecek B, Manica A, Balloux F, et al. (2008) Horizontal
versus familial transmission of Helicobacter pylori. PLoS Pathog 4: e1000180.
doi:10.1371/journal.ppat.1000180.55. Falush D, Wirth T, Linz B, Pritchard JK, Stephens M, et al. (2003) Traces of
human migrations in Helicobacter pylori populations. Science 299: 1582–1585.56. Moodley Y, Linz B, Yamaoka Y, Windsor HM, Breurec S, et al. (2009) The
peopling of the Pacific from a bacterial perspective. Science 323: 527–530.
Learning about Disease from H. pylori
PLoS Pathogens | www.plospathogens.org 6 October 2009 | Volume 5 | Issue 10 | e1000544
Review
Helminth Genomics: The Implications for Human HealthPaul J. Brindley1*, Makedonka Mitreva2, Elodie Ghedin3, Sara Lustigman4
1 Department of Microbiology, Immunology, and Tropical Medicine, George Washington University Medical Center, Washington, D. C., United States of America,
2 Genome Centre and Department of Genetics, Washington University School of Medicine, St. Louis, Missouri, United States of America, 3 Division of Infectious Diseases,
University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America, 4 New York Blood Center, Laboratory of Molecular Parasitology, New York,
New York, United States of America
Abstract: More than two billion people (one-third ofhumanity) are infected with parasitic roundworms orflatworms, collectively known as helminth parasites. Theseinfections cause diseases that are responsible for enor-mous levels of morbidity and mortality, delays in thephysical development of children, loss of productivityamong the workforce, and maintenance of poverty.Genomes of the major helminth species that affecthumans, and many others of agricultural and veterinarysignificance, are now the subject of intensive genomesequencing and annotation. Draft genome sequences ofthe filarial worm Brugia malayi and two of the humanschistosomes, Schistosoma japonicum and S. mansoni, arenow available, among others. These genome data willprovide the basis for a comprehensive understanding ofthe molecular mechanisms involved in helminth nutritionand metabolism, host-dependent development andmaturation, immune evasion, and evolution. They arelikely also to predict new potential vaccine candidates anddrug targets. In this review, we present an overview ofthese efforts and emphasize the potential impact andimportance of these new findings.
Helminth Infections—The Great NeglectedTropical Diseases
Helminth parasites are parasitic worms from the phyla
Nematoda (roundworms) and Platyhelminthes (flatworms)
(Figures 1 and 2); together, they comprise the most common
infectious agents of humans in developing countries. The collective
burden of the common helminth diseases—which range from the
dramatic sequelae of elephantiasis and blindness to the more
subtle but widespread effects on child development, pregnancy,
and productivity—rivals that of the main high-mortality condi-
tions such as HIV/AIDS or malaria [1]. For example, based on a
recent analysis [2], 85% of the neglected tropical disease (NTD)
burden for the poorest 500 million people living in sub-Saharan
Africa (SSA) results from helminth infections. Hookworm infection
occurs in almost half of the poorest people in SSA, including 40–
50 million school-aged children and 7 million pregnant women, in
whom it is a leading cause of anemia. Schistosomiasis (192 million
cases) is the second most prevalent NTD after hookworm,
accounting for 93% of the world’s number of cases of
schistosomiasis and possibly associated with increased horizontal
transmission of HIV/AIDS. Lymphatic filariasis (46–51 million
cases) and onchocerciasis (37 million cases) are also widespread in
SSA, each disease representing a significant cause of disability and
reduction in the region’s agricultural productivity. The disease
burden estimate in disability-adjusted life years (DALYs) for total
helminth infections in SSA is 5.4–18.3 million in comparison to
40.9 million DALYs for malaria and 9.3 million DALYs for
tuberculosis. Yet, research into helminth infections has not
received nearly the same level of support. This is partly because
helminthiases are diseases of the poorest people in the poorest
regions, but also because these pathogens are difficult to study in
the laboratory by comparison to most model eukaryotes and many
other pathogens. Standard tools and approaches, including cell
lines, culture in vitro, and animal models, are generally lacking. In
addition, the genomes of helminths are generally much more
complex than those of model organisms like yeast and fruit flies
[2].
Whereas helminth diseases are ancient scourges of humanity,
with some known from biblical times, most can also be considered
as re-emerging diseases in the sense that new outbreaks are
reported routinely in response to environmental and sociopolitical
changes [3]. For example, schistosomiasis has reemerged many
times in Africa in recent times in response to hydrological changes,
e.g. construction of dams, irrigation canals, reservoirs, etc. that
establish suitable new environments for the intermediate host
snails that transmit the parasites. Schistosomiasis has also
reemerged in mountainous and hilly regions in Sichuan, China,
where it had been controlled previously by intensive interventions
[4]. Furthermore, new strains of schistosomes are indeed emerging
through natural hybridizations between human and cattle species
of schistosomes [5].
Despite the difficulties with investigation of helminth parasites,
new insights into fundamental helminth biology are accumulating
through genome projects and the application of genome
manipulation technologies including RNA interference and
transgenesis (Figure 3). What’s more, research on immunology
of helminth infections has contributed enormously to our
understanding of Th2 immune responses, the function of
regulatory T cells, generation of alternatively activated macro-
phages, and the transmission dynamics of infectious agents. It is
hoped that this progress can be translated into new and robust
drugs, diagnostics, and vaccines for the helminth diseases
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journalcollection (http://ploscollections.org/emerginginfectiousdisease/).
Citation: Brindley PJ, Mitreva M, Ghedin E, Lustigman S (2009) HelminthGenomics: The Implications for Human Health. PLoS Negl Trop Dis 3(10): e538.doi:10.1371/journal.pntd.0000538
Editor: Matty Knight, Biomedical Research Institute, United States of America
Published October 26, 2009
Copyright: � 2009 Brindley et al. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.
Funding: Support from the NIH-NIAID, award numbers R01AI072773 (to PJB) andR01AI081803 (to MM) is gratefully acknowledged. The funder had no role in studydesign, data collection and analysis, decision to publish, or preparation of themanuscript.
Competing Interests: The authors have declared that no competing interestsexist.
* E-mail: [email protected]
www.plosntds.org 1 October 2009 | Volume 3 | Issue 10 | e538
of humanity and those of our livestock and companion species
[1,6–10].
Genomics Approaches to Investigating Helminths
Over the past decade, increasing numbers of helminth-specific
genome sequences have become available due to ever-improving
techniques for obtaining biological material, extracting RNA and
DNA, constructing complimentary DNA (cDNA)/whole genome
shotgun libraries, and, especially, major advances in the chemistry
and instrumentation for DNA sequencing and its concomitant
decreased cost. Helminth genomics began with the generation and
analysis of transcribed sequences (expressed sequence tags [ESTs]
[11]), which has proved to be a rapid and cost-effective route to
discover genes in other eukaryotes. In April 2009, there were
,550,000 nematode and 450,000 platyhelminth ESTs in the
dbEST division of GenBank, excluding those from the model
nematode Caenorhabditis rhabditis. Of these, 60% were from
parasites of humans and closely related animal pathogens used
to study human infections (Table 1). These ESTs have many
applications. They can be used to annotate helminth genomes (see
below) to determine alternative splicing, verify open reading
frames, and confirm exon/intron and gene boundaries. They are
valuable also, for example, in functional genomics to design probes
for expression microarrays (e.g., [12]) and to provide putative
protein sequence information for proteomics methods (e.g., [13]),
to name but a few applications. Quantitative analysis of ESTs
(transcriptomics), including serial expression of gene analysis, can
identify transcripts that are either over- or under-represented by
comparison to other transcripts in various helminth life cycle
stages or tissues (e.g., [14,15]), and the subset of genes evaluated
with gene ontology programs provide insights into cellular and
metabolic pathway functioning in the parasite (e.g., [16]).
Furthermore, one can identify potential targets for interventions
by applying a hierarchy of considerations including a matrix of
biological, expression, and phenotypic data [17] or by performing
a pan-phylum analysis to identify conserved parasite-specific genes
whose selective targeting will have low or no toxicity to the host
[18,19] or genes that have diverged enough from the host
counterpart, resulting in altered or absent functions [20].
The first multicellular genome sequenced was that of the free-
living roundworm C. elegans [21]; reported in 1998, it is still the
only metazoan for which the sequence of every nucleotide is
known with high confidence. Today, the genome sequences of 22
species of helminths that either infect humans or are closely related
parasites are completed or underway (Table 1). A comprehensive
genome analysis has been published for several of them, including
the lymphatic filarial nematode Brugia malayi [22], the dog
hookworm Ancylostoma caninum [23], and the blood flukes
Schistosoma japonicum and S. mansoni [24,25] (Figure 1; Table 1).
Figure 1. Montage of some of the major human helminth parasites, their developmental stages, and disease pathology. (A)Microfilaria of Brugia malayi in a thick blood smear, stained with Giemsa (http://www.dpd.cdc.gov/dpdx/html/frames/a-f/filariasis/body_Filariasis_mic1.htm); the microfilaria is about 250 mm in length. (B) Patient with lymphedema of the left leg due to lymphatic filariasis (http://www.cdc.gov/ncidod/dpd/parasites/lymphaticfilariasis/index.htm). (C) Hookworm egg passed in the stool of an infected person; the microscopic egg,barrel-shaped with a thin wall, is about 70640 mm in dimension. (D) longitudinal section through an adult hookworm attached to wall of smallintestine, ingesting host blood and mucosal wall. The parasite is about 1 cm in length. (E) Eggs of Schistosoma mansoni. The egg is about 150650 mmin dimension; the lateral spine is diagnostic for S. mansoni in comparison to the other human schistosome species. Fibrotic responses to schistosomeeggs trapped in the intestines, liver, and other organs of the infected person are the cause of the schistosomiasis pathology and morbidity. (F) A pairof adult worms of the blood fluke Schistosoma mansoni; the more slender female worm resides in the gynecophoral canal of the thicker male. Theworms are about 1.5 cm in length, and live for many years (http://www.dpd.cdc.gov/dpdx/HTML/ImageLibrary/Schistosomiasis_il.htm ).doi:10.1371/journal.pntd.0000538.g001
www.plosntds.org 2 October 2009 | Volume 3 | Issue 10 | e538
Some of the main obstacles to research on human parasites are
their life cycle complexity, tissue complexity, and the paucity of
genetic and transgenic methods for manipulating genes of interest.
Comparative genome analyses have also provided insights into the
adaptations of various parasites to niches in their human (and
vector) hosts as well as insights into the molecular basis of the
mutualistic relationship between the filarial nematode B. malayi
and its endosymbiont Wolbachia (see below).
The genomes of the schistosomes S. japonicum and S. mansoni are
the first complete genomes reported for members of the
Lophotrochozoa [24,25], a large taxon that includes about 50%
of all metazoan phyla including the mollusks, annelids, brachio-
pods, nemerteans, bryozoans, playthelminths, and others [26].
These schistosome genome sequences revealed remarkable
features of the host–parasite relationship. Among these, the
schistosome genome has lost numerous protein-encoding domains.
Whereas the total number (,6,000) of protein families is broadly
similar among schistosomes, humans, C. elegans, and fruit fly, about
1,000 protein domains have been abandoned by S. japonicum,
including some involved in basic metabolic pathways and defense,
implying that loss of these domains could be a consequence of the
adoption of a parasitic way of life. If so, the remaining molecular
repertoire must have evolved in parallel with this extensive domain
loss to permit the pathogen to locate and infect humans efficiently,
nourish itself, and interact with the external environment as well as
with the host. On the other hand, despite extensive gene and
domain loss, a number of schistosome gene families have
expanded and these provide insights into the requirements for a
parasitic lifestyle. Among the expanded gene families, a metallo-
protease called invadolysin (or leishmanolysin) has at least 12
putative family members in schistosomes compared to a single
orthologue in the human, fruit fly, and C. elegans genomes and only
three in the free-living flatworm S. mediterranea. This protease
family may facilitate skin penetration and tissue invasion by the
cercaria, the infective-stage larva of the schistosome [24,25].
Publication of genome sequences of filaria and schistosomes has
underscored the pressing need to develop functional genomics
approaches for these significant pathogens. Functional analyses—
which use approaches such as RNA interference (RNAi) and
translational studies—are essential to resolve uncertainties in the
molecular physiology of helminths and to illuminate mechanisms
of pathogenesis that may lead to development of new interventions
to control and eliminate these parasites or the diseases. Progress in
the functional genomics of helminths was reviewed recently
[6,27,28]. In brief, RNAi has been used to inactivate the RNA
products of several genes in schistosomes (e.g., [29–32]) and
nematodes (e.g., [33]; reviewed in [8]). In addition, the recent
genome sequences of S. mansoni and S. japonicum now make feasible
genome-scale investigation of transgene integration into schisto-
some chromosomes. Gene therapy–like approaches to transform
schistosomes include the use of the piggyBac transposon and
pseudotyped murine leukemia retrovirus as transgene vectors
Figure 2. Phylogeny of the major taxa of human helminths—nematodes and platyhelminths—as established by maximumlikelihood (ML) analysis of 18S ribosomal RNA from 18 helminth species. Sequences were aligned using ClustalX [93]. The topology of thetree was derived from a consensus tree by neighbor-joining–based bootstrapping, its branch lengths were computed using a ML-based method, andit was rooted with the orthologue from the brewer’s yeast, Saccharomyces cerevisiae. The branch lengths of the phylogenetic tree were computedusing DNAML (PHYLIP package [94]) by allowing rate variation among sites. The headings Chromadorea, Enoplea, Trematoda, and Cestoda are majorclasses of the phyla Nematoda and Platyhelminthes. The GenBank accession numbers of aligned sequences are DQ118536.1 (Trichuris trichiura),AY851265.1 (Trichuris suis), AF036637.1 (Trichuris muris), AY497012.1 (Trichinella spiralis), U94366.1 (Ascaris lumbricoides), AF036587.1 (Ascaris suum),AF036588.1 (Brugia malayi), AJ920348.1 (Necator americanus), AJ920347.2 (Ancylostoma caninum), AF036597.1 (Nippostrongylus brasiliensis), X03680.1(Caenorhabditis elegans), AF036605.1 (Strongyloides ratti), U81581.1 (Strongyloides ratti), AB453329.1 (Strongyloides ratti), AF279916.2 (Strongyloidesstercoralis), AB453315.1 (Strongyloides stercoralis), M84229.1 (Strongyloides stercoralis), EU011664.1 (Saccharomyces cerevisiae), , U27015.1(Saccharomyces cerevisiae), DQ157224.1 (Taenia solium), AF229852.1 (Clonorchis sinensis), Z11590.1 (Schistosoma japonicum), Z11976.1 (Schistosomahaematobium), U65657.1 (Schistosoma mansoni).doi:10.1371/journal.pntd.0000538.g002
www.plosntds.org 3 October 2009 | Volume 3 | Issue 10 | e538
[34–36] (Figure 3A), both of which offer a means to establish
transgenic lines of schistosomes, to elucidate schistosome gene
function and expression, and to advance functional genomics
approaches for these parasites. Notably, progress is also being
made to express reporter transgenes in parasitic nematodes
including Strongyloides stercoralis [37], in which transgene approaches
developed for use in C. elegans have recently been used to
demonstrate that morphogenesis of infective larvae requires the
DAF-16 orthologue FKTF-1 (Figure 3B) [38]. Progress is also
being made with systems for analysis of promoter sequences of
genes of parasitic nematodes (e.g., [39]).
Many future discoveries resulting from the parasitic helminths
genome information can be expected to emanate from the broader
scientific community rather than by the laboratory originating a
genome sequence project. For the specialized genome sequence
labs, dissemination of sequence information in a way that is useful,
consistent, centralized, and lasting has been therefore a key goal.
Efforts have gone well beyond depositing raw data in public
databases. Currently, helminthologists have available a number of
specialized sites for sequence analysis. C. elegans information is
easily accessible at http://www.wormbase.org [40]. Useful
information about the organism includes genome sequence,
genetic and physical maps, transcript data (EST, mRNA, SAGE,
TEC-RED, ORFeome, expression patterns from reporter gene
fusions, and microarrays), the developmental lineage of all cells,
connectivity of the nervous system, mutant phenotypes and genetic
markers, gene expression described at the level of single cells, 3D
protein structures, NCBI Clusters of Orthologous Groups, and
apoptosis and aging information. It also contains extensive
information from large-scale genomics analyses, including pre-
computed sequence similarity searches, protein motif analyses,
protein–protein interactions, findings from systematic RNAi
screens, single nucleotide polymorphisms (SNPs), orthologous
and paralogous relationships, and the assignment of Gene
Ontology (GO) terms to gene products. These resources greatly
aid in the interpretation of much of the sequence data emerging
from parasitic helminths.
However, accumulating evidence suggest that C. elegans is not a
good model for all parasitic helminths, especially for the ones that
are phylogenetically very distant such as the basic nematode and
zoonotic parasite Trichinella spiralis (e.g., [41]). The other
specialized site is Nematode.net (http://www.nematode.net)
[42]), developed with a primary aim to disseminate the diverse
collection of information for parasitic nematodes to the broader
scientific community in a way that is useful, consistent, centralized,
and enduring. In addition to sequence data, the site hosts
assembled NemaGene clusters in GBrowse views, characterizing
composition and protein homology, functional Gene Ontology
annotations presented via the AmiGO browser, KEGG-based
graphical display of NemaGene clusters mapped to metabolic
pathways, codon usage tables, NemFam protein families (which
represent conserved nematode-restricted coding sequences not
found in public protein databases), and a Web-based WU-BLAST
search tool that allows complex querying and other assorted
resources. Furthermore, Nematode.net, by connecting data across
the entire phylum Nematoda, has made a substantial contribution
toward integrating the historically separate fields of C. elegans,
vertebrate parasitology, and plant parasitology research. Finally,
Figure 3. Some recent approaches to expressing transgenes in human helminths. (A) Luciferase activity in Schistosoma mansoni larvae(schistosomules) after transduction with a pseudotyped retrovirus that expresses the luciferase reporter gene. Anti-luciferase antibody staining ofschistosomules three days after exposure to pseudotyped lentivirus carrying the firefly luciferase transgene. Schistosomules examined by confocallaser microscopy; (i) bright field, (ii) fluorescence red channel, (iii) merged images. Control non-transformed worms showed only background levels offluorescence (not shown; see [34–36] for relevant hypotheses and experimental methods). (B) Recent studies on transgenic Strongyloides stercoralisindicated that morphogenesis of the infective L3 stage larva requires the DAF-16 orthologue FKTF-1 [38]. L3s of this parasitic nematode weretransfected with plasmids carrying the transgene fktf-1b::gfp::fktf-1b and examined by fluorescence microscopy. (i, ii) Transgenic first-stage larvaeexpress green fluorescent protein (GFP) in the procorpus (arrow) of the pharynx. (iii, iv) A first-stage larva (L1) expresses the GFP::FKTF-1b(wt)transgene in the hypodermis. (v, vi) An infective L3 expresses the GFP::FKTF-1b(wt) fusion protein in the hypodermis and in a narrow band in thepharynx (arrow). Scale bars, 10 mm. Adapted from [38].doi:10.1371/journal.pntd.0000538.g003
www.plosntds.org 4 October 2009 | Volume 3 | Issue 10 | e538
Ta
ble
1.
Hu
man
par
asit
ich
elm
inth
s(a
nd
the
ircl
ose
rela
tive
s)w
ith
ge
no
me
seq
ue
nci
ng
pro
ject
sco
mp
lete
do
ru
nd
erw
ay.
Ph
ylu
mo
rC
lass
Sp
eci
es
Co
mm
on
Na
me
/D
ise
ase
Pri
ma
ryh
ost
Ge
no
me
siz
e,
Mb
Ge
nB
an
kP
roje
ctID
cDN
As
(37
30
AB
I),
1,0
00
sG
en
om
eS
eq
ue
nci
ng
Sta
tus
Se
qu
en
cin
gIn
stit
ute
a
Ne
ma
tod
a(r
ou
nd
wo
rms)
Cla
de
Vb
Nec
ato
ra
mer
ica
nu
sH
oo
kwo
rm/n
eca
tori
asis
Hu
man
—2
03
69
5In
pro
gre
ssW
UG
C
An
cylo
sto
ma
can
inu
mM
od
el
ho
okw
orm
Do
g3
44
12
84
18
1Im
pro
vin
gd
raft
WU
GC
Nip
po
stro
ng
ylu
sb
rasi
lien
sis
Mo
de
lh
oo
kwo
rmR
at—
20
44
51
4.7
Inp
rog
ress
SI
Cla
de
IVSt
ron
gyl
oid
esst
erco
ralis
Th
read
wo
rm/s
tro
ng
ylo
idia
sis
Hu
man
——
11
.4In
pro
gre
ssSI
S.ra
tti
Mo
de
lth
read
wo
rmR
at—
—2
7.4
Inp
rog
ress
SI/W
UG
C
Cla
de
IIIA
sca
ris
lum
bri
coid
esLa
rge
rou
nd
wo
rm/a
scar
iasi
sH
um
an2
30
—1
.8In
pro
gre
ssSI
A.
sum
Mo
de
lla
rge
rou
nd
wo
rmP
ig2
30
—5
5.7
Imp
rovi
ng
dra
ftW
UG
C/S
I
Bru
gia
ma
layi
Fila
ria/
lym
ph
atic
fila
rias
isH
um
an9
69
54
92
6.2
Imp
rovi
ng
dra
ftT
IGR
/Un
ive
rsit
yo
fP
itts
bu
rgh
Loa
Loa
Fila
ria/
loai
sis
(cu
tan
eo
us
fila
rias
is)/
Afr
ican
eye
wo
rmH
um
an—
—3
.3In
pro
gre
ssB
I
On
cho
cerc
avo
lvu
lus
Fila
ria/
rive
rb
lind
ne
ssH
um
an1
50
—1
5In
pro
gre
ssSI
Aca
nth
och
eilo
nem
avi
tea
eM
od
el
fila
ria
Ro
de
nt
—3
32
39
0In
pro
gre
ssU
MIG
S
Cla
de
ITr
ich
inel
lasp
ira
lisT
rich
ina
wo
rm/t
rich
ino
sis
Pig
toh
um
an7
11
26
05
25
.3D
raft
com
ple
ted
WU
GC
Tric
hu
ris
tric
hiu
raW
hip
wo
rm/t
rich
uri
asis
Hu
man
——
0In
pro
gre
ssSI
T.m
uri
sM
od
el
wh
ipw
orm
Mo
use
96
—7
Inp
rog
ress
SI
T.su
isM
od
el
wh
ipw
orm
Pig
-—
0In
pro
gre
ssW
UG
C
Ce
sto
da
(ta
pe
wo
rms)
Ech
ino
cocc
us
mu
ltilo
cula
ris
Tap
ew
orm
/alv
eo
lar
hyd
atid
osi
sR
od
en
t;la
rva
infe
cts
hu
man
s1
50
—1
Inp
rog
ress
SI
E.g
ran
ulo
sus
Tap
ew
orm
/un
ilocu
lar
hyd
atid
osi
sC
anid
s;la
rva
infe
cts
hu
man
s1
50
12
62
01
0In
pro
gre
ssSI
Taen
iaso
lium
Po
rkta
pe
wo
rm/t
aen
iasi
s/cy
stic
erc
osi
sH
um
an2
70
17
81
52
5D
raft
com
ple
ted
Me
xico
Cit
y
Tre
ma
tod
a(f
luk
es)
Sch
isto
som
am
an
son
iB
loo
dfl
uke
/in
test
inal
sch
isto
som
iasi
sH
um
an3
90
12
59
92
06
Dra
ftco
mp
lete
dSI
/TIG
R
S.h
aem
ato
biu
mB
loo
dfl
uke
/uri
nar
ysc
his
toso
mia
sis
Hu
man
—1
26
16
0In
pro
gre
ssSI
S.ja
po
nic
um
Blo
od
flu
ke/i
nte
stin
alsc
his
toso
mia
sis
Hu
man
40
02
94
91
10
4D
raft
com
ple
ted
CN
HG
C
Clo
no
rch
issi
nen
sis
Live
rfl
uke
/clo
no
rch
iasi
sH
um
an—
17
97
53
Inp
rog
ress
SNU
CM
aW
UG
C,
Was
hin
gto
nU
niv
ers
ity’
sG
en
om
eC
en
ter.
bP
hyl
og
en
yb
ase
do
nB
laxt
er
et
al.
[47
].B
I,B
road
Inst
itu
te;
CN
HG
C,
Ch
ine
seN
atio
nal
HG
C;
SI,
San
ge
rIn
stit
ute
;SN
UC
M,
Seo
ul
Nat
ion
alU
niv
ers
ity
Co
lleg
eo
fM
ed
icin
e;
TIG
R,
Th
eIn
stit
ute
for
Ge
no
mic
Re
sear
ch(n
ow
JCV
I).
do
i:10
.13
71
/jo
urn
al.p
ntd
.00
00
53
8.t
00
1
www.plosntds.org 5 October 2009 | Volume 3 | Issue 10 | e538
Nembase (http://www.nematodes.org [43]) also offers access to
parasite sequence and tools such as visualization of clusters by
stage of expression.
While each of these databases has been challenged by the
requirement to support the influx of new genomes and related
data, they nonetheless provide user communities with innovative
features and tools suited to their needs that are beyond the scope of
the large sequence repositories. For flatworms (Figure 2), it is
notable that public genome annotation and analysis tools are
already in place, including SchistoDB (http://schistoDB.net/), a
genomic database for S. mansoni that incorporates sequences and
annotation [44] and SjTPdb, http://function.chgc.sh.cn/sj-
proteome/index.htm, an integrated transcriptome and proteome
database and analysis platform for S. japonicum [45]. The genome
database for the planarian Schmidtea mediterranea, a model free-living
platyhelminth, can be expected to be advantageous to comparative
genome projects and specific research problems for the growing
number of parasitic flatworms that now are or will be subjects of
genome sequence analysis. In addition, because of the phyloge-
netic position of planarians as early bilaterian metazoans,
SmedGD (http://smedgd.neuro.utah.edu) will prove useful not
only to planarian research, but also to investigations on
developmental and evolutionary biology, comparative genomics
(specifically with parasitic flatworms including flukes and tape-
worms), stem cell research, and regeneration [46]).
Evolution of Parasitism in Helminths
Genomics research has helped our understanding of the
evolution of helminths of humans and other hosts, certainly with
regard to roundworms of the phylum Nematoda. The first
comprehensive study of the molecular evolution of helminths
was a phylogenetic analysis of the small subunit ribosomal DNA (ss
rDNA) sequences from 53 roundworms [47]. This study included
both major parasitic and free-living taxonomic groups. It identified
five major clades within the Nematoda and suggested that
parasitism of animal and plants arose independently multiple
times. A more recent study included 339 nearly full-length ss
rDNAs and proposed subdivision of the phylum into 12 clades
[48]. This revealed that nematodes that feed on fungi occupy a
basal position compared to their plant parasite relatives,
confirming that the parasitic nematodes of plants arose from
fungivorous ancestors. Phylogenetic methods are also being used
to study evolution of parasitism-related protein-coding genes (such
as the enzymes that degrade the plant cell wall in nematode
parasites of plants [cellulases, pectate lyases, etc.]) to understand
better the mechanisms underlying the evolution of parasitism
(reviewed in [49]). Recent genome-wide analysis of two plant
parasitic nematodes [50,51] provided a more complete picture of
the acquisition of these cellulase genes, apparently by horizontal
gene transfer (HGT) from prokaryotes. The subsequent expansion
and diversification of HGT genes in these nematodes allow
inferences about the evolutionary history of these parasites, and in
addition present potential targets for anti-nematode drugs. When
the genome of the necromenic nematode Pristionchus pacificus was
reported recently, it became was clear that cellulases were not
restricted to plant parasitic nematodes; their presence in this
species indicated preadaptation for parasitism of animals [52],
consistent with the intermediate evolutionary position of Pris-
tionchus between the microbivorous C. elegans and the animal
parasitic nematodes. In like fashion to evolution of parasitism
among nematodes, we can predict that additional analyses of
parasitic and free-living flatworm genomes will provide deeper
insights into how and when parasitism evolved in the phylum
Platyhelminthes, particularly in comparison to the fresh-water
planarian S. mediterranea, a non-parasitic flatworm for which a draft
genome is available [53]. In addition to evolution of parasitism of
humans and other vertebrate hosts, helminth parasite genome
sequences will also facilitate evolutionary studies on the role of
intermediate hosts/vectors such as the snail in schistosome
infections and the mosquito in filarial infections in this evolution.
Host–Parasite Relationships
Investigations of regulatory networks involved in the embryonic
development, organogenesis, development, and reproduction of
helminths based on newly available genome sequences have
revealed the presence of well-characterized signaling pathways,
including those for Wnt, Notch, Hedgehog, and transforming
growth factor b (TGF-b). These pathways can be recognized in the
B. malayi and schistosome genomes [22,24,25]. These include
endogenous hormones including epidermal growth factor (EGF)-
like and fibroblast growth factors (FGF)-like peptides. Predicted
components of the Ras–Raf–MAPK and TGF-b–SMAD signaling
pathways (including FGF and EGF receptors), for example,
encoded by these genomes, have components sharing high
sequence identity with their mammalian orthologs, implying that
schistosomes or filarial worms, in addition to utilizing their own
pathways, might exploit host growth factors as developmental
signals.
Immune regulation by helminth parasites includes suppression,
diversion, and alteration of the host immune response, resulting in
an anti-inflammatory environment that is favorable to parasite
survival. For example, chronic infections induce key changes in
host immune cell populations including dominance of the T-helper
2 (Th2) cells and selective loss of effector T cell activity, against a
background of regulatory T cells, alternatively activated macro-
phages, and Th2-inducing dendritic cells [54,55]. With advances
in genomics, numerous parasite-derived proteins, including
cytokine homologs, protease inhibitors, and an intriguing set of
novel products, as well as glycoconjugates and small lipid moieties,
have been discovered with known or hypothesized roles in
immune interference [56–61]. These studies suggest that secreted
parasite products interfere with different arms of the immune
system by influencing the cytokine network and signal transduc-
tion pathways or by inhibiting essential enzymes. Using bioinfor-
matics to compare the predicted proteome of B. malayi to proteins
implicated in the immune response (interleukins, chemokines, and
other signaling molecules), potential immune modulators pro-
duced by the filarial have been identified, including genes
encoding the macrophage migration inhibition family of signaling
proteins [62]. Furthermore, the genome of the blood fluke S.
mansoni encodes a large array of paralogues of fucosyl and
xylosyltransferases [25] that are involved in generating novel
glycans at the host–parasite interface and could have an important
role in the subverting the host immune system. A recent
comprehensive review summarizes our current understanding of
the growing number of individual helminth mediators that target
key receptors or pathways in the mammalian immune system [63].
Helminth infection can have a broad impact on the entire
immune system. Infection with trematode and nematode parasites,
for example, correlates with a reduced incidence of atopic,
allergic-type disorders [64]. Thus, helminth infection might
potentially be useful as a novel therapy for allergic or autoimmune
diseases [65]. Recently, worms, eggs, or purified nematode
parasite protein have been used in preclinical and clinical trials
to protect humans from allergy and autoimmunity (reviewed in
[66–70]), including Crohn’s disease and ulcerative colitis [71,72].
www.plosntds.org 6 October 2009 | Volume 3 | Issue 10 | e538
Other studies have shown that substances produced by helminths,
for example Ascaris suum, Nippostrongylus brasiliensis, and Acanthochei-
lonema viteae, can directly interfere with allergic responses or with
development of allergen-specific Th2 responses [73–75]. ES-62, a
molecule secreted by the filarial nematode A. viteae, directly inhibits
the FceRI-induced release of mediators from mast cells, protects
against mast-cell–dependent hypersensitivity in skin and lungs [76]
and inhibits collagen-induced arthritis [77]. Research is underway
to develop molecules that mimic the activity of ES-62 as drugs for
allergic and autoimmune diseases [66]. Other helminth-derived
products have the potential to reduce allergic responses. These
products include schistosomal lysophosphatidylserine (lyso-PS)
[61] and thioredoxin peroxidase from the liver fluke Fasciola
hepatica [78]. These findings demonstrated that helminths produce
products that can interfere with both the development of allergic
responses and the workings of host effector mechanisms.
The ‘‘Dependent’’ Helminth
As a consequence of evolution of an obligatory parasitic
existence, helminth parasites are dependent upon their interme-
diate and definitive hosts for many necessities including nutrients
such as amino acids; filariae are dependent on insect vectors to
transport them to the host. The newly available genome sequences
for schistosomes and B. malayi have confirmed earlier biochemical
studies that had revealed aspects of physiological/ biochemical
dependence of these parasites on the host. For example,
schistosomes cannot synthesize fatty acids de novo, or sterols,
purines, and nine human essential amino acids plus arginine or
tyrosine, and must catabolize complex precursors obtained from
their hosts. Loss or degeneracy of fatty acid, sterol, and purine
synthesis pathways in schistosomes likely relates to the adoption of
a parasitic lifestyle; it is notable that genes encoding all the key
enzymes for both the de novo fatty acid and purine syntheses are
complete in the (free-living) planarian S. mediterranea. To obtain
essential lipid nutrients, the schistosome genome encodes trans-
porters, including apolipoproteins, low-density lipoprotein recep-
tor, scavenger receptor, fatty-acid-binding protein, ATP-binding-
cassette transporters and cholesterol esterase, to exploit fatty acids
and cholesterol from host blood [25,79].
Many species of filarial nematodes are themselves infected by
the endosymbiotic bacterium Wolbachia. The genome sequence of
the Wolbachia species that infects the roundworm nematode B.
malayi (wBm) [80] helped establish which metabolites the
bacterium potentially provides to the nematode (riboflavin, flavine
adenine dinucleotide, heme, and nucleotides, for example) and
which are provided by the nematode to the endobacterium
(notably, amino acids). This type of information has opened up the
exciting possibility that drugs already registered for human use
might inhibit key biochemical pathways in Wolbachia that could
sterilize or kill the adult worms. Although the Wolbachia genome is
even more degenerate than that of the related pathogen Rickettsia,
it has retained more intact metabolic pathways than Rickettsia. This
may be important in its biochemical contribution to host (i.e.,
filarial) viability and fecundity.
The wBm genome encodes many more proteases and
peptidases than Rickettsia, which likely degrade host proteins in
the extracellular environment. Other proteins encoded by wBm
include a common type IV secretion system, as used by some
pathogenic gram-negative bacteria to transfer plasmids and
proteins into surrounding host cells, and an abundance of ankyrin
domain-containing proteins, which might regulate host gene
expression, as suggested for Ehrlichia phagocytophilia AnkA [81], as
well as several proteins predicted to localize on the cell surface.
Ankyrin domain–containing proteins are noteworthy because of
their roles in protein–protein interactions in a variety of cellular
processes. A number of other wBm molecules are of interest as
potential drug targets. For example, glutathione biosynthesis genes
may provide glutathione for the protection of the filaria from
oxidative stress or human immunological effector molecules.
Heme produced from wBm (all five synthesis genes are present)
could be vital to worm embryogenesis, as there is evidence that
molting and reproduction are controlled by ecdysteroid-like
hormones [82], synthesis of which requires heme. Depletion of
Wolbachia might therefore halt production of these hormones and
block molting and/or embryogenesis in B. malayi. Most, if not all,
nematodes, including B. malayi, appear to be unable to synthesize
heme, but must obtain it from extraneous sources, such as the host,
the food supply, or perhaps from endosymbionts.
Challenges for the Future
The filarial and schistosome genome sequences now available
provide the vanguard for assembly of a genome sequence catalog
of the numerous other neglected helminth parasites (Table 1).
Comparative genomics will likely be a dominant approach to
organize, interpret, and utilize the vast amounts of genomic
information anticipated from the genomes of these parasites (e.g.
[83,84]). In terms of sequencing tools, the new generation of
‘‘massively parallel’’ sequencing platforms commercially available
today, (such as the Roche/454 pyrosequencer [85], Illumina/
Solexa [86], and SOLiD [87]) offer of the order of 100- to 1,000-
fold increases in throughput over the Sanger sequencing
technology [88] on capillary electrophoresis instruments. This
rapid change to producing millions of DNA sequence reads in a
short time will have a huge impact on research on NTDs. Each
platform has a specific application: while the Roche/454 is
optimal for in-depth analysis of whole transcriptomes and de novo
sequencing of bacterial and small eukaryotic genomes, the
Illumina and SOLiD systems are highly attractive for resequencing
projects aimed at identifying genetic variants (mutations, inser-
tions, deletions), profiling and discovering noncoding RNAs
(ncRNAs), and studying epigenetic modifications of histones and
DNA. With the increased read length and improved error rate of
massively parallel pyrosequencing technology, de novo sequencing
of helminth genomes has become possible at a fraction of earlier
costs. In the next five years, projects at the Washington
University’s Genome Center (http://www.genome.gov/
10002154) and the Wellcome Trust Sanger Institute (http://
www.sanger.ac.uk/Projects/Helminths/) should increase the
available sequence data on human helminths and their close
relatives by an order of magnitude, adding more than 20 draft
genomes to those listed in Table 1.
Once these reference genomes become available, sequencing of
clinical isolates is expected to accelerate. Sequencing of the clinical
strains and strain-to-reference comparisons can be performed
using platforms such as Illumina/Solexa and SOLiD to investigate
genome-wide polymorphism and provide a comprehensive picture
of natural helminth genome variation. These approaches should
also be valuable for exploring genetic changes involved in
resistance to anti-worm drugs and understanding the potential
mechanisms of drug resistance in human parasites, and can be
expected to facilitate development of genetic markers to monitor
and manage any future appearance and spread of drug resistance.
These phenomena are of tremendous importance, particularly
since some major neglected helminth diseases are being targeted in
mass drug treatment campaigns [89]. In addition, the new
generation of sequencing technologies has also provided unprec-
www.plosntds.org 7 October 2009 | Volume 3 | Issue 10 | e538
edented opportunities for high-throughput functional genomic
research (reviewed in [90]) that awaits application to helminth
research.
Although some details of immunomodulation by helminth
components have been characterized, we are just beginning to
understand how these parasite products act on immune responses
and to assemble fragmentary information on individual compo-
nents into a comprehensive picture. Comparisons of helminth
molecules with orthologues/paralogues from free-living relatives
will strengthen efforts to decipher the strategies adopted by
helminth parasites to evade and subvert their host immune
responses. This information will be exploitable for development of
drugs and vaccines against the parasites and potentially also novel
therapeutic biologics for use in humans. Future studies might
determine whether helminth proteins with unknown function
might be the source for the intriguing regulatory effects helminth
infections have on the host immune response.
Treatment for helminthic infections, responsible for hundreds of
thousands of deaths each year, depends almost exclusively on just
two or three drugs: praziquantel, the benzimidazoles, and
ivermectin. Vaccines and new drugs are needed, certainly because
drug resistance in human helminth parasites such as schistosomes,
whipworms, and filariae, to these compounds would present a
major problem for current treatment and control strategies.
Pharmacogenomics with the new helminth genomes represents a
practicable route forward toward new drugs. For example,
chemogenomics screening of the genome sequence of S. mansoni
identified .20 parasite proteins for which potential drugs are
available approved for other human ailments [25], and indeed for
which, in the case of the schistosome thioredoxin glutathione
reductase, auranofin (an anti-arthritis medication) was shown
recently to exhibit potent anti-schistosomal activity [91]. Finally,
the vast new sequence information will undoubtedly allow revision
of our understanding of the host–parasite relationship, its
evolution, vector–pathogen and helminth–symbiont interactions,
unique aspects of cell biology and biochemistry, phylogenetic
relationships, intervention targets, research approaches (e.g. [92]),
and so forth.
Acknowledgments
We thank Victoria Mann, Geoffrey Gobert and Gabriel Rinaldi for access
to their unpublished findings on schistosomes transduced with pseudotyped
virions.
References
1. Hotez PJ, Brindley PJ, Bethony JM, King CH, Pearce EJ, et al. (2008) Helminth
infections: The great neglected tropical diseases. J Clin Invest 118: 1311–1321.
2. Hotez PJ, Kamath A (2009) Neglected tropical diseases in sub-Saharan Africa:Review of their prevalence, distribution, and disease burden. PLoS Negl Trop
Dis 3: e412.
3. Patz JA, Graczyk TK, Geller N, Vittor AY (2000) Effects of environmentalchange on emerging parasitic diseases. Int J Parasitol 30: 1395–1405.
4. Liang S, Yang C, Zhong B, Qiu D (2006) Re-emerging schistosomiasis in hilly
and mountainous areas of Sichuan, China. Bull WHO 84: 139–144.
5. Huyse T, Webster BL, Geldof S, Stothard JR, Diaw OT, et al. (2009)
Bidirectional introgressive hybridization between a cattle and human schisto-some species. PLoS Pathog 5: e1000571. doi:10.1371/journal.ppat.1000571.
6. Kalinna BH, Brindley PJ (2007) Manipulating the manipulators: Advances in
parasitic helminth transgenesis. Trends Parasitol 23: 197–204.
7. Krasky A, Rohwer A, Schroeder J, Selzer PM (2007) A combined bioinformaticsand chemoinformatics approach for the development of new antiparasitic drugs.
Genomics 89: 36–43.
8. Mitreva M, Zarlenga DS, McCarter JP, Jasmer DP (2007) Parasitic nematodes -
From genomes to control. Vet Parasitol 148: 31–42.
9. Berriman M, Lustigman S, McCarter JP (2007) Helminth initiative for drugdiscovery – Report of the informal consultation, genomics and emerging drug
discovery technologies. Expert Opin Drug Discovery 2: S83–S89.
10. Lustigman S, Ford S, Crawford MJ (2008) RNA Interference: from functionalgenomics to validation of drug targets in helminths. In: RNA interference
research progress LylandRoger T, BrowningIrving B, eds. Nova Publishers. pp
135–162.
11. Franco GR, Adams MD, Soares MB, Simpson AJG, Venter JC, et al. (1995)Identification of new Schistosoma mansoni genes by the EST strategy using a
directional cDNA library. Gene 152: 141–147.
12. Gobert GN, Moertel L, Brindley PJ, McManus DP (2009) Developmental geneexpression profiles of the human pathogen Schistosoma japonicum. BMC Genomics
10: 128.
13. Robinson MW, Connolly B (2005) Proteomic analysis of the excretory-secretoryproteins of the Trichinella spiralis L1 larva, a nematode parasite of skeletal
muscle. Proteomics 5: 4525–4532.
14. Mitreva M, McCarter JP, Martin J, Dante M, Wylie T, et al. (2004)
Comparative genomics of gene expression in the parasitic and free-livingnematodes Strongyloides stercoralis and Caenorhabditis elegans. Genome Res 14:
209–220.
15. Taft AS, Vermeire JJ, Bernier J, Birkeland SR, Cipriano MJ, et al. (2009)Transcriptome analysis of Schistosoma mansoni larval development using serial
analysis of gene expression (SAGE). Parasitology 136: 469–485.
16. Mitreva M, McCarter JP, Arasu P, Hawdon J, Martin J, et al. (2005)
Investigating hookworm genomes by comparative analysis of two Ancylostoma
species. BMC Genomics 6: 58.
17. McCarter JP (2004) Genomic filtering: An approach to discovering novel
antiparasitics. Trends Parasitol 20: 462–468.
18. Wasmuth J, Schmid R, Hedley A, Blaxter M (2008) On the extent and origins ofgenic novelty in the phylum Nematoda. PloS Negl Trop Dis 2: e258.
doi:10.1371/journal.pntd.0000258.
19. Yin Y, Martin J, Abubucker S, Wang Z, Wyrwicz L, et al. (2009) Molecular
determinants archetypical to the phylum Nematoda. BMC Genomics 10: 114.
20. Wang Z, Martin J, Abubucker S, Yin Y, Gasser R, et al. (2009) Systematic
analysis of insertions and deletions specific to nematode proteins and their
proposed functional and evolutionary relevance. BMC Evol Biol 9: 23.
21. The C. elegans Sequencing Consortium (1998) Genome sequence of the
nematode C. elegans: A platform for investigating biology. Science 282:
2012–2018.
22. Ghedin E, Wang S, Spiro D, Caler E, Zhao Q, et al. (2007) Draft genome of the
filarial nematode parasite Brugia malayi. Science 317: 1756–1760.
23. Abubucker S, Martin J, Yin Y, Fulton L, Yang S-P, et al. (2008) The canine
hookworm genome: Analysis and classification of Ancylostoma caninum survey
sequences. Mol Biochem Parasitol 157: 187–192.
24. Schistosoma japonicum Genome Sequencing and Functional Analysis Consortium,
Liu F, Zhou Y, Wang ZQ, Lu G, et al. (2009) The Schistosoma japonicum genome
reveals features of host-parasite interplay. Nature 460: 345–351.
25. Berriman M, Haas BJ, LoVerde PT, Wilson RA, Dillon GP, et al. (2009) The
genome of the blood fluke Schistosoma mansoni. Nature 460: 352–358.
26. Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, et al. (2008) Broad
phylogenomic sampling improves resolution of the animal tree of life. Nature
452: 745–749.
27. Krautz-Peterson G, Bhardwaj R, Faghiri Z, Tararam C, Skelly PJ (2009) RNA
interference in schistosomes: machinery and methodology. Parasitology;E-pub
ahead of print. doi:10.1017/S0031182009991168.
28. Mann VH, Morales ME, Kines KJ, Brindley PJ (2008) Transgenesis of
schistosomes: approaches using mobile genetic elements. Parasitology 134: 1–13.
29. Freitas TC, Jung E, Pearce EJ (2007) TGF-beta signaling controls embryo
development in the parasitic flatworm Schistosoma mansoni. PLoS Pathog 3: e52.
doi:10.1371/journal.ppat.0030052.
30. Morales ME, Rinaldi G, Kines KJ, Gobert GN, Tort JF, et al. (2008) RNA
interference targeting Schistosoma mansoni cathepsin D, the apical enzyme of the
hemoglobin proteolysis cascade. Mol Biochem Parasitol 157: 160–168.
31. Rinaldi G, Morales ME, Alrefaei YN, Cancela M, Castillo E, et al. (2009) RNA
interference targeting leucine aminopeptidases inhibits hatching of eggs of the
human blood fluke, Schistosoma mansoni. Mol Biochem Parasitol 167: 118–126.
32. Faghiri Z, Skelly PJ (2009) The role of tegumental aquaporin from the human
parasitic worm, Schistosoma mansoni, in osmoregulation and drug uptake. FASEB J
23: 2780–2789.
33. Ford L, Zhang J, Liu J, Hashmi S, Fuhrman JA, et al. (2009) Functional analysis
of the cathepsin-like cysteine protease genes in adult Brugia malayi using RNA
interference. PLoS Negl Trop Dis 3: e377. doi: 10.1371/journal.pntd.0000377.
34. Morales ME, Mann VH, Kines KJ, Gobert GN, Kalinna BH, et al. (2007)
piggyBac transposon mediated transgenesis of the human blood fluke, Schistosoma
mansoni. FASEB J 21: 3479–3489.
35. Kines KJ, Mann VH, Morales ME, Shelby BD, Kalinna BH, et al. (2006)
Transduction of Schistosoma mansoni by vesicular stomatitis virus glycoprotein-
pseudotyped Moloney murine leukemia retrovirus. Exp Parasitol 112: 209–220.
36. Kines KJ, Morales ME, Mann VH, Gobert GN, Brindley PJ (2008) Integration
of reporter transgenes into Schistosoma mansoni chromosomes mediated by
pseudotyped murine leukemia virus. FASEB J 22: 2936–2948.
37. Li X, Massey HC, Jr., Nolan TJ, Schad GA, Kraus K, et al. (2006) Successful
transgenesis of the parasitic nematode Strongyloides stercoralis requires endogenous
non-coding control elements. Int J Parasitol 36: 671–679.
www.plosntds.org 8 October 2009 | Volume 3 | Issue 10 | e538
38. Castelletto ML, Massey HC, Jr., Lok JB (2009) Morphogenesis of Strongyloides
stercoralis infective larvae requires the DAF-16 ortholog FKTF-1. PLoS Pathog 5:e1000370. doi: 10.1371/journal.ppat.1000370.
39. de Oliveira A, Katholi CR, Unnasch TR (2008) Characterization of the
promoter of the Brugia malayi 12 kDa small subunit ribosomal protein (RPS12)
gene. Int J Parasitol 38: 1111–1119.
40. Rogers A, Antoshechkin I, Bieri T, Blasiar D, Bastiani C, et al. (2008) Wormbase2007. Nucleic Acids Res 36(Database issue). pp D612–617.
41. Mitreva N, Appleton J, McCarter JP, Jasmer DP (2005) Expressed sequence tags
from life cycle stages of Trichinella spiralis: Application to biology and parasitecontrol. Vet Parasitol 132: 13–17.
42. Martin J, Abubucker S, Wylie T, Yin Y, Mitreva M (2009) Nematode.net update
2008: Improvements enabling more efficient data mining and comparativenematode genomics. Nucleic Acids Res 37(Database issue): D571–578.
43. Parkinson J, Whitton C, Schmid R, Thomson M, Blaxter M (2004) NEMBASE:
A resource for parasitic nematode ESTs. Nucleic Acids Res 32: D427–430.
44. Zerlotini A, Heiges M, Wang H, Moraes RL, Dominitini AJ, et al. (2009)
SchistoDB: A Schistosoma mansoni genome resource. Nucleic Acids Res37(Database issue): D579–582.
45. Liu F, Chen P, Cui SJ, Wang ZQ, Han ZG (2008) SjTPdb: Integrated
transcriptome and proteome database and analysis platform for Schistosoma
japonicum. BMC Genomics 9: 304.
46. Robb SMC, Ross E, Sanchez Alvarado A (2008) SmedGD: The Schmidtea
mediterranea genome database. Nucleic Acids Res 36(Database issue). ppD599–D606.
47. Blaxter ML, De Ley P, Garey JR, Liu LX, Scheldeman P, et al. (1998) A
molecular evolutionary framework for the phylum Nematoda. Nature 392:
71–75.
48. Holterman M, van der Wurff A, van den Elsen S, van Megen H, Bongers T,et al. (2006) Phylum-wide analysis of SSU rDNA reveals deep phylogenetic
relationships among nematodes and accelerated evolution toward crown clades.Mol Biol Evol 23: 1792–1800.
49. Mitreva M, Smant G, Helder J (2009) Role of horizontal gene transfer in the
evolution of plant parasitism among nematodes. In: Horizontal Gene Transfer.
Methods Mol Biol 532: 517–535.
50. Abad P, Gouzy J, Aury J-M, Castagnone-Sereno P, Danchin EG, et al. (2008)Genome sequence of the metazoan plant-parasitic nematode Meloidogyne incognita.
Nat Biotech 26: 909–915.
51. Opperman CH, Bird DM, Williamson VM, Rokhsar DS, Burke M, et al. (2008)Sequence and genetic map of Meloidogyne hapla: A compact nematode genome for
plant parasitism. Proc Natl Acad Sci U S A 105: 14802–14807.
52. Dieterich C, Clifton SW, Schuster LN, Chinwalla A, Delehaunty K, et al. (2008)The Pristionchus pacificus genome provides a unique perspective on nematode
lifestyle and parasitism. Nat Genet 40: 1193–1198.
53. Robb SM, Ross E, Sanchez Alvarado A (2008) SmedGD: The Schmidtea
mediterranea genome database. Nucleic Acids Res 6: D599–D606.
54. Maizels RM, Balic A, Gomez-Escobar N, Nair M, Taylor MD, et al. (2004)Helminth parasites–Masters of regulation. Immunol Rev 201: 89–116.
55. Ohnmacht C, Voehringer D (2009) Basophil effector function and homeostasis
during helminth infection. Blood 113: 2816–2825.
56. Hartmann S, Kyewski B, Sonnenburg B, Lucius R (1997) A filarial cysteineprotease inhibitor down-regulates T cell proliferation and enhances interleukin-
10 production. Eur J Immunol 27: 2253–2260.
57. Hartmann S, Lucius R (2003) Modulation of host immune responses bynematode cystatins. Int J Parasitol 33: 1291–1302.
58. Harnett W, McInnes IB, Harnett MM (2004) ES-62, a filarial nematode-derived
immunomodulator with anti-inflammatory potential. Immunol Lett 94: 27–33.
59. Gomez-Escobar N, Lewis E, Maizels RM (1998) A novel member of the
transforming growth factor-beta (TGF-beta) superfamily from the filarialnematodes Brugia malayi and B. pahangi. Exp Parasitol 88: 200–209.
60. Gomez-Escobar N, Gregory WF, Maizels RM (2000) Identification of tgh-2, a
filarial nematode homolog of Caenorhabditis elegans daf-7 and human transforminggrowth factor beta, expressed in microfilarial and adult stages of Brugia malayi.
Infect Immun 68: 6402–6410.
61. van der Kleij D, Latz E, Brouwers JF, Kruize JC, Schmitz M, et al. (2002) Anovel host-parasite lipid cross-talk. Schistosomal lyso-phosphatidylserine acti-
vates toll-like receptor 2 and affects immune polarization. J Biol Chem 277:
48122–48129.
62. Pastrana DV, Raghavan N, FitzGerald P, Eisinger SW, Metz C, et al. (1998)Filarial nematode parasites secrete a homologue of the human cytokine
macrophage migration inhibitory factor. Infect Immun 66: 5955–5963.
63. Hewitson JP, Grainger JR, Maizels RM (2009) Helminth immunoregulation:The role of parasite secreted proteins in modulating host immunity. Mol
Biochem Parasitol 167: 1–11.
64. Yazdanbakhsh M, van den Biggelaar A, Maizels RM (2001) Th2 responses
without atopy: Immunoregulation in chronic helminth infections and reducedallergic disease. Trends Immunol 22: 372–377.
65. Imai S, Fujita K (2004) Molecules of parasites as immunomodulatory drugs.
Curr Top Med Chem 4: 539–552.66. Harnett W, Harnett MM (2008) Therapeutic immunomodulators from
nematode parasites. Expert Rev Mol Med 10: e18.
67. Harnett W, Harnett MM (2008) Parasitic nematode modulation of allergicdisease. Curr Allergy Asthma Rep 8: 392–397.
68. Johnston MJ, MacDonald JA, McKay DM (2009) Parasitic helminths: Apharmacopeia of anti-inflammatory molecules. Parasitology 136: 125–147.
69. McKay DM (2009) The therapeutic helminth? Trends Parasitol 25: 109–114.
70. Erb KJ (2009) Can helminths or helminth-derived products be used in humansto prevent or treat allergic diseases? Trends Immunol 30: 75–82.
71. Summers RW, Elliott DE, Urban JF, Jr., Thompson R, Weinstock JV (2005)Trichuris suis therapy in Crohn’s disease. Gut 54: 87–90.
72. Summers RW, Elliott DE, Urban JF, Jr., Thompson RA, Weinstock JV (2005)Trichuris suis therapy for active ulcerative colitis: A randomized controlled trial.
Gastroenterology 128: 825–832.
73. Lima C, Perini A, Garcia ML, Martins MA, Teixeira MM, et al. (2002)Eosinophilic inflammation and airway hyper-responsiveness are profoundly
inhibited by a helminth (Ascaris suum) extract in a murine model of asthma. ClinExp Allergy 32: 1659–1566.
74. Schnoeller C, Rausch S, Pillai S, Avagyan A, Wittig BM, et al. (2008) A
helminth immunomodulator reduces allergic and inflammatory responses byinduction of IL-10-producing macrophages. J Immunol 180: 4265–4272.
75. Melendez AJ, Harnett MM, Pushparaj PN, Wong WS, Tay HK, et al. (2007)Inhibition of Fc epsilon RI-mediated mast cell responses by ES-62, a product of
parasitic filarial nematodes. Nat Med 13: 1375–1381.76. McInnes IB, Leung BP, Harnett M, Gracie JA, Liew FY, et al. (2003) A novel
therapeutic approach targeting articular inflammation using the filarial
nematode-derived phosphorylcholine-containing glycoprotein ES-62.J Immunol 171: 2127–2133.
77. Donnelly S, O’Neill SM, Sekiya M, Mulcahy G, Dalton JP (2005) Thioredoxinperoxidase secreted by Fasciola hepatica induces the alternative activation of
macrophages. Infect Immun 73: 166–173.
78. Holland MJ, Harcus YM, Riches PL, Maizels RM (2000) Proteins secreted bythe parasitic nematode Nippostrongylus brasiliensis act as adjuvants for Th2
responses. Eur J Immunol 30: 1977–1987.79. Han ZG, Brindley PJ, Wang S, Chen Z (2009) Schistosome genomics: New
perspectives on schistosome biology and host parasite interaction. Annu RevGenomics Hum Genet 10: 211–240.
80. Foster J, Ganatra M, Kamal I, Ware J, Makarova K, et al. (2005) The Wolbachia
genome of Brugia malayi: endosymbiont evolution within a human pathogenicnematode. PLoS Biol 3: e121. doi:10.1371/journal.pbio.0030121.
81. Park J, Kim KJ, Choi K-S, Grab DJ, Dumler JS (1993) Anaplasma phagocytophilum
AnkA binds to granulocyte DNA and nuclear proteins. Cell Microbiol 6:
743–751.
82. Warbrick EV, Barker GC, Rees HH, Howells RE (1993) The effect ofinvertebrate hormones and potential hormone inhibitors on the third larval
moult of the filarial nematode, Dirofilaria immitis, in vitro. Parasitology 107:459–463.
83. Nisbet AJ, Cottee PA, Gasser RB (2008) Genomics of reproduction innematodes: prospects for parasite intervention? Trends Parasitol 24: 89–95.
84. Dieterich C, Sommer RJ (2009) How to become a parasite - Lessons from the
genomes of nematodes. Trends Genet 25: 203–209.85. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. (2005) Genome
sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380.
86. Bennett S (2004) Solexa Ltd. Pharmacogenomics 5: 433–438.
87. Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, et al. (2005)Accurate multiplex polony sequencing of an evolved bacterial genome. Science
309: 1728–1732.88. Sanger F, Niklen S, Coulson A (1977) DNA sequencing with chain-terminating
inhibitors. Proc Natl Acad Sci U S A 74: 5463–5467.
89. Fenwick A (2009) Host-parasite relations and implications for control. AdvParasitol 68: 247–261.
90. Morozova O, Marra MA (2008) Applications of next-generation sequencingtechnologies in functional genomics. Genomics 92: 255–264.
91. Kuntz AN, Davioud-Charvet E, Sayed AA, Califf LL, Dessolin J, et al. (2007)Thioredoxin glutathione reductase from Schistosoma mansoni: An essential parasite
enzyme and a key drug target. PLoS Med 4: e206. Erratum in: PLoS Med 2007,
4: e264.92. Cosseau C, Azzi AH, Smith K, Freitag M, Mitta G, et al. (2009) Native
chromatin immunoprecipitation (N-ChIP) and ChIP-Seq of Schistosoma mansoni:Critical experimental parameters. Mol Biochem Parasitol 166: 70–76.
93. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, et al. (2003) Multiple
sequence alignment with the Clustal series of programs. Nucleic Acids Res 31:3497–3500.
94. Felsenstein J (1988) Phylogenies from molecular sequences: Inference andreliability. Ann Rev Genet 22: 521–565.
www.plosntds.org 9 October 2009 | Volume 3 | Issue 10 | e538