the genomics of emerging infectious disease · the organism cause disease before but a new form is...

The Genomics of Emerging Infectious Disease

www.plos.org

A collection of essays, perspectives, and reviews from six PLoS Journals about how genomics can revolutionize our understanding of emerging infectious disease.

Produced with support from Google.org. The PLoS Journal editors have sole responsibility for the content of this collection.

Image credits: Brindley et al., PLoS Neglected Tropical Diseases 3(10) e538.McHardy et al., PLoS Pathogens 5(10) e1000566.Salama et al., PLoS Pathogens 5(10) e1000544.

Editorial

Genomics of Emerging Infectious Disease: A PLoSCollectionJonathan A. Eisen1*, Catriona J. MacCallum2*

1 University of California Davis, Davis, California, United States of America, 2 Public Library of Science, Cambridge, United Kingdom

Today, the Public Library of Science publishes a collection of

essays, perspectives, and reviews about how genomics, with all its

associated tools and techniques, can provide insights into our

understanding of emerging infectious disease (http://ploscollections.

org/emerginginfectiousdisease/) [1–13]. This collection, focused on

human disease, is particularly timely as pandemic H1N1 2009

influenza (commonly referred to as swine flu) spreads around the

globe, and government officials, the public, journalists, bloggers, and

tweeters strive to find out more. People want to know if this flu poses

more of a threat than other seasonal flu strains, how fast it’s

spreading (and where), and what can be done to contain it. As this

collection illustrates, the increasing speed at which complete genome

sequences and other genome-scale data can be generated for

individual isolates and strains of a pathogen provides tremendous

opportunities to identify the molecular changes in these disease

agents that will enable us to track their spread and evolution through

time (e.g., [3,7,8]) and generate the vaccines and drugs necessary to

combat them (e.g., [5–7]). The collection also shines a spotlight on

specific pathogens, some familiar and widespread, such as the

influenza A virus (e.g., [9]); some ‘‘reemerging,’’ such as the

Mycobacterium tuberculosis complex that causes tuberculosis [10]; and

some identified only recently, as with the bacterium Helicobacter pylori

(which causes peptic ulcers and gastric cancer [11]).

There is no simple definition of an emerging disease, but it can

be loosely described as a disease that is novel in some way—for

example, one that displays a change in geographic location,

genetics, or function. Emerging infectious diseases are caused by a

wide range of organisms, but they are perhaps best typified by

zoonotic viral diseases that cross from animal to human hosts and

can have a devastating impact on human health, causing a high

disease burden and mortality [8]. These zoonotic diseases include

monkeypox, Hendra virus, Nipah virus, and severe acute

respiratory syndrome coronavirus (SARS-CoV), in addition to

influenza A and the lentiviruses that cause AIDS. The apparently

increased transmission of pathogens from animals to humans over

the recent decades has been attributed to the unintended

consequences of globalization as well as environmental factors

and changes in agricultural practices [8]. Generally, the burden of

these diseases is most strongly felt by those in developing countries.

Brindley et al. [12] point to the debilitating effects of the most

common human infectious agent in such areas—helminths

(parasitic worms)—and the role that genomics plays in advancing

our understanding of molecular and medical helminthology.

Compounding the problem of emerging infectious diseases in

developing countries is the reality that researchers in developing

countries have often been unable to participate fully in genomics

research, because of their technological isolation and limited

resources. As Harris et al. emphasize [13], ‘‘collaborations—

starting with capacity building in genomics research—need to be

fostered so that countries that are currently excluded from the

genomics revolution find an entry point for participation.’’

This collection is a collaborative effort that combines financial

support from Google.org (which has also sponsored research on

emerging infectious disease through its Predict and Prevent

initiative [14]) with PLoS’s editorial independence and rigor.

Gupta et al. [1] provide Google.org’s perspective and vision for

how systematic application of genomics, proteomics, and bioinfor-

matics to infectious diseases could predict and prevent the next

pandemic. To realize this vision, they urge the community to unite

under an ‘‘Infectious Disease Genomics Project,’’ analogous to the

Human Genome Project. This is, as the authors admit, a

potentially ‘‘grandiose’’ and difficult proposition. Some researchers

might justifiably argue that much is already being achieved—as

demonstrated by this collection—and that the vision is naıve.

However, as every article in the collection also points out,

tremendous challenges remain if the potential of genomics in this

field is to be realized.

One problem is that, despite the fact that sequencing is now the

method of choice for characterizing new disease agents, and new

substantially faster and cheaper sequencing methods are contin-

ually being produced, we still lack the range of computational tools

necessary to analyze these sequences in sufficient detail [4]. It is

possible to sequence the entire assemblage of viruses in a particular

tissue type or host species [3] and to obtain complete or nearly

complete genome sequences for large samples of bacteria [7]. Yet

we remain in the early, albeit essential, stages of pathogen

discovery (Box 1). These sequences can be interpreted fully only

when integrated with relevant environmental, epidemiological,

and clinical data (e.g., [3,4,8]). And, despite the increased

sequencing, really comprehensive genome data are still only

available for a few key pathogens, which further limits our

understanding. For example, a full quantitative understanding of

the processes that shape the epidemiology and evolution—the

phylodynamics—of RNA viruses is currently possible only for HIV

and influenza A virus [3].

In this collection, you will find not only the views of leading

researchers from several different disciplines, and a provocative

vision from a funding agency, but also the contributions of six

different PLoS journals (PLoS Biology, PLoS Medicine, PLoS

Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases,

and PLoS Pathogens). The PLoS open-access model of publishing

makes possible such a large multidisciplinary cross-journal

collection, in which all articles are simultaneously available online

Citation: Eisen J, MacCallum CJ (2009) Genomics of Emerging Infectious Disease:A PLoS Collection. PLoS Biol 7(10): e1000224. doi:10.1371/journal.pbio.1000224

Published October 26, 2009

Copyright: � 2009 Eisen, MacCallum. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.

Competing Interests: The authors have declared that no competing interestsexist.

* E-mail: [email protected] (JAE); [email protected] (CJM)

This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journalcollection (http://ploscollections.org/emerginginfectiousdisease/).

PLoS Biology | www.plosbiology.org 1 October 2009 | Volume 7 | Issue 10 | e1000224

for unrestricted reuse, regardless of venue (see also the podcast that

accompanies the collection; http://ploscollections.org/podcast/

emerginginfectiousdisease.mp3).

Our aim is that this collection will add to other ‘‘open science’’

activities that have helped provide insights into infectious disease

more quickly than would have been thought feasible only a few

years ago. This accelerated availability of research findings is

exemplified by the recent response to the flu pandemic. Consider,

for example, data access. Traditionally, scientists have released

data after publishing a study. Fortunately, in part due to

experience from genome sequencing projects, prepublication flu

sequence data have been released in a relatively unrestricted

manner to the community [15]. This has in turn enabled

anyone—not just those who collected the data—to carry out

analyses while the epidemic is occurring (when in principle there is

still time to save lives) rather than being forced to provide a

Box 1. A Field Guide to Microbes?

When an American robin (Turdus migratorius) showed up in London a few years ago, birders were rapidly all atwitter and manycame flocking to town [22]. Why had this one bird created such a stir? For one main reason—it was out of place. This species isnormally found in North America and only very rarely shows up on the other side of the ‘‘pond.’’ Amazingly, this rapid, collectiveresponse is not that unusual in the world of birding. When a bird is out of place, people notice quickly.

This story of the errant robin gets to the heart of the subject of this collection because being out of place in a metaphorical wayis what defines an emerging infectious disease. Sometimes we have never seen anything quite like the organism or the diseasebefore (e.g., SARS, Legionella). Or perhaps, as with many opportunistic pathogens, we have seen the organism before but it wasnot previously known to cause disease. In other cases, such as with as pandemic H1N1 2009 or E. coli O157:H7, we have seenthe organism cause disease before but a new form is causing far more trouble. And of course organisms can be literally out ofplace, by showing up in a location not expected (e.g., consider the anthrax letters [2]).

Historically, despite the metaphorical similarities with the robin case, the response to emerging infectious disease is almostalways much slower. Clearly, there are many reasons for these differences, which we believe are instructive to consider. At leastfour factors are required for birders’ rapid responses to the arrival of a vagrant bird: (1) knowledge of the natural ‘‘fauna’’ in aparticular place, (2) recognition that a specific bird may be out of place, (3) positive identification of the possibly out-of-placebird, and (4) examination of the ‘‘normal’’ place for relatives of the identified bird.

How are these requirements achieved? Mostly through the existence of high-quality field guides that allow one to place anorganism such as a bird into the context of what is known about its relatives. This placement in turn is possible because of twokey components of field guides. First, such guides contain information about the biological diversity of a group of organisms.This usually includes features such as a taxonomically organized list of species with details for each species on biogeography(distribution patterns across space and time, niche preferences, relative abundance), biological properties (e.g., behavior, size,shape, etc.), and genetic variation within the species (e.g., presence of subspecies). Second, a good field guide providesinformation on how to identify particular types (e.g., species) of those organisms. With such information, and with a network ofinterested observers, an out-of-place bird can be detected with relative ease.

In much the same way, a field guide to microbes would be valuable in the study of emerging infectious diseases. The articles inthis collection describe what can be considered the beginnings of species-specific field guides for the microbial agents ofemerging diseases. If we want to truly gain the benefits that can come from good field guides it will be necessary to expandcurrent efforts to include more organisms, more systematic biogeographical sampling, and more epidemiological and clinicaldata. But the current efforts are a great start.

Figure: The American Robin (Turdus migratorius). (Photo Credit: NASA).doi:10.1371/journal.pbio.1000224.g001


posthumous account of the spread of infection. Such a response

highlights both the importance of early data access and the

removal of restrictions in the use of data (e.g., in many past cases

data might be released but use of the data in presentations and

publications would be limited).

The value of open access to sequence data is helping to put

pressure both on private organizations to release their sequence

data [16,17] and on all agencies to release other information (e.g.,

metadata about strains) more rapidly. This pressure is not being

brought to bear only on flu data—in this collection Van Voorhis

et al. [5] call on pharmaceutical companies to deposit the

structural coordinates of drug targets from all globally important

infectious disease organisms in public databases.

Of course, data about any infectious disease are not very useful

unless placed in the scientific context of past studies (i.e.,

publications) specifically about the disease or about methods to

analyze such data. It is also important to have access to

information about other diseases and other organisms that might

impact its spread or evolution. Perhaps the most intriguing aspect

of open science in response to flu has been the move toward pre-

journal publication release of findings. Many flu researchers took

the available data, analyzed it, and posted results on blogs [18,19],

wikis [20], and other sites. Although some view this ‘‘non peer-

reviewed’’ release as unseemly, it is clear that it has helped

accelerate the science in the study of pandemic H1N1 2009 and

led to some important journal papers [17]. Indeed, such advances

helped provide one of the stimuli for PLoS’s most recent initiative,

PLoS Currents: Influenza, a Google ‘‘Knol,’’ for the rapid

communication of research results and ideas about flu vetted by

expert moderators [21].

This is not to say there are no possible risks or drawbacks from

more openness. For example, some governments may avoid

releasing data because of fears about discrimination (as was seen in

many aspects of the flu in Mexico). Others worry that complete

openness might foster the spread of misinformation. However, as

Fricke et al. argue in their article on the relationship between

genomics and biopreparedness [2], open source genomic resources

are actually of immense benefit to those in charge of our public

health and biosecurity.

It is clear that ‘‘for all stages of combating emerging infections,

from the early identification of the pathogen to the development

and design of vaccines, application of sophisticated genomics tools

is fundamental to success’’ [8]. It is equally clear that open science

and open access to publications and data will be key to that

success. Whatever one’s position has been on the various open

science initiatives, there is no doubt that the ‘‘esoteric’’ label on

some open science initiatives has largely been eliminated by the

emergence of H1N1 flu epidemic.

The faster, cheaper, and more openly we can distribute the

discoveries of science, the better for scientific progress and public

health. As this collection emphasizes, managing the threat of

novel, re-emerging, and longstanding infectious diseases is

challenging enough even without barriers to scientific research.

We encourage you to make the most of this collection by sharing,

rating, and annotating the articles using our online commenting

tools. Better yet, join the discussion by providing your own vision

to prevent the emergence and spread of the next rogue pathogen.

References

1. Gupta R, Michalski MH, Rijsberman FR (2009) Can an Infectious Diseases

Genomics Project predict and prevent the next pandemic? PLoS Biol 7:

e1000219. doi:10.1371/journal.pbio.1000219.

2. Fricke WF, Rasko DA, Ravel J (2009) The role of genomics in the identification,

prediction, and prevention of biological threats. PLoS Biol e1000217. doi:10.1371/

journal.pbio.1000217.

3. Holmes EC, Grenfell BT (2009) Discovering the phylodynamics of RNA viruses.

PLoS Comput Biol 5: e1000505. doi:10.1371/journal.pcbi.1000505.

4. Berglund EC, Nystedt B, Andersson SGE (2009) Computational resources in

infectious disease: Limitations and challenges. PLoS Comput Biol 5: e1000481.

doi:10.1371/journal.pcbi.1000481.

5. Van Voorhis WC, Hol WGJ, Myler PJ, Stewart LJ (2009) The role of medical

structural genomics in discovering new drugs for infectious diseases. PLoS Comp

Biol 5: e1000530. doi:10.1371/journal.pcbi.1000530.

6. Seib KL, Dougan G, Rappuoli R (2009) The key role of genomics in modern

vaccine and drug design for emerging infectious diseases. PLoS Genet 5:

e1000612. doi:10.1371/journal.pgen.1000612.

7. Falush D (2009) Toward the use of genomics to study microevolutionary change

in bacteria. PLoS Genet 5: e1000627. doi:10.1371/journal.pgen.1000627.

8. Haagmans BL, Andeweg AC, Osterhaus ADME (2009) The application of

genomics to emerging zoonotic viral diseases. PLoS Pathog 5: e1000557.

doi:10.1371/journal.ppat.1000557.

9. McHardy AC, Adams B (2009) The role of genomics in tracking the evolution of

influenza A virus. PLoS Pathog 5: e1000566. doi:10.1371/journal.ppat.1000566.

10. Comas I, Gagneux S (2009) The past and future of tuberculosis research. PLoS

Pathog 5: e1000600. doi:10.1371/journal.ppat.1000600.

11. Dorer MS, Talarico S, Salama NR (2009) Helicobacter pylori’s unconventional

role in health and disease. PLoS Pathog 5: e1000544. doi:10.1371/journal.ppat.

1000544.

12. Brindley PJ, Mitreva M, Ghedin E, Lustigman S (2009) Helminth genomics:

The implications for human health. PLoS Negl Trop Dis 3: e538. doi:10.1371/journal.ppat.1000538.

13. Coloma J, Harris E (2009) Molecular genomic approaches to infectious diseasesin resource-limited settings. PLoS Med 6: e1000142. doi:10.1371/journal.

pmed.1000142.

14. Google.org (2008) Predict and Prevent Initiative homepage. Available: http://www.google.org/predict.html. Accessed 16 September 2009.

15. National Center for Biotechnology Information (2009) Influenza VirusResource. Available: http://www.ncbi.nlm.nih.gov/genomes/FLU/SwineFlu.

html. Accessed 11 September 2009.16. Butler D (2005) Flu researchers slam US agency for hoarding data. Nature 437:

458–459.

17. Smith GJD, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, et al. (2009) Originsand evolutionary genomics of the 2009 swine-origin H1N1 influenza A

epidemic. Nature 459: 1122–1125.18. Porter S (2009) Did the California H1N1 swine flu come from Ohio?

Discovering Biology in a Digital World blog. Available: http://scienceblogs.

com/digitalbio/2009/04/did_the_california_h1n1_swine.php. Accessed 11September 2009.

19. Koppstein D (2009) Swine flu phylogeny, part II. Koppology blog. Available:http://koppology.blogspot.com/2009/04/swine-flu-phylogeny-part-ii.html. Ac-

cessed 11 September 2009.20. Rambaut A (2009) Human/Swine A/H1N1 Influenza Origins and Evolution.

Available: http://tree.bio.ed.ac.uk/groups/influenza/. Accessed 11 September

2009.21. Allen L (2009) Welcome to PLoS Currents: Influenza. PLoS Blog. Available:

http://www.plos.org/cms/node/481. Accessed 8 September 2009.22. Evans I (29 March 2009) American Robin Spotted in South London.

Foxnews.com. Available at:http://www.foxnews.com/story/0,2933,189510,00.

html. Accessed 14 September 2009.


Essay

Molecular Genomic Approaches to Infectious Diseases inResource-Limited SettingsJosefina Coloma1,2, Eva Harris1,2*

1 Division of Infectious Diseases and Vaccinology, School of Public Health, University of California Berkeley, Berkeley, California, United States of America, 2 Sustainable

Sciences Institute, San Francisco, California, United States of America

Only half a century after the landmark

discovery of the double helix structure of

DNA, the human genome was sequenced

and a new era of biomedical research was

ushered in [1]. Parallel advances in

comparative genomics, genetics, high-

throughput biochemical techniques, and

bioinformatics have provided researchers

in wealthy nations with a repertoire of

tools to analyze the sequence and func-

tions of organisms at an unprecedented

pace and level of detail. Since the

beginning of the genomics era [2,3],

however, it has been evident that research-

ers in many developing countries will not

be participating fully in genomics research,

mainly because of their technological

isolation and their limited resources and

capacity for genomics research combined

with the urgency of many other health

priorities. To share the benefits of this

technology equitably worldwide, some

have advocated that developed and devel-

oping countries alike should participate in

genomics research to prevent widening of

the already large gap in global health

resources [4]. As most of the funding that

has fueled the rapid advance of the field

comes from developed country govern-

ments, private initiatives, and industry,

however, not much has been done to

enable poorer countries to participate as

equals in genomics research. Developing

countries that are not directly participating

in a genomics initiative can, nonetheless,

gain from the discoveries of this field in a

number of ways, as detailed below. It

remains to be seen, however, how the

developing world will specifically benefit

from the refined genetic information and

the drugs and vaccines produced as a result

of genomics initiatives. Information ex-

change and translation of knowledge must

be carried out continually through fora

accessible to researchers in developing

countries. ‘‘North–South’’ collaborations—

starting with capacity building in genomics

research—need to be fostered so that

countries that are currently excluded from

the genomics revolution find an entry point

for participation. ‘‘South–South’’ collabo-

rations must be encouraged to allow

countries with limited resources to pool

their human and financial capital, learn

from each other’s experience, and share in

the benefits of genomics. Ensuring that the

benefits of genomics-based medicine are

shared by developing countries involves

their inclusion in the discussion of ethical,

legal, social, economic, and sovereignty

issues (Box 1).

Initiatives in the DevelopingWorld

In the developing world, the link between

human genomics and infectious disease is

particularly important. The influence of

host genes on the differential susceptibility

of individuals or populations to infection

and the evolutionary influence of pathogens

on the genetic composition of populations

by selecting for resistant individuals through

coevolution can be now dissected in more

detail with genomics. An array of host–

pathogen interactions are associated with

particular human genes and loci, as best

illustrated by the relationship of the malaria

pathogen with host genetic evolution. As

genetic information about larger popula-

tions becomes increasingly available, it

is important to disseminate information

The Essay section contains opinion pieces on topicsof broad interest to a general medical audience.

Citation: Coloma J, Harris E (2009) Molecular Genomic Approaches to Infectious Diseases in Resource-LimitedSettings. PLoS Med 6(10): e1000142. doi:10.1371/journal.pmed.1000142


Copyright: � 2009 Coloma, Harris. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

Provenance: Commissioned; externally peer-reviewed.

This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/).

Funding: No specific funding was received for this study/essay.

Summary Points

N Researchers in most developingcountries lack the technology,resources, and capacity to partic-ipate fully in genomics research.

N Information exchange and knowl-edge translation must be carried outcontinually through ‘‘North–South’’collaborations, starting with capaci-ty building in genomics research;‘‘South–South’’ collaborations mustbe encouraged to allow countrieswith limited resources to pool theirhuman and financial capital andshare in the benefits of genomics.

N Several emerging countries havemade significant progress in thepast decade by sequencing thegenomes of organisms with littleeconomic value in the developedworld but of great local relevance.

N Molecular diagnostics and molec-ular epidemiology are the firstfrontier of genomics, with acces-sible tools that can be applied inresource-limited settings.

N Developing countries entering thegenomics era should start by es-tablishing their priorities and enact-ing appropriate legislation beforeembarking on large-scale projects.

N Access to training and capacitybuilding of human resources inbioinformatics and data mining arecrucial in the developing world.

PLoS Medicine | www.plosmedicine.org 1 October 2009 | Volume 6 | Issue 10 | e1000142

relating genomics to disease as well as to

devise intervention strategies for at-risk

populations worldwide [5].

Because science and technology are

increasingly recognized as vital compo-

nents for national development, emerging

economies and some developing countries

are building their infrastructures to pro-

mote local innovation and to retain the

value of their human, plant, and microbial

genomic diversity and research. India,

Thailand, South Africa, Indonesia, Brazil,

and Mexico, for example, have devoted

considerable resources to large-scale popu-

lation genotyping projects that explore

human genetic variation. The Institute for

Genomic Medicine (INMEGEN) initiative

in Mexico is the largest and most compre-

hensive, with a broad strategy for incorpo-

rating genomics into health care that

includes infrastructure, strategic public–

private partnerships, research and develop-

ment in genomics relevant to local health

problems, capacity building, and bioethics

policy making [6,7]. Although it is unclear

how Mexico will make the transition from

early-phase investment to translation of

knowledge into products and services with

health and economic impacts, the country

is taking important steps to address the

challenges it and other emerging economies

face, such as the shortage of trained

professionals and the ability to retain local

talent. For example, the National Council

for Science and Technology (CONICYT)

is making efforts to engage the Mexican

scientific diaspora with expertise in geno-

mics by offering repatriation packages tied

to jobs at universities and research insti-

tutes, an approach that is also being

adopted by Brazil.

Brazil’s Foundation for Research Sup-

port in Sao Paolo (FAPESP) genomics

initiative is also considered a political and

scientific achievement. Key to its success

has been early investment in training

young scientists by sponsoring scholarships

abroad in areas related to genomics in

which Brazil lacks expertise. To avoid

brain drain, beneficiaries are required to

return to Brazil for at least four years and

must have a committed teaching position

at a local university before they leave. One

important principle of Brazil’s genomics

initiative is that the projects are relevant to

Brazil and the rest of the developing world

but are low on the list of priorities of the

US and Europe, thus providing both an

important contribution to genomics and a

benefit to Brazil’s economy and scientific

endeavor [8]. FAPESP is in the process of

sequencing the genes of the parasite that

causes schistosomiasis, a disease that

afflicts millions in Brazil. Another example

in Brazil is the government-funded con-

sortium Organization for Nucleotide Se-

quencing and Analysis (ONSA), formed to

sequence and analyze the genome of the

plant pathogen Xylella, which infects

orange trees and has great economic

impact [9]. This effort led to additional

genomics projects on vectors of pathogens

that cause major public health problems in

Brazil, such as the sandfly Lutzomyia long-

ipalpis, which transmits Leishmania spp.,

and the Triatominae bug species, which

are vectors of Trypanosoma cruzi [10].

The impact of genomics on the devel-

oping world is also illustrated by multina-

tional initiatives such as the one funded by

the US National Institutes of Health

(NIH), the UK’s Wellcome Trust, and

private and public institutes in the US and

Europe in collaboration with research

centers in Brazil, Argentina, Venezuela

and Singapore to sequence the genomes of

the parasites T. brucei, T. cruzi and

Leishmania major, which cause the deadly

insect-borne diseases African sleeping sick-

ness, Chagas disease, and leishmaniasis,

respectively [11–13]. The potential new

drug targets identified by these initiatives

have great relevance in over 100 develop-

ing countries where the diseases take a

significant toll on the economy and the

quality of life of their citizens. Similar

initiatives have resulted in sequencing of

other pathogens important to medicine

and agriculture. The data from these

projects are usually freely available online

for data mining and for bioinformatics

analysis at remote locations, as most

researchers follow the recommendation

set by the Bermuda Accord to make

DNA sequences (especially human) freely

and openly available without delay [14].

Resource-limited countries can enter

the genomics era by creating partnerships

and regional centers for technology and

resources [15]. For example, DNA se-

quencing technology, still unaffordable for

many researchers and public laboratories

because of low-use volume and high costs

of equipment, reagents, and maintenance,

can be affordable if a regional center

provides services to a pool of laboratories

and researchers within a country or

geographical region. As an illustration,

using Brazilian infrastructure, Peru and

Chile joined the global potato sequencing

consortium, which will sequence different

varieties of this important agricultural

species [16]. Brazil has also generated

several open-source bioinformatics tools

for the annotation of bacterial and proto-

zoan genomes that can be used by any

researcher worldwide [17]. In Africa, the

Center for Training in Functional Geno-

mics of Insect Vectors of Human Disease

(AFRO VECTGEN) was initiated by

TDR (Special Programme in Research

and Training in Tropical Diseases) at the

World Health Organization (WHO) and

the Department of Medical Entomology

and Vector Ecology of the Malaria

Research and Training Center in Mali to

train young scientists in functional geno-

mics who will ultimately use genome

sequence data for research on insect

vectors of human disease. The program

triggers collaborative research with neigh-

boring nations and the vector biology

network in Mali, which was built around

research grants funded by the US NIH

and TDR/WHO [18]. The Malaria

Genomic Epidemiology Network (Malar-

iaGEN) uses a consortial approach that

brings together researchers from 21 coun-

tries to overcome scientific, ethical, and

practical challenges to conducting large-

scale studies of genomic variation that

could assist efforts in the fight against

malaria [19]. Successful ‘‘North–South’’

partnerships that help scientists bridge

the genomic gap usually involve a project

of mutual interest. An example is the

Box 1. Societal and Ethical Issues in Genomics to Be Discussedwith Full Participation of All Nations

N Issues of confidentiality, stigmatization, discrimination, and misuse of geneticinformation

N Dangers of a reductionist approach to health issues based only on geneticinformation that ignores multifactorial determinants

N Issues about intellectual property rights associated with the patentability ofDNA sequences, the applications derived from them, and the implications fordeveloping countries [45]

N The potential exploitation of developing-country populations by creatinggenetic databases for a price [46]

N The potential risk of breeding human beings by design [47]

N Issues about informed consent, standard of care, and availability and pricing ofnew drugs and vaccines being tested in developing countries [48]


common effort of the International Live-

stock Research Institute (ILRI) in Nairobi

and The Institute for Genome Research

(TIGR; now the J. Craig Ventner Insti-

tute) to sequence and annotate the genome

of Theileria parva, a cattle parasite that

causes important economic losses to small

farmers in Africa and elsewhere [20]. This

effort has generated local human resources

in genomics and infrastructure for the

future.

Application of Molecular,Genetic, and Genomic Toolswith Limited Resources

Although the genomics initiatives de-

scribed above challenge the notion that

developing countries must wait to import

advances in science and technology that

emerge from the developed world, poorer

developing countries still do not have the

resources to develop their own genomic

projects on a large scale. However,

implementing simpler molecular genetic

approaches to solve health problems is

very feasible in resource-limited settings.

The decades preceding the human and

microbial genome initiatives were high-

lighted by important developments in

molecular and genetic methods applied

to infectious diseases. These developments

were enabled by increasingly available

genetic information about many patho-

gens and their vectors and by molecular

tools such as PCR and powerful sequenc-

ing technologies, which permitted rapid

advances that were successfully introduced

into the developing world with little delay.

Molecular tools for diagnosis have

gained a ready foothold because many

poor countries do not have the facilities for

traditional diagnosis and surveillance.

Thus, diagnosis often relies on clinical

observations or requires that a sample be

sent out to foreign agencies such as the US

Centers for Disease Control and Preven-

tion (CDC) for confirmation. In addition,

even when available, classic techniques

based on serological, microscopic, and

culture-based methods are often lengthy,

of only moderate sensitivity, and not

highly discriminatory at the level of species

subtype or strain. By adapting DNA

technologies to the existing infrastructure,

using home-grown solutions to reduce

their cost, and applying them to solve

local health problems, molecular ap-

proaches to detect and type infectious

agents on-site offer real value [21]. Fos-

tering appropriate technology transfer and

capacity-building in the ‘‘South’’ enables

public health laboratories and research

groups in less scientifically developed

countries to participate in global genomics

by contributing their findings and sharing

their expertise with their peers [22,23].

For example, we and others adapted PCR-

based molecular diagnostic techniques for

infectious diseases such as leishmaniasis

and dengue for cost-effective application

in laboratories with minimum infrastruc-

ture and basic technical expertise, which

are now fully validated and used routinely

throughout Latin America [21,24–30].

This approach relies on understanding

the principles of the technologies, decon-

structing them into their basic compo-

nents, and rebuilding them on-site [21].

Another area where molecular tools have

demonstrated their utility in resource-poor

settings is in detecting drug resistance in a

variety of pathogens. This has been facili-

tated in large part by successful ‘‘North–

South’’ partnerships that have served to

train scientists in developing countries in the

use, implementation, and interpretation of

modern molecular methods applied to

emerging drug resistance (see [31]). This

approach has been particularly successful

with certain diseases, such as malaria, HIV/

AIDS, tuberculosis, and drug-resistant bac-

terial infections (both nosocomial and com-

munity-based). Unfortunately, most studies

of drug-resistant pathogens are performed

independently of one another, so data on the

prevalence of resistance markers is scattered

in disparate databases or in unpublished

studies without links to clinical, laboratory,

and pharmacokinetic data needed to relate

the genetic information to relevant pheno-

types. To enable molecular markers of

malaria drug resistance to realize their

potential as public health tools, the World-

wide Malaria Resistance Network (WARN)

database is being created with the dual goals

of improving treatment of malaria by

informed drug selection and use and

providing a prompt warning when treat-

ment protocols need to be changed [32,33].

By accelerating the identification and vali-

dation of markers for resistance to combi-

nation therapies, this global database should

help prolong the useful therapeutic lives of

important new drugs.

The ultimate power of genetic tools in

resource-limited settings is evident in the

field of molecular epidemiology, where

genetic information about the host or

infectious agent is analyzed together with

clinical and epidemiological data to derive

and implement appropriate interventions.

For example, molecular tools based on

limited sequence information, such as

molecular fingerprinting of a polymorphic

marker, have made important contributions

to strengthening control of tuberculosis in

both developed and developing countries by

enabling analysis of transmission patterns,

helping identify phenotypic variation

among strains, and facilitating evaluation

of the global distribution, relative transmis-

sibility, virulence, and immunogenicity of

different lineages of M. tuberculosis [34–38].

Bacterial infections, food-borne outbreaks,

and viral infections in developing countries,

including the recent H1N1 influenza pan-

demic, are monitored using similar typing

methodologies [39–41]. Molecular tools

permit a refined case definition and thus

have tremendous potential for decision-

making support and informing targeted

public health interventions in countries with

high burdens of disease and limited tech-

nological capabilities and resources.

The trend to move beyond genetic

marker analysis to full genome sequencing

is growing, as complete genome data can

provide a wealth of information about

etiologic agents of disease that was previ-

ously unknown. Full-genome approaches

are not always necessary, however. In

molecular epidemiology of infectious dis-

eases, nucleic acid fingerprinting can

provide enough answers to important

epidemiological questions to allow critical

interventions to be designed (see above). In

fact, too much genetic information, in

some instances, can obscure the picture, as

several closely related pathogenic variants

might coexist in one individual or one

outbreak that differ by only a few

nucleotides but that nonetheless belong

to the same strain or subtype, complicating

the interpretation of results [42].

The relatively rapid transfer of DNA

technology from developed to developing

countries is an excellent example of what

can be done by forging strong relation-

ships between universities and research

groups and public-health laboratories

across the world. The validity of adapting

these technologies relies on links with

epidemiological data and translation into

local public health interventions.

Setting Priorities

General international ethical and scien-

tific guidelines for genomics have been

created and are being adapted by nations

participating in the field as it evolves.

Governments and regulatory agencies in

the ‘‘North’’ have prepared for the

eventual implementation of genomics-

based medicine in their respective coun-

tries. A critical problem faced by develop-

ing countries is the lack of national

guidelines for genomics research and its

ethical ramifications. Thus, a priority to be

set by countries in the early steps of

genomic applications is to draw up the


necessary rules and legislation on geno-

mics and to generate procedures for

implementation. Creating the necessary

communication channels between re-

searchers, social scientists, policy makers,

and civil society organizations is also a

critical step. Other key challenges facing

emerging genomics researchers include

proper informed consent and privacy

protocols for research participants, pro-

tecting them against the potential discrim-

ination that might emerge from genetic

information and ensuring that any benefit

that comes to fruition from the research

reaches them. In parallel, capacity build-

ing of scientists in clinical research and of

ethics committees in these issues is essen-

tial. Past experience with ‘‘safari research’’

in which biological samples are taken out-

of-country for research that does not

benefit local populations have prompted

countries such as Mexico, India, and

Brazil to draw up legislation governing

‘‘sovereignty’’ over genomics material and

data that restricts the export of biological

materials for studies abroad and prioritizes

national interests. Poorer countries cur-

rently lacking their own genomics initia-

tives could benefit from similar legislation

balancing the protection of ‘‘genomic

sovereignty’’ while fostering international

collaborations that bring much-needed

resources and increase local scientific

capacity. Beyond the improvement of their

basic genomics research capabilities, gov-

ernments should engage their relevant

ministries to develop a plan to integrate

genetic and genomics products (including

diagnostics, vaccines, therapies, and oth-

ers), within the health system and public

health programs with emphasis on acces-

sibility and equity to improve health for

all. A good example of priority setting in

genomics is Mexico’s national genomics

program over the last 15 years (see Box 2).

Sharing Know-How

To strengthen genomics globally, the

tools necessary for analysis of genomics

data are urgently needed in developing

countries, where they are currently under-

utilized [43]. A problem with genomics is

that much of the advanced knowledge is

concentrated in individuals and a few

research centers and companies rather

than in textbooks or academia, restricting

dissemination even though massive

amounts of genomic data and software

are openly accessible through the Internet.

A conscious effort on the part of developed

nations to transfer their knowledge of the

use and analysis of genomic databases

needs to be encouraged to help developing

countries manage their own specific data

on indigenous biological species, local

epidemiology and infectious diseases, bio-

diversity, and other issues. Some successful

programs and initiatives include the Well-

come Trust Sanger Institute training

courses on bioinformatics and genomic

analysis, the Sustainable Sciences Insti-

tute–Broad Institute bioinformatics work-

shops (Figure 1), and the TDR/WHO-

South African Bioinformatics Institute

(SANBI) regional training center. Online

training like the S-star alliance bioinfor-

matics courses and conferences such as the

African Bioinformatics Conference (Af-

bix’09) with remote participation are

becoming more widespread and are an

excellent option for countries with limited

resources. GARSA (Genomic Analysis

Resources for Sequence Annotation) is a

flexible Web-based system designed to

analyze genomic data in the context of a

data analysis pipeline. Hosted in Brazil,

this free system aims to facilitate the

analysis, integration, and presentation of

genomic information, concatenating sev-

eral bioinformatics tools and sequence

databases with a simple user interface

[44]. An alternative to on-site sequencing

is to partner with colleagues in more-

developed countries to have samples

processed abroad in sequencing centers.

This is possible only if local legislation

allows for export of biological samples,

and if true partnership and trust exist with

a colleague(s) in the developed country.

Challenges for the Future

As developing countries reevaluate their

role in the genomics era, they will continue

to explore the unique opportunities that

arise from the vast natural and genomic

diversity that they embody. As exemplified

by the successes in Brazil, Mexico, and

several African countries, it is possible to

turn challenges and problems such as

emerging and endemic infectious diseases

into opportunities for unique scientific and

economic growth. Access to sequencing

facilities, open-source databases, and har-

monized methodologies for genomic analy-

sis are essential for the future of genomics in

the developing world. However, unless a

more concerted effort is made to include

countries with limited scientific development

and resources, it is unlikely that they will

fully participate in genomics projects or use

the technologies available other than by

allowing their genetic material to be acces-

sible to others. As emerging countries set

their own priorities for genomics research

and take ownership of its results, the main

challenge across developing nations remains

access to training and knowledge translation.

Human resources and local capacity in

genomics are thus central to development,

as countries with these skills could partici-

pate in the potential benefits of the field with

respect to health, food security, natural

resource management, and other critical

areas. ‘‘North–South’’ and ‘‘South–South’’

collaborations are a viable and extremely

rewarding way to increase the capacities of

developing countries to access genomic tools

to address unique problems considered of

little economic value outside these countries

but of tremendous importance to the

majority of the world’s population.

Author Contributions

ICMJE criteria for authorship read and met: JC

EH. Wrote the first draft of the paper: JC.

Contributed to the writing of the paper: JC EH.

Box 2. Building a Road toward Genomics: The MexicanExperience 1995–2009 [7]

N Increases in investment in science and technology (S&T) from 0.35% to 0.43% ofthe GNP and creation of national S&T legislation to increase regional funding

N Four-fold increase in number of students registered for doctoral-level programs

N Participation in international genomics efforts

N Creation of sequencing initiatives of organisms with local agricultural andhealth relevance

N Creation of a Genomics Sciences degree and two scientific societies ingenomics

N Creation of the National Institute of Genetic Medicine (2004-INMEGEN) withseed funding for modern infrastructure; a strategy for development thatincludes country-wide strategic alliances; high-level research and academicprograms; ethical, legal, and social implications of genomic medicine; andtranslation of the scientific knowledge into public goods

N Establishment of genomics research priorities based on most prevalent localdiseases

N Plans for creation of public–private partnerships to guarantee sustainability


References

1. Venter JC (2003) A part of the human genomesequence. Science 299: 1183–1184.

2. Singer PA, Daar AS (2001) Harnessing genomics

and biotechnology to improve global healthequity. Science 294: 87–89.

3. Calva E, Cardosa MJ, Gavilondo JV (2002)

Avoiding the genomics divide. Trends Biotechnol20: 368–370.

4. Acharya T, Daar AS, Thorsteinsdottir H,Dowdeswell E, Singer PA (2004) Strengthening

the role of genomics in global health. PLoS Med

1: e40. doi:10.1371/journal.pmed.0010040.

5. Manolio TA, Rodriguez LL, Brooks L,

Abecasis G, Ballinger D, et al. (2007) New

models of collaboration in genome-wide associa-tion studies: The Genetic Association Informa-

tion Network. Nat Genet 39: 1045–1051.

6. Seguin B, Hardy BJ, Singer PA, Daar AS (2008)

Genomics, public health and developing coun-

tries: The case of the Mexican National Instituteof Genomic Medicine (INMEGEN). Nat Rev

Genet 9 (Suppl 1): S5–9.

7. Jimenez-Sanchez G, Silva-Zolezzi I, Hidalgo A,

March S (2008) Genomic medicine in Mexico:

Initial steps and the road ahead. Genome Res 18:1191–1198.

8. Castilla EE, Luquetti DV (2008) Brazil: Public

Health Genomics. Public Health Genomics.

E-pub ahead of print (3 Sept). doi:10.1159/

000153424.

9. Simpson AJ, Reinach FC, Arruda P, Abreu FA,

Acencio M, et al. (2000) The genome sequence ofthe plant pathogen Xylella fastidiosa. The Xylella

fastidiosa Consortium of the Organization for

Nucleotide Sequencing and Analysis. Nature406: 151–159.

10. Davila AM, Majiwa PA, Grisard EC, Aksoy S,Melville SE (2003) Comparative genomics to

uncover the secrets of tsetse and livestock-infectivetrypanosomes. Trends Parasitol 19: 436–439.

11. Berriman M, Ghedin E, Hertz-Fowler C,

Blandin G, Renauld H, et al. (2005) The genomeof the African trypanosome Trypanosoma brucei.

Science 309: 416–422.

12. El-Sayed NM, Myler PJ, Bartholomeu DC,

Nilsson D, Aggarwal G, et al. (2005) The genomesequence of Trypanosoma cruzi, etiologic agent of

Chagas disease. Science 309: 409–415.

13. Ivens AC, Peacock CS, Worthey EA, Murphy L,Aggarwal G, et al. (2005) The genome of the

kinetoplastid parasite, Leishmania major. Science309: 436–442.

14. Bentley DR (1996) Genomic sequence informa-tion should be released immediately and free-

ly in the public domain. Science 274: 533–

534.

15. Rabinowicz PD (2001) Genomics in LatinAmerica: Reaching the frontiers. Genome Res

11: 319–322.

16. Potato Genome Sequencing Consortium. Avail-able: http://www.potatogenome.net. Accessed 19

July 2009.

17. Almeida LG, Paixao R, Souza RC, Costa GC,Almeida DF, et al. (2004) A new set of bioinfor-

matics tools for genome projects. Genet Mol Res3: 26–52.

18. Doumbia S, Chouong H, Traore SF, Dolo G,

Toure AM, et al. (2007) Establishing an insectdisease vector functional genomics training center

in Africa. Afr J Med Med Sci 36 (Suppl): 31–33.

19. Malaria Genomic Epidemiology Network (2008)A global network for investigating the genomic

epidemiology of malaria. Nature 456: 732–737.

20. Gardner MJ, Bishop R, Shah T, de Villiers EP,

Carlton JM, et al. (2005) Genome sequence of

Theileria parva, a bovine pathogen that transformslymphocytes. Science 309: 134–137.

21. Harris E (1998) A low-cost approach to PCR:Appropriate transfer of biomolecular techniques.

New York: Oxford University Press.

22. Coloma MJ, Harris E (2004) Innovative low costtechnologies for biomedical research and diag-

nosis in developing countries. BMJ 329:

1160–1162.

Figure 1. Participants in a Bioinformatics/Genomics Analysis workshop in Managua, Nicaragua, in June 2008 (conducted by theSustainable Sciences Institute and the Broad Institute). Photograph by Eva Harris.doi:10.1371/journal.pmed.1000142.g001


23. Harris E (2004) Scientific capacity building in

developing countries. EMBO Rep 5: 7–11.24. Harris E, Tanner M (2000) Health technology

transfer. BMJ 321: 817–820.

25. Aviles H, Belli A, Armijos R, Monroy FP,Harris E (1999) PCR detection and identification

of Leishmania parasites in clinical specimens inEcuador: A comparison with classical diagnostic

methods. J Parasitol 85: 181–187.

26. Harris E, Kropp G, Belli A, Rodriguez B,Agabian N (1998) Single-step multiplex PCR

assay for characterization of New World Leish-

mania complexes. J Clin Microbiol 36:

1989–1995.27. Belli A, Rodriguez B, Aviles H, Harris E (1998)

Simplified polymerase chain reaction detection of

new world Leishmania in clinical specimens ofcutaneous leishmaniasis. Am J Trop Med Hyg 58:

102–109.28. Coloma J, Harris E (2008) Sustainable transfer of

biotechnology to developing countries: fighting

poverty by bringing scientific tools to developing-country partners. Ann N Y Acad Sci 1136:

358–368.29. Miagostovich MP, Sequeira PC, Dos Santos FB,

Maia A, Nogueira RM, et al. (2003) Moleculartyping of dengue virus type 2 in Brazil. Rev Inst

Med Trop Sao Paulo 45: 17–21.

30. Schriefer A, Schriefer AL, Goes-Neto A,Guimaraes LH, Carvalho LP, et al. (2004)

Multiclonal Leishmania braziliensis populationstructure and its clinical implication in a region

of endemicity for American tegumentary leish-

maniasis. Infect Immun 72: 508–514.31. Falush D (2009) Toward the use of genomics to

study microevolutionary change in bacteria. PLoS

Gen 5: e1000627. doi:10.1371/journal.

pgen.1000627.32. Plowe CV, Roper C, Barnwell JW, Happi CT,

Joshi HH, et al. (2007) World Antimalarial

Resistance Network (WARN) III: Molecularmarkers for drug resistant malaria. Malar J 6:

121.33. Sibley CH, Barnes KI, Watkins WM, Plowe CV

(2008) A network to monitor antimalarial drug

resistance: a plan for moving forward. TrendsParasitol 24: 43–48.

34. Bifani PJ, Mathema B, Kurepina NE,Kreiswirth BN (2002) Global dissemination of

the Mycobacterium tuberculosis W-Beijing familystrains. Trends Microbiol 10: 45–52.

35. Filliol I, Driscoll JR, van Soolingen D,

Kreiswirth BN, Kremer K, et al. (2003) Snapshotof moving and expanding clones of Mycobacterium

tuberculosis and their global distribution assessedby spoligotyping in an international study. J Clin

Microbiol 41: 1963–1970.

36. Manca C, Reed MB, Freeman S, Mathema B,Kreiswirth B, et al. (2004) Differential monocyte

activation underlies strain-specific Mycobacterium

tuberculosis pathogenesis. Infect Immun 72:

5511–5514.37. Valway SE, Sanchez MP, Shinnick TF, Orme I,

Agerton T, et al. (1998) An outbreak involving

extensive transmission of a virulent strain ofMycobacterium tuberculosis. N Engl J Med 338:

633–639.38. Gagneux S, Comas I (2009) The past and future

of tuberculosis research. PLoS Path 5(10): e600.

doi:10.1371/journal.ppat.1000600.39. Poon LL, Chan KH, Smith GJ, Leung CS,

Guan Y, et al. (2009) Molecular detection of a

novel human influenza (H1N1) of pandemic

potential by conventional and real-time quantita-

tive RT-PCR assays. Clin Chem 55: 1555–1558.

40. Reis JN, Palma T, Ribeiro GS, Pinheiro RM,

Ribeiro CT, et al. (2008) Transmission of

Streptococcus pneumoniae in an urban slum commu-

nity. J Infect 57: 204–213.

41. Vieira N, Bates SJ, Solberg OD, Ponce K,

Howsmon R, et al. (2007) High prevalence of

enteroinvasive Escherichia coli isolated in a remote

region of northern coastal Ecuador. Am J Trop

Med Hyg 76: 528–533.

42. Riley LW (2004) Molecular epidemiology of

infectious diseases: Principles and practices.

Herndon (Virginia): ASM Press.

43. Teufel A, Krupp M, Weinmann A, Galle PR

(2006) Current bioinformatics tools in genomic

biomedical research. Int J Mol Med 17: 967–973.

44. Davila AM, Lorenzini DM, Mendes PN,

Satake TS, Sousa GR, et al. (2005) GARSA:

Genomic analysis resources for sequence anno-

tation. Bioinformatics 21: 4302–4303.

45. Cook-Deegan RM, McCormack SJ (2001) Intel-

lectual property. Patents, secrecy, and DNA.

Science 293: 217.

46. Burton B (2002) Proposed genetic database on

Tongans opposed. BMJ 324: 443.

47. Pang T (2002) The impact of genomics on global

health. Am J Public Health 92: 1077–1079.

48. Chokshi DA, Thera MA, Parker M, Diakite M,

Makani J, et al. (2007) Valid consent for genomic

epidemiology in developing countries. PLoS Med

4: e95. doi:10.1371/journal.pmed.0040095.


Perspective

Can an Infectious Disease Genomics Project Predict andPrevent the Next Pandemic?Rajesh Gupta¤*, Mark H. Michalski¤, Frank R. Rijsberman

Google.org, Mountain View, California, United States of America

We believe that there is great potential

in the systematic application of genomics,

proteomics, and bioinformatics to infec-

tious diseases, and that this potential has

yet to be fully realized. We suggest that the

international community unite under an

Infectious Disease Genomics Project, anal-

ogous to the Human Genome Project,

with a goal of a comprehensive, open-

access system of genomic information to

accelerate scientific understanding and

product development in the very settings

where diseases have the highest probabil-

ity of emerging. If properly structured,

such an approach could shift fundamen-

tally the global response to emerging

infectious diseases.

Genomics Is SystematicallyTransforming Medicine

The ‘‘Genomic Revolution’’ has trans-

formed our vision and understanding of

how living organisms and systems interact

with each other and with the environment

[1]. Increasingly, the science of genomics

serves as the foundation for translational

research for advancing the management of

many important diseases [2–7]. Decreas-

ing costs and increasing throughput of new

technologies has made possible multina-

tional collaboration on large-scale projects

such as the Human Microbiome Project

and the 1000 Genomes Project [8–10].

Infectious disease management is also

transforming thanks to molecular technol-

ogies as seen in HIV [11,12], tuberculosis

[13,14], malaria [15,16], and other ne-

glected tropical diseases [17,18]. Discov-

ering novel pathogens and elucidating the

implications of genetic variation among

existing pathogens [19,20] is critical for

rapidly mitigating pandemic threats, as

demonstrated recently with severe acute

respiratory syndrome (SARS) [21,22] and

avian (H5N1) and pandemic H1N1 2009

influenza (commonly referred to as ‘‘swine

flu’’) [23–26].

To fully harness the benefit of genomics

in infectious diseases, a chain of overarch-

ing activities must occur. First, under-

standing the dynamics of infectious diseas-

es through the genomics lens requires a

tremendous amount of integrated com-

parative sequence, expression, epigenetic,

and proteomic data from a variety of

pathogens (bacteria, virus, protozoa, fun-

gi), vectors (arthropod and avian sources),

reservoirs (non-human mammals, environ-

ment) and human hosts. Second, generat-

ing, collating, organizing, and curating

these data is an essential public health task.

Third, translating this information to tools

to improve surveillance and response

mechanisms is critical to effectively impact

disease management.

If this bench-to-beside chain of activities

were optimized, we envision that the

following could occur:

N Fully annotated genomes of all known

pathogens, vectors, non-human hosts,

and reservoir species, as well as a large

number of candidate microbes in

families that have a high risk of

generating future pathogens, are held

in public open-access databases such as

GenBank.

N A ‘‘Genomic search’’ of all available

contextual information, from sample

origins through to published analyses,

is as simple as a Google search.

N Sequencing and other molecular tech-

nologies are everyday tools-of-the-

trade in every district hospital and

laboratory in hotspots of emerging

infectious disease, such as southeast

Asia and sub-Saharan Africa.

N Automated molecular diagnostic as-

says are low-cost, reduced at least to

the size of a smart mobile phone, and

can return definitive diagnoses of a

range of specialized known pathogen

panels at the point of care.

N A range of products that use infectious

disease genomic information routine-

ly—such as vector maps, early warning

systems, diagnostics, vaccines, and

drugs—contribute to the prediction

and prevention of epidemics.

While progress is occurring in each of

these areas, the outputs—which are need-

ed today—are far from complete.

Creating an Infectious DiseaseGenomics Project (IDGP)

We believe that accelerated advances in

the area of infectious diseases can occur

under a global collaborative framework

composed of discrete and delineated

activities between the public and private

sectors among resource-wealthy and re-

source-limited settings. The Human Ge-

nome Project (HGP) was a pioneering

international effort that helped unlock the

power of genomics for human health

The Perspective section provides experts with aforum to comment on topical or controversial issuesof broad interest. This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://

ploscollections.org/emerginginfectiousdisease/).

Citation: Gupta R, Michalski MH, Rijsberman FR (2009) Can an Infectious Disease Genomics Project Predict andPrevent the Next Pandemic? PLoS Biol 7(10): e1000219. doi:10.1371/journal.pbio.1000219


Copyright: � 2009 Gupta et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.

Funding: Google.org is financially supported through its parent company, Google.com. At the time thismanuscript was developed, RG was an employee of Google.org and MM was a consultant to Google.org. Thefunder had no role in the decision to publish or preparation of the manuscript.



¤ Current address: Stanford University, Stanford, California, United States of America


[27,28]. This effort generated important

information in part by having clear,

targeted outcomes and by implementing

a standard methodology across all partic-

ipants. The HGP was a great impetus for

progress seen thus far in genomics and

health. Moreover, the HGP recognized

that sequencing was just the first step in

a much bigger process [26]. A similar

effort for infectious diseases could, in our

view, help predict and prevent the next

pandemic.

To capitalize on existing successful

efforts in the area of genomics and

infectious diseases such as those by the

Broad Institute, Genomics Standards Con-

sortium, J Craig Venter Institute, the

National Institute of Allergy and Infectious

Diseases, and the Wellcome Trust Sanger

Institute (to name a few), we urge the

international community to unite its nu-

merous activities under an Infectious

Diseases Genomic Project (IDGP)—a

coordinated, large-scale, international ef-

fort focused on the genomes of pathogens,

vectors, hosts, and reservoirs and linked to

end-point surveillance and response sys-

tems. Such a project could coordinate

activities in four specific areas: generating

data, linking data, analyzing data, and

applying data (Figure 1).

Generating DataAt the outset, the IDGP would need to

determine what the world requires in

terms of genomic information. A standard

approach to generating depth and diver-

sity in genomic data is essential; beyond

this, continuous real-time surveillance and

characterization of evolving pathogens can

help effectively forestall future epidemics/

pandemics. Frontline work by consor-

tiums, genome research centers, and

individual laboratories has yielded baseline

approaches in this area and a wealth of

critical genomic information for many

important infectious agents [29–34].

While each actor in the genomics field

brings its own priority for targeting

particular pathogens or diseases, a clear

roadmap to generating a complete geno-

mic picture of all infectious agents, emerg-

ing threats, hosts, and reservoirs, incorpo-

rating a broad range of investigators with

varied technological capacity, would en-

hance both data generation and applica-

tion. Such a process allows for communi-

ty-level priority setting, thereby enabling

smaller-scale laboratories to tailor projects

to fit the needs of local communities while

contributing to global efforts.

Linking DataThe data collected must be connected

to all relevant information and analytical

tools in a single, easy-to-use, open-source,

real-time interface. Such a system would

improve on current systems by: gathering

data across the public domain and work-

ing with companies/institutions to harness

information in the private domain; linking

accurate, annotated sequencing informa-

tion to functional genomic and proteo-

mic/functional proteomic information;

attaching scientific literature associated

with all levels of information; and includ-

ing a self-sustaining financial mechanism

potentially based on royalties from com-

mercial products generated from the use of

this system.

Analyzing DataThe data need to be linked via large-

scale, dynamic databases held in virtual

servers allowing for collaboration and

sharing while maintaining originating

information for data rights and sovereign-

ty. Concurrently, these data should be

associated with a centralized collection of

open-source bioinformatics tools capable

of real-time operation in low- and high-

speed computers and varying levels of

internet connectivity. A single interface

also would bring various sample collec-

tions together in formally structured bio-

banks that capture geospatial and context

data to allow efficient scientific collabora-

tion to take place. Centralizing the entire

spectrum of information and analytic tools

also allows researchers in resource-limited

settings to participate in the genomics

revolution without prohibitively costly

machines, laboratories, and sample acces-

sibility. Although we fully acknowledge

Figure 1. A coordinated Infectious Disease Genome Project (IDGP) could unify sequencing efforts, enhance data usability, and leadto essential tools for infectious disease management.doi:10.1371/journal.pbio.1000219.g001

Author Summary

The world of genomics is transforming medicine, and is likely to influence thefuture development of new drugs, diagnostics, and vaccines. To date, the greaterfocus of genomics and medicine has been on conditions affecting resource-wealthy settings, primarily involving scientists and companies in those settings.However, we believe that it is possible to expand genomics into a more globaltechnology that can also focus on diseases of resource-limited settings. This goalcan be achieved if genomics is made a global priority. We feel one way to move inthis direction is through a comprehensive approach to infectious diseases—i.e.,an Infectious Disease Genomics Project—that would mirror the Human GenomeProject. Without an active, unified effort specifically focused on allowing actors atany level to participate in the genomics revolution, infectious diseases thatprimarily affect the poor will likely not achieve the same level of scientificadvancement as diseases affecting the wealthy.


that internet connectivity is a requirement

that is not currently available to all, rapid

technical innovation and investment from

cheap netbook computers to new fiber

optic cables in Africa are changing that

equation. This system could be facilitated

by virtual community collaboration or

crowd-sourcing, taking full advantage of

networking tools such as Wikipedia, Face-

book, Twitter, FusionTables, and PLoS.

Applying DataTechnological advances for basic scien-

tific discovery (such as next-generation

sequencers, microarrays, mass spectrome-

ters, cell-based assay methods, and other

tools for transcriptome, metabolome, and

proteome discovery), novel techniques to

increase throughput and/or decrease the

cost of analysis, and applied clinical

decision-making and surveillance tools

(point-of-care diagnostics, rapid multi-

pathogen assays) are in progress and

should be supported actively. The IDGP

should be informed by and incorporate

emerging technology platforms to rapidly

develop more accurate field diagnostics

and to identify new opportunities for

vaccine and drug development.

Moving beyond Discourse intoAction

An IDGP is attainable if others share

this vision, show leadership, and see the

added value resulting from a coordinated

effort. The HGP certainly was a more

targeted effort and we acknowledge that

an IDGP will have additional obstacles to

overcome. Scientific disagreement over

targets is bound to occur. Complications

resulting from the proposed level of data

sharing should not be underestimated, and

care must be taken to ensure proprietary

rights and acknowledgement when war-

ranted. Adapting molecular genetic tech-

nologies to resource-limited settings is a

significant challenge, but is occurring with

some success. Bringing together a com-

munity of scientists and donors, each with

their own objectives and goals, to work

under a single framework, is a difficult

proposition. Finally, there will be many

who will find this perspective simply too

grandiose. Leaps of progress also require

big visions, however, and it may just be

possible that the 2009 H1N1 influenza

pandemic is a enough of a reminder of

what is at stake to provide a catalyst for

action.

Google.org has supported global public

health through its ‘‘Predict and Prevent’’

initiative with the aim of using the power

of information and technology to address

emerging infectious diseases by helping the

world to know where to look for these

diseases, find the threats earlier, and

respond to them faster [35]. Google.org

has focused its support on sequencing and

pathogen discovery activities, bringing

genomic technologies to resource-limited

settings in East Africa, improving surveil-

lance networks and systems, and exploring

how our core competence in internet

search can assist the infectious diseases

community [36].

As firm supporters of the open access

model for scientific publication [37],

Google.org is pleased to support this series

of essays, The Genomics of Emerging

Infectious Disease, in partnership with the

Public Library of Science (PLoS) journals

(PLoS Biology, PLoS Computational Biology,

PLoS Genetics, PLoS Medicine, PLoS Neglected

Tropical Diseases, and PLoS Pathogens), not

only to help define the current state of the

art in pathogen genomics, but also, we

hope, to stimulate debate on priorities for

research and technology development.

References

1. Yudell M, DeSalle R (2002) The genomic

revolution: Unveiling the unity of life. Washing-

ton (D. C.): Joseph Henry Press. 272 p.

2. Langston AA, Malone KE, Thompson JD,

Daling JR, Ostrander EA (1996) BRCA1 mutations

in a population-based sample of young women with

breast cancer. N Engl J Med 334: 137–142.

3. Futreal P, Liu Q, Shattuck-Eidens D, Cochran C,

Harshman K, et al. (1994) BRCA1 mutations in

primary breast and ovarian carcinomas. Science

266: 120–122.

4. Helgadottir A, Manolescu A, Thorleifsson G,

Gretarsdottir S, Jonsdottir H, et al. (2004) The

gene encoding 5-lipoxygenase activating protein

confers risk of myocardial infarction and stroke.

Nature Genetics 36: 233–239.

5. Wellcome Trust C (2007) Genome-wide associa-

tion study of 14,000 cases of seven common

diseases and 3,000 shared controls. Nature 447:

661–678.

6. Consortium G (2007) New models of collabora-

tion in genome-wide association studies: The

Genetic Association Information Network. Nat

Genet 39: 1045–1051.

7. Vigneri P, Wang J (2001) Induction of apoptosis

in chronic myelogenous leukemia cells through

nuclear entrapment of BCR-ABL tyrosine kinase.

Nat Med 7: 228–234.

8. Gresham D, Kruglyak L (2008) Rise of the

mach ine s . PLoS Gene t 4 : e1000134 .

doi:10.1371/journal.pgen.1000134.

9. Spencer G (2008) Researchers establish interna-

tional human microbiome consortium. NIH

News. Available: http://www.nih.gov/news/

health/oct2008/nhgri-16.htm. Accessed 19 Sep-

tember 2009.

10. Spencer G (2008) International consortium an-

nounces the 1000 Genomes Project. NIH News.

Available: http://www.nih.gov/news/health/

jan2008/nhgri-22.htm. Accessed 19 September

2009.

11. Martinez-Cajas JL, Wainberg MA (2008) Anti-

retroviral therapy: Optimal sequencing of therapy

to avoid resistance. Drugs 68: 43–72.

12. Wilkinson KA, Gorelick RJ, Vasa SM, Guex N,

Rein A, et al. (2008) High-throughput SHAPE

analysis reveals structures in HIV-1 Genomic

RNA strongly conserved across distinct biological

states. PLoS Biol 6: e96. doi:10.1371/journal.

pbio.0060096.

13. Smith CV, Sacchettini JC (2003) Mycobacterium

tuberculosis: A model system for structural geno-

mics. Curr Opin Struct Biol 13: 658–664.

14. Cockle PJ, Gordon SV, Lalvani A, Buddle BM,

Hewinson RG, et al. (2002) Identification of novel

Mycobacterium tuberculosis antigens with potential as

diagnostic reagents or subunit vaccine candidates

by comparative genomics. Infect Immun 70:

6996–7003.

15. Gonzales JM, Patel JJ, Ponmee N, Jiang L, Tan A,

et al. (2008) Regulatory hotspots in the malaria

parasite genome dictate transcriptional variation.

PLoS Biol 6: e238. doi:10.1371/journal.

pbio.0060238.

16. Ekland EH, Fidock DA (2007) Advances in

understanding the genetic basis of antimalarial

drug resistance. Curr Opin Microbiol 10:

363–370.

17. Beaty BJ, Prager DJ, James AA, Jacobs-Lorena M,

Miller LH, et al. (2009) From Tucson to genomics

and transgenics: The Vector Biology Network

and the emergence of modern vector biology.

PLoS Negl Trop Dis 3: e343. doi:10.1371/

journal.pntd.0000343.

18. Hertz-Fowler C, Figueiredo LM, Quail MA,

Becker M, Jackson A, et al. (2008) Telomeric

expression sites are highly conserved in Trypano-

soma brucei. PLoS ONE 3: e3527. doi:10.1371/

journal.pone.0003527.

19. Wolfe N, Heneine W, Carr J, Garcia A,

Shanmugam V, et al. (2005) Emergence of

unique primate T-lymphotropic viruses among

central African bushmeat hunters. Proc Natl

Acad Sci U S A 102: 7994–7999.

20. Palacios G, Druce J, Du L, Tran T, Birch C, et al.

(2008) A new arenavirus in a cluster of fatal

transplant-associated diseases. N Engl J Med 358:

991–998.

21. Grant P, Garson J, Tedder R, Chan P, Tam J,

et al. (2003) Detection of SARS coronavirus in

plasma by real-time RT-PCR. N Engl J Med 349:

2468.

22. Marra M, Jones S, Astell C, Holt R, Brooks-

Wilson A, et al. (2003) The genome sequence of

the SARS-associated coronavirus. Science 300:

1399–1404.

23. Gu J, Xie Z, Gao Z, Liu J, Korteweg C, et al.

(2007) H5N1 infection of the respiratory tract and

beyond: A molecular pathology study. Lancet

370: 1137–1145.

24. Zhao Z-M, Shortridge KF, Garcia M, Guan Y,

Wan X-F (2008) Genotypic diversity of H5N1

highly pathogenic avian influenza viruses. J Gen

Virol 89: 2182–2193.

25. Garten RJ, Davis CT, Russell CA, Shu B,

Lindstrom S, et al. (2009) Antigenic and genetic

characteristics of swine-origin 2009 A(H1N1)

influenza viruses circulating in humans. Science

325: 197–201.

26. Shinde V, Bridges CB, Uyeki TM, Shu B,

Balish A, et al. (2009) Triple-reassortant swine

influenza A (H1) in humans in the United States,

2005–2009. N Engl J Med 360: 2616–2625.

27. Consortium IHGS (2001) Initial sequencing and

analysis of the human genome. Nature 409:

860–921.

28. Collins FS, Morgan M, Patrinos A (2003) The

Human Genome Project: Lessons from large-

scale biology. Science 300: 286–290.

29. Wellcome Trust Sanger Institute (2009) Pathogen

genomics [Web site]. Available: http://www.

sanger.ac.uk/Projects/Pathogens/. Accessed 11

August 2009.


30. National Institute of Allergy and Infectious

Disease (2009) Microbial Genome Sequencing

Centers: Completed NIAID-Supported Sequenc-

ing Projects. Available: http://www3.niaid.nih.

gov/research/resources/mscs/completed.htm.

Accessed 11 August 2009.

31. Cole ST, Brosch R, Parkhill J, Garnier T,

Churcher C, et al. (1998) Deciphering the biology

of Mycobacterium tuberculosis from the complete

genome sequence. Nature 393: 537–544.

32. Gardner MJ, Hall N, Fung E, White O,

Berriman M, et al. (2002) Genome sequence ofthe human malaria parasite Plasmodium falciparum.

Nature 419: 498–511.

33. Greene JM, Collins F, Lefkowitz EJ, Roos D,Scheuermann RH, et al. (2007) National Institute

of Allergy and Infectious Diseases bioinformaticsresource centers: New assets for pathogen infor-

matics. Infect Immun 75: 3212–3219.

34. Field D, Garrity G, Gray T, Morrison N,Selengut J (2008) The minimum information

about a genome sequence (MIGS) specification.

Nat Biotechnol 26: 541–547.35. Google.org (2008) Predict and Prevent initiative.

Available: http://www.google.org/predict.html.

Accessed 19 September 2009.36. Ginsberg J, Mohebbi MH, Patel RS, Brammer L,

Smolinski MS, et al. (2009) Detecting influenzaepidemics using search engine query data. Nature

457: 1012–1014.

37. Gass A (2004) Open access as public policy. PLoSBiol 2: e353. doi:10.1371/journal.pbio.0020353.


Perspective

The Role of Genomics in the Identification, Prediction,and Prevention of Biological ThreatsW. Florian Fricke, David A. Rasko, Jacques Ravel*

Institute for Genome Sciences (IGS), University of Maryland School of Medicine, Baltimore, Maryland, United States of America

Since the publication in 1995 of the first

complete genome sequence of a free-living

organism, the bacterium Haemophilus influ-

enzae [1], more than 1,000 genomes of

species from all three domains of life—

Bacteria, Archaea, and Eukarya—have

been completed and a staggering 4,300

are in progress (not including an even

larger number of viral genome projects)

(GOLD, Genomes Online Database v.

2.0; http://www.genomesonline.org/gold.

cgi, as of August 2009). Whole-genome

shotgun sequencing remains the standard

in biomedical, biotechnological, environ-

mental, agricultural, and evolution-

ary genomics (http://genomesonline.org/

gold_statistics.htm#aname). While next-

generation sequencing technology is

changing the field, this approach will

continue to be used and lead to a

previously unimaginable number of ge-

nome sequences, providing opportunities

that could not have been thought of a few

years ago. These opportunities include

studying genomes in real-time to under-

stand the evolution of known pathogens

and predict the emergence of new infec-

tious agents (Box 1). With the introduction

of next-generation sequencing platforms,

cost has decreased dramatically, resulting

in genomics no longer being an indepen-

dent discipline, but becoming a tool

routinely used in laboratories around the

world to address scientific questions. This

global sequencing effort has been focusing

primarily on pathogenic organisms, which

today are still the subject of the majority of

genome projects [2]. Sequencing two to

five strains of the same pathogen has, in

recent years, afforded us not only a better

understanding of evolution, virulence, and

biology in general [3], but, taken to the

next level (hundreds or thousands of

strains) it will enable even more accurate

diagnostics to support epidemiological

studies, food safety improvements, public

health protection, and forensics investiga-

tions, among others.

Biodefense Funding forGenomic Research

Since the anthrax letter attacks of 2001,

when letters containing anthrax spores

were mailed to several news media offices

and two Democratic senators in the

United States, killing five people and

infecting 17 others, funding agencies in

the US and other countries have priori-

tized research projects on organisms that

might potentially challenge our security

and economy should they be used as

biological weapons. This has resulted in

large amounts of funding dedicated to so-

called ‘‘biodefense’’ research, totaling close

to $50 billion between 2001 and 2009 [4].

Genomics has benefited greatly from this

influx of research dollars and as a result,

representatives of most major animal, plant,

and human pathogens have been sequenced

(http://www.pathogenportal.org/). Support-

ed by federal funds from the National

Institutes of Health (NIH), the National

Institute of Allergy and Infectious Diseases

(NIAID), and the US Department of De-

fense, research programs, such as the Micro-

bial Sequencing Centers and the Bioinfor-

matics Resource Centers (http://www3.

niaid.nih.gov/topics/pathogenGenomics/

PDF/genomicsinitiatives.htm), have been

established that carry out genomics re-

search on pathogenic organisms and have

spearheaded a new phase of the genomics

revolution. Similar programs were started

in Europe, such as those at the Wellcome

Trust Sanger Institute in the United

Kingdom, and the multinational European

effort, The Network of Excellence Euro-

PathoGenomics (http://www.noe-epg.

uni-wuerzburg.de/epg_general.htm). As

an example of the success of these types

of programs, the genome sequences of over

90,000 influenza viruses were rapidly

generated and are now deposited in

GenBank (http://www.ncbi.nlm.nih.gov/

genomes/FLU/aboutdatabase.html). Be-

cause of the availability of large sequencing

capacity and the large amount of informa-

tion, the response to the 2009 H1N1

influenza pandemic was rapid and efficient

(Box 2): Genomics information was gener-

ated within days and validated diagnostic

tools were approved within weeks [5,6]. A

global response was made possible through

tremendous research efforts enabled by

genomic research.

Access to and Documentationof Sequence Data

Open access to genomics resources (i.e.,

raw sequence data and associated publi-

cations) is an essential component of the

nation preparedness to biological threats

(biopreparedness), whether intentionally

delivered or not. Although some consider

open-source genomic resources a threat to

security [7] because they make publicly

available information that could facilitate

the construction of dangerous infectious

agents, we strongly disagree with this point

of view. Rather, we and others [8] believe

that it is an enabling tool more useful to

those in charge of our public health and

biosecurity than to those with ill inten-

tions. Genomic sequence data can provide

a starting point for the development of

new vaccines, drugs, and diagnostic tests

[9], hence improving public health capa-

bilities and increasing our bioprepared-

ness. Access to the organisms from which

the sequences are derived should be

restricted, not their genome sequences.

The Perspective section provides experts with aforum to comment on topical or controversial issuesof broad interest.

Citation: Fricke WF, Rasko DA, Ravel J (2009) The Role of Genomics in the Identification, Prediction, andPrevention of Biological Threats. PLoS Biol 7(10): e1000217. doi:10.1371/journal.pbio.1000217


Copyright: � 2009 Fricke et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.





Now that genomics technologies are

broadly available, there is the potential

for commercial interests to hamper the

release of genomic data in the public

domain. Thus it is important that federally

funded large-scale genome sequencing

efforts have enforceable rapid release

policies. This accessibility could afford

further opportunities to capitalize on

investments in genome sequencing by

providing the necessary resources to bio-

preparedness.

Whereas genome projects aimed at

sequencing one, two, or three isolates of

a pathogen seemed adequate a few years

ago, it is now possible to sequence rapidly

hundreds of individual genomes for each

species. Access to relevant, well-curated

culture collections [10] and DNA prepa-

rations suitable for sequencing may be-

come a bottleneck in the future when

sequencing resources are no longer limit-

ing. More importantly, the impact of large

genomic sequence datasets from clinical

isolates will be limited without key clinical

metadata that characterize these isolates,

such as patients’ medical information,

date of isolation, and the number of

culture passages in the laboratory. Open

access to large numbers of sequences and

associated metadata allows for powerful

comparative genomic analyses and thus

provides major insights into the charac-

teristics of a pathogen. Standardized

vocabulary should be developed to de-

scribe these isolates and the genes they

contain. Such efforts have already started,

for example through the open-access

journal Standards in Genome Sciences

(SIGS) (http://standardsingenomics.org/

index.php/sigen), but the dedicated re-

sources are not adequate and highlight the

lack of understanding of the importance of

metadata in genomics. Initiatives such as

those of the Genomics Standards Consor-

tium have made great strides [11,12], but

still need widespread implementation

from the ever-expanding genomic com-

munity. Open access to the genomic DNA

that has been sequenced or the culture

from which the DNA was extracted and to

the associated metadata is key to success-

ful genome sequencing projects, whether

on single or several hundred genomes or

metagenomes. Well-documented genome

sequence data will form a key growing

resource for biodefense and other re-

search fields.

Emerging New BioinformaticsResources

As we enter a new era of modern

genomics, the ever-expanding sequence

datasets are becoming more challenging to

analyze. Future analysts will require powerful

new bioinformatics tools in conjunction with

new computer systems engineered with

genomic analysis in mind. Open-source

new bioinformatics software tools are being

developed that exploit Web-based services

and the increasing computing power provid-

ed by academic and commercial ‘‘cloud

computing networks’’ (large computing re-

sources provided as a service over the

Internet). For example, ‘‘Science Clouds’’

(http://workspace.globus.org/clouds/) allow

members of the scientific community to lease

cloud computing resources free of charge.

To leverage these capabilities, novel cloud-

optimized bioinformatics tools are being

developed, such as the genome sequence

read mapper CloudBurst [13]. In addition,

novel resources are currently under devel-

opment to increase the availability of open-

source bioinformatics tools for cloud com-

puting (http://www.nsf.gov/awardsearch/

showAward.do?AwardNumber=0949201;

http://www.nsf.gov/awardsearch/showAward.

do?AwardNumber=0844494). These emerging

tools make access to the Worldwide Web the

only requirement to join the genomic revolution

and achieve large scale bioinformatics analyses

that could not be possible on local servers. As a

consequence, it is conceivable that in the future

genomic research will increasingly move away

from the large sequencing centers toward a

more decentralized organization. Decentralized

Author Summary

In all likelihood, it is only a matter of time before our public health system willface a major biological threat, whether intentionally dispersed or originating froma known or newly emerging infectious disease. It is necessary not only to increaseour reactive ‘‘biodefense,’’ but also to be proactive and increase ourpreparedness. To achieve this goal, it is essential that the scientific and publichealth communities fully embrace the genomic revolution, and that novelbioinformatic and computing tools necessary to make great strides in ourunderstanding of these novel and emerging threats be developed. Genomics hasgraduated from a specialized field of science to a research tool that soon will beroutine in research laboratories and clinical settings. Because the technology isbecoming more affordable, genomics can and should be used proactively tobuild our preparedness and responsiveness to biological threats. All pieces,including major continued funding, advances in next-generation sequencingtechnologies, bioinformatics infrastructures, and open access to data andmetadata, are being set in place for genomics to play a central role in our publichealth system.

Box 1. Hot Spots for the Emergence of Infectious Disease

Can we define ‘‘hot spots’’ of microbial populations where new infectiousdiseases are more likely to evolve? Human contact with new types of infectiousagents precedes the emergence of infectious diseases. Infectious agents can benew in the sense of not having previously infected humans or new in the sensethat a combination of preexisting genetic factors (for example, mobile elementsor regulatory elements) have reassembled to give rise to an infectious agent witha substantially altered genome. The Ebola virus, which first emerged by infectinghumans 1976 in Zaire [21], is an example of the former, whereas the acquisition ofantimicrobial resistance by Acinetobacter baumannii [22] is an example of thelatter. In both cases, a change in the selective pressure on an infectious agentallows its emergence from a specific setting. This selective pressure may be, forexample, the new niche that the human host provides to the pathogen or theantimicrobial selection on a pathogen. Since both events rely on preexistinggenetic resources and not on the de novo evolution of virulence factors, thepotential of a setting to serve as a hot spot or reservoir for an emerging infectiousdisease is theoretically predictable from the examination of the total metagen-ome. In this scenario, traditional microbiological approaches that focus on singleisolates of bacteria or viruses are limited in their predictive power since they lack aview of the complete genetic landscape. The potential infectious disease agentcould, however, arise from an environment that only contains pieces of a‘‘virulence puzzle,’’ i.e., individual virulence factors encoded within the genomesof different organisms (the metagenomic ‘‘gene soup’’). These pieces would haveto be assembled in one species for the new pathogen to emerge as an infectiousagent.


rapid genome sequencing and bioinformatic

analysis of infectious agents will enable near-real-

time global surveillance, detection of new

pathogens, new virulence factors, antimi-

crobial resistance determinants, or engineered

organisms.

Population Genomics Appliedto Single Cultures

Because the resources for affordable

high-throughput sequencing, data pro-

cessing, and analysis are available, the

time is right to think about microbial

population genomics and large-scale mi-

crobial metagenomics in the context of

biodefense research (Box 3). Traditional-

ly, the concept of population genomics

has applied to variation within a species.

However, a bacterial culture, even if

derived from a single clone, is composed

of millions of cells that are not necessarily

identical at the genome sequence level,

hence forming a population of genomes.

Therefore we propose to apply the

concept of population genomics to mi-

crobial cultures. The assemblage of

genotypes defines what is called a ‘‘cul-

ture,’’ ‘‘culture stock,’’ or ‘‘reference

strain.’’ Population genomics addresses

the genomic diversity within these assem-

blages and has significant implications for

many fields of research but, most impor-

tantly, for pathogen evolution, diagnos-

tics, epidemiology, and microbial foren-

sics. For example, following the anthrax

mail attacks of 2001, microbiologists and

genomicists joined forces to characterize

the unique genetic traits of the Bacillus

anthracis spores recovered from the enve-

lopes, which were quickly identified as

the B. anthracis Ames strain (DAAR et al.,

unpublished data). Sequencing the ge-

nome of several single colonies obtained

from the spores revealed that the entire

chromosome and its associated plasmids

were 100% identical to the genome

sequence of the ancestral B. anthracis

Ames strain that was stored for over 20

years in a military laboratory in Freder-

ick, Maryland. The only genotypic dif-

ferences were found in a small, pheno-

typically and genetically distinct portion

of cells grown from the spores used in the

attacks. Genomic characterization of

these phenotypic variants revealed a

number of unique genetic alterations that

together provided a characteristic DNA

fingerprint of the spore population that

could be unequivocally matched to the

spore sample used in the attacks. Using

this fingerprint, a genetic assay was

developed to screen a B. anthracis spore

repository, which identified the origin of

the spores as a single spore stock of B.

anthracis Ames. This stock was stored at

the US Army Medical Research Institute

for Infectious Diseases in Fort Detrick,

Maryland, narrowing the pool of suspects

to a manageable number (those who had

access to the spore stock) for the investi-

gative team. The police investigation that

followed identified a potential suspect as

the custodian of the spore stock. This was

the first use of microbial genomics as an

essential tool in a forensic investigation.

In the course of the investigation, scien-

tists had to establish culture repositories

from strains used in research in the US

and build databases of genome sequences

of all B. anthracis isolates. This work took

several years and delayed the investiga-

tion significantly. A lesson to be learned

from this investigation should therefore

be that there is a need for comprehensive

databases of unique DNA fingerprints of

stocks of potentially threatening patho-

gens. In the event that another bioterror

attack were to take place such genomic

databases would be key in quickly

establishing the source of the biological

material.

The concept of population genomics also

applies to epidemiological studies of out-

breaks of infectious diseases such as those

caused by food-borne or zoonotic patho-

gens, such as Salmonella spp. Traditionally,

epidemiologists and pathologists have used

low-resolution methods such as pulsed-field

gel electrophoresis (PFGE), multi-locus

sequence typing (MLST), or multi-locus

variable number tandem repeats analysis

(MLVA) to trace an individual isolate from

a patient back to a potentially infected food

source or to isolates from other patients

[14–17]. In 2006, for example, during an

outbreak of pathogenic Escherichia coli

O157:H7 infections in 26 states of the

US, which was caused by contaminated

spinach, isolates of the pathogen were

recovered from cows and wild pigs (the

zoonotic reservoirs), bags of spinach (the

vehicle of transmission), and ill patients

(http://www.cdc.gov/mmwr/preview/

mmwrhtml/mm55d926a1.htm). One

of these isolates was designated as the

reference for the outbreak based on

conserved PFGE patterns. Genome

sequencing of several isolates from the

same outbreak performed in our labo-

ratory, however, revealed genomic

variations that questioned a direct

evolutionary link between all out-

break-associated isolates (Eppinger

et al., unpublished data). Comparative

genomics followed by whole-genome

phylogenetic analyses based on single

nucleotide polymorphisms demonstrat-

ed that these isolates were indeed

closely related to one another and only

distantly related to other E. coli

O157:H7 isolates, hence linking all

isolates to the same outbreak, some-

thing that was not possible using PFGE

patterns. In this case, phylogenetic

analyses suggest that several highly

related genotypes were at the source

of the outbreak, thus challenging the

Box 2. Pandemic H1N1 2009 Influenza: A Recent Example of theImpact of Genomics on Biopreparedness

Genomics can be readily applied to follow outbreaks of infectious diseases. This isclearly illustrated during the severe acute respiratory syndrome (SARS) outbreakin 2002–2003 and the emergence and worldwide spread of the pandemic H1N12009 influenza virus this year. In both cases, genomics played a key role in theimmediate response to the outbreak. Initially, very little was known about thevirus responsible for the SARS outbreak. Pangenomic virus microarrays identifiedit as a coronavirus [23]; however, it was only through detailed sequencing that thespecific genotype of this virus could be determined [24]. Comparative sequenceanalysis identified the SARS virus as distinct from other coronaviruses in terms ofits encoded proteins responsible for antigen presentation. This finding ultimatelylead to development of diagnostics [25] and potential therapeutics [26]. Thisexample of a sequencing approach as a rapid response to a virus outbreakdemonstrates that genomics can be a useful and important, if not essential,epidemiological tool. In the ongoing H1N1 influenza outbreak, the NationalCenter for Biotechnology Information (NCBI) established the Influenza VirusResource (a database and tool for flu sequence analysis, annotation, andsubmission to GenBank; http://www.ncbi.nlm.nih.gov/genomes/FLU/SwineFlu.html), containing 462 complete viral genome sequences from worldwide viralsamples (as of September, 2009). Some of the genomic data was completed,compared, and released to the public within two weeks of isolation of the DNA.The rapid generation of genome sequence data is providing a paradigm shift inthe analysis of infectious disease outbreaks, from more classical methods ofisolation to the rapid molecular examination of the pathogen in question.


utility of assigning a single reference

strain to a specific outbreak. Instead,

collecting and sequencing tens or

hundreds of isolates from each source

or patient linked to an outbreak would

provide a better basis for understand-

ing the genomic diversity within the

outbreak population and would aid in

defining the population dynamics of an

outbreak.

A New Concept: Contrabiotics

Insufficient attention has been paid to

the human microbiome (i.e., the consor-

tium of microbes that inhabit the human

body) as it relates to our efforts to

increase biopreparedness. New analyses

of the diversity and composition of the

human microbiome are making it in-

creasingly clear that human health

depends on a delicate equilibrium be-

tween the microbial inhabitants and the

human host [18,19]. Severe effects on

health could be caused not only by the

introduction of true pathogens in the

traditional sense into these human-asso-

ciated microbial communities (e.g., Vib-

rio cholerae, the etiologic agent of cholera)

but potentially also by slight shifts in the

proportions of different populations wi-

thin the community that give an other-

wise harmless species or strain an un-

desirable advantage over others, a sim-

ilar situation to what is observed in

bacterial vaginosis [20]. Probiotic die-

tary supplements of live microorganisms

deliver beneficial bacteria that promote

an healthy state of the targeted micro-

biota. In a completely hypothetical

possibility, the opposite would also be

plausible, where the healthy microbiota

(skin, gut, or upper respiratory tract,

among others) may be disturbed by

introducing large amounts of ‘‘contra-

biotics,’’ i.e., living nonpathogenic bac-

teria that would shift the microbiota

away from a healthy state. A better

understanding of the ecological princi-

ples that shape the composition of our

microbiome might contribute to our

biopreparedness for such a threat to

public health.


The field of biodefense has thoroughly

embraced genomics and made it a

keystone for developing better identifica-

tion technologies, diagnostic tools, and

vaccines and improving our understand-

ing of pathogen virulence and evolution.

Enabling technologies and bioinfor-

matics tools have shifted genomics from

a separate research discipline to a tool so

powerful that it can provide novel

insights that were not imaginable a few

years ago, including for example redefin-

ing the notion of strains or cultures in the

context of biopreparedness or microbial

forensics. Challenges remain, though,

mostly in the form of large amounts of

data that are being generated, and will

continue to be generated in the future,

and are becoming difficult to manage.

The need for better bioinformatic algo-

rithms, access to faster computing capa-

bilities, larger or novel and more efficient

data storage devices, and better training

in genomics are all in critical demand,

and will be required to fully embrace the

genomic revolution. Our nation’s pre-

paredness for biological threats, whether

they are deliberate or not, and our public

health system would benefit greatly by

leveraging these capabilities into better

real-time diagnostics (in the environment

as well as at the bedside), vaccines, a

greater understanding of the evolution-

ary process that makes a friendly microbe

become a pathogen (Box 3) (hence to

better predict what microbial foes will be

facing us in the near future), and better

forensics and epidemiological tools. The

time is right to be bold and capitalize on

these enabling technological advances to

sequence microbial species or complex

microbial communities to the greatest

level possible—that is, hundreds of ge-

nomes per species or samples—but let us

not forget that informatics and comput-

ing resources are now becoming the

bottleneck to actually making major

progress in this field.

References

1. Fleischmann RD, Adams MD, White O,Clayton RA, Kirkness EF, et al. (1995) Whole-

genome random sequencing and assembly of

Haemophi lus inf luenzae Rd. Science 269:496–512.

2. Guzman E, Romeu A, Garcia-Vallve S (2008)Completely sequenced genomes of pathogenic

bacteria: A review. Enferm Infecc Microbiol Clin

26: 88–98.

3. Binnewies TT, Motro Y, Hallin PF, Lund O,

Dunn D, et al. (2006) Ten years of bacterialgenome sequencing: Comparative-genomics-

based discoveries. Funct Integr Genomics 6:

165–185.

4. Franco C (2008) Billions for biodefense: Federalagency biodefense funding, FY2008-FY2009.

Biosecur Bioterror 6: 131–146.

5. Rowe T, Abernathy RA, Hu-Primmer J,Thompson WW, Lu X, et al. (1999) Detection

of antibody to avian influenza A (H5N1)virus in human serum by using a combina-

tion of serologic assays. J Clin Microbiol 37:

937–943.

6. Maurer-Stroh S, Ma J, Lee RT, Sirota FL,

Eisenhaber F (2009) Mapping the sequencemutations of the 2009 H1N1 influenza A virus

neuraminidase relative to drug and antibody

binding sites. Biol Direct 4: 18.

7. Aldhous P (2001) Biologists urged to address risk

of data aiding bioweapon design. Nature 414:

237–238.

8. Read TD, Parkhill J (2002) Restricting genome

data won’t stop bioterrorism. Nature 417: 379.

9. Bambini S, Rappuoli R (2009) The use of

genomics in microbial vaccine development.Drug Discov Today 14: 252–260.

10. Tindall BJ, Garrity GM (2008) Proposals to clarifyhow type strains are deposited and made available to

the scientific community for the purpose of systematicresearch. Int J Syst Evol Microbiol 58: 1987–1990.

11. Garrity GM, Field D, Kyrpides N, Hirschman L,

Sansone SA, et al. (2008) Toward a standards-

Box 3. Simple Genomics, Population Genomics, andMetagenomics

It is now technically possible and scientifically desirable to combine sequencingprojects on single genomes, genome populations, and metagenomes to studygenome evolution. Single-genome projects provide the greatest resolution foridentifying genetic factors responsible for specific virulence phenotypes andprovide answers to many important questions, such as: What is the minimal geneset in a pathogen required to cause a specific disease phenotype? What does thegenetic context of virulence or antibiotic resistance factors tell us about theirevolutionary origin or the mobility between different microbial species or evengenera? Population-level genome sequencing projects provide us with informa-tion about the pangenomic gene pool and the potential of a species to evolveinto a novel pathogen. Are certain bacterial species or strains more likely thanothers to evolve pathogenic traits? What distinguishes a commensal from apathogenic isolate? What provides the trigger or ability to convert a commensalor opportunistic strain into a pathogen? What role does horizontal gene transferplay in species evolution? Is an infection always caused by an individual isolate ormight infection be caused by a combination of individuals in a population that allhave different attenuated infectious potentials? Metagenomics projects samplethe genetic reservoir (the set of genes carried by all members of a community)within a specific environment or sample. This ‘‘gene soup’’ reflects the maximumgenetic potential accessible to individual isolates by horizontal gene transfer.


compliant genomic and metagenomic publication

record. OMICS 12: 157–160.12. Field D, Garrity GM, Sansone SA, Sterk P,

Gray T, et al. (2008) Meeting report: The fifth

Genomic Standards Consortium (GSC) work-shop. OMICS 12: 109–113.

13. Schatz MC (2009) CloudBurst: Highly sensitiveread mapping with MapReduce. Bioinformatics

25: 1363–1369.

14. Gerner-Smidt P, Hise K, Kincaid J, Hunter S,Rolando S, et al. (2006) PulseNet USA: A five-

year update. Foodborne Pathog Dis 3: 9–19.15. Urwin R, Maiden MC (2003) Multi-locus se-

quence typing: A tool for global epidemiology.Trends Microbiol 11: 479–487.

16. Keim P, Price LB, Klevytska AM, Smith KL,

Schupp JM, et al. (2000) Multiple-locus variable-number tandem repeat analysis reveals genetic

relationships within Bacillus anthracis. J Bacteriol182: 2928–2936.

17. Boxrud D, Pederson-Gulrud K, Wotton J,

Medus C, Lyszkowicz E, et al. (2007) Compar-ison of multiple-locus variable-number tandem

repeat analysis, pulsed-field gel electrophoresis,

and phage typing for subtype analysis of Salmonella

enterica serotype Enteritidis. J Clin Microbiol 45:

536–543.18. Gao Z, Tseng CH, Strober BE, Pei Z, Blaser MJ

(2008) Substantial alterations of the cutaneous

bacterial biota in psoriatic lesions. PLoS One 3:e2719.

19. Turnbaugh PJ, Ley RE, Mahowald MA,Magrini V, Mardis ER, et al. (2006) An obesity-

associated gut microbiome with increased capac-ity for energy harvest. Nature 444: 1027–1031.

20. Srinivasan S, Fredricks DN (2008) The human

vaginal bacterial biota and bacterial vaginosis.Interdiscip Perspect Infect Dis 2008: 750479.

21. Pourrut X, Kumulungui B, Wittmann T,Moussavou G, Delicat A, et al. (2005) The

natural history of Ebola virus in Africa. Microbes

Infect 7: 1005–1014.

22. Peleg AY, Seifert H, Paterson DL (2008)

Acinetobacter baumannii: Emergence of a successful

pathogen. Clin Microbiol Rev 21: 538–582.

23. Wang D, Urisman A, Liu YT, Springer M,

Ksiazek TG, et al. (2003) Viral discovery and

sequence recovery using DNA microarrays. PLoS

Biol 1: e2. doi:10.1371/journal.pbio.0000002.

24. Marra MA, Jones SJ, Astell CR, Holt RA,

Brooks-Wilson A, et al. (2003) The genome

sequence of the SARS-associated coronavirus.

Science 300: 1399–1404.

25. Zhu M (2004) SARS immunity and vaccination.

Cell Mol Immunol 1: 193–198.

26. Haagmans BL, Osterhaus AD (2006) Coronavi-

ruses and their therapy. Antiviral Res 71:

397–403.


Perspective

Discovering the Phylodynamics of RNA VirusesEdward C. Holmes1,2*, Bryan T. Grenfell2,3

1 Center for Infectious Disease Dynamics, Department of Biology, The Pennsylvania State University, Mueller Laboratory, University Park, Pennsylvania, United States of

America, 2 Fogarty International Center, National Institutes of Health, Bethesda, Maryland, United States of America, 3 Department of Ecology and Evolutionary Biology

and Woodrow Wilson School, Princeton University, Princeton, New Jersey, United States of America

Phylodynamics: The DiscoveryPhase

The advent of extremely high through-

put DNA sequencing ensures that genomic

data from microbial organisms can be

acquired in unprecedented quantities and

with remarkable rapidity. Although this

genomic revolution will affect all microbes

alike, our focus here is on RNA viruses, as

the rapidity of their evolution, which is

observable over the time scale of human

observation, allows phylodynamic infer-

ences to be made with great precision. In

the foreseeable future it is likely that

complete genome sequencing will become

the standard method of viral characteriza-

tion, providing the highest possible reso-

lution for phylogenetic studies. The rapid-

ity with which genome sequence data were

generated from the ongoing epidemic of

swine-origin H1N1 influenza A virus [1] is

testament to the power of this technology.

Understandably, pathogen discovery is

a major focus of this new-scale genome

sequencing [2]. It is now possible to

sequence the entire assemblage of viruses

in a particular tissue type or host species

[3–5], as well as all those viruses that are

associated with specific disease syndromes

[6,7]. In essence, this new era of metage-

nomics constitutes a crucial taxonomic

discovery phase in virology and epidemi-

ology that allows the genetic characteriza-

tion of new viruses within hours of their

isolation.

Assembling an inventory of viruses that

may emerge in human populations is of

major importance to public health and to

students of biodiversity. However, it is only

the first step in developing a full quanti-

tative understanding of the processes that

shape the epidemiology and evolution—

the phylodynamics—of RNA virus infec-

tions [8]. To achieve this goal, we argue

here that the field of viral phylodynamics

requires its own discovery phase; that is, a

comprehensive and quantitative analysis

of the interaction between the ecological

and evolutionary dynamics of all circulat-

ing RNA viruses from the molecular to the

global scale. Such a marriage of phyloge-

netic and epidemiological dynamics is

currently only potentially possible for the

select few human viruses for which large

genome sequence datasets have been

acquired, such as HIV and influenza A

virus, and even here fundamental gaps in

our knowledge remain (see below). Indeed,

it is striking that so few complete genome

sequences are currently available for

viruses whose epidemiological dynamics

are known in exquisite detail, such as

measles [9,10]; these sequences have been

so sparsely sampled in both time and space

that a full phylodynamic perspective has

not yet been achieved. We contend that a

better understanding of RNA virus phylo-

dynamics will allow more directed at-

tempts at pathogen surveillance, facilitate

more accurate predictions of the epidemi-

ological impact of newly emerged viruses,

and assist in the control of those viruses

that exhibit complex patterns of antigenic

variation such as dengue and influenza.

Just as PCR and first-generation DNA

sequencing ushered in the science of

molecular epidemiology, so next-genera-

tion sequencing may herald the age of

phylodynamics. Box 1 lists a number of

key questions that can be addressed within

this phylodynamics research program.

A number of important advances are

needed to meet our goal of a comprehen-

sive catalog of the diversity of phylody-

namic patterns in RNA viruses. Because

answers to many of the most interesting

research questions depend on sufficiently

large sample sizes, we require large

numbers of sequences that have been

rigorously sampled according to strict

temporal, spatial, and clinical criteria,

and that as much of these data are publicly

accessible as possible. A phylodynamic

analysis has little value unless viral ge-

nomes are sampled on the same scale as

the epidemiological processes under inves-

tigation.

The only acute virus for which a suitably

expansive genome dataset currently exists is

influenza. In this case, the .4,000 com-

plete genomes generated under the Influ-

enza Genome Sequencing Project [11]

have provided important new insights into

the evolution and epidemiology of this

major human pathogen [12]. To highlight

one key insight here, these genome se-

quence data have revealed that multiple

lineages of influenza virus are imported and

circulate within specific geographic locali-

ties (even within relatively isolated popula-

tions), generating both frequent mixed

infections [13] and reassortment events

[14]. Even so, the sampling of these

genome sequences (and associated epide-

miological covariates) may not be dense

enough to fully capture spatial dynamics

[15]. There is also a marked absence of

samples from asymptomatically infected

patients (or those with mild disease), so it

is impossible to link genetic variation to

clinical syndrome. Such a bias against

viruses sampled from individuals with

asymptomatic infections is a common

problem in molecular epidemiology.

Epidemiological Factors

It is also clear that for many RNA

viruses we need to better understand a


Citation: Holmes EC, Grenfell BT (2009) Discovering the Phylodynamics of RNA Viruses. PLoS ComputBiol 5(10): e1000505. doi:10.1371/journal.pcbi.1000505

Editor: Ernest Fraenkel, Massachusetts Institute of Technology, United States of America


Copyright: � 2009 Holmes, Grenfell. This is an open-access article distributed under the terms of theCreative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in anymedium, provided the original author and source are credited.

Funding: BTG was supported by the RAPIDD program of the Science & Technology Directorate of theDepartment of Homeland Security and the National Institutes of Health (NIH), and National Science Foundationgrant 0742373. ECH was supported by the NIH (grant GM080533). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.



PLoS Computational Biology | www.ploscompbiol.org 1 October 2009 | Volume 5 | Issue 10 | e1000505

number of key epidemiological factors,

such as the interaction between local

persistence, epidemic dynamics in both

time and space, the impact of measures to

control the spread of infection, and the

consequences of adaptive evolution in

those viral genes that interact most

intimately with the host immune response.

It is instructive to imagine the ideal

database for addressing these issues. In

the case of acute infections, the goal would

be to collect four parallel datasets on the

appropriate scale of interest during out-

breaks (Figure 1). This database would

comprise, first, epidemic dynamics in time and

space, ideally at a comparable or higher

frequency than the generation time of

individual infections. Second, and in

parallel, our ideal study would collect viral

genome sequence data at these time points,

sampling both within and among infected

hosts. Both disease incidence data (bol-

stered by contact tracing) and viral

sequence data furnish information on the

transmission network traced by an out-

break. Third, we would need to know the

underlying contact network of susceptible

individuals, which serves as fuel for the

epidemic. This is a difficult structure to

measure directly, although novel measure-

ments of human interactions are increas-

ingly shedding light on the problem [16].

Finally, measurements of the immunity

structure of our contact network [17]—

reflecting the past history of the virus in

the population—are key for understanding

both the dynamics of epidemic spread and

the evolutionary pressures that shape virus

diversity.

The outbreak of foot-and-mouth disease

(FMD, an RNA virus infection of cattle) in

the UK in 2001 resulted in a database that

is arguably closest to our ideal on the

epidemiological scale [18,19]. Notwith-

standing a variety of gaps in data from

the epidemic [20], it is one of the most

well-documented large outbreaks in terms

of the availability of spatiotemporal inci-

dence data in parallel with contact tracing

and the underlying spatial pattern of the

susceptible farms as a measure of the

contact network. In addition, analyses of

viral sequences from relatively small sam-

ples of farms have drawn important

conclusions about epidemic spread and

allowed the testing of new methods to

recover the spatiotemporal patterns writ-

ten into sequence data [18,20]. Impor-

tantly, samples exist from over half the

,2,000 confirmed infected premises in

2001: sequencing whole FMD virus ge-

nomes from these samples would provide a

vast resource for basic and applied devel-

Box 1. Key Research Questions in RNA Virus Phylodynamics

(1) What is the range of phylodynamic patterns observed in RNA viruses? Can they

be categorized into specific groups? How do these patterns relate to other ‘‘life

history’’ variables exhibited by RNA viruses?

(2) What epidemiological and evolutionary processes give rise to these phylodynamic

patterns? What generalities can be drawn?

(3) How commonly does natural selection (compared to neutral evolutionary

processes) determine the population dynamics of pathogens? On what scale does

natural selection act? How does viral immune escape reduce herd immunity at the

population level and allow the persistence of viral lineages in epidemic troughs?

(4) What is the range of spatial patterns exhibited by RNA viruses? What

epidemiological factors are responsible for these patterns?

(5) How do different viral species (various respiratory viruses, for example) interact

in host immunity?

Figure 1. Sampling scales for acute RNA viruses and the associated phylodynamic processes that viral genome sequence data andhost sampling can elucidate.doi:10.1371/journal.pcbi.1000505.g001


opments in integrating epidemiological

and phylogenetic information to dissect

spatiotemporal spread. We suggest that

achieving this task would be a huge

contribution to understanding the phylo-

dynamics of acute viruses. Another virtue

of animal infections like FMD is that the

relationship between the determinants of

viral variability within and between hosts

can also be dissected by experimental

infections (see [21] for another example).

A parallel limitation of many phyloge-

netic approaches to viral epidemiology is

that they have often proceeded in the

absence of the necessary metadata, such as

the precise time and place of sampling or

those that relate to clinical syndrome [22].

A perhaps more challenging goal for

phylodynamics is therefore to integrate

phylogenetic patterns with other biological

variables, such as the nature of antigenic

variation, the capacity for drug resistance,

or the clinical syndrome of the host, as well

as the spatial host network data outlined

above. Cohort studies may be the most

productive way to link genomics with

epidemiological variables.

The lack of a synthesis of phylogenetic

and phenotypic/epidemiological data is

reflected in the current debate over the

mode of antigenic evolution in human

influenza A virus. Although it has long

been known that the hemagglutinin (HA)

and neuraminidase (NA) proteins of hu-

man influenza A virus evolve by strong

natural selection to evade the host immune

response—a process commonly called

antigenic drift [23,24]—the precise mech-

anisms by which such drift occurs are

uncertain. From a phylodynamics perspec-

tive, the key observation is that over long

time periods a single lineage of HA

sequences from subtype A/H3N2 influen-

za viruses links epidemic to epidemic [23],

although intensive sampling has revealed

that single populations may harbor far

higher levels of genetic diversity [25].

Rather different phylodynamic patterns

are seen in other influenza viruses, includ-

ing those sampled from birds (Figure 2).

Three models have been proposed to

explain the distinctive phylodynamic pat-

tern observed in human A/H3N2 viruses:

(i) that there is short-lived cross-immunity

among viral strains [26], (ii) that the HA

evolves in a punctuated manner among

antigenic types that are linked by a

network of neutrally evolving sites [27],

and (iii) that the virus continually reuses a

limited number of antigenic combinations

[28].

To determine which combination of

these models best explains influenza phy-

lodynamics will require more expansive

genome sequence data, as well as focused

sampling and epidemiological surveillance

in Southeast Asia, which is likely the global

source population for the virus [29]. More

importantly, it is also crucial that these

phylogenetic data are combined with

detailed, spatiotemporally disaggregated

antigenic information. Indeed, it is re-

markable that despite the abundance of

information on the antigenic characteris-

tics of individual influenza viruses, most

notably through the use of the hemagglu-

tinin inhibition (HI) assay [17], these data

have not been routinely linked to phylo-

genetic information. It is clear that both

antigenic and phylogenetic analyses would

greatly benefit from each other.

New-Generation ComputationalTools

Another important challenge for phylo-

dynamics is to match the remarkable

ongoing developments in genome se-

quencing technology to the increase in

the power of the computational tools

available to analyze these sequence data.

Crucially, in phylogenetics, the size of the

space of possible trees increases faster than

exponentially with the number of sequenc-

es, such that the availability of datasets

comprising thousands of complete ge-

nomes [30] presents a major combinato-

rial problem. This problem creates a

growing discrepancy between our ability

to generate genome sequence data and our

capacity to analyze them using the most

sophisticated methods. Redressing this

Figure 2. Phylodynamic patterns of human and avian influenza viruses. The left diagram shows the phylogeny of the hemagglutinin (HA)gene of human H3N2 influenza A viruses sampled between 1985 and 2005, revealing the ‘‘ladder-like’’ branching structure indicative of antigenicdrift. By comparison, the phylogeny of the HA gene of human influenza B virus sampled over the same interval (center diagram) shows the co-circulation of the antigenically distinct ‘‘Victoria 1987’’ and ‘‘Yamagata 1988’’ lineages, as well a shorter length from root to tip, reflecting a lower rateof evolutionary change. Finally, the phylogeny for the HA gene of H4 avian influenza virus (right diagram) reveals the deep geographic divisionbetween the Eurasian and Australian versus North American lineages of this virus.doi:10.1371/journal.pcbi.1000505.g002


balance should be the major goal of

bioinformatics in the future; and in fact

some progress has been made recently

[31].

It is also clear that improvements need

to be made to the methods that are

available to analyze genome sequence

data. A powerful set of research tools in

this area comprises those based on coales-

cent theory, as this provides a natural link

between the analysis of epidemiological

and phylogenetic patterns [8,32]. In par-

ticular, the coalescent allows the demo-

graphic characteristics of viral populations

(particularly population size and growth

rate) to be inferred directly from gene

sequence data. Coalescent analyses are

especially powerful in the case of RNA

viruses, because their rapid evolution

means that temporal and spatial dynamics

are discernable over the period of human

observation [33] and can in theory be

combined with time series epidemiological

data. However, currently available coales-

cent methods are restricted by the limited

scope of demographic models and their

inability to fully incorporate spatial infor-

mation. In particular, most acute RNA

viruses have complex population dynamics

that combine distinct periods of growth

and decline. The most commonly used

phylodynamic tool available in such cases

is the Bayesian skyline plot (and the related

Bayesian ‘‘skyride’’ [34]), which represents

a piecewise graphical depiction of changes

in genetic diversity through time [32]. In

the case of neutral evolution, such changes

in genetic diversity also reflect underlying

changes in the number of infected hosts.

Although the Bayesian skyline plot can

reveal unique features of epidemic dynam-

ics (Figure 3) [30], precise estimates of

parameters such as population growth rate

are not yet possible.

The coalescent methods commonly

used to study RNA virus evolution focus

largely on temporal dynamics (a natural

function of the rapidity of viral evolution),

with little consideration of patterns of

spatial diffusion. Although these phylogeo-

graphic patterns are becoming increasing-

ly well described for RNA viruses [35], few

methods effectively recover the spatial

component in genome sequence data.

For example, commonly used parsimony-

based approaches consider a single phylo-

genetic tree without an explicit spatial

model (see, for example, [36]). In addition,

these methods usually describe the place of

origin and direction of spread of viral

lineages without formal tests of competing

spatial hypotheses. As a specific case in

point, although gravity models (in which

patterns of viral transmission reflect the

size of and distance between population

centers) have been applied successfully to

morbidity and mortality data from human

influenza A virus to describe its spread

across the United States [37], they have

yet to be interpreted within a phylogenetic

setting. A clear push for the future should

therefore be the development of coalescent

tools that integrate the analysis of spatial

and temporal dynamics within a single

framework, with a focus on those that

combine phylogenetic data and informa-

tion on the dynamics of the host contact

network of susceptible, infected, and

immune individuals.

Looking beyond the ConsensusSequence

The vast majority of studies of RNA

virus evolution undertaken to date, partic-

ularly of those viruses that cause acute

infections, rely on the analysis of consensus

sequences in which the nucleotide shown

for any given site is the most common

among all the genomes within a patient.

Although the use of consensus sequences is

adequate for many aspects of molecular

epidemiology, in which complete genomes

may suffice to determine even tight

transmission chains [20], there is growing

evidence that key evolutionary processes

occur beyond the consensus. In particular,

extensive intra-host gene sequencing has

revealed the existence of minor viral

subpopulations within individual hosts that

are not detected by consensus sequencing

and that are sometimes of great pheno-

typic importance [38,39]. Given the in-

trinsically high mutation rates of RNA

viruses, as well as the immense size of

intra-host populations, such extensive ge-

netic and phenotypic diversity is only to be

expected.

Figure 3. Fluctuating genetic diversity of influenza A virus. The figure shows a Bayesian skyline plot of changing levels of genetic diversitythrough time for the HA gene (165 sequences) of A/H3N2 virus sampled from the state of New York, US, during the period 2001–2003. The y-axesdepict relative genetic diversity (Net, where Ne is the effective population size, and t the generation time from infected host to infected host), whichcan be considered a measure of effective population size under strictly neutral evolution. Peaks of genetic diversity, reflecting the seasonaloccurrence of influenza, are clearly visible. See [30] for a more detailed analysis.doi:10.1371/journal.pcbi.1000505.g003


A full description of the extent and

structure of intra-host viral genetic varia-

tion is critical for understanding evolu-

tionary dynamics, informing on such issues

as the frequency of mixed infection, and

hence the degree and extent of cross-

immunity; the frequency with which

antigenic variants are produced and

whether antigenic evolution can occur on

the time scale of individual infections; and

the size of the population bottleneck that

might accompany inter-host transmission.

As a case in point, it is commonly assumed

that viruses experience a severe population

bottleneck as they are transmitted to new

hosts, a phenomenon that greatly restricts

the power of natural selection to fix

advantageous mutations. Although this

assumption appears to be true in some

cases [40], whether this is a general

property of RNA viruses is unclear; the

evidence that multiple viral lineages can

be transmitted among hosts argues against

a narrow bottleneck in all cases [41]. To

more accurately determine the size of the

transmission bottleneck, analyses of intra-

host genetic diversity along known trans-

mission chains will be essential. On a

larger scale, it is unclear whether phylo-

dynamic patterns differ within and among

hosts, and whether any differences among

these scales of analysis are qualitative or

quantitative.

Intra-host sequence data are also essen-

tial for understanding the process of cross-

species virus transmission and emergence.

Key parameters in determining whether a

virus will adapt successfully to a new host

species include the extent of intra-host

genetic diversity, the fitness distribution of

the mutations produced, and how many of

these mutations will assist adaptation to

new host species [41–43]. No such data

are available for any acute RNA virus, so

testing models for viral emergence is

difficult. We believe, however, that under-

standing the mechanics of this adaptive

process is at least as important as surveying

for new emerging viruses.


Our discussion has highlighted a num-

ber of key challenges for a successful

phylodynamic research agenda. These

challenges comprise data, theory, and

methodological issues, and are briefly

summarized as follows. First, with respect

to data, it is clear that more genome

sequences must be acquired and with

increased temporal and spatial precision.

For example, wherever possible, GenBank

records should contain the exact day and

precise latitude and longitude of sampling.

In addition, it is essential that these

sequence data be linked with the relevant

metadata, such as the associated clinical

syndrome and (if applicable) measure of

antigenicity. Similarly, it is essential that

equivalent genome sequence data be

acquired from multiple time points within

individual hosts. Second, in terms of

theory, it is crucial that we fully integrate

patterns of viral evolution across multiple

epidemiological scales, from within hosts,

to local outbreaks, and on to global

pandemics. Although the coalescent is

hugely useful in this respect, it is essential

that its theoretical framework be extended

to incorporate models of population

growth and decline that most accurately

reflect the population dynamics of acute

RNA viruses, in particular the dynamics of

the susceptible ‘‘denominator’’ that fuels

epidemics. Sequencing of all available

samples from the UK 2001 FMD epidem-

ic would yield great scientific dividends

here. Third and finally, with respect to

methodology, new computational tools are

needed to rapidly make phylodynamic

inferences from genomic datasets that

may contain thousands of sequences and

that efficiently integrate genomic with

other forms of biological data. We hope

this review will stimulate research in all

these areas.

References

1. Novel Swine-Origin Influenza A (H1N1) Virus

Investigation Team, Dawood FS, Jain S, Finelli L,

Shaw MW, et al. (2009) Emergence of a novel

swine-origin influenza A (H1N1) virus in humans.

N Engl J Med 360: 2605–2615.

2. Lipkin WI (2009) Microbe hunting in the 21st

century. Proc Natl Acad Sci U S A 106: 6–7.

3. Cox-Foster DL, Conlan S, Holmes EC,

Palacios G, Evans JD, et al. (2007) A metage-

nomic survey of microbes in honey bee colony

collapse disorder. Science 318: 283–287.

4. Finkbeiner SR, Allred AF, Tarr PI, Klein EJ,

Kirkwood CD, et al. (2008) Metagenomic analysis

of human diarrhea: viral detection and discovery.

PLoS Pathog 4(2): e1000011. doi:10.1371/journal.

ppat.1000011.

5. Zhang T, Breitbart M, Lee WH, Run JQ, Wei CL,

et al. (2005) RNA viral community in human feces:

Prevalence of plant pathogenic viruses. PLoS Biol

4(1): e3. doi:10.1371/journal.pbio.0040003.

6. Palacios G, Druce J, Du L, Tran T, Birch C, et al.

(2008) A new arenavirus in a cluster of fatal

transplant-associated diseases. N Engl J Med 358:

991–998.

7. Palmenberg AC, Spiro D, Kuzmickas R,

Wang S, Djikeng A, et al. (2009) Sequencing

and analyses of all known human rhinovirus

genomes reveals structure and evolution. Sci-

ence 324: 55–59.

8. Grenfell BT, Pybus OG, Gog JR, Wood JLN,

Daly JM, et al. (2004) Unifying the epidemiolog-

ical and evolutionary dynamics of pathogens.

Science 303: 327–332.

9. Bjørnstad ON, Finkenstadt B, Grenfell BT (2002)

Dynamics of measles epidemics. I. estimating

scaling of transmission rates using a time series

SIR model. Ecol Monogr 72: 169–184.

10. Grenfell BT, Bjornstad ON, Finkenstadt BF

(2002) Dynamics of measles epidemics. II. Scaling

noise, determinism and predictability with the

time series SIR model. Ecol Monogr 72:

185–202.

11. Ghedin E, Sengamalay NA, Shumway M,

Zaborsky J, Feldblyum T, et al. (2005) Large-

scale sequencing of human influenza reveals the

dynamic nature of viral genome evolution.

Nature 437: 1162–1166.

12. Nelson MI, Holmes EC (2007) The evolution of

epidemic influenza. Nat Rev Genet 8: 196–205.

13. Ghedin E, Fitch A, Boyne A, DePasse J, Bera J,

et al. (2009) Mixed infection and the genesis of

influenza diversity. J Virol 83: 8832–8841.

14. Nelson MI, Simonsen L, Viboud C, Miller MA,

Taylor J, et al. (2006) Stochastic processes are key

determinants of the short-term evolution of

influenza A virus. PLoS Pathog 2: e125.


15. Nelson MI, Edelman L, Spiro DJ, Boyne AR,

Bera J, et al. (2008) Molecular epidemiology of

A/H3N2 and A/H1N1 influenza virus during a

single epidemic season in the United States. PLoS

Pathog 4(8): e1000133. doi:10.1371/journal.

ppat.1000133.

16. Gonzalez MC, Hidalgo CA, Barabasi AL (2008)

Understanding individual human mobility pat-

terns. Nature 453: 779–782.

17. Smith DJ, Lapedes AS, de Jong JC,

Bestebroer TM, Rimmelzwaan GF, et al.

(2004) Mapping the antigenic and genetic

evolution of influenza virus. Science 305:

371–376.

18. Cottam EM, Haydon DT, Paton DJ, Gloster J,

Wilesmith JW, et al. (2006) Molecular epidemi-

ology of the foot-and-mouth disease virus out-

break in the United Kingdom in 2001. J Virol 80:

11274–11282.

19. Keeling MJ, Woolhouse MEJ, Shaw DJ,

Matthews L, Chase-Topping M, et al. (2001)

Dynamics of the 2001 UK foot and mouth

epidemic: stochastic dispersal in a heterogeneous

landscape. Science 294: 813–817.

20. Cottam EM, Wadsworth J, Shaw AE,

Rowlands RJ, Goatley L, et al. (2008) Transmis-

sion pathways of foot-and-mouth disease virus in

the United Kingdom in 2007. PLoS Pathog 4(4):

e1000050. doi:10.1371/journal.ppat.1000050.

21. Hoelzer K, Shackelton LA, Holmes EC,

Parrish CR (2008) Within-host genetic diversity

of endemic and emerging parvoviruses of cats

and dogs. J Virol 82: 11096–11105.

22. Holmes EC (2007) Viral evolution in the genomic

age. PLoS Biol 5(10): e278. doi:10.1371/journal.

pbio.0050278.

23. Fitch WM, Leiter JME, Li X, Palese P (1991)

Positive Darwinian evolution in human influenza

A viruses. Proc Natl Acad Sci U S A 88:

4270–4274.

24. Webster RG, Laver WG, Air GM, Schild GC

(1982) Molecular mechanisms of variation in

influenza viruses. Nature 296: 115–121.

25. Holmes EC, Ghedin E, Miller N, Taylor J, Bao Y,

et al. (2005) Whole genome analysis of human

influenza A virus reveals multiple persistent

lineages and reassortment among recent H3N2

viruses. PLoS Biol 3(9): e300. doi:10.1371/


26. Ferguson NM, Galvani AP, Bush RM (2003)

Ecological and immunological determinants of

influenza evolution. Nature 422: 428–433.

27. Koelle K, Cobey S, Grenfell B, Pascual M (2006)

Epochal evolution shapes the phylodynamics of


interpandemic influenza A (H3N2) in humans.

Science 314: 1898–1903.

28. Recker M, Pybus OG, Nee S, Gupta S (2007)

The generation of influenza outbreaks by a

network of host immune responses against a

limited set of antigenic types. Proc Natl Acad

Sci U S A 104: 7711–7716.

29. Russell CA, Jones TC, Barr IG, Cox NJ,

Garten RJ, et al. (2008) The global circulation

of seasonal influenza A (H3N2) viruses. Science

320: 340–346.

30. Rambaut A, Pybus OG, Nelson MI, Viboud C,

Taubenberger JK, et al. (2008) The genomic and

epidemiological dynamics of human influenza A

virus. Nature 453: 615–619.

31. Suchard MA, Rambaut A (2009) Many-core

algorithms for statistical phylogenetics. Bioinfor-

matics 25: 1370–1376.

32. Drummond AJ, Rambaut A, Shapiro B,

Pybus OG (2005) Bayesian coalescent inference

of past population dynamics from molecular

sequences. Mol Biol Evol 22: 1185–1192.

33. Drummond AJ, Pybus OG, Rambaut A,

Forsberg R, Rodrigo AG (2003) Measurablyevolving populations. Trends Ecol Evol 18:

481–488.

34. Minin VN, Bloomquist EW, Suchard MA (2008)Smooth skyride through a rough skyline: Bayesian

coalescent-based inference of population dynam-ics. Mol Biol Evol 25: 1459–1471.

35. Holmes EC (2008) The evolutionary history and

phylogeography of human viruses. Annu RevMicrobiol 62: 307–328.

36. Wallace RG, Hodac H, Lathrop RH, Fitch WM(2007) A statistical phylogeography of influenza A

H5N1. Proc Natl Acad Sci U S A 104:4473–4478.

37. Viboud C, Bjornstad ON, Smith DL, Simonsen L,

Miller MA, et al. (2006) Synchrony, waves, andspatial hierarchies in the spread of influenza.

Science 312: 447–451.38. Aaskov J, Buzacott K, Thu HM, Lowry K,

Holmes EC (2006) Long-term transmission of

defective RNA viruses in humans and Aedes

mosquitoes. Science 311: 236–238.

39. Jerzak G, Bernard KA, Kramer LD, Ebel GD

(2005) Genetic variation in West Nile virus fromnaturally infected mosquitoes and birds suggests

quasispecies structure and strong purifying selec-

tion. J Gen Virol 86: 2175–2183.40. Keele BF, Giorgi EE, Salazar-Gonzalez JF,

Decker JM, Pham KT, et al. (2008) Identificationand characterization of transmitted and early

founder virus envelopes in primary HIV-1

infection. Proc Natl Acad Sci U S A 105:7552–7557.

41. Holmes EC (2009) The evolution and emergenceof RNA viruses. Oxford Series in Ecology and

Evolution. Harvey PH, May RM, eds. Oxford:Oxford University Press.

42. Kuiken T, Holmes EC, McCauley J ,

Rimmelzwaan GF, Williams CS, et al. (2006)Host species barriers to influenza virus infections.

Science 312: 394–397.43. Parrish CR, Holmes EC, Morens DM, Park EC,

Burke DS, et al. (2008) Cross-species viral

transmission and the emergence of new epidemicdiseases. Microbiol Mol Biol Rev 72: 457–470.


Perspective

Computational Resources in Infectious Disease:Limitations and ChallengesEva C. Berglund, Bjorn Nystedt, Siv G. E. Andersson*

Department of Molecular Evolution, Uppsala University, Uppsala, Sweden

Infectious diseases continue to be a

major cause of death in the human

population, with tuberculosis and malaria

affecting 500 million people and causing

1–2 million deaths annually [1]. The

situation is aggravated by the increasing

prevalence of antibiotic-resistant bacteria

and the risk that terrorists might use

infectious organisms to aggress target

populations. During the past decade, we

have also witnessed the emergence of

many new pathogens not previously de-

tected in humans, such as the avian

influenza virus, severe acute respiratory

syndrome (SARS), and Ebola. The ap-

pearance of these novel agents and the

reemergence of previously eradicated

pathogens may be associated with the

growing human population, flooding, and

other environmental perturbations; global

travel and migration; and animal trade

and domestic animal husbandry practices.

Simultaneously, we have seen an explosion

of genome sequence data. Sequencing is

now the method of choice for character-

ization of new disease agents, as exempli-

fied by the rapid sequencing of the

genome of the SARS virus, which was

made available within a month of identi-

fication of the virus [2,3]. Like SARS,

most newly emerging disease agents orig-

inate in animals and have been transmit-

ted to humans recently at food markets, by

insect bites, or through hunting [1].

The new sequencing technologies enable

small academic research groups to create

huge genome datasets at low cost. As a

result, scientists with expertise in other

fields of research, such as clinical microbi-

ology and ecology, are just beginning to

face the challenge of handling, comparing,

and extracting useful information from

millions of sequences. Here, we discuss

the limitations of publicly available resourc-

es in the field of genomics of emerging

bacterial pathogens, emphasizing areas

where increased efforts in computational

biology are urgently needed.

Genome Evolution in EmergingBacterial Pathogens

A natural ecosystem of a bacterial

population that incidentally infects hu-

mans provides a high-risk microenviron-

ment for the establishment of this patho-

gen in the human population (Box 1;

Figure 1). Comparative studies of the

genomes of well-recognized human path-

ogens, incidental pathogens, and their

closely related nonpathogenic species [4–

11] are valuable for efforts to predict the

propensity for host shifts and their conse-

quences for human health.

A successful infectious bacterium,

whether it causes disease or not, must

possess mechanisms for interacting with

the host and evading the host immune

system. The key players in these processes

are often proteins on the surface of the

bacterium, including secretion systems

that release effector proteins into the

surrounding medium or directly into the

host cells. These host-interaction factors

are often members of large protein families

with many paralogs and often encoded by

long genes with internal repeats. Fluctua-

tions in gene length and copy number

occur through homologous recombination

over these repeats [12–15].

Adding to the variability of the host-

interaction genes is that they are often

located on mobile elements such as

plasmids or bacteriophages, which are

easily gained and lost. Rapid sequence

evolution of these genes may be driven by

selection, because it often increases bacte-

rial fitness by escaping the host immune

system, creating a diverse set of binding

structures or tuning effector proteins to a

new host. As a consequence, host-interac-

tion genes typically show extreme plastic-

ity in both sequence and copy number,

partly because they are under strong

evolutionary pressure and partly because

they are mechanistically prone to drastic

mutational changes. Understanding these

complex dynamics poses major challenges

in many areas of computational biology,

ranging from sequence assembly to epi-

demic risk assessment.

Complete Genome AssemblyRemains Difficult

Despite the ease with which shotgun

sequence data can be generated, assem-

bling these data into a single genomic

contig remains labor-intensive and time-

consuming. This obstacle is primarily due

to the difficulty of assembling repeated

sequences. Hence, resequencing ap-

proaches—where short sequence reads

are directly mapped to an already com-

pleted reference genome—have become

increasingly popular. Resequencing read-

ily detects SNPs (single nucleotide poly-

morphisms) in single-copy genes, but

performs very poorly in repeated and

highly divergent regions of the genome.

Genes involved in infection processes, with

their complex repeat structures, high

duplication frequency, and rapid evolu-

tion, are thus often left unresolved.

The perhaps most imminent need is not

for improved assembly algorithms but for

Citation: Berglund EC, Nystedt B, Andersson SGE (2009) Computational Resources in Infectious Disease:Limitations and Challenges. PLoS Comput Biol 5(10): e1000481. doi:10.1371/journal.pcbi.1000481



Copyright: � 2009 Berglund et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.

Funding: The authors are supported by grants to SGEA from the European Union (QLK3-CT2000-01079,EUWOL and EuroPathogenomics), the Swedish Research Council (http://www.vr.se/), the Goran GustafssonFoundation (http://www.gustafssonsstiftelse.se/), the Swedish Foundation for Strategic Research (http://www.stratresearch.se/) and the Knut and Alice Wallenberg Foundation (http://www.wallenberg.com/kaw/). Thefunders had no role in study design, data collection and analysis, decision to publish, or preparation of themanuscript.





better ways to integrate data from diverse

sources, including shotgun sequencing,

paired-end sequencing, PCR experiments,

fosmid and BAC (bacterial artificial chro-

mosome) clone sequencing, physical map-

ping, and restriction fragment data. A

program integrating these different data

should not only accurately assemble as

much of the genome as possible, but also

assist the researcher in designing addition-

al experiments to resolve the remaining

regions. Given the rapidly increasing

number of incomplete genome sequences

available, it would also be valuable with a

quality-scoring standard that not only

provides quality scores at individual sites

under the assumption that the assembly is

correct, but also reflects the uncertainty of

the actual assembly over specific regions.

While assembly software development is

struggling to keep up, the sequencing

revolution shows no signs of slowing down.

Perhaps the most important new develop-

ment is real-time single molecule detection

platforms with ultra-long sequencing reads

[16]. Within the next few years, we can

expect to see read lengths of 20 kb, which

will help resolve many of the complex

genomic features underlying host adapta-

tion and pathogenicity.

Functional Annotation ofVirulence and Host-InteractionGenes

Annotation is the process of assigning

meaningful information, such as the loca-

tion or function of genes, to raw sequence

data. Reliable and consistent annotations

are thus fundamental for analysis and

interpretation of genome data. Since

annotation of new genomes is usually

based on homology searches (e.g., BLAST

hits), errors and inconsistencies tend to

propagate. One way to reduce error

propagation is to functionally annotate a

set of reference genomes based on exper-

imentally determined information. Anno-

tation of new genomes could then start

with searches in this database, which

would allow high-quality annotation of

all well-conserved genes. The Gene On-

tology’s Reference Genome Project [17]

and BioCyc [18] represent developments

in this direction. However, the number of

species included is still limited, and a

broader taxonomic breadth of bacteria,

with one reference species per genus,

would be desirable.

Functional annotation of pathogen ge-

nomes is particularly important, because

genes involved in host-interaction process-

es are among the most difficult to

annotate. One problem is that different

Box 1. Genomic Changes Associated with Host Shifts

The movement of a bacterial species from abundant animal hosts such asrodents, which are a major reservoir of infectious disease agents, to the relativelysmall human population is typically associated with decreased genome size andloss/alteration of the mobile gene pool [4,32–34]. One illustrative example can befound in the genus Mycobacterium, which contains several severe humanpathogens, including the agents of tuberculosis (M. tuberculosis) and leprosy (M.leprae) and also the recently emerged M. ulcerans. M. ulcerans causes severe skinlesions; this disease, known as Buruli ulcer, is becoming a serious public healthproblem in West and Central Africa as well as in other parts of the tropics.

Like many other recently emerged human pathogens [4,34–36], M. ulceransappears to have switched from a generalist to a specialist lifestyle: starting with aprogenitor very similar to the aquatic M. marinum. While M. marinum has beenfound both free-living and as an intracellular pathogen of fish and other species,M. ulcerans is thought to have a restricted host range and to be transmitted byinsects (Figure 1A). The host switch was likely initiated by the uptake of avirulence plasmid, and preceded through a series of ‘‘bottleneck events’’ or(severe reductions in population size due to environmental circumstances). Thisprocess resulted in loss of about 1 Mb of the genome, major genomicrearrangements, extensive proliferation of insertion sequences, and a massiveincrease in number of pseudogenes [37–39]. In particular, there was a massivereduction in the size of the two major surface protein gene families (a decrease ofmore than 250 genes compared to M. marinum). This gene loss is thought to havebeen crucial for the organism to evade the human immune system, by limitingthe number of antigens on the bacterial surface [40].

The uptake of a new virulence plasmid producing an immunosuppressivesubstance called mycolactone is also thought to have played a key role in theevolution and host switch of M. ulcerans. This plasmid consists mainly of threeunusually large and internally repeated genes (over 100 kb in total), and thusillustrates the concept of long and repeated virulence genes (Figure 1B) [41].These genes appear to evolve rapidly by recombination and gene conversion,and new variants can be directly connected to variations in the chemical structureof mycolactone [42], which might be important for host specificity, immunosup-pressive potency, and drug design.

Figure 1. Evolution of a new infectious disease agent. (A) Recent evolution of the specialisthuman pathogen M. ulcerans from the aquatic generalist pathogen M. marinum. (B) Arrangementof the three M. ulcerans plasmid–encoded repeated virulence genes (arrows from left to right:mlsA1 [51 kb], mlsA2 [7.6 kb], mlsB [43 kb]) coding for three polyketide synthases. The loadingmodules (labeled LM) and the 16 repeated modules depicted in purple (labeled 1–9 for mlsA1 andmlsA2, and 1–7 for mlsB) enable the serial buildup of the backbone carbon chain of the compleximmunosuppressive substance mycolactone.doi:10.1371/journal.pcbi.1000481.g001


research groups often have studied homol-

ogous genes in various species, and given

them different names that are not always

logical or reflective of similarities in

sequence and function. A manually curat-

ed database of protein families involved in

host interactions that incorporates cur-

rently used gene names, sequence motifs,

gene functions, and experimental results

would substantially improve the situation.

Much improved guidelines for how to

annotate genes in large families with

different combinations of sequence motifs

would also be valuable.

Comparative studies of very closely

related genomes can help to distinguish

functional genes from spurious ORFs

(open reading frames) and pseudogenes,

and thereby improve gene prediction. To

this end, a tool to visualize all the fine

details in comparisons of multiple closely

related genomes is crucial. Such a tool was

developed recently for genomes with a

conserved order of genes, and it has been

applied to analyze sequence deterioration

in the typhus pathogen Rickettsia prowazekii

and its closest relatives [10]. Future

studies, however, will require software that

can also handle multiple genome compar-

isons from highly rearranged genomes.

Another limitation of currently available

visualization tools is that, although multi-

ple genomes can be included, only serial

pairwise comparisons can be made. This

limitation can be overcome by visualiza-

tion of genome comparisons in ‘‘three

dimensions’’ (3D visualization), enabling

all-against-all comparisons to be viewed

simultaneously (Figure 2). Just as 3D

visualizations revolutionized the field of

structural biology over the past decades,

such developments might well revolution-

ize the field of comparative genomics in

the years to come.

Molecular Diagnostics andVaccine Development

Classification of infectious disease

agents is typically based on multilocus

sequence-typing (MLST) systems, by

which new bacterial isolates are analyzed

by sequencing five to seven predefined

core genes [19]. With the increasing

number of complete genome sequences

of pathogenic and nonpathogenic strains,

it will be possible to concatenate a much

larger number of conserved genes and use

this dataset to infer a tree to represent the

underlying population structure [20].

However, while genotyping systems based

on conserved genes can be useful for

monitoring the spread of strains, they do

not necessarily correlate with genomotypes

defined by virulence properties [21]. This

is because genes contributing to virulence

are prone to horizontal gene transfer, gene

duplications, and gene loss. Further com-

plicating the development of molecular

diagnostic methods is that homologs of

virulence genes are often present also in

nonpathogenic species, making it difficult

to recognize pathogens solely from the

gene content. Hence, classification and

risk assessments for the emergence of

novel infectious strains ultimately should

be based on a combination of strain

typing, gene content, and identification

of virulence genes.

Understanding the evolutionary dynam-

ics of host-interaction genes in terms of

both mechanisms and selective forces is also

important in order to design drugs that will

be effective in the long term. What good

would be the development of a new

antibiotic or vaccine if the intended target

protein evolves beyond recognition before

the drug reaches the market? One solution

to this problem is to characterize the

selective pressures on candidate vaccine

targets, and then exclude genes or parts of

genes based on their evolutionary dynamics

[22]. However, current tools for measuring

positive or diversifying selection are severe-

ly limited in that they assume that single-

base mutations are the only underlying

mechanism of sequence change. For reli-

able analyses of genes with a complex

evolution, a new generation of evolutionary

tests needs to be developed that acknowl-

edge the importance of mutation by

recombination (Figure 3) and multiple-base

insertion/deletion events as well as point

mutations. With the expected huge increase

of complete and draft genomes for many

strains of a species, there is a need for

programs capable of screening a large set of

alignments for recombination signals, with

novel statistical and visualization tools to

analyze the full set of results.

Predicting Risk for DiseaseOutbreaks

The next challenge is to place the

genomic data within its ecological context,

which has led to a new research field

called molecular ecosystems biology [23].

This field focuses on dissecting the many

complex molecular interactions between

the bacterial population and its environ-

ment. This environment can be highly

specialized, as in the case of bacteria

adapted to a single host species, or very

complex as for soil-, water-, or airborne

bacteria. The behavior of a pathogen thus

depends on many ecological factors, such

as seasonal fluctuations in temperature

and nutritional availability, species rich-

ness and host population density.

To be able to integrate and evaluate

these data, new software is needed. Imagine

a program that can read sequence data

from hundreds of bacterial isolates, infer

the underlying population structure, and

combine it with gene expression data,

Figure 2. New visualization tools forgenome comparisons. Comparison of thegenes in multiple genomes can be represent-ed visually by using a 3D program. Each arrowrepresents one gene, and the grey shadingbetween genes indicates homology. Redindicates genes that are unique to onegenome. The difference between this ap-proach and existing programs is that allgenomes can be compared to each othersimultaneously, rather than by pairwise com-parisons. With multiple genomes, and withzooming, flipping, and selecting options, eventhis rudimentary 3D program would be ofgreat help in genome analysis.doi:10.1371/journal.pcbi.1000481.g002

Figure 3. New methods for analyzingevolution by recombination. Improvedmodels and visualization tools are needed toanalyze recombination. Virulence genes, hereexemplified by the acfD gene in the Vibriocholerae pathogenicity island [43], oftendisplay complex recombination patterns. Thealigned acfD genes (arrows) from three V.cholerae strains (M2140, M1567, and M1118)are plotted separately; a line connects eachsite where the nucleotides in two strains differfrom the third strain. Noninformative siteswere removed before plotting.doi:10.1371/journal.pcbi.1000481.g003


ecological factors, and clinical data such as

the number of disease cases reported in

various geographic areas. It should be

possible to visualize global patterns in the

data, such as abundance of particular

strains and sequence variants and migra-

tion of infected hosts and vectors over

geographic areas and seasons. Changes in

taxonomic profiles, virulence genes, and

metabolic pathways should be visualized in

real time. This program could also be

linked to a Web site where researchers can

post daily updates of clinical cases, spread

of virulence genes, appearance of new

strains and new mutations, migration

patterns, and news about genome and

functional data. This site would be useful

for estimating the risk for new epidemics to

emerge in the human population.

Analyzing MicrobialCommunities

Analyzing the behavior of complete

pathogen ecosystems is an immediate

priority. Random shotgun sequencing

projects of bacterial DNA from diverse

environments count in the hundreds, and

the amount of metagenomic sequence

data already exceeds the available geno-

mic sequences in public databases [24,25]

(http://www.genomesonline.org). Several

multinational projects on the human

microbiome have been launched, which,

together with studies of 16S rRNA ampli-

cons, have provided new insights into the

human intestinal [26–28], oral [29], and

vaginal flora [30]. Comparison of the

microbial flora in healthy and diseased

people can be a powerful diagnostic tool

and enable the discovery of both emerging

pathogens and novel virulence factors,

such as antibiotic resistance plasmids. An

important technical development that

holds great promise for associating the

functional adaptation of the community as

a whole with the metabolic pathways

present in the individual strains is single-

cell isolation followed by whole-genome

amplification. Community sequencing also

provides an excellent tool for epidemic

surveillance of pathogenic strains and

virulence genes in environments from

which they may further spread to humans.

The massive amount of data created by

microbial community sequencing poses

new challenges and will require extensive

bioinformatics development [24]. Al-

though the advent of longer sequence

reads will have a large impact on the

assembly of community data, the presence

of many closely related species or strains in

the same sample, along with horizontal

gene transfer, will remain a daunting

challenge. A whole new field of compar-

ative algorithms needs to be developed, for

example to provide meaningful compari-

sons between taxonomic profiles. New

sequence databases will be essential for

rapid access to both raw and processed

data. Also, for fair comparisons between

datasets, a certain level of standardization

of sampling, experimental work, and

statistics will be crucial [31]. Bioinfor-

matics skills combined with a deep biolog-

ical understanding of the system under

study are needed to use these complex

sequence datasets to answer such questions

as: Who is there? What are they doing?

How are they communicating? And what

is the risk for disease?


The priority goals for the next decade

within the area of emerging infectious

diseases should be the study of complete

pathogen ecosystems and the dissection of

host–pathogen interaction communication

pathways directly in the natural environ-

ment. To achieve these goals, investments

in user-friendly software and improved

visualization tools, along with excellent

expertise in computational biology, will be

of utmost importance. Unfortunately, too

few undergraduate students in clinical

microbiology and microbial ecology are

trained in computational skills, and nation-

al governments and universities need to

take action to address this deficiency to

meet the demands of the near future. Often

neglected by public and private funding is

the monumental need for stable and

standardized infrastructure at all levels,

from the individual research group to the

intergovernmental organization. Only with

proper investments in everything from

hardware and personnel for data handling,

to the development of sensible and stan-

dardized file formats, can we ensure that

the current developments can be fully

exploited to more efficiently battle emerg-

ing infectious diseases.

Currently, the slow transition from a

scientific in-house program to the distribu-

tion of a stable and efficient software

package is a major bottleneck in scientific

knowledge sharing, preventing efficient

progress in all areas of computational

biology. Efforts to design, share, and

improve software must receive increased

funding, practical support, and, not the

least, scientific impact. Since microorgan-

isms do not follow national borders, such

initiatives are probably best started from

intergovernmental organizations with close

links to national centers with established

communication networks to distribute

know-how and advances further within

the country, and vice versa, to facilitate

the spread of new concepts and software to

all members of the organization. Eventual-

ly, many of these initiatives may become

community-driven. The example of Wiki-

pedia, with more than 10 million entries

written since the launch in 2001 and a

current growth rate of thousands of articles

daily (http://www.wikipedia.org), demon-

strates the power of user-contributed ini-

tiatives.

Acknowledgments

We thank Eddie Persson for graphical work.

References

1. Rappuoli R (2004) From Pasteur to genomics:

Progress and challenges in infectious diseases. Nat

Med 10: 1177–1185.

2. Marra MA, Jones SJ, Astell CR, Holt RA,

Brooks-Wilson A, et al. (2003) The genome

sequence of the SARS-associated coronavirus.

Science 300: 1399–1404.

3. Rota PA, Oberste MS, Monroe SS, Nix WA,

Campagnoli R, et al. (2003) Characterization of a

novel coronavirus associated with severe acute

respiratory syndrome. Science 300: 1394–1399.

4. Parkhill J, Wren BW, Thomson NR, Titball RW,Holden MT, et al. (2001) Genome sequence of

Yersinia pestis, the causative agent of plague.

Nature 413: 523–527.

5. Welch RA, Burland V, Plunkett G 3rd,

Redford P, Roesch P, et al. (2002) Extensive

mosaic structure revealed by the complete

genome sequence of uropathogenic Escherichia

coli. Proc Natl Acad Sci U S A 99: 17020–17024.

6. Dziejman M, Balon E, Boyd D, Fraser CM,

Heidelberg JF, et al. (2002) Comparative genomic

analysis of Vibrio cholerae: genes that correlate with

cholera endemic and pandemic disease. Proc Natl

Acad Sci U S A 99: 1556–1561.

7. Wolfgang MC, Kulasekara BR, Liang X, Boyd D,

Wu K, et al. (2003) Conservation of genome

content and virulence determinants among clin-

ical and environmental isolates of Pseudomonas

aeruginosa. Proc Natl Acad Sci U S A 100:

8484–8489.

8. Seshadri R, Myers GS, Tettelin H, Eisen JA,

Heidelberg JF, et al. (2004) Comparison of the

genome of the oral pathogen Treponema denticola

with other spirochete genomes. Proc Natl Acad

Sci U S A 101: 5646–5651.

9. Gill SR, Fouts DE, Archer GL, Mongodin EF,

Deboy RT, et al. (2005) Insights on evolution of

virulence and resistance from the complete

genome analysis of an early methicillin-resistant

Staphylococcus aureus strain and a biofilm-producing

methicillin-resistant Staphylococcus epidermidis strain.

J Bacteriol 187: 2426–2438.

10. Fuxelius HH, Darby AC, Cho NH, Andersson SG

(2008) Visualization of pseudogenes in intracellu-

lar bacteria reveals the different tracks to gene

destruction. Genome Biol 9: R42.

11. Berglund EC, Frank AC, Calteau A, Vinnere

Pettersson O, Granberg F, et al. (2009) Run-

off replication of host-adaptability genes is

associated with gene transfer agents in the

genome of mouse-infecting Bartonella grahamii.

PLoS Genet 5: e1000546. doi:10.1371/journal.

pgen.1000546.


12. Deitsch KW, Moxon ER, Wellems TE (1997)

Shared themes of antigenic variation and viru-

lence in bacterial, protozoal, and fungal infec-

tions. Microbiol Mol Biol Rev 61: 281–293.

13. Brayton KA, Knowles DP, McGuire TC,

Palmer GH (2001) Efficient use of a small

genome to generate antigenic diversity in tick-

borne ehrlichial pathogens. Proc Natl Acad

Sci U S A 98: 4130–4135.

14. Nystedt B, Frank AC, Thollesson M,

Andersson SG (2008) Diversifying selection and

concerted evolution of a type IV secretion system

in Bartonella. Mol Biol Evol 25: 287–300.

15. Bilek N, Ison CA, Spratt BG (2009) Relative

contributions of recombination and mutation to

the diversification of the opa gene repertoire of

Neisseria gonorrhoeae. J Bacteriol 191: 1878–1890.

16. Gupta PK (2008) Single-molecule DNA sequenc-

ing technologies for future genomics research.

Trends Biotechnol 26: 602–611.

17. The Gene Ontology’s Reference Genome Pro-

ject: A unified framework for functional annota-

tion across species. PLoS Comput Biol 5:

e1000431.

18. Karp PD, Ouzounis CA, Moore-Kochlacs C,

Goldovsky L, Kaipa P, et al. (2005) Expansion of

the BioCyc collection of pathway/genome data-

bases to 160 genomes. Nucleic Acids Res 33:

6083–6089.

19. Maiden MC, Bygraves JA, Feil E, Morelli G,

Russell JE, et al. (1998) Multilocus sequence

typing: A portable approach to the identification

of clones within populations of pathogenic

microorganisms. Proc Natl Acad Sci U S A 95:

3140–3145.

20. Ciccarelli FD, Doerks T, von Mering C,

Creevey CJ, Snel B, et al. (2006) Toward

automatic reconstruction of a highly resolved

tree of life. Science 311: 1283–1287.

21. Turner KM, Feil EJ (2007) The secret life of the

multilocus sequence type. Int J Antimicrob

Agents 29: 129–135.

22. Bambini S, Rappuoli R (2009) The use of

genomics in microbial vaccine development.

Drug Discov Today 14: 252–260.

23. Raes J, Bork P (2008) Molecular eco-systems

biology: Towards an understanding of communi-ty function. Nat Rev Microbiol 6: 693–699.

24. Kunin V, Copeland A, Lapidus A, Mavromatis K,

Hugenholtz P (2008) A bioinformatician’s guideto metagenomics. Microbiol Mol Biol Rev 72:

557–578.25. Liolios K, Mavromatis K, Tavernarakis N,

Kyrpides NC (2008) The Genomes On Line

Database (GOLD) in 2007: Status of genomicand metagenomic projects and their associated

metadata. Nucleic Acids Res 36: D475–479.26. Dethlefsen L, Huse S, Sogin ML, Relman DA

(2008) The pervasive effects of an antibiotic onthe human gut microbiota, as revealed by deep

16S rRNA sequencing. PLoS Biol 6: e280.

doi:10.1371/journal.pbio.0060280.27. Turnbaugh PJ, Hamady M, Yatsunenko T,

Cantarel BL, Duncan A, et al. (2009) A coregut microbiome in obese and lean twins. Nature

457: 480–484.

28. Mahowald MA, Rey FE, Seedorf H,Turnbaugh PJ, Fulton RS, et al. (2009) Charac-

terizing a model human gut microbiota com-posed of members of its two dominant bacterial

phyla. Proc Natl Acad Sci U S A 106:5859–5864.

29. Keijser BJ, Zaura E, Huse SM, van der

Vossen JM, Schuren FH, et al. (2008) Pyrose-quencing analysis of the oral microflora of

healthy adults. J Dent Res 87: 1016–1020.30. Spear GT, Sikaroodi M, Zariffard MR,

Landay AL, French AL, et al. (2008) Comparison

of the diversity of the vaginal microbiota in HIV-infected and HIV-uninfected women with or

without bacterial vaginosis. J Infect Dis 198:1131–1140.

31. Raes J, Foerstner KU, Bork P (2007) Get the mostout of your metagenome: Computational analysis

of environmental sequence data. Curr Opin

Microbiol 10: 490–498.32. Andersson SG, Kurland CG (1998) Reductive

evolution of resident genomes. Trends Microbiol6: 263–268.

33. Cole ST, Eiglmeier K, Parkhill J, James KD,

Thomson NR, et al. (2001) Massive gene decay inthe leprosy bacillus. Nature 409: 1007–1011.

34. Alsmark CM, Frank AC, Karlberg EO,

Legault BA, Ardell DH, et al. (2004) The louse-borne human pathogen Bartonella quintana is a

genomic derivative of the zoonotic agent Barton-

ella henselae. Proc Natl Acad Sci U S A 101:9716–9721.

35. Cole ST, Brosch R, Parkhill J, Garnier T,Churcher C, et al. (1998) Deciphering the biology

of Mycobacterium tuberculosis from the complete

genome sequence. Nature 393: 537–544.36. Parkhill J, Sebaihia M, Preston A, Murphy LD,

Thomson N, et al. (2003) Comparative analysis ofthe genome sequences of Bordetella pertussis,

Bordetella parapertussis and Bordetella bronchiseptica.Nat Genet 35: 32–40.

37. Yip MJ, Porter JL, Fyfe JA, Lavender CJ,

Portaels F, et al. (2007) Evolution of Mycobacterium

ulcerans and other mycolactone-producing myco-

bacteria from a common Mycobacterium marinum

progenitor. J Bacteriol 189: 2021–2029.

38. Rondini S, Kaser M, Stinear T, Tessier M,

Mangold C, et al. (2007) Ongoing genomereduction in Mycobacterium ulcerans. Emerg Infect

Dis 13: 1008–1015.39. Stinear TP, Seemann T, Pidot S, Frigui W,

Reysset G, et al. (2007) Reductive evolution andniche adaptation inferred from the genome of

Mycobacterium ulcerans, the causative agent of Buruli

ulcer. Genome Res 17: 192–200.40. Huber CA, Ruf MT, Pluschke G, Kaser M (2008)

Independent loss of immunogenic proteins inMycobacterium ulcerans suggests immune evasion.

Clin Vaccine Immunol 15: 598–606.

41. Stinear TP, Mve-Obiang A, Small PL, Frigui W,Pryor MJ, et al. (2004) Giant plasmid-encoded

polyketide synthases produce the macrolide toxinof Mycobacterium ulcerans. Proc Natl Acad Sci U S A

101: 1345–1349.42. Pidot SJ, Hong H, Seemann T, Porter JL, Yip MJ,

et al. (2008) Deciphering the genetic basis for

polyketide variation among mycobacteria pro-ducing mycolactones. BMC Genomics 9: 462.

43. Tay CY, Reeves PR, Lan R (2008) Importation ofthe major pilin TcpA gene and frequent recom-

bination drive the divergence of the Vibrio

pathogenicity island in Vibrio cholerae. FEMSMicrobiol Lett 289: 210–218.


Perspective

The Role of Medical Structural Genomics in DiscoveringNew Drugs for Infectious DiseasesWesley C. Van Voorhis1, Wim G. J. Hol2, Peter J. Myler3,4,5*, Lance J. Stewart6*

1 Department of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Biochemistry, University of Washington, Seattle,

Washington, United States of America, 3 Seattle Biomedical Research Institute, Seattle, Washington, United States of America, 4 Department of Global Health, University of

Washington, Seattle, Washington, United States of America, 5 Department of Medical Education and Biomedical Informatics, University of Washington, Seattle,

Washington, United States of America, 6 deCODE biostructures, Bainbridge Island, Washington, United States of America

Introduction

Whether we think of Alzheimer’s dis-

ease, microbial infection, or any other

modern-day disease, new medicines are

urgently needed. The number of new

drugs registered since the advent of

genomics, however, has not lived up to

expectations. One recent review revealed

that over 70 high-throughput biochemical

screens against genetically validated drug

targets in bacteria failed to yield a single

candidate that could be tested in the clinic

[1]. The reasons for the failure of high-

throughput biochemical screens are not

completely clear, but it could reflect the

limited diversity of chemical libraries used

and/or the absence of structural informa-

tion for many of the targets. Indeed,

structure-based drug design is playing a

growing role in modern drug discovery,

with numerous approved drugs tracing

their origins, at least in part, to the use of

structural information from X-ray crystal-

lography or nuclear magnetic resonance

(NMR) analysis of protein targets and their

ligand-bound complexes. Although it is

beyond the scope of this brief overview to

present a comprehensive list of structures

that have led to useful drugs, Table 1 lists

some examples in which protein structure

information has provided insights to the

design and development of new therapeu-

tic entities. These cases include both novel

drug design based on native and ligand-

bound structures and optimization of

inhibitors based on the binding mode

revealed by the structures of inhibitor–

target complexes. These approaches have

allowed increased affinity for the target

and/or improvement of pharmacological

properties while maintaining target

affinity.

With the increasing availability of

complete human and pathogen genome

sequences and the substantial progress in

structure determination methods, it is no

surprise that the field of ‘‘structural

genomics’’ has emerged recently. Its aim

is to solve as many useful protein struc-

tures as possible from the entire genome of

a single organism or group of related

organisms. Over the past ten years, over

20 structural genomics initiatives have

begun around the world (Table 2). The

impact of these efforts on structural

biology has been substantial, both in the

sheer number of new structures and,

perhaps even more importantly, in the

development of new methodologies, espe-

cially the use of robotics and informatics to

generate and capture data in a systematic

way [2]. Over the next five years,

thousands of new protein structures, many

bound to their ligands, will be elucidated;

laying the groundwork for structure-

based design and development of new

and improved chemotherapeutic agents

against pathogen proteins. Here, we will

focus on the intersection of structural

biology with chemistry and biology—a

field called ‘‘medical structural geno-

mics’’—particularly on how the structures

of medically relevant drug targets in

pathogens can serve as a starting point

for inhibitor design and drug develop-

ment. We argue that the pharmaceutical

industry should be persuaded to comple-

ment the publicly funded structural geno-

mics initiatives by making public the

structural coordinates of their drug targets

for important infectious disease organisms

in a timely fashion and by developing

public–private partnerships to provide the

maximal synergy between target valida-

tion, structure determination, and hit-to-

lead development.

Target Selection

A prerequisite of medical structural

genomics is that the proteins whose

structures are determined must be well-

validated as good drug targets. The term

‘‘drugability’’ is often used to loosely

describe how tractable any given target is

for the development of a drug candidate.

For infectious organisms, one key factor in

defining drugability is that the target

protein be essential for survival of the

microbe. While essentiality has tradition-

ally been defined using techniques such as

‘‘gene knockout’’ and RNA interference,

these are not always feasible and should be

complemented by chemical biology ap-

proaches (see below). Furthermore, the

meaningfulness of these experiments can

often be difficult to assess, since the

interplay of host and pathogen is complex

and full of surprises. For example, tre-

mendous effort has been devoted recently

to the development of antagonists for

targets in the fatty acid biosynthesis


Citation: Van Voorhis WC, Hol WGJ, Myler PJ, Stewart LJ (2009) The Role of Medical Structural Genomics inDiscovering New Drugs for Infectious Diseases. PLoS Comput Biol 5(10): e1000530. doi:10.1371/journal.pcbi.1000530



Copyright: � 2009 Van Voorhis et al. This is an open-access article distributed under the terms of theCreative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in anymedium, provided the original author and source are credited.

Funding: This work was supported by the NIAID funding to the Seattle Structural Genomics Center forInfectious Disease (SSGCID) contract HHSN266200700057C, the Medical Structural Genomics of ProtozoanPathogens (MSGPP) contract P01 AI067921 and to WCVV, grant 1R01AI080625. The funders had no role inpreparation of the article.

Competing Interests: Co-author Lance Stewart is an employee of deCODE biostructures, which developedthe Fragments-of-Life library presented in Figure 1 and discussed in sections titled ‘Fragment-based drugdiscovery’ and ‘Targeting oligomeric enzymes’. Fragments-of-Life TM is a technology trademarked by deCODEbiostructures and chemistry (http://www.decodechembio.com/Capabilities/StructuralBiology/FragmentsofLife.aspx).

* E-mail: [email protected] (PJM); [email protected] (LJS)


pathway of bacteria [3]. Potent drug-like

molecules with high bioavailability have

been developed that can effectively shut

down bacterial replication in vitro. These

compounds were found to be ineffective in

subsequent animal testing, however, be-

cause fatty acids are quite abundant in

vertebrates, so bacteria can secure these

host molecules for their survival and

growth even if their own fatty acid

biosynthesis pathways are blocked [4].

Thus, to improve target selection for

medical structural genomics, it will be

important to collaborate with chemical

biology groups to undertake screening

campaigns to identify compounds that

cause the death of a pathogen under the

appropriate assay conditions [5].

If the target protein of a drug is known,

medical structural genomics offers a rapid

and efficient way to obtain ligand-bound

structures by using high-throughput X-ray

crystallography and/or NMR. Converse-

ly, when the target of a cell-active

compound is unknown, medical structural

genomics efforts provide purified protein

for many potential drug targets that can be

screened for interaction with the active

compound by a number of biophysical

methods (such as thermal stability [6]).

The Medicinal Structural Genomics of

Protozoan Pathogens (MSGPP, http://

www.msgpp.org/) initiative has already

begun such an effort by screening thou-

sands of anti-malaria compounds against

67 potential Plasmodium falciparum targets

expressed in bacteria (WC Van Voorhis,

unpublished data). These approaches aim

to generate knowledge about the biological

effect of a small molecule on a target

protein. Follow-up experiments are then

needed to test the activity of this com-

pound in live organisms in order to

validate the target; this valuable ‘‘chemical

validation’’ makes the target much more

likely to be drugable, and thus worthy of

more intensive effort. The future will likely

see more medical structural genomics

centers working with chemical biology

groups that have collections of ‘‘pheno-

type-defined’’ compounds (i.e., those with

known anti-pathogen activity). The result

will be synergistic target validation and

hit-to-lead development using structure-

based drug design.

Fragment-Based DrugDiscovery

Fragment-based drug discovery has rapid-

ly gained interest within the pharmaceutical

industry (reviewed in [7] with roots of 128-

compound cocktails in [8]), as an alternative

to expensive and sometimes inefficient high-

throughput screening methods for hit identi-

fication and optimization [9]. The general

concept of fragment-based drug discovery

involves screening libraries of ‘‘rule-of-three’’

compounds [10] against target macromole-

cules by using a variety of methods including

X-ray crystallography, NMR, surface plas-

mon resonance, differential thermal denatur-

ation, fluorescence polarization, and other

techniques [7,11–14]. The rule of three

consists of molecular weight ,300 daltons,

#3 rotatable bonds, #3 hydrogen bond

donors/acceptors, and Clog P (calculated log

of octanol/water partition coefficient) ,3.

These compounds generally include frag-

ments or ‘‘building blocks’’ of available drugs,

on the assumption that these fragments are

more likely to be ‘‘drug-like.’’ Fragment-

based drug discovery has been used by

commercial and academic groups, including

our own, and has led to a number of leads for

further drug development [15]. At deCODE

biostructures, a partner in the Seattle Struc-

tural Genomics Center for Infectious Disease

(SSGCID, http://www.ssgcid.org/) consor-

tium, the approach to assembling a fragment

library has been somewhat different. The

Fragments of Life (FOL) library (Figure 1) is a

collection of approximately 1,400 structurally

diverse small molecules found in the cellular

environment, metabolites, natural products,

and their derivatives or isosteres (molecules of

Table 1. Examples of how target protein structure can assist drug discovery and development.

Source Target Protein Approach Reference(s)

HIV gp41 Structure led to strategies that target viral entry. [43–45]

HIV Protease Protease–inhibitor complexes allowed lead optimization. [46–52]

HIV Reverse transcriptase Non-nucleoside inhibitor complexes led to drug design that targetspockets outside the enzyme’s active site.

[53–55]

Influenza virus Neuraminidase Complex with a transition state analog led to inhalable and orally activeneuraminidase inhibitors.

[56–59]

Rhinovirus Coat protein Small fatty acid molecules bound in hydrophobic pocket led to newstrategies of antiviral drug design.

[60]

Vibrio Cholera toxin Five receptor-binding sites provided inspiration for design of novelmultivalent inhibitors.

[61]

Bacteria Peptide deformylase Protein–inhibitor complexes led to macrocyclic compounds withimproved potency, selectivity and metabolic stability.

[62]

Trypanosoma GAPDH Novel adenosine analogs showed enhanced selectivity towards theparasite target versus human protein.

[63,64]

Human Cyclophilin and calcineurin A ternary complex with cyclosporine A led to insights into itsimmunosuppressive activity.

[65]

Human Renin The ligand-bound structure allowed design and improvement of orallyactive non-peptide inhibitors to regulate blood pressure.

[66]

Human Coagulation factor Xa Structure-based design led to improved pharmacological anticoagulantproperties in a primate model.

[67]

Human Adenosine deaminase Optimization of a non-nucleoside inhibitor led to an orally activeanti-inflammatory compound in a rat model.

[68]

Human Kinases Structures of kinases provided a basis to improve and design newtherapeutics for various human diseases including cancer.

[69]

doi:10.1371/journal.pcbi.1000530.t001


similar size containing the same number and

types of atoms). Also included in the FOL

library are a series of biaryl small molecules

(which contain two tethered five- or six-

membered ring structures) that mimic protein

secondary structure elements (e.g., a-helical

turns). Thus, this fragment set is useful for

targeting both the active sites of enzymes and

more complex protein surfaces including

allosteric small molecule binding sites and

protein–protein interfaces [16].

Targeting Oligomeric Enzymes

Protein–protein interaction and assem-

blies, ranging from simple dimers to

extremely complex arrangements as seen

in the ribosome or the nuclear pore

complex, form the basis of most biological

processes, and there are usually numerous

points of contact between the macromol-

ecules involved. Yet the protein–protein

interfaces formed by oligomerization are

not necessarily accompanied by a large

gain in free energy, and small molecules

have been shown to prevent critical

protein–protein interactions [17]. These

Table 2. Structural genomics projects worldwide submitting to the Protein Data Bank.

Name URL Target Focus

Berkeley Structural Genomics Center (BSGC) http://www.strgen.org/ Near complete coverage of Mycoplasma genome

Center for Eukaryotic Structural Genomics (CESG) http://www.uwstructuralgenomics.org/ PSI Center—Eukaryotic bottlenecks, specifically solubility

Center for Structural Genomics of Infectious Disease(CSGID)

http://csgid.org/csgid/ Medically relevant infectious disease targets

Center for Structure of Membrane Proteins (CSMP) http://csmp.ucsf.edu/index.htm PSI Center—Bacterial and human membrane proteins

Integrated Center for Structure and FunctionInnovation (ISFI)

htp://techcenter.mbi.ucla.edu/ PSI Center—Protein solubility and crystallizationimprovement

Israel Structural Proteomics Center http://www.weizmann.ac.il/ISPC/ Member of Structural Proteomics in Europe (seebelow)

Joint Center for Structural Genomics (JCSG) http://www.jcsg.org/ PSI Center—High-throughput pipeline developmentand operation

Marseilles Structural Genomics Program http://www.afmb.univ-mrs.fr/rubrique93.html Human health

Medical Structural Genomics of PathogenicProtozoa (MSGPP)

http://www.msgpp.org/ Structural and functional genomics of ten species ofpathogenic protozoa

Montreal-Kingston Bacterial Structural GenomicsInitiative (BSGI)

http://euler.bri.nrc.ca/brimsg/bsgi.html ORFs from pathogenic and nonpathogenic bacterialstrains

Mycobacterium Tuberculosis Structural GenomicsConsortium (TBsgc)

http://www.doe-mbi.ucla.edu/TB/ Mycobacterium tuberculosis—To understandpathogenesis and for structure-based drug design

Mycobacterium Tuberculosis Structural ProteomicsProject (X-MTB)

http://webclu.bio.wzw.tum.de/binfo/proj/mtb/ 35 Mycobacterium tuberculosis targets to identify fivefor drug development

New York SGX Research Center for StructuralGenomics (NYSGXRC)

http://www.nysgrc.org/nysgrc/ PSI Center—High-throughput pipeline developmentand operation

Ontario Center for Structural Proteomics (OCSP) http://www.uhnres.utoronto.ca/centres/proteomics/ Enzymatic activity characterization

Oxford Protein Production Facility http://www.oppf.ox.ac.uk/OPPF/ Human and pathogen targets of biomedicalrelevance

RIKEN Structural Genomics/Proteomics Initiative http://www.rsgi.riken.jp/rsgi_e/ Protein functional networks

Seattle Structural Genomics Center for InfectiousDisease (SSGCID)

http://www.ssgcid.org/ Medically relevant infectious disease targets

Southeast Collaboratory for Structural Genomics http://www.secsg.org/ High-throughput eukaryotic genome-scan methodsdevelopment

Structural Genomics of Pathogenic Protozoa http://www.sgpp.org/ PSI Center - Three-dimensional structures of proteinsfrom four major pathogenic protozoa

Structural Proteomics in Europe (SPINE) http://www.spineurope.org/ Structures of medically relevant proteins and proteincomplexes

Structural Proteomics in Europe 2-Complexes(SPINE2 - Complexes)

http://www.spine2.eu/SPINE2/ Structures of protein complexes from medicallyrelevant signaling pathways

Structural Genomics Consortium http://www.thesgc.org/ Medically relevant human and pathogen proteins

Structure 2 Function Project http://s2f.umbi.umd.edu/ Poorly characterized and hypothetical protein targets

The Accelerated Technologies Center for Geneto 3D Structure

http://atcg3d.org/default.aspx PSI Center—Technologies development of X-raysource, synthetic gene design, and microfluidiccrystallization

The Midwest Center for Structural Genomics(MCSG)

http://www.mcsg.anl.gov/ PSI Center—High-throughput methods developmentand operation

The Northeast Structural Genomics Consortium(NESG)

http://www.nesg.org/ PSI Center—Protein domains, network families,biomedical relevance

Note: Some centers with fewer than ten released structures in the PDB (www.rcsb.org/pdb/) are not shown.PSI, Protein Structure Initiative.doi:10.1371/journal.pcbi.1000530.t002


findings have prompted recent discussion

of a structure-based approach aimed at

developing novel small-molecule antibiot-

ics that modulate protein activity by

binding to an interface between subunits

within multi-protein complexes [18]. The

bacterial enzyme inorganic pyrophospha-

tase may serve as an example for this

approach, since it exists in a hexameric

state that requires conformational flexibil-

ity for its essential role in converting

inorganic pyrophosphate into phosphate

[19–21]. Moreover, whereas all bacterial

inorganic pyrophosphatases function as a

homohexamer, the eukaryotic cytosolic

and mitochondrial inorganic pyrophos-

phatases function as homodimers [21].

Hence eukaryotic inorganic pyrophospha-

tases have different oligomeric interfaces

than those of bacterial enzymes. This

suggests that it may be possible to inhibit

the bacterial inorganic pyrophosphatase

safely by targeting its oligomeric state

rather than its highly conserved active

site. A similar approach has recently been

used to identify species-specific modulators

of porphobilinogen synthase (PBGS) ac-

tivity [22]. SSGCID has solved the high-

resolution X-ray crystal structure of inor-

ganic pyrophosphatase from the patho-

genic bacterium Burkholderia pseudomallei,

and a subsequent FOL screen of this target

identified several fragments that specifical-

ly bind at multiple oligomerization pockets

in a molecular interface between the two

trimers of the homohexamer (Figure 2).

While these fragments remain to be

validated in terms of their species-specific

inhibition of inorganic pyrophosphatase

activity, they represent potential starting

points for the development of novel

antibiotics.

Industry-Generated Structuresand the Protein Data Bank

As we have seen above, protein struc-

ture information is the bread and butter of

structure-based drug discovery. Structural

genomics projects (Table 2) have substan-

tially increased the number of protein

structures solved and have made this

information freely and openly available

(i.e., at no cost and without restriction by

copyright or other constraints) by depos-

iting it in the Protein Data Bank (PDB)

[23]. Most publishers have policies that

require authors to deposit structural data

in the PDB at the time of publication, so

structures determined by academic re-

searchers worldwide are, for the most

part, well disseminated. By contrast, the

pharmaceutical industry is sitting on a

mountain of structural data for protein–

ligand complexes from globally important

pathogens, which is not available to the

wider scientific community. The secrecy

engendered by the current economic

incentives driving drug discovery in the

commercial sector has led to a substantial

waste of precious resources through dupli-

cation of effort and inability to learn from

others’ successes and failures. The situa-

tion is unlikely to change without a

concerted effort to find ways to overcome

the financial and intellectual property

barriers that prevent dissemination of this

information. A recent publication suggest-

ed that open access industry–academia

partnerships may provide one possible

model [24]. We propose that the United

States National Institutes of Health, along

with other national and international

research-funding agencies, issue calls for

proposals that will fund the transfer of the

highly valuable structural information

from corporate databases into the PDB.

Such an effort would obviously require

discussion with industrial parties to nego-

tiate mutually acceptable policies and

mechanisms for the deposition of these

structures in the public databases. These

might include relaxation of release stan-

dards for industrial entities, such that

structural information could be safely

deposited in PDB at the time of structure

Figure 1. Conceptual organization of the deCODE biostructures Fragments of Life library. The current ,1,400-compound library containschemically tractable natural small molecule metabolites (FOL-Nat), metabolite-like compounds and their bioisosteres (FOL-NatD), and biaryl mimeticsof protein architecture (FOL-Biaryl). The FOL-Nat members include any natural molecule of molecular weight ,350 daltons that exists as a substrate,natural product, or allosteric regulator of any metabolic pathway in any cell type, such as the biosynthetic pathways for the neurotransmitterserotonin (1) and the plant hormone auxin (2). The FOL-Nat members also include secondary metabolites such as bestatin (3), a secondarymetabolite of Streptomyces olivoreticuli [38]. FOL-NatD fragments are defined as heteroatom-containing derivatives, isosteres, or analogs of any FOL-Nat molecule. For example, fragments 4–7 contain the indole scaffold, which is known to be a privileged building block for drug molecules [39]. Toemulate protein architecture, the FOL-Biaryl fragments were selected from a variety of biaryl compounds that are potential mimics of protein a, b, orc turns [40–42]. These include a compound (8) whose structure in an energy-minimized state can be seen to mimic the architecture on an a-turn of aprotein structure (here, residues Ser65-Ile66-Leu67-Lys68 of PDB ID:1RTP) and, similarly, a compound (9) whose structure mimics the b-turn of aprotein structure (residues Ala20-Ala21-Asp22-Ser23).doi:10.1371/journal.pcbi.1000530.g001


determination and released only at a later

date more appropriate for protection of

intellectual property.


We are currently witnessing an explo-

sion in technological and computational

advances in structural genomics, with

protein structures of hundreds or thou-

sands of medically relevant targets from

infectious disease organisms likely to be

available over the next few years. This new

information provides both academic and

for-profit scientists with an unprecedented

opportunity to accelerate the development

of new and improved chemotherapeutic

agents against these pathogens. One major

challenge will be the adaptation of existing

fragment-based drug design methods to

match the scale of the structural genomics

era. New high-throughput methods need

to be developed for fragment-screening to

enhance the success rate for protein–

ligand structure determination.

Major attention is also needed to the

development of fully automated, very high

throughput crystal growth screening meth-

ods to elucidate the binding of well-

selected compounds to medically relevant

targets. These screens need to cover many

(up to 100) protein variants [25,26],

1,000–10,000 different small molecule

compounds, and approximately 1,000

different crystal growth conditions [27],

resulting in 108 to 109 conditions to be

tested for a single drug target. Obviously,

this will require development of even

smaller volume assays than those currently

in use [28–31]—down to the low pico-

liters—and automated detection of crystals

in the millions of crystallization chambers

[32–34]. Further development of automat-

ed capillary crystallization methods [35]

might provide another way to achieve the

very high throughput crystal screening

required for reaching the full power of

medical structural genomics in the future.

Cryoprotection of the crystals is a specific

hurdle, although it might be possible to

routinely collect and merge partial datasets

from multiple crystals under non-cryo

conditions. Alternatively, the use of micro-

meshes [36,37] and further miniaturiza-

tion of trays and other crystal screening

tools may allow cryoprotection of many

crystals simultaneously.

In addition, existing databases will need

to be modified to allow easy dissemination

of the results from these fragment screens,

and a serious effort should be made to

persuade small and big pharma to release

coordinates of drug targets from globally

important infectious disease organisms. It

will also be critical (but challenging) for

structural biologists to collaborate with

medicinal chemists and molecular biolo-

gists to turn these fragment from promis-

ing leads to effective drugs. Together,

these steps should begin to release a flood

of structures that provide a tremendous

resource for improving health in rich and

poor countries alike.

Acknowledgments

The authors wish to thank all the individuals

who have dedicated themselves to the SSGCID

and MSGPP projects. In particular, we thank

Robin Stacy, Bart Staker, Alberto Napuli,

Frank E. Zucker, Erkang Fan, Christophe

Verlinde, Ethan Merritt, and Frederick Buck-

ner, to name but a few.

Figure 2. B. pseudomallei inorganic pyrophosphatase with bound ligand at an oligomeric interface. Homo-hexameric bacterial inorganicpyrophosphatase is a dimer of trimers (blue and green). The illustration shows the hexamer structure in a complex with three ligand fragmentmolecules (red spheres and stick structures represent fragment FOL 110), each of which is located at one of three ‘‘dimer of trimer’’ interfaces (1.5ligands per monomer) (PDBID:3EJ0). The location of one pyrophosphate substrate (cyan spheres) at the active site of one of the monomers isindicated here based on the superimposed structure of the hexamer with pyrophosphate bound in the active site (PDBID:3EIY). The binding sites ofthe ligands (red) are clearly seen in a pocket formed by the homo-oligomeric assemblage, which is distant from the active site where pyrophosphate(cyan) binds.doi:10.1371/journal.pcbi.1000530.g002


References

1. Payne DJ , Gwynn MN, Holmes DJ ,Pompliano DL (2007) Drugs for bad bugs:

Confronting the challenges of antibacterial dis-

covery. Nat Rev Drug Discov 6: 29–40.

2. Haquin S, Oeuillet E, Pajon A, Harris M,Jones AT, et al. (2008) Data management in

structural genomics: An overview. Methods Mol

Biol 426: 49–79.

3. Wright HT, Reynolds KA (2007) Antibacterialtargets in fatty acid biosynthesis. Curr Opin

Microbiol 10: 447–453.

4. Brinster S, Lamberet G, Staels B, Trieu-Cuot P,

Gruss A, et al. (2009) Type II fatty acid synthesisis not a suitable antibiotic target for gram-positive

pathogens. Nature 458: 83–86.

5. Hoon S, Smith AM, Wallace IM, Suresh S,Miranda M, et al. (2008) An integrated platform

of genomic assays reveals small-molecule bioac-

tivities. Nat Chem Biol 4: 498–506.

6. Ericsson UB, Hallberg BM, Detitta GT,Dekker N, Nordlund P (2006) Thermofluor-based

high-throughput stability optimization of proteins

for structural studies. Anal Biochem 357:

289–298.

7. Congreve M, Chessari G, Tisi D, Woodhead AJ

(2008) Recent developments in fragment-based

drug discovery. J Med Chem 51: 3661–3689.

8. Verlinde CLMJ, Kim H, Bernstein BE,Mande SC, Hol WG (1997) Antitrypanosomiasis

drug development based on structures of glyco-

lytic enzymes. In: Veerapandian P, ed. Structure-

based drug design. New York: Marcel Dekker. pp

365–394.

9. Rees DC, Congreve M, Murray CW, Carr R

(2004) Fragment-based lead discovery. Nat Rev

Drug Discov 3: 660–672.

10. Congreve M, Carr R, Murray C, Jhoti H (2003)

A ‘‘rule of three’’ for fragment-based lead

discovery? Drug Discov Today 8: 876–877.

11. Nienaber VL, Greer J (2000) Discovering novelligands for macromolecules using X-ray crystal-

lographic screening. Nature Biotechnol 18:

1105–1108.

12. Neumann T, Junker HD, Schmidt K, Sekul R

(2007) SPR-based fragment screening: Advantag-

es and applications. Curr Top Med Chem 7:

1630–1642.

13. Jhoti H, Cleasby A, Verdonk M, Williams G

(2007) Fragment-based screening using X-ray

crystallography and NMR spectroscopy. Curr

Opin Chem Biol 11: 485–493.

14. Erlanson DA (2006) Fragment-based lead discov-

ery: A chemical update. Curr Opin Biotechnol

17: 643–652.

15. Bosch J, Robien MA, Mehlin C, Boni E,Riechers A, et al. (2006) Using fragment cocktail

crystallography to assist inhibitor design of

Trypanosoma brucei nucleoside 2-deoxyribosyltrans-

ferase. J Med Chem 49: 5939–5946.

16. Davies DR, Mamat B, Magnusson OT,

Christensen J, Haraldsson MH, et al. (2009)

Discovery of leukotriene A4 hydrolase inhibitors

using metabolomics biased fragment crystallog-raphy. J Med Chem 52: 4694–4715.

17. Liuzzi M, Deziel R, Moss N, Beaulieu P,

Bonneau AM, et al. (1994) A potent peptidomi-

metic inhibitor of HSV ribonucleotide reductasewith antiviral activity in vivo. Nature 372:

695–698.

18. Wells JA, McClendon CL (2007) Reaching for

high-hanging fruit in drug discovery at protein-protein interfaces. Nature 450: 1001–1009.

19. Kankare J, Salminen T, Lahti R, Cooperman BS,

Baykov AA, et al. (1996) Structure of Escherichia

coli inorganic pyrophosphatase at 2.2 A resolu-tion. Acta Crystallogr D Biol Crystallogr 52:

551–563.

20. Oksanen E, Ahonen AK, Tuominen H,

Tuominen V, Lahti R, et al. (2007) A completestructural description of the catalytic cycle of

yeast pyrophosphatase. Biochemistry 46:

1228–1239.

21. Sivula T, Salminen A, Parfenyev AN,

Pohjanjoki P, Goldman A, et al. (1999) Evolu-tionary aspects of inorganic pyrophosphatase.

FEBS Lett 454: 75–80.

22. Lawrence SH, Ramirez UD, Tang L, Fazliyez F,Kundrat L, et al. (2008) Shape shifting leads to

small-molecule allosteric drug discovery. Chem

Biol 15: 586–596.

23. Berman H, Henrick K, Nakamura H, Markley JL(2007) The worldwide Protein Data Bank

(wwPDB): Ensuring a single, uniform archive ofPDB data. Nucleic Acids Res 35: D301–303.

24. Edwards AM, Bountra C, Kerr DJ, Willson TM

(2009) Open access chemical and clinical probes

to support drug discovery. Nat Chem Biol 5:436–440.

25 . Cho i KH, Groarke JM, Young DC,

Rossmann MG, Pevear DC, et al. (2004) Design,expression, and purification of a Flaviviridae

polymerase using a high-throughput approach to

facilitate crystal structure determination. ProteinSci 13: 2685–2692.

26. Graslund S, Sagemark J, Berglund H,

Dahlgren LG, Flores A, et al. (2008) The use ofsystematic N- and C-terminal deletions to

promote production and structural studies of

recombinant proteins. Protein Expr Purif 58:210–221.

27. Luft JR, Collins RJ, Fehrman NA, Lauricella AM,

Veatch CK, et al. (2003) A deliberate approach toscreening for initial crystallization conditions of

biological macromolecules. J Struct Biol 142:170–179.

28. Santarsiero BDYD, Lee CC, Spraggon G, Gu J,Scheibe D, Uber EC, Cornell EW, Nordmeyer RA,

Kolbe WF, Jin J, Jones AL, Jaklevic JM,Schultz PG, Stevens RC (2002) An approach to

rapid protein crystallization using nanodroplets.J Appl Crystallogr 35: 278–281.

29. Hansen CL, Skordalakes E, Berger JM, Quake SR(2002) A robust and scalable microfluidic meter-

ing method that allows protein crystal growth byfree interface diffusion. Proc Natl Acad Sci U S A

99: 16531–16536.

30. Zheng B, Roach LS, Ismagilov RF (2003)

Screening of protein crystallization conditionson a microfluidic chip using nanoliter-size

droplets. J Am Chem Soc 125: 11170–11171.

31. Gerdts CJ, Elliott M, Lovell S, Mixon MB,Napuli AJ, et al. (2008) The plug-based nanovo-

lume Microcapillary Protein Crystallization Sys-tem (MPCS). Acta Crystallogr D Biol Crystallogr

64: 1116–1122.

32. Wilson J (2002) Towards the automated evalua-

tion of crystallization trials. Acta Crystallogr D BiolCrystallogr 58: 1907–1914.

33. Pan S, Shavit G, Penas-Centeno M, Xu DH,

Shapiro L, et al. (2006) Automated classification

of protein crystallization images using supportvector machines with scale-invariant texture and

Gabor features. Acta Crystallogr D Biol Crystal-logr 62: 271–279.

34. Liu R, Freund Y, Spraggon G (2008) Image-

based crystal detection: A machine-learning

approach. Acta Crystallogr D Biol Crystallogr64: 1187–1195.

35. Fan E, Baker D, Fields S, Gelb MH, Buckner FS,

et al. (2008) Structural genomics of pathogenicprotozoa: An overview. Methods Mol Biol 426:

497–513.

36. Wagner A, Diez J, Schulze-Briese C, Schluckebier G

(2009) Crystal structure of ultralente—A microcrys-talline insulin suspension. Proteins 74: 1018–1027.

37. Thorne RESZ, Kmetko J, O’Niell J, Gillilan R

(2003) Microfabricated mounts for high-through-put macromolecular cryocrystallography.

J Applied Crystallography 36: 1455–1460.

38. Schorlemmer HU, Bosslet K, Dickneite G,

Luben G, Sedlacek HH (1984) Studies on the

mechanisms of action of the immunomodulatorBestatin in various screening test systems. Behring

Inst Mitt: 157–173.

39. Costantino L, Barlocco D (2006) Privileged

structures as leads in medicinal chemistry. Curr

Med Chem 13: 65–85.

40. Biros SM, Moisan L, Mann E, Carella A, Zhai D,

et al. (2007) Heterocyclic alpha-helix mimetics for

targeting protein-protein interactions. Bioorg

Med Chem Lett 17: 4641–4645.

41. Robinson JA (2008) Beta-hairpin peptidomi-

metics: design, structures and biological activities.

Acc Chem Res 41: 1278–1288.

42. Saraogi I, Hamilton AD (2008) alpha-Helix

mimetics as inhibitors of protein-protein interac-

tions. Biochem Soc Trans 36: 1414–1417.

43. Root MJ, Steger HK (2004) HIV-1 gp41 as atarget for viral entry inhibition. Curr Pharm Des

10: 1805–1825.

44. Weissenhorn W, Dessen A, Harrison SC,

Skehel JJ, Wiley DC (1997) Atomic structure of

the ectodomain from HIV-1 gp41. Nature 387:

426–430.

45. Ferrer M, Kapoor TM, Strassmaier T,

Weissenhorn W, Skehel JJ, et al. (1999) Selection

of gp41-mediated HIV-1 cell entry inhibitors

from biased combinatorial libraries of non-

natural binding elements. Nat Struct Biol 6:

953–960.

46. Lapatto R, Blundell T, Hemmings A,

Overington J, Wilderspin A, et al. (1989) X-ray

analysis of HIV-1 proteinase at 2.7 A resolution

confirms structural homology among retroviral

enzymes. Nature 342: 299–302.

47. Miller M, Schneider J, Sathyanarayana BK,

Toth MV, Marshall GR, et al. (1989) Structureof complex of synthetic HIV-1 protease with a

substrate-based inhibitor at 2.3 A resolution.

Science 246: 1149–1152.

48. Navia MA, Fitzgerald PM, McKeever BM,

Leu CT, Heimbach JC, et al. (1989) Three-

dimensional structure of aspartyl protease from

human immunodeficiency virus HIV-1. Nature337: 615–620.

49. Wlodawer A, Mil ler M, Jaskolski M,

Sathyanarayana BK, Baldwin E, et al. (1989)

Conserved folding in retroviral proteases: Crystal

structure of a synthetic HIV-1 protease. Science

245: 616–621.

50. Wlodawer A, Vondrasek J (1998) Inhibitors of

HIV-1 protease: A major success of structure-

assisted drug design. Annu Rev Biophys Biomol

Struct 27: 249–284.

51. Abdel-Rahman HM, Al-karamany GS, El-Koussi NA, Youssef AF, Kiso Y (2002) HIV

protease inhibitors: Peptidomimetic drugs and

future perspectives. Curr Med Chem 9:

1905–1922.

52. Chrusciel RA, Strohbach JW (2004) Non-peptidic

HIV protease inhibitors. Curr Top Med Chem 4:

1097–1114.

53. Das K, Lewi PJ, Hughes SH, Arnold E (2005)

Crystallography and the design of anti-AIDS

drugs: Conformational flexibility and positional

adaptability are important in the design of non-

nucleoside HIV-1 reverse transcriptase inhibitors.

Prog Biophys Mol Biol 88: 209–231.

54. Kohlstaedt LA, Wang J, Friedman JM, Rice PA,

Steitz TA (1992) Crystal structure at 3.5 A

resolution of HIV-1 reverse transcriptase com-

plexed with an inhibitor. Science 256: 1783–1790.

55. Smerdon SJ, Jager J, Wang J, Kohlstaedt LA,

Chirino AJ, et al. (1994) Structure of the bindingsite for nonnucleoside inhibitors of the reverse

transcriptase of human immunodeficiency virus

type 1. Proc Natl Acad Sci U S A 91: 3911–3915.

56. Babu YS, Chand P, Bantia S, Kotian P,

Dehghani A, et al. (2000) BCX-1812 (RWJ-

270201): Discovery of a novel, highly potent,

orally active, and selective influenza neuramini-


dase inhibitor through structure-based drug

design. J Med Chem 43: 3482–3486.57. Bossart-Whitaker P, Carson M, Babu YS,

Smith CD, Laver WG, et al. (1993) Three-

dimensional structure of influenza A N9 neur-aminidase and its complex with the inhibitor 2-

deoxy 2,3-dehydro-N-acetyl neuraminic acid.J Mol Biol 232: 1069–1083.

58. Kim CU, Lew W, Williams MA, Liu H, Zhang L,

et al. (1997) Influenza neuraminidase inhibitorspossessing a novel hydrophobic interaction in the

enzyme active site: Design, synthesis, and struc-tural analysis of carbocyclic sialic acid analogues

with potent anti-influenza activity. J Am ChemSoc 119: 681–690.

59. von Itzstein M, Wu WY, Kok GB, Pegg MS,

Dyason JC, et al. (1993) Rational design of potentsialidase-based inhibitors of influenza virus repli-

cation. Nature 363: 418–423.60. Hadfield AT, Lee W, Zhao R, Oliveira MA,

Minor I, et al. (1997) The refined structure of

human rhinovirus 16 at 2.15 A resolution:Implications for the viral life cycle. Structure 5:

427–441.

61. Merritt EA, Zhang Z, Pickens JC, Ahn M,

Hol WG, et al. (2002) Characterization andcrystal structure of a high-affinity pentavalent

receptor-binding inhibitor for cholera toxin and

E. coli heat-labile enterotoxin. J Am Chem Soc124: 8818–8824.

62. Hu X, Nguyen KT, Jiang VC, Lofland D,Moser HE, et al. (2004) Macrocyclic inhibitors

for peptide deformylase: A structure-activity

relationship study of the ring size. J Med Chem47: 4941–4949.

63. Aronov AM, Verlinde CL, Hol WG, Gelb MH(1998) Selective tight binding inhibitors of try-

panosomal glyceraldehyde-3-phosphate dehydro-genase via structure-based drug design. J Med

Chem 41: 4790–4799.

64. Bressi JC, Choe J, Hough MT, Buckner FS, VanVoorhis WC, et al. (2000) Adenosine analogues as

inhibitors of Trypanosoma brucei phosphoglyceratekinase: Elucidation of a novel binding mode for a

2-amino-N(6)-substituted adenosine. J Med Chem

43: 4135–4150.65. Jin L, Harrison SC (2002) Crystal structure of

human calcineurin complexed with cyclosporin A

and human cyclophilin. Proc Natl Acad Sci U S A

99: 13522–13526.

66. Rahuel J, Rasetti V, Maibaum J, Rueger H,

Goschke R, et al. (2000) Structure-based drug

design: The discovery of novel nonpeptide orally

active inhibitors of human renin. Chem Biol 7:

493–504.

67. Lam PY, Clark CG, Li R, Pinto DJ, Orwat MJ,

et al. (2003) Structure-based design of novel

guanidine/benzamidine mimics: Potent and oral-

ly bioavailable factor Xa inhibitors as novel

anticoagulants. J Med Chem 46: 4405–4418.

68. Terasaka T, Kinoshita T, Kuno M, Seki N,

Tanaka K, et al. (2004) Structure-based design,

synthesis, and structure-activity relationship stud-

ies of novel non-nucleoside adenosine deaminase

inhibitors. J Med Chem 47: 3730–3743.

69. Noble ME, Endicott JA, Johnson LN (2004)

Protein kinase inhibitors: Insights into drug design

from structure. Science 303: 1800–1805.


Review

The Key Role of Genomics in Modern Vaccine and DrugDesign for Emerging Infectious DiseasesKate L. Seib1, Gordon Dougan2, Rino Rappuoli1*

1 Novartis Vaccines and Diagnostics, Siena, Italy, 2 The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom

Abstract: It can be argued that the arrival of the‘‘genomics era’’ has significantly shifted the paradigm ofvaccine and therapeutics development from microbiolog-ical to sequence-based approaches. Genome sequencesprovide a previously unattainable route to investigate themechanisms that underpin pathogenesis. Genomics,transcriptomics, metabolomics, structural genomics, pro-teomics, and immunomics are being exploited to perfectthe identification of targets, to design new vaccines anddrugs, and to predict their effects in patients. Further-more, human genomics and related studies are providinginsights into aspects of host biology that are important ininfectious disease. This ever-growing body of genomicdata and new genome-based approaches will play acritical role in the future to enable timely development ofvaccines and therapeutics to control emerging infectiousdiseases.

By controlling debilitating and often-lethal infectious diseases,

vaccines and antibiotics have had an enormous impact on world

health. Now, with the arrival of the ‘‘genomics era,’’ a paradigm

shift is occurring in the development of vaccines—and potentially

also in the development of antibiotics—that is providing fresh

impetus to this field. The world is still faced with a huge burden of

infection, however, by classic pathogens (e.g., typhoid, measles),

recently discovered causes of disease (e.g., Helicobacter pylori and

hepatitis C virus [HCV]), and emerging infectious diseases (EIDs,

e.g., H1N1 swine flu and severe acute respiratory syndrome

coronavirus [SARS-CoV]). In addition, variant forms of previ-

ously identified infectious diseases are reemerging (e.g., Streptococcus

pyogenes, also known as group A streptococcus [GAS], and dengue

fever), along with antibiotic-resistant forms of microbes (e.g.,

methicillin-resistant Staphylococcus aureus [MRSA] and Mycobacterium

tuberculosis) [1,2] (for a list of EIDs see http://www3.niaid.nih.gov/

topics/emerging/list.htm). The World Health Organization

(WHO) estimates that we can expect at least one such new

pathogen to appear every year.

The fact that an infectious disease has emerged or reemerged

indicates immune naıvety in the infected population, or altered

virulence potential or an increase in antibiotic/antiviral resistance

in the pathogen population. The rapid development of vaccines

and therapeutics that target these pathogens is therefore essential

to limit their spread. Traditional empirical approaches that screen

for vaccines or drugs a few candidates at a time are time-

consuming and have often proven insufficient to control many

EIDs, particularly when the causative pathogens are antigenically

diverse (e.g., HIV), cannot be cultivated in the laboratory (e.g.,

HCV), lack suitable animal models of infection (e.g., Neisseria spp.),

have complex mechanisms of pathogenesis (e.g., retroviruses),

and/or are controlled by mucosal or T cell–dependent immune

responses rather than humoral immune responses (e.g., Shigella

spp., M. tuberculosis) [3]. For many EIDs, the wealth of information

emerging in the genome era has already had a significant impact

on the way we approach vaccine and therapeutic development.

For EIDs that appear in the near future, genomics will be in the

first line of defense in terms of antigen identification, diagnostic

development, and functional characterization.

Since the completion of the genome sequence of Haemophilus

influenzae—the first finished bacterial genome sequence—in 1995 [4],

advances in sequencing technology and bioinformatics have

produced an exponential growth of genome sequence information.

At least one genome sequence is now available for each major

human pathogen. As of October 2009, over 1,000 bacterial genomes

were ‘‘completed’’ (i.e., closed genomes and whole genome shotgun

sequences) and more than 1,000 were ongoing; over 3,000 viral

genomes were completed (http://www.genomesonline.org/gold.cgi,

http://www.ncbi.nlm.nih.gov/genomes/MICROBES/microbial_

taxtree.html, http://cmr.jcvi.org/tigr-scripts/CMR/shared/

Genomes.cgi). For a bacterial pathogen, which may have more

than 4,000 genes, the genome sequence provides the complete

genetic repertoire of antigens or drug targets from which novel

candidates can be identified. For viral pathogens that may possess

fewer than 10 genes, genomics can be used to define the variability

that may exist between isolates. Host genetic factors also play a role

in infectious disease [5,6], however, and the availability of

‘‘complete’’ human genome sequences, as well as large-scale human

genome projects (see http://www.1000genomes.org/), are valuable

resources. Hence, the sequences of both pathogen and host genomes

can facilitate identification of a growing number of potential vaccine

and drug targets (Figure 1). It is estimated that 10–100 times more

candidates can be identified in one to two years using genomics-

based approaches than can be identified by conventional methods

in the same time frame. Furthermore, genomics-based vaccine

projects have substantially increased our understanding of microbial

physiology, epidemiology, pathogenesis, and protein functions (see

Box 1).


Citation: Seib KL, Dougan G, Rappuoli R (2009) The Key Role of Genomics inModern Vaccine and Drug Design for Emerging Infectious Diseases. PLoSGenet 5(10): e1000612. doi:10.1371/journal.pgen.1000612

Editor: Nicholas J. Schork, University of California San Diego and The ScrippsResearch Institute, United States of America


Copyright: � 2009 Seib et al. This is an open-access article distributed underthe terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.

Funding: KLS is the recipient of an Australian NHMRC CJ Martin Fellowship. GD issupported by The Wellcome Trust. KLS and RR are employed by Novartis Vaccines.The funders had no role in the preparation of the article.

Competing Interests: KLS and RR are employed by Novartis Vaccines.


PLoS Genetics | www.plosgenetics.org 1 October 2009 | Volume 5 | Issue 10 | e1000612

Figure 1. Genomics-based approaches used in the control of EIDs from the outbreak of a disease to the development of a vaccineor drug. (A) The causative agent of a disease may first be identified from patient samples by using metagenomics. (B) Vaccine and therapeutictargets can be identified from the pathogen genome using a variety of screening approaches that focus on the genome, transcriptome, proteome,immunome or structural genome. (C) The human genome can be screened to avoid homologies or similarities with pathogen vaccine andtherapeutic targets, or to identify new targets. (D) Once candidate vaccine and therapeutic targets have been identified they must be shown toprovide protection against disease and to be safe for use in patients. (E) The clinically tested vaccine or therapeutic can then be licensed for use. Theclinical responses of a vaccine and/or therapeutic can be analyzed using human genome based studies (dotted arrows). The pathogen genome canalso be used to analyze mutants that are able to evade the immune system in vaccinated subjects or organisms that develop antibiotic resistance.Examples of the approaches indicated are given in Table 1.doi:10.1371/journal.pgen.1000612.g001


From the outbreak of a disease, metagenomics (the study of all the

genetic material recovered directly from a sample) can be applied to

diseased human samples to aid the rapid identification of the

causative agent [7,8]. Once the complete genome sequence of the

organism is available, high-throughput approaches can be used to

screen for target molecules, as outlined below and in Table 1 [9,10].

Screening approaches vary depending on the nature of the pathogen

but are based on several accepted principles and key requirements of

vaccines and therapeutics, including the need for targets to be (i)

expressed and accessible to the host immune system, or to a

therapeutic agent, during human disease; (ii) genetically conserved;

(iii) important for survival or pathogenesis; and (iv) free of measurable

homology or similarity to host factors. Although many of the

approaches described here focus on vaccine development, which

involves screening of candidates for immunogenicity, they are largely

applicable to drug development by altering the selection criteria used

and screening candidates against compound libraries [11–13].

Reverse Vaccinology, Pan-genomics, andComparative Genomics

The idea behind reverse vaccinology is to screen an entire

pathogen genome to find genes that encode proteins with the

attributes of good vaccine targets, such as, for example, bacterial

surface associated proteins [14]. These proteins can then undergo

normal laboratory evaluation for immunogenicity. The Neisseria

meningitidis serogroup B (MenB) reverse vaccinology project provides

the ‘‘proof of concept’’ for this type of approach. This project

identified more novel vaccine candidates in 18 months than had

been discovered in 40 years of conventional vaccinology [15].

Analysis of the genome sequence of the virulent MenB strain MC58

found 2,158 predicted open reading frames (ORFs); these were

screened using bioinformatics tools to identify 570 ORFs that were

predicted to encode surface-exposed or secreted proteins that might

be accessible to the immune system [15]. Antigen screening

Box 1: Reverse Vaccinology Drives the Discovery of New Protein Functions

Reverse vaccinology involves the in silico screening of theentire genome of a pathogen to find genes that encodeproteins with the attributes of good vaccine targets, usingeither the genome of a single pathogenic isolate or the pan-genome (the genomic information from several isolates) of apathogenic species.

Pili in pathogenic streptococci play a key role invirulence and are promising vaccine candidates Theidentification of pili (long filamentous structures that extendfrom the bacterial surface) in the main pathogenic strains ofstreptococci is a good example of how genomics can lead tothe discovery of protein functions and increasedunderstanding of host–pathogen interactions. The pili ofgram-negative bacteria are well-described virulence factors.Little was known, however, about pili in gram-positivebacteria before the sequencing and analysis of the genomesof S. pyogenes, S. agalactiae, and S. pneumoniae (reviewed in[72]).During analysis of eight S. agalactiae genome sequences,three protective antigens identified by pan-genomic reversevaccinology [20] were found to contain LPXTG motifs typicalof cell wall-anchored proteins and seen to assemble into pili[73]. Further bioinformatics analysis revealed three indepen-dent loci that encode structurally distinct pilus types, each ofwhich contains two surface-exposed antigens capable ofeliciting protective immunity in mice [75]. Because of thelimited variability of S. agalactiae pili, it has been suggestedthat a combination of only three pilin subunits could lead tobroad protective immunity [74].Following the identification of S. agalactiae pili, typical pilusregions were identified in the available S. pyogenes genomesbased on the presence of genes encoding LPXTG-containingproteins. In addition, a combination of recombinant pilusproteins was shown to confer protection in mice againstmucosal challenge with virulent S. pyogenes isolates [75].Falugi and colleagues have since found that S. pyogenes piliare encoded by nine different gene clusters, and theyestimate that a vaccine comprising a combination of 12backbone variants could provide protection against over90% of circulating S. pyogenes strains [76].The availability of multiple complete genome sequences forS. pneumoniae, and the increased understanding of pilusproteins in other pathogenic streptococci, led to thediscovery of two pilus ‘‘islands’’ that encode proteins that

play a role in adherence to lung epithelial cells andcolonization in a murine model of infection, where theyelicit host inflammatory responses [77,78]. In addition, thepilus subunits confer protection in passive and activeimmunization models [79]. The presence of pili that containprotective antigens in all three principal streptococcalpathogens indicates that these structures play an importantrole in virulence.

Reverse vaccinology leads to identification of thefHBP and its role in meningococcal species specificitySerogroup B N. meningitidis (MenB) strains are responsiblefor the majority of meningococcal disease in the developedworld, yet there is no comprehensive MenB vaccineavailable. Screening of the MenB genome for vaccinecandidates by using reverse vaccinology led to thediscovery of the meningococcal factor H-binding protein(fHBP) [15], which was recently suggested to play animportant role in the species specificity of N. meningitidis[80]. fHBP is a component of the Novartis multivalent MenBvaccine that entered Phase III clinical testing in 2008 [16,17]and is also under investigation by Wyeth Vaccines(designated LP2086) [81] and other groups [82]. Initiallyidentified as the genome-derived Neisseria antigen 1870(GNA1870), a Neisseria-specific putative surface lipoproteinof unknown function, fHBP was renamed because of itsability to bind complement factor H (fH), a molecule thatdown-regulates activation of the complement alternativepathway. Hence, binding of fH to the surface of Neisseriaallows the pathogen to evade complement-mediated killingby the innate immune system [83]. fHBP is expressed by all N.meningitidis strains studied [84]. It induces high levels ofbactericidal antibodies in mice [16] and is important forsurvival of bacteria in human serum and blood [83,85,86].The discovery that binding of fH to N. meningitidis is specificfor human fH, and that human fH alone is able to down-regulate complement activation and bactericidal activityleading to increased bacterial survival has significantimplications for the study of this organism [80]. Theadministration of human fH to infant rats challenged withMenB led to a greater than 10-fold increase in survival ofbacteria [80], providing an important insight into host–pathogen interactions that may lead to the development ofnew animal models of infection.


continued on the basis of several criteria: the ability of antigens to be

expressed in Escherichia coli as recombinant proteins (350 candidates);

confirmation by ELISA and flow cytometry that the antigen is

exposed on the cell surface (91 candidates); the ability of induced

antibodies to elicit killing, as measured by serum bactericidal assay

and/or passive protection in infant rat assays (28 candidates); and

screening of a panel of diverse meningococcal isolates to determine

whether the antigens are conserved. This approach resulted in the

development of a multi-component recombinant MenB vaccine

that entered Phase III clinical trials in 2008 [16,17].

As multiple genome sequences become available for a single

species, the concept of pan-genomic reverse vaccinology is

Table 1. Approaches to identify vaccine and/or drug targets against EIDs in the genomic era.

Approach Methods Used Limitations of Method Example

Organism Disease

Genomics/reverse vaccinology:Analysis of the genetic material ofan organism in order to identify therepertoire of protein antigens/drugtargets the organism has the potentialto express.

Bioinformatics screening of the genomesequence to identify ORFs predicted tobe exposed on the surface of thepathogen or secreted, expression ofrecombinant proteins, generation ofantibodies in mice to confirm surfaceexposure, and bactericidal activity [14].

Prediction algorithms need to bevalidated.Non-protein antigens includingpolysaccharides or glycolipids, andpost-translational modificationscannot be identified.High-throughput cloning and proteinexpression is required.

Serogroup B N.meningitidis [15,16]

Major cause ofsepticemia andmeningitis in thedeveloped world.

Pan-genomics: Analysis of the geneticmaterial of several organisms of a singlespecies to identify conserved antigens/targets and ensure the chosen targetcovers the diversity of the organism.

Similar to above, but ORFs are chosenby screening of multiple genomes witheither direct sequencing or comparativegenome hybridization [18].

Sequences of multiple isolatesof a species are required.Similar limitations as describedabove.

S. agalactiae [20] Leading cause ofneonatal bacterialsepsis, pneumonia,and meningitis inthe US and Europe.

Comparative genomics: Analysis ofthe genetic material of several individualsof a single species, to identify antigens/targets that are present in pathogenicstrains but absent in commensal strains,and thus important for disease.

Similar to pangenomics, but ORFs arechosen by screening of genomes frommultiple strains of pathogenic andcommensal strains of a species [18,21].

Similar limitations as for the abovetwo approaches.

E. coli [22] Major cause of mildto severe diarrhea,hemolytic-uremicsyndrome, andurinary tract infections.

Transcriptomics: Analysis of the setof RNA transcripts expressed by anorganism under a specified condition.

Gene expression is evaluated in vitro orin vivo using DNA microarrays or cDNAsequencing [24].

There is no direct correlationbetween the levels of mRNAand protein.In vivo studies require relativelylarge amounts of mRNA.

V. cholerae [26] Causes diseasesranging from self-limiting to severe,life-threateningdiarrhea, woundinfections, and sepsis.

Functional genomics: Analysis of therole of genes and proteins in order toidentify genes required for survivalunder specific conditions.

Genes that are functionally essential inspecific conditions in vitro or in vivo aredetermined by gene inhibition followedby screening of mutants in animal modelsor cell culture to identify attenuatedclones [87].

Genetic tools, acceptance oftransposons, and naturalcompetence of the pathogenare required.

H. pylori [32] Major cause ofduodenal and gastriculcers and stomachcancer as a resultof chronic low-levelinflammation of thestomach lining.

Proteomics: Analysis of the set ofproteins expressed by an organismunder a specified condition and/or inspecific cellular locations (e.g., on thecell surface).

2D-PAGE, MS, and chromatographictechniques to identify proteins fromwhole cells, fractionated samples, orthe cell surface [34].

Proteins with low abundanceand/or solubility and proteinsthat are only expressed in vivomay not be identified.

S. pyogenes [36] Cause of a range ofdiseases from mildpharyngitis to severetoxic shock syndrome,necrotizing fasciitis,and rheumatic fever.

Immunomics: Analysis of the subsetof proteins/epitopes that interact withthe host immune system.

Analysis of seroreactive proteins, using2D-PAGE, phage display libraries, orprotein microarrays, probed with hostsera [38].Bioinformatics prediction of B cell andT cell epitopes [37].

Potential bias against sequencesthat cannot be displayed.Large conformational epitopesmade up of noncontiguous aminoacids may not be detected.Prediction of B cell epitopes isdifficult due to the need toidentify conformational epitopes.

S. aureus [39] Cause of woundinfections. Hasemerged as asignificantopportunisticpathogen due toantibiotic resistance.

Structural genomics: Analysis of thethree-dimensional structure of anorganism’s proteins and how theyinteract with antibodies or therapeutics.

NMR or crystallography to determinethe structure of proteins in thepresence/absence of antibodies ortherapeutics [51].

Poor understanding ofdeterminants of immunogenicity,immunodominance, and structure-function relationships.

HIV [53] Causative agent ofAIDS.

Vaccinomics/immunogeneticspharmacogenetics: Analysis of howthe human immune system respondsto a vaccine or drug.

Investigation of genetic heterogeneity/polymorphisms in the host, at theindividual or population level, that mayalter immune responses to vaccines [68]or metabolism of therapeutics [71].

Ethical issues of ‘‘personalized’’medicine.Immense diversity of the humangenome and, in particular, in thehuman immune response.

Mumps virus [69] Cause of diseaseranging from self-limiting parotidinflammation toepididymo-orchitis,meningitis, andencephalitis.

doi:10.1371/journal.pgen.1000612.t001


emerging as a powerful tool to identify vaccine candidates in

antigenically diverse species [18]. Pan-genomics aims to identify

the full complement of genes in a species, based on the superset of

genes in several strains of the same species. Analysis of the genome

sequences of eight Streptococcus agalactiae (also known as group B

streptococcus) strains revealed substantial genetic heterogeneity

and the extended gene repertoire of the species [19]. Screening

found a total of 589 genes predicted to encode surface-exposed or

secreted proteins in the S. agalactiae pan-genome (396 from the

‘‘core genome’’—genes conserved in all strains—and 193 from the

‘‘dispensable genome’’—genes that are present in two or more

strains and are hence considered dispensable for survival). Based

on further screening of this pool of candidates, including the ability

of recombinant proteins to provide protection when used to

immunize animals, a combination of four antigens—only one of

which is in the core genome—was selected and shown to confer

protection against a panel of S. agalactiae strains [20].

Whereas genome sequencing projects have typically focused on

pathogenic organisms, comparison of the genomes of pathogenic and

nonpathogenic strains allows vaccine and drug targets to be identified

on the basis of proteins that are specifically involved in pathogenesis

[21]. Comparative studies of up to 17 commensal and pathogenic E.

coli genomes identified genes unique to certain pathogenic strains that

are largely absent in commensal strains. This filter decreases the pool of

targets to be screened and potentially limits any detrimental effects of

therapeutics on the composition of the commensal flora [22].

New sequencing technologies will also open up opportunities for

monitoring pathogen vaccine escape by screening for evidence of

immune selection in the genomes of pathogen populations before

and after vaccine selection. By deep-sequencing of bacterial and

viral populations it will be possible to identify antigens under

immune selection by monitoring the clustering of single nucleotide

polymorphisms (SNPs) and other mutations that affect protein

sequence. This approach has already been used to search for

evidence of antigenic variation/selection in populations of

Salmonella enterica serovar Typhi [23], where variation is extremely

limited. Similar sequencing strategies could be applied to

populations of bacteria taken before or after a vaccine trial in a

particular geographical region.

Beyond Genomics: Other -Omics Approaches toStudy Pathogens

Pathogen genes that are up-regulated during infection and/or

essential for microorganism survival or pathogenesis can be

identified by using transcriptomics, i.e., the analysis of a near

complete set of RNA transcripts expressed by the pathogen under

a specified condition. Comprehensive DNA-based microarray

chips (probed with cDNA generated from RNA by reverse

transcription) [24] and ultra-high-throughput sequencing technol-

ogies that allow rapid sequencing and direct quantification of

cDNA [25] enable the transcriptome of a pathogen to be

characterized and particular types of gene product to be identified.

For example, genes involved in the hyperinfectious state of Vibrio

cholerae, which appears after passage through the human

gastrointestinal tract, were identified through a comparison of

the transcriptome of bacteria isolated directly from stool samples

of cholera patients with that of V. cholerae grown in vitro [26].

Similarly, analysis of the transcription profile of M. tuberculosis

during early infection in immune-competent (BALB/c) and severe

combined immunodeficient (SCID) mice revealed a set of 67 genes

activated exclusively in response to the host immune system [27].

Functional genomics—linking genotype, through transcrip-

tomics and proteomics, to phenotype—has been applied to many

pathogens to identify genes essential to survival or virulence that

may be valid vaccine candidates. DNA microarrays can be used to

screen comprehensive libraries of pathogen mutants, by compar-

ing bacterial isolates from before and after passage through animal

models or exposure to compound libraries to identify attenuated

clones [28–30]. For example, these methods have been used to

identify 65 novel MenB genes that are required for the pathogen to

cause septicemia in infant rats [31], 47 genes essential for H. pylori

gastric colonization of the gerbil [32], and genes contributing to

M. tuberculosis persistence in the host [33].

Analysis of a pathogen’s proteome (the near complete set of

proteins expressed under a specified condition) to reveal potential

vaccine and drug candidates can add significant value to in silico

approaches [34]. High-throughput proteomic analyses can be

performed by using mass spectrometry (MS), chromatographic

techniques, and protein microarrays [35]. A novel proteome-based

approach has been applied to identify the surface proteins of GAS

by making use of proteolytic enzymes to ‘‘shave’’ the bacterial

surface, releasing exposed proteins and partially exposed peptides.

Seventeen surface proteins of a virulent GAS strain were identified

in this way by using MS and genome sequence analysis. Their

location on the pathogen surface was confirmed by flow

cytometry, and one of them provided protective immunity in a

mouse model of the disease [36].

The proteome of a pathogen can also be screened to identify the

immunome (the near complete set of pathogen proteins or

epitopes that interact with the host immune system) using in vitro

or in silico techniques [37,38]. In vitro identification and screening

of the immunome are based on the idea that antibodies present in

serum from a host, which has been exposed to a pathogen,

represent a molecular ‘‘imprint’’ of the pathogen’s immunogenic

proteins and can be used to identify vaccine candidates. As such,

several techniques have been developed to allow the high-

throughput display of pathogen proteins, and the subsequent

screening for proteins that interact with antibodies in sera.

Immunogenic surface proteins of several organisms have been

identified, including S. aureus using 2D-PAGE, membrane blotting,

and MS [39]; S. agalactiae, S. pyogenes, and Streptococcus pneumoniae

using phage- or E. coli-based comprehensive genomic peptide

expression libraries [38,40]; and Francisella tularensis (the causative

agent of tularemia or rabbit fever) [41] and V. cholerae using protein

microarray chips [42]. Protein microarrays, in which proteins

from the pathogen are spotted onto a microarray chip, can also

be used to characterize protein–drug interactions, as well as

other protein–protein, protein–nucleic acid, ligand–receptor, and

enzyme–substrate interactions [43].

The ability to predict in silico which pathogen epitopes will be

recognized by B cells or T cells has greatly improved in recent

years [44]. Large-scale screening of pathogens including HIV,

Bacillus anthracis, M. tuberculosis, F. tularensis, Yersinia pestis (the

causative agent of bubonic plague), flaviviruses, and influenza for

B cell and T cell epitopes is currently underway [45,46]. Although

epitope prediction is not foolproof, it can serve as a guide for

further biological evaluation. T cell epitopes are presented by

MHC/HLA proteins on the surface of antigen-presenting cells,

which vary considerably between hosts, complicating the task of

functional epitope prediction. Additionally, B cell epitopes can be

both linear and conformational. The ultimate aim of researchers

in this field of study would be to engineer a single peptide that

represents defined epitope combinations from a protein or

organism, enabling the genetic variability of both pathogen and

host to be overcome [44].

Structural genomics—the study of the three-dimensional

structures of the proteins produced by a species—is increasingly


being applied to vaccine and drug development as a result of the

explosion of genome and proteome data, and continuing

improvements in the fields of protein expression, purification,

and structural determination [47]. The structure-based design of

antiviral therapeutics has led to the development of drugs directed

at the active sites of the HIV-1 protease [48] and influenza

neuraminidase [49]. More than 45,000 high-resolution protein

structures are available in public databases (see http://www.

wwpdb.org/stats.html), and several initiatives have been estab-

lished to pursue high-throughput characterization of protein

structures on a genome-wide scale [50], focusing on determining

and understanding the structural basis of immune-dominant and

immune-recessive antigens as well as protein active sites and

potential drug-binding sites [51,52]. For example, structural

characterization of the HIV envelope proteins gp120 and gp41

has revealed mechanisms used by the virus to evade host antibody

responses, many of which involve hypervariability in immunodo-

minant epitopes [53,54]. Based on this information, immune

refocusing (e.g., by retargeted glycosylation, deletion, and/or

substitution of amino acids) has been used to dampen the response

to variable immunodominant epitopes of the envelope glycopro-

tein gp160, enabling the host to respond to previously subdom-

inant epitopes [55]. High-throughput modification of proteins and

their screening for immunogenicity and interaction with antimi-

crobials is predicted to become more common as techniques

evolve [51].

The Contribution of Human Genomics

When designing new vaccines, one important consideration is

the risk that the vaccine might generate ‘‘self’’ immune reactions

against host epitopes; immune responses against a pathogen

antigen can cross-react with host antigens if homologies exist in the

primary amino acid sequence or structure, potentially leading to

damage to the host tissue [56]. Drugs aimed at pathogen targets

could also theoretically target similar host molecules. The

availability of the human genome sequence combined with

methods for predicting B cell and T cell epitopes will facilitate

screening for the presence of homologies between candidate

microbial vaccine antigens and proteins in humans, enabling issues

of autoimmunity and cross-reactivity to be tackled [57]. As such,

vaccine or drug targets identified using methods based on

pathogen genomics should be screened for homology or similarity

to human proteins in silico, using programs such as BLAST (Basic

Local Alignment Search Tool; http://blast.ncbi.nlm.nih.gov/

Blast.cgi) to query human genome databases. Interestingly,

analysis of 30 viral genomes revealed that around 90% of viral

pentapeptides, which could be components of epitopes, are

identical to human peptides [58]. There is little homology,

however, between validated immunogenic disease-associated

peptides/epitopes and host peptides [57,59], suggesting that

screening approaches that include prediction of immunogenicity

could improve the pool of target candidates.

It is important to keep in mind that we do not fully understand

how self-tolerance is broken, so we currently have no perfect way

of predicting all potential autoimmune triggers that could be

associated with vaccination. While many links have been made

between autoimmune disease and vaccination, they have been

confirmed in only a small number of cases (reviewed in [60]). For

example, treatment-resistant Lyme arthritis is associated in certain

patients with immune reactivity to the outer surface protein A

(OspA) of the causative agent of Lyme disease, Borrelia burgdorferi,

and an OspA epitope (OspA165–173) has homology to the human

lymphocyte function-associated antigen (hLFA)-1aL [61]. As a

result, the OspA-based Lyme disease vaccine (LYMErix) was taken

off the market in 2002, but a recombinant OspA lacking the

potentially autoreactive T cell epitope has been proposed as a

replacement vaccine [62].

Rather than targeting drugs to pathogen enzymes, an

alternative approach has focused on targeting the host-cell proteins

that are exploited by pathogens for replication and survival. The

use of techniques including microarray-based analysis of virus-

induced host gene expression has revealed several possible targets

[63,64]. The cholesterol-lowering drugs statins, for example, have

an anti-HIV effect that is believed to be mediated by preventing

activation of the host protein Rho, which is activated by the HIV

envelope protein and required for virus entry to the cell [65].

Furthermore, such studies can improve our understanding of the

host immune responses that protect against a pathogen (i.e.,

innate, antibody, Th1, or Th2 responses), which will aid the

selection of appropriate vaccine adjuvants. For example, induction

of interferon signaling early in infection may be critical to confer

protection against SARS-CoV, as determined from functional

genomic studies of early host responses to SARS-CoV infection in

the lungs of macaques [66].

Many of the genes of the human immune system are highly

polymorphic, which enables the population as a whole to generate

sufficient immunological diversity to combat EIDs. This variation

also impacts on the outcome of vaccination and treatment. The

International HapMap Project has identified over 3.1 million

SNPs in 270 individuals [67] and the 1000 Genomes Project aims

to identify even more genetic variants. The field of vaccinomics

(also called immunogenetics) investigates heterogeneity in host

genetic markers that results in variations in vaccine-induced

immune responses, with the aim of predicting and minimizing

vaccine failures or adverse events [68]. For example, polymor-

phisms of HLA and immunoregulatory cytokine receptor genes

are associated with variable outcomes of vaccination against

mumps [69]. Similarly, pharmacogenetics, which investigates

genetic differences in the way individuals metabolize therapeutics,

has found that human variability in the speed of metabolism of the

common first-line tuberculosis drug isoniazid is associated with

genetic variants, including SNPs, in the gene encoding arylamine

N-acetyltransferase (NAT2) [70,71]. The ability to predict an

individual’s response to a vaccine or drug, may eventually allow

physicians to determine whether a patient is genetically susceptible

to a disease, the possible adverse effects of a vaccine or drug, and

the appropriate schedule or dose to use.


We predict that genomics will greatly aid the control of EIDs

because of the increased efficiency with which vaccine and

therapeutic targets can be identified using the genome-based

approaches described above. Furthermore, we anticipate the

continual refinement and development of novel genome-based

approaches as sequencing becomes faster and more affordable.

Several challenges remain, however, in the identification of these

targets and in the processes needed to bring a new vaccine or drug

to the market. Understanding the molecular nature of epitopes,

the mechanisms of action of adjuvants, and T cell and mucosal

immunity are key priorities to be tackled in the coming years [3].

These issues can be addressed by improved structural studies of

antigen epitopes and the compilation of databases containing

information on structure, immunogenicity, and in silico B cell and

T cell epitope predictions. Genome-based development of effective

vaccines and therapeutics is still largely dependent on the

availability of valid models to measure efficacy and protection


against disease; however, the increased understanding of microbial

pathogenesis that is emerging from genomics should greatly aid in

this respect. Likewise, the continued development of animal

models with knockout and allele-specific mutations in key

components of the immune response will greatly increase

understanding of the type of immune response needed to control

disease and the ways in which the immune system can be

programmed to protect the host against disease. Unfortunately,

the stepwise series of prelicensure clinical trials (Phase I, II, and III)

that are required to document the safety, immunogenicity, and

efficacy of a vaccine are still highly time-consuming and costly. We

can only hope that the increasingly ‘‘smart’’ identification and

design of targets, and the fresh impetuous given to the fields of

vaccine and drug development by the arrival of genomics, will

enable increased success of those vaccines and drugs that do make

it into clinical development.

References

1. Dong J, Olano JP, McBride JW, Walker DH (2008) Emerging pathogens:Challenges and successes of molecular diagnostics. J Mol Diagn 10: 185–197.

2. Yang X, Yang H, Zhou G, Zhao GP (2008) Infectious disease in the genomicera. Annu Rev Genomics Hum Genet 9: 21–48.

3. Rappuoli R (2007) Bridging the knowledge gaps in vaccine design. Nat

Biotechnol 25: 1361–1366.

4. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, et al. (1995)

Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.Science 269: 496–512.

5. Casanova JL, Abel L (2007) Human genetics of infectious diseases: A unifiedtheory. EMBO J 26: 915–922.

6. Burgner D, Jamieson SE, Blackwell JM (2006) Genetic susceptibility to infectious

diseases: Big is beautiful, but will bigger be even better? Lancet Infect Dis 6:653–663.

7. Nakamura S, Yang CS, Sakon N, Ueda M, Tougan T, et al. (2009) Directmetagenomic detection of viral pathogens in nasal and fecal specimens using an

unbiased high-throughput sequencing approach. PLoS ONE 4: e4219.

doi:10.1371/journal.pone.0004219.

8. Bittar F, Richet H, Dubus JC, Reynaud-Gaubert M, Stremler N, et al. (2008)

Molecular detection of multiple emerging pathogens in sputa from cystic fibrosispatients. PLoS ONE 3: e2908. doi:10.1371/journal.pone.0002908.

9. Rinaudo CD, Telford JL, Rappuoli R, Seib KL (2009) Vaccinology in thegenome era. J Clin Invest 119: 2515–2525.

10. Kaushik DK, Sehgal D (2008) Developing antibacterial vaccines in genomics

and proteomics era. Scand J Immunol 67: 544–552.

11. Pucci MJ (2007) Novel genetic techniques and approaches in the microbial

genomics era: identification and/or validation of targets for the discovery of newantibacterial agents. Drugs R D 8: 201–212.

12. Mills SD (2006) When will the genomics investment pay off for antibacterialdiscovery? Biochem Pharmacol 71: 1096–1102.

13. Van Voorhis WC, Hol WGJ, Myler PJ, Stewart LJ (2009) The role of medical

structural genomics in discovering new drugs for infectious diseases. PLoSComput Biol 5(10): e530. 10.1371/journal.pcbi.1000530.

14. Masignani V, Rappuoli R, Pizza M (2002) Reverse vaccinology: A genome-based approach for vaccine development. Expert Opin Biol Ther 2: 895–905.

15. Pizza M, Scarlato V, Masignani V, Giuliani MM, Arico B, et al. (2000)Identification of vaccine candidates against serogroup B meningococcus by

whole-genome sequencing. Science 287: 1816–1820.

16. Giuliani MM, Adu-Bobie J, Comanducci M, Arico B, Savino S, et al. (2006) Auniversal vaccine for serogroup B meningococcus. Proc Natl Acad Sci U S A

103: 10834–10839.

17. Rappuoli R (2008) The application of reverse vaccinology, Novartis MenB

vaccine developed by design. 16th International Pathogenic Neisseria Confer-

ence, Rotterdam, The Netherlands: http://www.IPNC2008.org. Abstr. 81 p.

18. Muzzi A, Masignani V, Rappuoli R (2007) The pan-genome: Towards a

knowledge-based discovery of novel targets for vaccines and antibacterials. DrugDiscov Today 12: 429–439.

19. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, et al. (2005) Genomeanalysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the

microbial ‘‘pan-genome.’’ Proc Natl Acad Sci U S A 102: 13950–13955.

20. Maione D, Margarit I, Rinaudo CD, Masignani V, Mora M, et al. (2005)Identification of a universal Group B streptococcus vaccine by multiple genome

screen. Science 309: 148–150.

21. Bhagwat AA, Bhagwat M (2008) Methods and tools for comparative genomics of

foodborne pathogens. Foodborne Pathog Dis 5: 487–497.

22. Rasko DA, Rosovitz MJ, Myers GS, Mongodin EF, Fricke WF, et al. (2008) The

pangenome structure of Escherichia coli: Comparative genomic analysis of E. coli

commensal and pathogenic isolates. J Bacteriol 190: 6881–6893.

23. Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill FX, et al. (2008) High-

throughput sequencing provides insights into genome variation and evolution inSalmonella typhi. Nat Genet 40: 987–993.

24. Dhiman N, Bonilla R, O’Kane DJ, Poland GA (2001) Gene expressionmicroarrays: A 21st century tool for directed vaccine design. Vaccine 20: 22–30.

25. Morozova O, Marra MA (2008) Applications of next-generation sequencing

technologies in functional genomics. Genomics 92: 255–264.

26. Merrell DS, Butler SM, Qadri F, Dolganov NA, Alam A, et al. (2002) Host-

induced epidemic spread of the cholera bacterium. Nature 417: 642–645.

27. Talaat AM, Lyons R, Howard ST, Johnston SA (2004) The temporal expression

profile of Mycobacterium tuberculosis infection in mice. Proc Natl Acad Sci U S A

101: 4602–4607.

28. Scarselli M, Giuliani MM, Adu-Bobie J, Pizza M, Rappuoli R (2005) The

impact of genomics on vaccine design. Trends Biotechnol 23: 84–91.

29. Saenz HL, Dehio C (2005) Signature-tagged mutagenesis: technical advances in

a negative selection method for virulence gene identification. Curr Opin


30. Sakata T, Winzeler EA (2007) Genomics, systems biology and drug development

for infectious diseases. Mol Biosyst 3: 841–848.

31. Sun YH, Bakshi S, Chalmers R, Tang CM (2000) Functional genomics of

Neisseria meningitidis pathogenesis. Nat Med 6: 1269–1273.

32. Kavermann H, Burns BP, Angermuller K, Odenbreit S, Fischer W, et al. (2003)

Identification and characterization of Helicobacter pylori genes essential for gastric

colonization. J Exp Med 197: 813–822.

33. Sassetti CM, Boyd DH, Rubin EJ (2003) Genes required for mycobacterial

growth defined by high density mutagenesis. Mol Microbiol 48: 77–84.

34. Zhu H, Bilgin M, Snyder M (2003) Proteomics. Annu Rev Biochem 72:

783–812.

35. Grandi G (2006) Genomics and proteomics in reverse vaccines. Methods

Biochem Anal 49: 379–393.

36. Rodriguez-Ortega MJ, Norais N, Bensi G, Liberatori S, Capo S, et al. (2006)

Characterization and identification of vaccine candidate proteins through

analysis of the group A Streptococcus surface proteome. Nat Biotechnol 24:

191–197.

37. De Groot AS, McMurry J, Moise L (2008) Prediction of immunogenicity: in

silico paradigms, ex vivo and in vivo correlates. Curr Opin Pharmacol 8:

620–626.

38. Meinke A, Henics T, Hanner M, Minh DB, Nagy E (2005) Antigenome

technology: A novel approach for the selection of bacterial vaccine candidate

antigens. Vaccine 23: 2035–2041.

39. Vytvytska O, Nagy E, Bluggel M, Meyer HE, Kurzbauer R, et al. (2002)

Identification of vaccine candidate antigens of Staphylococcus aureus by serological

proteome analysis. Proteomics 2: 580–590.

40. Giefing C, Meinke AL, Hanner M, Henics T, Bui MD, et al. (2008) Discovery of

a novel class of highly conserved vaccine antigens using genomic scale antigenic

fingerprinting of pneumococcus with human antibodies. J Exp Med 205:

117–131.

41. Eyles JE, Unal B, Hartley MG, Newstead SL, Flick-Smith H, et al. (2007)

Immunodominant Francisella tularensis antigens identified using proteome

microarray. Proteomics 7: 2172–2183.

42. Rolfs A, Montor WR, Yoon SS, Hu Y, Bhullar B, et al. (2008) Production and

sequence validation of a complete full length ORF collection for the pathogenic

bacterium Vibrio cholerae. Proc Natl Acad Sci U S A 105: 4364–4369.

43. Stoevesandt O, Taussig MJ, He M (2009) Protein microarrays: high-throughput

tools for proteomics. Expert Rev Proteomics 6: 145–157.

44. De Groot AS, Moise L, McMurry JA, Martin W (2008) Epitope-based immunone-

derived vaccines: a strategy for improved design and safety. In: Falus A, ed. Clinical

Applications of Immunomics. New York: Springer. pp 39–69.

45. Sette A, Fleri W, Peters B, Sathiamurthy M, Bui HH, et al. (2005) A roadmap

for the immunomics of category A-C pathogens. Immunity 22: 155–161.

46. De Groot AS, Rivera DS, McMurry JA, Buus S, Martin W (2008) Identification

of immunogenic HLA-B7 ‘‘Achilles’ heel’’ epitopes within highly conserved

regions of HIV. Vaccine 26: 3059–3071.

47. Lundstrom K (2007) Structural genomics and drug discovery. J Cell Mol Med

11: 224–238.

48. Kaldor SW, Kalish VJ, Davies JF, 2nd, Shetty BV, Fritz JE, et al. (1997)

Viracept (nelfinavir mesylate, AG1343): A potent, orally bioavailable inhibitor of

HIV-1 protease. J Med Chem 40: 3979–3985.

49. Kim CU, Lew W, Williams MA, Liu H, Zhang L, et al. (1997) Influenza

neuraminidase inhibitors possessing a novel hydrophobic interaction in the

enzyme active site: Design, synthesis, and structural analysis of carbocyclic sialic

acid analogues with potent anti-influenza activity. J Am Chem Soc 119: 681–690.

50. Todd AE, Marsden RL, Thornton JM, Orengo CA (2005) Progress of structural

genomics initiatives: An analysis of solved target structures. J Mol Biol 348:

1235–1260.

51. Dormitzer PR, Ulmer JB, Rappuoli R (2008) Structure-based antigen design: A

strategy for next generation vaccines. Trends Biotechnol 26: 659–667.

52. Nicola G, Abagyan R (2009) Structure-based approaches to antibiotic drug

discovery. Curr Protoc Microbiol Chapter 17: Unit 17.2.

53. Zhou T, Xu L, Dey B, Hessell AJ, Van Ryk D, et al. (2007) Structural definition

of a conserved neutralization epitope on HIV-1 gp120. Nature 445: 732–737.


54. Prabakaran P, Dimitrov AS, Fouts TR, Dimitrov DS, KuanTeh J (2007)

Structure and function of the HIV envelope glycoprotein as entry mediator,vaccine immunogen, and target for inhibitors. In: Advances in Pharmacology.

Academic Press. pp 33–97.

55. Tobin GJ, Trujillo JD, Bushnell RV, Lin G, Chaudhuri AR, et al. (2008)Deceptive imprinting and immune refocusing in vaccine design. Vaccine 26:

6189–6199.56. Ercolini AM, Miller SD (2009) The role of infections in autoimmune disease.

Clin Exp Immunol 155: 1–15.

57. Amela I, Cedano J, Querol E (2007) Pathogen proteins eliciting antibodies donot share epitopes with host proteins: A bioinformatics approach. PLoS ONE 2:

e512. doi:10.1371/journal.pone.0000512.58. Kanduc D, Stufano A, Lucchese G, Kusalik A (2008) Massive peptide sharing

between viral and human proteomes. Peptides 29: 1755–1766.59. Kanduc D, Lucchese A, Mittelman A (2007) Non-redundant peptidomes from

DAPs: Towards ‘‘the vaccine’’? Autoimmun Rev 6: 290–294.

60. Wraith DC, Goldman M, Lambert PH (2003) Vaccination and autoimmunedisease: What is the evidence? Lancet 362: 1659–1666.

61. Gross DM, Forsthuber T, Tary-Lehmann M, Etling C, Ito K, et al. (1998)Identification of LFA-1 as a candidate autoantigen in treatment-resistant Lyme

arthritis. Science 281: 703–706.

62. Willett TA, Meyer AL, Brown EL, Huber BT (2004) An effective second-generation outer surface protein A-derived Lyme vaccine that eliminates a

potentially autoreactive T cell epitope. Proc Natl Acad Sci U S A 101: 1303–1308.63. Kellam P (2006) Attacking pathogens through their hosts. Genome Biol 7: 201.

64. Andeweg AC, Haagmans BL, Osterhaus AD (2008) Virogenomics: the virus-host interaction revisited. Curr Opin Microbiol 11: 461–466.

65. del Real G, Jimenez-Baranda S, Mira E, Lacalle RA, Lucas P, et al. (2004) Statins

inhibit HIV-1 infection by down-regulating Rho activity. J Exp Med 200: 541–547.66. de Lang A, Baas T, Teal T, Leijten LM, Rain B, et al. (2007) Functional

genomics highlights differential induction of antiviral pathways in the lungs ofSARS-CoV-infected macaques. PLoS Pathog 3: e112. doi:10.1371/journal.

ppat.0030112.

67. International HapMap Consortium (2007) A second generation humanhaplotype map of over 3.1 million SNPs. Nature 449: 851–861.

68. Poland GA, Ovsyannikova IG, Jacobson RM (2009) Application of pharmaco-genomics to vaccines. Pharmacogenomics 10: 837–852.

69. Ovsyannikova IG, Jacobson RM, Dhiman N, Vierkant RA, Pankratz VS, et al.(2008) Human leukocyte antigen and cytokine receptor gene polymorphisms

associated with heterogeneous immune responses to mumps viral vaccine.

Pediatrics 121: e1091–1099.70. Sim E, Lack N, Wang CJ, Long H, Westwood I, et al. (2008) Arylamine N-

acetyltransferases: Structural and functional implications of polymorphisms.Toxicology 254: 170–183.

71. Baudhuin LM, Langman LJ, O’Kane DJ (2007) Translation of pharmacoge-

netics into clinically relevant testing modalities. Clin Pharmacol Ther 82:373–376.

72. Telford JL, Barocchi MA, Margarit I, Rappuoli R, Grandi G (2006) Pili ingram-positive pathogens. Nat Rev Microbiol 4: 509–519.

73. Lauer P, Rinaudo CD, Soriani M, Margarit I, Maione D, et al. (2005) Genome

analysis reveals pili in Group B Streptococcus. Science 309: 105.

74. Margarit I, Rinaudo CD, Galeotti CL, Maione D, Ghezzo C, et al. (2009)

Preventing bacterial infections with pilus-based vaccines: The group B

streptococcus paradigm. J Infect Dis 199: 108–115.

75. Mora M, Bensi G, Capo S, Falugi F, Zingaretti C, et al. (2005) Group A

Streptococcus produce pilus-like structures containing protective antigens and

Lancefield T antigens. Proc Natl Acad Sci U S A 102: 15641–15646.

76. Falugi F, Zingaretti C, Pinto V, Mariani M, Amodeo L, et al. (2008) Sequence

variation in Group A Streptococcus pili and association of pilus backbone types

with Lancefield T serotypes. J Infect Dis 198: 1834–1841.

77. Barocchi MA, Ries J, Zogaj X, Hemsley C, Albiger B, et al. (2006) A

pneumococcal pilus influences virulence and host inflammatory responses. Proc

Natl Acad Sci U S A 103: 2857–2862.

78. Bagnoli F, Moschioni M, Donati C, Dimitrovska V, Ferlenghi I, et al. (2008) A

second pilus type in Streptococcus pneumoniae is prevalent in emerging serotypes and

mediates adhesion to host cells. J Bacteriol 190: 5480–5492.

79. Gianfaldoni C, Censini S, Hilleringmann M, Moschioni M, Facciotti C, et al.

(2007) Streptococcus pneumoniae pilus subunits protect mice against lethal challenge.

Infect Immun 75: 1059–1062.

80. Granoff DM, Welsch JA, Ram S (2009) Binding of complement factor H (fH) to

Neisseria meningitidis is specific for human fH and inhibits complement activation

by rat and rabbit sera. Infect Immun 77: 764–769.

81. McNeil LK, Murphy E, Zhao XJ, Guttmann S, Harris S, et al. (2009) Detection

of LP2086 on the cell surface of Neisseria meningitidis and its accessibility in the

presence of serogroup B capsular polysaccharide. Vaccine 27: 3417–3421.

82. Koeberling O, Seubert A, Granoff DM (2008) Bactericidal antibody responses

elicited by a meningococcal outer membrane vesicle vaccine with overexpressed

factor H-binding protein and genetically attenuated endotoxin. J Infect Dis 198:

262–270.

83. Madico G, Welsch JA, Lewis LA, McNaughton A, Perlman DH, et al. (2006)

The meningococcal vaccine candidate GNA1870 binds the complement

regulatory protein factor H and enhances serum resistance. J Immunol 177:

501–510.

84. Masignani V, Comanducci M, Giuliani MM, Bambini S, Adu-Bobie J, et al.

(2003) Vaccination against Neisseria meningitidis using three variants of the

lipoprotein GNA1870. J Exp Med 197: 789–799.

85. Welsch JA, Ram S, Koeberling O, Granoff DM (2008) Complement-dependent

synergistic bactericidal activity of antibodies against factor H-binding protein, a

sparsely distributed meningococcal vaccine antigen. J Infect Dis 197:

1053–1061.

86. Seib KL, Serruto D, Oriente F, Delany I, Adu-Bobie J, et al. (2009) Factor H-

binding protein is important for meningococcal survival in human whole blood

and serum and in the presence of the antimicrobial peptide LL-37. Infect

Immun 77: 292–299.

87. Mazurkiewicz P, Tang CM, Boone C, Holden DW (2006) Signature-tagged

mutagenesis: Barcoding mutants for genome-wide screens. Nat Rev Genet 7:

929–939.


Review

Toward the Use of Genomics to Study MicroevolutionaryChange in BacteriaDaniel Falush*

Department of Microbiology, University College Cork, Environmental Research Institute, Lee Road, Cork, Ireland

Abstract: Bacteria evolve rapidly in response to theenvironment they encounter. Some environmental chang-es are experienced numerous times by bacteria from thesame population, providing an opportunity to dissect thegenetic basis of adaptive evolution. Here I discuss twoexamples in which the patterns of rapid change provideinsight into medically important bacterial phenotypes,namely immune escape by Neisseria meningitidis and hostspecificity of Campylobacter jejuni. Genomic analysis ofpopulations of bacteria from these species holds greatpromise but requires appropriate concepts and statisticaltools.

Bacteria lack a natural reproductive system, comparable to

meiosis in eukaryotes, that segregates genes randomly. Instead,

they evolve progressively through mostly small genetic changes, a

proportion of which have noteworthy phenotypic effects. Some

phenotypes are intrinsically difficult to study in the laboratory:

virulence in humans or adaptation to particular ecological niches,

for example. For these traits in particular, a promising avenue for

scientific investigation is to identify the genetic changes that have

provided the basis for their evolution in natural populations.

Most human phenotypes are hard to study in vitro and,

consequently, methods for relating differences amongst humans to

natural genetic variation are well developed. Association studies

were proposed as an effective way of identifying genes with small

phenotypic effects more than a decade ago [1] and, although

initially controversial [2], the recent development of arrays for

genotyping hundreds of thousands of single nucleotide polymor-

phisms (SNPs) scattered across the whole genome has allowed the

approach to be successfully applied to many different human

diseases and other phenotypes [3]. This success should inspire the

development of equivalent protocols within bacteriology.

One challenge in developing generally applicable protocols for

mapping phenotypic traits in bacteria is that processes by which

microevolution occurs vary tremendously between species. For

example, the human pathogen Mycobacterium tuberculosis, the causal

agent of tuberculosis (TB), diverged recently from an obscure

organism occasionally isolated from humans in Africa called

Mycobacterium canetti [4]. M. tuberculosis shows very little variation

and there is no evidence of strains acquiring DNA by import from

other M. tuberculosis strains or indeed from any other organism, so

that individuals are clones of each other, distinguished only by rare

mutations or other small changes. By contrast, individual

Helicobacter pylori, a cause of gastric cancer, acquire DNA from

other members of the species at an extremely high rate.

Consequently, as well as varying in gene content [5], strains

isolated from different host individuals in the same ethnic group

typically differ from each other at approximately 3% of

nucleotides in core genes, and this diversity segregates nearly

randomly [6]. The majority of bacterial species fall between these

extremes, with their genomes showing signs of both clonal descent

and DNA import from other strains.

In this essay, I will argue that the clonal mode of reproduction

shared by all bacteria and Archaea, in which replication occurs by

binary fission, in fact provides an extremely powerful context for

association studies. These studies will require both appropriate

technologies for genotyping and evolutionary analysis and

judiciously chosen strain collections. I will here concentrate on

two examples in which placing evolutionary changes in their

clonal context provides the power to relate phenotype to genotype.

Population-scale genome sequencing promises to allow a full and

unbiased catalogue of variation within the same clonal context.

This reconstruction will facilitate identification of loci that show

correlations with phenotype or anomalous patterns that indicate

natural selection, with minimal assumptions about the mecha-

nisms by which phenotypes change.

Example 1: Immune Escape during Clonal Spreadof Neisseria meningitidis

Neisseria meningitidis lives in the human nasopharynx and is best

known for its role in meningitis and other forms of meningococcal

disease. N. meningitidis is a major cause of morbidity and mortality

in childhood in industrialised countries and is responsible for

epidemics, principally in Africa and Asia. Many lineages persist

stably within human populations, causing little disease. There are

a handful of ‘‘hyperinvasive’’ lineages, however, that have a

distinct epidemiology, spreading rapidly from location to location

and causing clusters of disease cases but not persisting in any one

place.

Mark Achtman and colleagues examined variation within a

single hyperinvasive lineage of N. meningitidis, designated subgroup

III, over a period of three decades [7]. The strains within subgroup

III showed little diversity in most of their housekeeping and other

genes surveyed. A few loci were identified that did show variation,

however, allowing clonal relationships to be partially reconstruct-

Citation: Falush D (2009) Toward the Use of Genomics to Study Microevolu-tionary Change in Bacteria. PLoS Genet 5(10): e1000627. doi:10.1371/journal.pgen.1000627

Editor: David S. Guttman, University of Toronto, Canada


Copyright: � 2009 Daniel Falush. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.

Funding: The author is funded by Science Foundation of Ireland grant number05/FE1/B882. The funders had no role in the preparation of the article.

Competing Interests: The author has declared that no competing interestsexist.




ed. This reconstruction demonstrated that there were strong

bottlenecks during geographical spread, with a single ancestor for

each major wave of infection. It also showed that, notwithstanding

the low overall level of variation, certain genes encoding specific

antigens changed repeatedly in different countries and pandemic

waves.

The most remarkable variation was found in the transferrin-

binding protein B gene (tbpB), which encodes a protein responsible

for iron uptake that is expressed on the surface of the bacterium.

This gene had evolved on three occasions by nonsynonymous

point mutations that altered the structure of the protein and on 21

occasions by import of different versions of the protein from a

variety of sources, including from N. lactamica, a closely related and

entirely noninvasive species that also colonizes humans (Figure 1).

The import events vary: analysis of similar tbpB changes in a

closely related lineage showed that between 2 kb and 10 kb of

sequence was transferred, which often altered the sequence of the

flanking genes as well as tbpB [8]. In each case, however, an effect

of the imported DNA was to change the externally exposed part of

the protein from the usual version (called the family 4 version) to

one of two antigenically highly distinct versions (family 1 and

family 3).

The fact that functionally equivalent changes to tbpB are

achieved by heterogeneous genetic events shows that the large

number of imports is not caused by a recombination mechanism

that is specific to the locus. Instead it reflects the amplifying effect

of natural selection within the large number of bacteria that

circulate during epidemics. Imports happen at a low rate

throughout the genome, but those that cause an antigenic change

at the tbpB locus have a selective advantage, meaning that they are

observed at a much higher rate than imports elsewhere in the

genome.

High diversity at a particular antigen locus is usually explained

by invoking a mechanism called negative frequency-dependent

selection [9]. Hosts who have been exposed to a particular variant

develop immune responses against this variant. Bacteria with

antigenically distinct variants escape this response, giving them an

advantage in colonizing that host. At the population level, this

selection should lead to the persistence of multiple variants. Yet,

despite this selection for rare variants within individual epidemics,

the antigenic diversity of subgroup III did not increase

progressively over time but was instead reset at the beginning of

each new epidemic, which was started by a strain with a family 4

allele.

The continuous generation of subgroup III strains with family 1

and 3 tbpB alleles is better explained by a mechanism called

source–sink dynamics [10]. The source consists of an environment

within which transmission of the bacterium is self-sustaining. Sinks

consist of environments that bacteria can colonize effectively

(perhaps by undergoing genetic modification) but from which

onward transmission is ineffective. Here, the sink environment

consists of individuals with acquired immunity to subgroup III

strains that carry family 4 alleles, while the source is the remainder

of the human population. The fact that the variant genotypes

capable of colonizing the sink do not spread geographically but

instead are repeatedly regenerated locally suggests that that these

strains have reduced overall transmission fitness in naıve hosts,

which comprise the majority of individuals in populations where

an epidemic has not occurred recently.

Two other examples of sink environments are the lungs of

immunocompromised patients for Pseudomonas aeruginosa, and the

human urinary tract for Escherichia coli [10]; as for the N. meningitidis

example, specific genetic changes have been identified that adapt

strains of these bacteria to those environments but at the expense

of overall transmission fitness, with the result that infections occur

generally sporadically.

Example 2: Host Specificity in Campylobacter jejuni

Campylobacter jejuni is a gram-negative bacterium commonly

found in animal feces. It is often associated with poultry and

naturally colonises the GI tract of many bird species. C. jejuni is one

of the most common causes of human gastroenteritis in the world.

Infection caused by Campylobacter species can be severely

debilitating but is rarely life-threatening. Human infection is

sporadic and, although poorly prepared food is often thought to be

implicated, it is generally difficult to track the source. There has

therefore been a substantial effort to isolate bacteria from a wide

variety of reservoirs and to genotype them using multilocus

sequence typing (MLST), which involves obtaining the DNA

Figure 1. Acquisition of new tbpB genes by subgroup III Neisseria meningitidis during epidemic spread. Colours indicate the family ofeach tbpB allele, with red corresponding to family 4, green corresponding to family 1, and blue corresponding to family 3. The bars highlight the timeframe, most common tbpB type, and geographical extent of each epidemic (in 1987, pilgrims from the Hajj pilgrimage briefly distributed the lineageworldwide). The circles correspond to variant genotypes. Small circles indicating that the variant allele was found in only one strain; large circlesindicate it was found in between two and four strains.doi:10.1371/journal.pgen.1000627.g001


sequence for each isolate at a standardized panel of genes (seven

for Campylobacter) that are chosen because they have an essential

function and are present in the vast majority of isolates in the

species [11].

The C. jejuni strains acquired by chickens are distinct from those

of the wild birds around them, even when the poultry are kept

outdoors [12]. Within farm animals, certain lineages are found

with very different frequencies in chickens and cattle, whereas

several genotypes are found at high frequency in both (strains with

the MLST type ST-21, for example) [13]. Strains from different

farm animals are more similar to each other than they are to

strains found, for example, in starlings (a native European bird

that is also common in may other countries, including the US)

[14].

The digestive system of chickens differs from that of cattle in

multiple aspects, and their body temperature is several degrees

higher than that of cattle. This raises the question of how some

lineages are able to compete successfully in both hosts.

Mechanisms facilitating rapid phenotypic adaptation include: (1)

inbuilt regulatory mechanisms that allow individual bacteria to

alter gene expression in response to new environments [15], (2)

‘‘contingency loci’’ that mutate rapidly, creating phenotypic

variation amongst bacteria that are otherwise genetically identical

[16], and (3) import of DNA from other strains that are already

adapted to the current environment.

A first step toward understanding the evolution of host

specificity is to establish whether it is possible to predict the host

origin of strains based on their genome sequence. One approach

to doing this uses phylogenetic relationships. For example, the

program AdaptML (http://almlab.mit.edu/ALME/Software/

Software.html) attempts to assign branches of the phylogenetic

tree to preferred habitats based on where the strains on that

branch were isolated [17]. For C. jejuni, habitat can, for example,

be equated to host species. The observation of a group of

phylogenetically related strains in a single host species might reflect

the common ancestor of those strains acquiring the traits required

to survive in that species.

Since C. jejuni recombines frequently, the genome composition

of each strain is determined by the sources from which it has

imported DNA, as well by which strains it is phylogenetically

related to. For example, ST-21, together with its variants, is a

lineage analogous to subgroup III of N. meningitidis. Like subgroup

III, the lineage has imported DNA from other strains on numerous

occasions during its spread, with the result that many isolates have

variant genotypes that differ from ST-21 at one or two of the seven

MLST fragments. By convention, these strains are grouped with

ST-21 into the ST-21 clonal complex.

ST-21 itself has been found at high frequency in several

agricultural species and elsewhere. Therefore, if a new strain is

found to be ST-21, this provides little information on where it

might have originated. However, for the variants of ST-21, Noel

McCarthy and colleagues obtained significantly better than

random assignment by predicting hosts based on the frequency

with which the variant allele was found in chicken or cattle [13]. A

useful signal of host-of-origin is thus provided by the DNA that

each isolate has acquired (Figure 2). Furthermore, the high rate of

recombination within particular hosts represents a mechanism by

which complex adaptations to a particular host species can be

acquired quickly subsequent to a host switch.

The Power of Bacterial Genomics

Studies in bacteria have two major advantages over those in

humans or other mammals when it comes to relating phenotype to

genotype based on natural variation. The first is the magnifying

effect of natural selection in enormous bacterial populations. This

selection acts to rapidly increase the frequency of genotypes that

give small fitness advantages in a particular environment, even if

these genotypes are generated only rarely. Adaptation in bacteria

is likely to be more frequent and to leave more distinctive genetic

signatures than in species such as humans where signals of

adaptation to local environments have proved to be remarkably

subtle [18]. The second is the fact that evolution occurs in the

context of progressively changing clonal backgrounds. This

property can make it possible to identify strains that have

extremely similar genomes but nevertheless differ phenotypically

[19]. These strains represent the natural equivalent of an isogenic

line and can allow precise inferences about the effects of natural

variation and how different changes interact with each other.

In order to fully exploit the advantages of bacteria for detecting

phenotypic associations, it is necessary to develop a conceptual

and analytical framework within which rapid evolutionary change

can be interpreted. One such framework is source–sink dynamics

[10]. The Neisseria example illustrates the power of microevolu-

tionary analysis in a source–sink ecological context to identify first

the sink (hosts with immune responses to tbpB family 4 alleles) and

second the loci under an immediate selective pressure to change

within that sink (the tbpB gene).

Source–sink dynamics cannot be applied to investigate host

specificity within Campylobacter, because individual host species,

e.g., chicken, cattle, and individual species of wild birds, each

harbour large, viable populations of bacteria with high rates of

within-species transmission and do not represent sinks. Neverthe-

less, there is a key similarity between the Neisseria and Campylobacter

Figure 2. A schematic illustration of the evolution of the C.jejuni ST-21 clonal complex in cattle and chickens. The commonancestor of the complex occurred in chickens (red). During evolution,the lineage occasionally switched to a cattle host (indicated by a bluebranch) and sometimes back to chicken. The bacteria acquired DNA byhomologous recombination from other C. jejuni in the same host. Sincerecombination is assumed to occur from donors within the same host,the gene pool is determined by the genomic composition of the strainsthat colonize each host. The gene pools are illustrated for two separateloci (right and left facing arrows) in chickens and cattle. The gene poolscontain alleles whose frequencies occur at much higher frequency inone host than another (shown in colour) and others that did not (shownin black). The former are informative about the host in which therecombination event occurred, while the latter are not. The recombi-nation event labelled a introduces the left facing black arrow gene fromthe cattle gene pool and is phylogenetically informative because itdefines a lineage that is largely restricted to cattle. The fiverecombination events labelled b are not phylogenetically informative,since they only affect a single strain in the sample. These events arenevertheless informative because they introduce alleles that arecharacteristic of the host species. The event labelled c is bothphylogenetically informative and characteristic of host. The eventlabelled d is noninformative.doi:10.1371/journal.pgen.1000627.g002


examples, namely that the strains are repeatedly challenged by an

environment that is novel in the recent history of the strain. In the

Neisseria example, this challenge is repeatedly met by genetic

changes at particular antigenic loci, which consequently have

extremely atypical patterns of variation. In Campylobacter this

challenge is met in the context of a high rate of import of DNA

across the genome from other Campylobacter strains that already

colonize the new host.

The availability of full genome sequences promises to enhance

our understanding of the bacterial responses to new environments

in a number of ways. First, phylogenetic relationships will be better

resolved. In the Neisseria example, a well-resolved tree will

elucidate patterns of transmission within epidemics and, for

example, whether tbpB imports take place at the later stages of

each wave and if strains with such imports ever reacquire family 4

alleles and seed later epidemics. In the Campylobacter example this

will allow estimates of the number of occasions that the ST-21

lineage has jumped between host-species and establish whether

there are sublineages that are becoming progressively more

adapted to single-host transmission.

Second, genomics will provide a complete catalogue of loci

whose pattern of descent is atypical of the genome as a whole and

therefore either associated with a particular phenotype or

putatively affected by selection. In the Neisseria example, an

elevated rate of change at particular loci and consistency in the

nature of those changes would provide signs of selection. In the

Campylobacter example, loci that are imported at very high

frequency and/or that are highly differentiated between host

species may be involved in adaptation to a new host. An isolate-by-

isolate analysis of the patterns of import should establish whether

the multi-host lifestyle of ST-21 and, by extension, of C. jejuni as a

whole is facilitated by import of DNA from locally adapted strains.

Third, genomics will allow detection of epistasis between loci.

Epistasis occurs when the fitness effects of alleles at one gene are

modified by the genotype at one or more additional genes. In

outbreeding diploids, such as mammals, each allele has its fitness

tested on a new genetic background in every generation, with the

result that epistasis does not leave a distinctive signature in the

frequency of particular combinations of alleles unless the loci are

closely linked on the same chromosome or selection is very strong.

In bacteria, combinations of alleles remain together for many

generations wherever they occur in the genome, providing ample

opportunity for epistasis to bring particular combinations of alleles

to high frequency. For example, subgroup III strains that have

imported variant tbpB alleles can potentially enhance their fitness

by importing other parts of the genome that adapt other strains in

the Neisseria population to having high fitness when carrying family

1 or family 3 alleles. These parts of the genome could be detected

by identifying parallel changes that have occurred on the 21

occasions that a variant tbpB allele was imported during the spread

of subgroup III strains. Fitness interactions establish functional

relationships between loci and represent a central part of the

evolutionary landscape, for example triggering the origin of species

[20]. Genome sequencing of bacteria should provide key insights

on the nature of these interactions in natural populations.

In C. jejuni and other zoonoses, genomic analyses will facilitate a

qualitative advance in our understanding of the epidemiology,

ecology, and molecular biology of host switches. These develop-

ments will allow accurate delineation of the sources of human

infection and an understanding of the factors promoting successful

and pathogenic colonization of humans. In N. meningitidis and

similar bacteria, we will gain a much better understanding of the

genetic differences between invasive and noninvasive strains and

the particular adaptive strategies that cause lineages to become

invasive. These advances will together allow the design of targeted

interventions that reduce the burden of human disease.


Advances in sequencing technology mean that it is becoming

economically feasible to obtain complete or nearly complete

genome sequences for large samples of bacteria. To better exploit

this technology to understand bacterial phenotypes, the field

should emulate the research program of human genetics and (1)

develop statistical tools that use sequence variation to infer

mechanisms of evolution [21] and patterns of genetic relationship

[22]; (2) collect and sequence samples of isolates in which bacteria

that differ in phenotypes of interest are matched as far as possible

in time and space [23]; and (3) design statistical tools for detecting

phenotypic associations [24] and natural selection [25] by

identifying patterns of relationship at particular loci that are

atypical of the genome as a whole.

Acknowledgments

Mark Achtman, Jim Bull, Jana Haase, Riikka Haukkanen, and Daniel

Stoebel provided insightful discussions and comments on the manuscript.

References

1. Risch N, Merikangas K (1996) The future of genetic studies of complex humandiseases. Science 273: 1516–1517.

2. Weiss KM, Terwilliger JD (2000) How many diseases does it take to map a genewith SNPs? Nat Genet 26: 151–157.

3. Hardy J, Singleton A (2009) Genomewide association studies and humandisease. N Engl J Med 360: 1759–1768.

4. Fabre M, Koeck JL, Le Fleche P, Simon F, Herve V, V, et al. (2004) Highgenetic diversity revealed by variable-number tandem repeat genotyping and

analysis of hsp65 gene polymorphism in a large collection of ‘‘Mycobacterium

canettii’’ strains indicates that the M. tuberculosis complex is a recently emergedclone of ‘‘M. canettii’’. J Clin Microbiol 42: 3248–3255.

5. Gressmann H, Linz B, Ghai R, Pleissner KP, Schlapbach R, et al. (2005) Gainand loss of multiple genes during the evolution of Helicobacter pylori. PLoS Genet

1: e43. doi:10.1371/journal.pgen.0010043.

6. Suerbaum S, Maynard Smith J, Bapumia K, Morelli G, Smith NH, et al. (1998)

Free recombination within Helicobacter pylori. Proc Natl Acad Sci U S A 95:12619–12624.

7. Zhu P, van Der EA, Falush D, Brieske N, Morelli G, et al. (2001) Fit genotypesand escape variants of subgroup III Neisseria meningitidis during three pandemics

of epidemic meningitis. Proc Natl Acad Sci U S A 98: 5234–5239.

8. Linz B, Schenker M, Zhu P, Achtman M (2000) Frequent interspecific genetic

exchange between commensal Neisseriae and Neisseria meningitidis. Mol Microbiol

36: 1049–1058.

9. Brisson D, Dykhuizen DE (2004) ospC diversity in Borrelia burgdorferi: Different

hosts are different niches. Genetics 168: 713–722.

10. Sokurenko EV, Gomulkiewicz R, Dykhuizen DE (2006) Source-sink dynamics ofvirulence evolution. Nat Rev Microbiol 4: 548–555.

11. Maiden MCJ, Bygraves JA, Feil E, Morelli G, Russell JE, et al. (1998) Multilocus

sequence typing: A portable approach to the identification of clones withinpopulations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95:

3140–3145.

12. Colles FM, Jones TA, McCarthy ND, Sheppard SK, Cody AJ, et al. (2008)Campylobacter infection of broiler chickens in a free-range environment.

Environ Microbiol 10: 2042–2050.

13. McCarthy ND, Colles FM, Dingle KE, Bagnall MC, Manning G, et al. (2007) Host-

associated genetic import in Campylobacter jejuni. Emerg Infect Dis 13: 267–272.

14. Colles FM, McCarthy ND, Howe JC, Devereux CL, Gosler AG, et al. (2009)Dynamics of Campylobacter colonization of a natural host, Sturnus vulgaris

(European starling). Environ Microbiol 11: 258–267.

15. Coulson RM, Ouzounis CA (2003) The phylogenetic diversity of eukaryotictranscription. Nucleic Acids Res 31: 653–660.

16. Moxon R, Bayliss C, Hood D (2006) Bacterial contingency loci: The role of simple

sequence DNA repeats in bacterial adaptation. Annu Rev Genet 40: 307–333.

17. Hunt DE, David LA, Gevers D, Preheim SP, Alm EJ, et al. (2008) Resource

partitioning and sympatric differentiation among closely related bacterioplank-

ton. Science 320: 1081–1085.


18. Coop G, Pickrell JK, Novembre J, Kudaravalli S, Li J, et al. (2009) The role of

geography in human adaptation. PLoS Genet 5: e1000500. doi:10.1371/journal.pgen.1000500.

19. Beres SB, Richter EW, Nagiec MJ, Sumby P, Porcella SF, et al. (2006)

Molecular genetic anatomy of inter- and intraserotype variation in the humanbacterial pathogen group A Streptococcus. Proc Natl Acad Sci U S A 103:

7059–7064.20. Coyne JA, Orr HA (2004) Speciation. Sunderland (MA): Sinauer Associates.

21. McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, et al. (2004) The

fine-scale structure of recombination rate variation in the human genome.Science 304: 581–584.

22. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, et al. (2002)

Genetic structure of human populations. Science 298: 2381–2385.23. The Wellcome Trust Case Control Consortium (2007) Genome-wide association

study of 14,000 cases of seven common diseases and 3,000 shared controls.

Nature 447: 661–678.24. Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint

method for genome-wide association studies by imputation of genotypes. NatGenet 39: 906–913.

25. Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, et al. (2007) Genome-

wide detection and characterization of positive selection in human populations.Nature 449: 913–918.


Review

The Application of Genomics to Emerging Zoonotic ViralDiseasesBart L. Haagmans, Arno C. Andeweg, Albert D. M. E. Osterhaus*

Department of Virology, Erasmus Medical Center, Rotterdam, The Netherlands

Abstract: Interspecies transmission of pathogens mayresult in the emergence of new infectious diseases inhumans as well as in domestic and wild animals.Genomics tools such as high-throughput sequencing,mRNA expression profiling, and microarray-based analysisof single nucleotide polymorphisms are providing un-precedented ways to analyze the diversity of the genomesof emerging pathogens as well as the molecular basis ofthe host response to them. By comparing and contrastingthe outcomes of an emerging infection with those ofclosely related pathogens in different but related hostspecies, we can further delineate the various hostpathways determining the outcome of zoonotic trans-mission and adaptation to the newly invaded species. Theultimate challenge is to link pathogen and host genomicsdata with biological outcomes of zoonotic transmissionand to translate the integrated data into novel interven-tion strategies that eventually will allow the effectivecontrol of newly emerging infectious diseases.

Emerging Zoonotic Viruses

Most of the well-known human viruses persist in the population

for a relatively long time, and coevolution of the virus and its

human host has resulted in an equilibrium characterized by

coexistence, often in the absence of a measurable disease burden.

When pathogens cross a species barrier, however, the infection

can be devastating, causing a high disease burden and mortality.

In recent years, several outbreaks of infectious diseases in humans

linked to such an initial zoonotic transmission (from animal to

human host) have highlighted this problem. Factors related to our

increasingly globalized society have contributed to the apparently

increased transmission of pathogens from animals to humans over

the past decades; these include changes in human factors such as

increased mobility, demographic changes, and exploitation of the

environment (for a review see Osterhaus [1] and Kuiken et al. [2]).

Environmental factors also play a direct role, and many examples

exist. The recently increased distribution of the arthropod

(mosquito) vector Aedes aegypti, for example, has led to massive

outbreaks of dengue fever in South America and Southeast Asia.

Intense pig farming in areas where frugivorous bats are common is

probably the direct cause of the introduction of Nipah virus into

pig populations in Malaysia, with subsequent transmission to

humans. Bats are an important reservoir for a plethora of zoonotic

pathogens: two closely related paramyxoviruses—Hendra virus

and Nipah virus—cause persistent infections in frugivorous bats

and have spread to horses and pigs, respectively [3].

The similarity between human and nonhuman primates permits

many viruses to cross the species barrier between different primate

species. The introduction into humans of HIV-1 and HIV-2 (the

lentiviruses that cause AIDS), as well as other primate viruses, such

as monkeypox virus and Herpesvirus simiae, provide dramatic

examples of this type of transmission. Other viruses, such as

influenza A viruses and severe acute respiratory syndrome

coronavirus (SARS-CoV), may need multiple genetic changes to

adapt successfully to humans as a new host species; these changes

might include differential receptor usage, enhanced replication,

evasion of innate and adaptive host immune defenses, and/or

increased efficiency of transmission. Understanding the complex

interactions between the invading pathogen on the one hand and

the new host on the other as they progress toward a new host–

pathogen equilibrium is a major challenge that differs substantially

for each successful interspecies transmission and subsequent

spread of the virus.

Genomics of Zoonotic Viruses and Their Hosts

New molecular techniques such as high-throughput sequencing,

mRNA expression profiling, and array-based single nucleotide

polymorphism (SNP) analysis provide ways to rapidly identify

emerging pathogens (Nipah virus and SARS-CoV, for example)

and to analyze the diversity of their genomes as well as the host

responses against them. Essential to the process of identification

and characterization of genome sequences is the exploitation of

extensive databases that allow the alignment of viral genome

sequences and the linkage of these genomics data to those obtained

by classical viral culture and serological techniques, and

epidemiological, clinical, and pathological studies [4]. Extensive

genetic analysis of HIV-1, for example, has provided clues to the

geography and time scale of the early diversification of HIV-1

strains when the virus emerged in humans. HIV-1 strains are

divided into multiple clades, each of which has independently

evolved from a simian immunodeficiency virus (SIV) that naturally

infects chimpanzees in West and Central Africa. Current estimates

date the common ancestor of HIV-1 to the beginning of the

twentieth century [5].

Citation: Haagmans BL, Andeweg AC, Osterhaus ADME (2009) The Application ofGenomics to Emerging Zoonotic Viral Diseases. PLoS Pathog 5(10): e1000557.doi:10.1371/journal.ppat.1000557

Editor: Marianne Manchester, The Scripps Research Institute, United States ofAmerica


Copyright: � 2009 Haagmans et al. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.

Funding: Supported by the VIRGO consortium, an Innovative Cluster approvedby the Netherlands Genomics Initiative and partially funded by the DutchGovernment (BSIK 03012), The Netherlands and the US National Institutes ofHealth, RO1 grant HL080621-O1A1. The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.




PLoS Pathogens | www.plospathogens.org 1 October 2009 | Volume 5 | Issue 10 | e1000557

Because zoonotic pathogens typically may cause variable

clinical outcomes in human hosts that differ in age, nutritional

status, genetic background, and immunological condition, deci-

phering the complex interactions between evolving pathogens and

their hosts is a great challenge. The genome sequences of many

host species have become available the last decade, and with them

a range of novel tools are available to study virus–host interactions

at the molecular level. This progress, together with advances in

high-throughput sequencing technology and, not least, in

(bio)informatics and statistics, allows us to analyze the ‘‘genome-

wide’’ networks of gene interactions that control the host response

to pathogens. By comparing and contrasting the outcomes of

infection with closely related pathogens in different but related

host species, we can further delineate the various host pathways

involved in the different outcomes. The power of this approach

was nicely demonstrated for SIV infection of various primate host

species. Natural reservoir hosts of SIV do not develop AIDS upon

infection, whereas non-natural hosts, such as rhesus macaques and

pig-tailed macaques, when infected experimentally with SIV,

develop AIDS in a similar manner to HIV-infected humans.

Transcriptional profiling indicates that SIV infection of these

species produces a distinctive host response [6]. SIV-infected

primates with symptoms of AIDS have a high viral load, immune

activation, and loss of certain types of T cells, whereas SIV-

infected sooty mangabeys (the species from which HIV-2 is

thought to have originated) have substantially lower levels of

innate immune activation than the symptomatic primates, partly

due to the production of less interferon-a by plasmacytoid

dendritic cells in response to SIV and other Toll-like receptor

ligands [7]. Identification of host factors that restrict HIV infection

may aid the development of effective intervention strategies.

Below, we elaborate on two other examples of recent important

zoonotic events that led to sustained virus transmission in the

human host, and the role that genomics has played in the

elucidation of their pathogenesis thus far.

Influenza Virus

Influenza is caused by RNA viruses of the Orthomyxoviridae

family. Whereas fever and coughs are the most frequent

symptoms, in more serious cases a fatal pneumonia can develop,

particularly in the young and the elderly. Typically, influenza is

transmitted through the air by coughs or sneezes, creating aerosols

containing the virus; but influenza can also be transmitted by bird

droppings, saliva, feces, and blood. Birds and pigs play an

important role in the emergence of new influenza viruses in

humans. Fecal sampling of migratory birds has revealed that they

harbor a large range of different subtypes of influenza A viruses

[8]. Some wild duck species, particularly mallards, are potential

long-distance vectors of highly pathogenic avian influenza virus

(H5N1), whereas others, including diving ducks, are more likely to

act as ‘‘sentinel’’ species that die upon infection [9]. Following the

introduction of a new pandemic influenza A virus subtype from an

avian reservoir, either directly or via another mammalian species

such as the pig, the virus may continue to circulate in humans in

subsequent years as a seasonal influenza virus. In the past century,

three major influenza epidemics resulted in the loss of many

millions of lives. Spanish flu alone caused the deaths of more than

50 million people by the end of World War I in 1918. The 2009

outbreak of a new H1N1 virus (causing ‘‘swine flu’’) that started in

Mexico further illustrates the pandemic potential of influenza A

viruses.

After introduction of a new influenza A virus from an avian or

porcine reservoir into the human species, viral genomics studies

are essential to identify critical mutations that enable the

circulating virus to spread efficiently, interact with different

receptors, and cause disease in the new host. For example, the

importance of residue 627 of the PB2 protein of the viral

polymerase in determining species restriction has been demon-

strated through these kinds of approaches [10]. Furthermore,

changes in the hemagglutinin molecules may allow influenza A

viruses to switch receptor specificity. The hemagglutinin of avian

H5N1 influenza viruses preferentially binds to oligosaccharides

that terminate with a sialic acid–a-2,3-Gal disaccharide, whereas

the hemagglutinins of mammalian influenza A viruses prefer

oligosaccharides that terminate with sialic acid–a-2,6-Gal

(Figure 1). Fatal viral pneumonia in humans infected with avian

H5N1 viruses is partly due to the ability of these viruses to attach

to and replicate in the cells of the lower respiratory tract, which

have oligosaccharides that terminate in sialic acid–a-2,3-Gal

disaccharide [11,12]. The sequence of the hemagglutinin protein

may also affect its binding affinity for neutralizing antibodies.

Understanding the relationship between genetic diversity and

antigenic properties of these viruses [13] may help to predict the

emergence of influenza viruses and to develop effective vaccines.

Microarray-assisted mRNA expression profiling of emerging

zoonotic viral infections, including influenza A virus, is used to

phenotype the host response in great detail. By comparing mRNA

expression in individuals infected with an emerging virus to

expression in individuals infected with a related established virus,

researchers can generate a ‘‘molecular fingerprint’’ of the host

response genes or pathways specifically involved in the often-

exuberant host responses to the emerging virus. By using

genetically engineered influenza A viruses, a role for the

nonstructural NS1 viral protein in evasion of the innate host

response has been demonstrated [14]. Interestingly, the NS1

protein derived from the 1918 Spanish H1N1 pandemic influenza

virus blocked expression of interferon-regulated genes more

efficiently than did the NS1 protein from established seasonal

influenza viruses [14]. Other genomics studies of genetically

engineered influenza A viruses containing some or all of the gene

segments from either the 1918 H1N1 virus or the highly

pathogenic avian influenza A virus (H5N1), suggest that these

highly pathogenic influenza viruses induce severe disease in mice

and macaques through aberrant and persistent activation of

proinflammatory cytokine and chemokine responses [15–18].

Application of genomics tools not only supports the elucidation

of mechanisms underlying pathogenesis but may also help to

identify leads for therapeutic intervention. In ferrets, H5N1

infection induced severe disease that was associated with strong

expression of interferon response genes including the interferon-c-

induced cytokine CXCL10. Treatment of H5N1-infected ferrets

with an antagonist of the CXCL10 receptor (CXCR3) reduced the

severity of the flu symptoms and the viral titers compared to the

controls [19], clearly demonstrating the potential of biological

response modifiers for the clinical management of viral infections.

The host evasion and evolution of influenza virus is further

discussed in [20].

SARS-CoV

Coronaviruses (CoVs) primarily infect the upper respiratory and

gastrointestinal tract of mammals and birds. Five different

currently known CoVs infect humans and are believed to cause

a significant percentage of all common colds in human adults.

Surprisingly, recent studies revealed that approximately 6% of bats

sampled in China were positive for CoVs [21]. Subsequent

phylogenetic studies revealed that bat CoVs that resembled


human SARS-CoV clustered in a putative group comprising one

subgroup of bat CoVs and another of SARS-CoVs from humans

and other mammalian hosts. According to the current hypothesis

SARS-CoV has arisen by recombination between two bat viruses.

Phylogenetic analysis of SARS-CoV isolates from animals indicate

that the resulting bat virus was transmitted first to palm civets

(Paguma larvata), a wild cat-like animal hunted for its meat, and

subsequently to humans at live animal markets in southern China

[22].

Genome analyses have provided evidence that genetic variation

in the spike gene of these viruses from civets is associated with

increased transmission of the virus [21]. In addition, species-to-

species variation in the sequence of the gene angiotensin-converting

enzyme 2 (ACE2), which encodes the SARS-CoV receptor, also

affects the efficiency by which the virus can enter cells [23]. By a

combination of phylogenetic and bioinformatics analyses, chimeric

gene design, and reverse genetics–aided generation of viruses that

encode spike proteins of diverse isolates, researchers have

reconstructed the events that led to the emergence of a virus able

to spread efficiently in humans [24]. Structural modeling predicted

that the SARS-CoV that caused the epidemic had an increased

affinity for both civet and human ACE2 receptors due to

adaptation (Figure 2). Subsequent functional genomics studies of

these viruses in diverse species provided further insight into the

role of specific host genes involved in the pathogenic response

[25,26]. The pathological changes observed in the lungs are

initiated by a disproportionate innate immune response, illustrated

by elevated levels of inflammatory cytokines and chemokines, such

Figure 1. Zoonotic transmission of influenza A virus. The hemagglutinin of avian influenza A viruses (blue) preferentially bind tooligosaccharides that terminate in sialic acid–a-2,3-Gal (red), whereas the hemagglutinin on human influenza A viruses (green) preferoligosaccharides that terminate in sialic acid–a-2,6-Gal (orange). Fatal viral pneumonia in humans infected with the H5N1 subtype of avianinfluenza A viruses is likely due to the ability of these viruses to attach to and replicate in the lower respiratory tract cells, which have sialic acid-a-2,3-Gal terminated saccharides. The horizontal arrows indicate interspecies transmission, including the transmission from an avian or porcine reservoirinto the human species. Image credit: Bart Haagmans, Erasmus MC. Original images (left to right, from top to bottom) by Roman Kohler, Alvesgaspar,Anton Holmquist, Joshua Lutz, and CDC.doi:10.1371/journal.ppat.1000557.g001


as CXCL10 (IP-10), CCL2 (MCP-1), interleukin (IL)-6, IL-8, IL-

12, IL-1b, and interferon-c [27]. These clinical data were

confirmed experimentally by demonstrating that SARS-CoV

infection of diverse cell types induces a range of cytokines and

chemokines, thus providing a conceptual framework for SARS-

CoV pathogenesis. Host genome expression analyses of various

animal hosts and humans with different outcomes of infection

indicated differential activation of innate immune genes in, for

example, aged subjects compared to young subjects. Importantly,

treatment of aged macaques with pegylated interferon-a (i.e.

interferon-a covalently modified with polyethylene glycol polymer

chains, to enhance its bioavailability) reduced SARS-CoV

replication and pathogenic responses [28]. Thus, host genomics

analysis may provide markers of pathogenesis and leads for

therapeutic intervention, as in this example of SARS-CoV

infection.


Rapid identification of newly emerging viruses through the use

of genomics tools is one of the major challenges for the near future.

In addition, the identification of critical mutations that enable

viruses to spread efficiently, interact with different receptors, and

cause disease in diverse hosts through, for instance, enhanced viral

replication or circumvention of the innate and adaptive immune

responses, needs to be further expanded. Although microarray-

assisted transcriptional profiling can provide us with a wealth of

information regarding host genes and gene-interacting networks in

Figure 2. Zoonotic transmission of SARS-CoV. Genomic analyses provided evidence that genetic changes in the spike gene of SARS-CoV frombats (left) and civet cats (center) are essential for the animal-to-human transmission (horizontal arrows). Species-to-species genetic variation in the(thus far unidentified) viral receptor in bats and in the angiotensin converting enzyme 2 (ACE2) gene, encoding the SARS-CoV receptor in civet cats andhumans also affects the efficiency with which the virus can enter cells (vertical arrows). The SARS-CoV that caused the epidemic evolved a highaffinity for both civet (center) and human (right) ACE2 receptors (indicated by the single diagonal and the right side vertical arrow). Image credit: BartHaagmans, Erasmus MC. Original images (left to right) by Dodoni, Paul Hilton, and Hoang Dinh Nam.doi:10.1371/journal.ppat.1000557.g002


virus–host interactions, future research should focus on combining

data obtained in different experimental settings. Therefore, the

careful design of complementary sets of experiments using

different formats of virus–host interactions is absolutely needed

for successful genomics studies [29]. Special attention should be

addressed to the comparative analysis of the host response in

diverse animal species. Thus far a limited number of laboratory

animal species has been studied, but the recent elucidation of the

genome of several other animal species will provide tools to

decipher the virus–host interactions in the more relevant natural

host. Recent developments in the sequencing of the RNA

transcriptome may aid this development. Ultimately, microarray

technology may also extend to genotyping of the human host by

SNP analysis, to identify markers of host susceptibility and severity

of disease, that can be used in tailor-made clinical management of

disease caused by emerging infections. Comparative analysis of

host responses to emerging viruses may also point toward a similar

dysregulated host response to a range of emerging virus infections,

enabling the rational design of multipotent biological response

modifiers to combat a variety of emerging viral infections. By

focusing on broad-acting intervention strategies rather than on the

discovery of a newly emerging pathogen that is not characterized

yet, we may be able to protect ourselves from several unexpectedly

emerging infections with the same clinical manifestations. This

approach may readily reduce the burden of disease and time will

be gained to design preventive pathogen specific intervention

strategies such as antiviral therapy or vaccination. Clearly, for all

stages of combating emerging infections, from the early identifi-

cation of the pathogen to the development and design of vaccines,

application of sophisticated genomics tools is fundamental to

success.

References

1. Osterhaus A (2001) Catastrophes after crossing species barriers. Philos Trans Soc

Lond B Biol Sci 356: 791–793.2. Kuiken T, Leighton FA, Fouchier RA, LeDuc JW, Peiris JS, et al. (2005) Public

health. Pathogen surveillance in animals. Science 309: 1680–1681.

3. Field HE, Mackenzie JS, Daszak P (2007) Henipaviruses: Emerging paramyxo-viruses associated with fruit bats. Curr Top Microbiol Immunol 315: 133–159.

4. Rivers TM (1937) Viruses and Koch’s postulates. J Bacteriol 33: 1–12.5. Worobey M, Gemmel M, Teuwen DE, Haselkorn T, Kunstman K, et al. (2008)

Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960. Nature

455: 661–664.6. Lederer S, Favre D, Walters KA, Proll S, Kanwar B, et al. (2009)

Transcriptional profiling in pathogenic and non-pathogenic SIV infectionsreveals significant distinctions in kinetics and tissue compartmentalization. PLoS


7. Mandl JN, Barry AP, Vanderford TH, Kozyr N, Chavan R, et al. (2008)Divergent TLR7 and TLR9 signaling and type I interferon production

distinguish pathogenic and nonpathogenic AIDS virus infections. Nat Med 14:1077–1087.

8. Munster VJ, Baas C, Lexmond P, Waldenstrom J, Wallensten A, et al. (2007)Spatial, temporal, and species variation in prevalence of influenza A viruses in

wild migratory birds. PLoS Pathog 3: e61. doi:10.1371/journal.ppat.0030061.

9. Keawcharoen J, van Riel D, van Amerongen G, Bestebroer T, Beyer WE, et al.(2008) Wild ducks as long-distance vectors of highly pathogenic avian influenza

virus (H5N1). Emerg Infect Dis 4: 600–607.10. Hatta M, Gao P, Halfmann P, Kawaoka Y (2001) Molecular basis for high

virulence of Hong Kong H5N1 influenza A viruses. Science 293: 1840–1842.

11. van Riel D, Munster VJ, de Wit E, Rimmelzwaan GF, Fouchier RA, et al. (2006)H5N1 virus attachment to lower respiratory tract. Science 312: 399.

12. Yamada S, Suzuki Y, Suzuki T, Le MQ, Nidom CA, et al. (2006)Haemagglutinin mutations responsible for the binding of H5N1 influenza A

viruses to human-type receptors. Nature 444: 378–382.13. Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, et al.

(2004) Mapping the antigenic and genetic evolution of influenza virus. Science

305: 371–376.14. Geiss GK, Salvatore M, Tumpey TM, Carter VS, Wang X, et al. (2002) Cellular

transcriptional profiling in influenza A virus-infected lung epithelial cells: Therole of the nonstructural NS1 protein in the evasion of the host innate defense

and its potential contribution to pandemic influenza. Proc Natl Acad Sci U S A

99: 10736–10741.15. Kobasa D, Jones SM, Shinya K, Kash JC, Copps J, et al. (2007) Aberrant innate

immune response in lethal infection of macaques with the 1918 influenza virus.Nature 445: 319–323.

16. Baskin CR, Bielefeldt-Ohmann H, Tumpey TM, Sabourin PJ, Long JP, et al.(2009) Early and sustained innate immune response defines pathology and death

in nonhuman primates infected by highly pathogenic influenza virus. Proc Natl

Acad Sci U S A 106: 3455–3460.

17. Kash JC, Tumpey TM, Proll SC, Carter V, Perwitasari O, et al. (2006) Genomic

analysis of increased host immune and cell death responses induced by 1918

influenza virus. Nature 443: 578–581.

18. Kash JC, Basler CF, Garcıa-Sastre A, Carter V, Billharz R, et al. (2004) Global

host immune response: Pathogenesis and transcriptional profiling of type A

influenza viruses expressing the hemagglutinin and neuraminidase genes from

the 1918 pandemic virus. J Virol 78: 9499–9511.

19. Cameron CM, Cameron MJ, Bermejo-Martin JF, Ran L, Xu L, et al. (2008)

Gene expression analysis of host innate immune responses during lethal H5N1

infection in ferrets. J Virol 82: 11308–11317.

20. McHardy AC, Adams, B (2009) The role of genomics in tracking the evolution

of influenza A virus. PLoS Pathog e1000566: doi:10.1371/journal.

ppat.1000566.

21. Tang XC, Zhang JX, Zhang SY, Wang P, Fan XH, et al. (2006) Prevalence and

genetic diversity of coronaviruses in bats from China. J Virol 80: 7481–7490.

22. Song HD, Tu CC, Zhang GW, Wang SY, Zheng K, et al. (2005) Cross-host

evolution of severe acute respiratory syndrome coronavirus in palm civet and

human. Proc Natl Acad Sci U S A 102: 2430–2435.

23. Li W, Zhang C, Sui J, Kuhn JH, Moore MJ, et al. (2005) Receptor and viral

determinants of SARS-coronavirus adaptation to human ACE2. EMBO J 24:

1634–1643.

24. Sheahan T, Rockx B, Donaldson E, Sims A, Pickles R, et al. (2008) Mechanisms

of zoonotic severe acute respiratory syndrome coronavirus host range expansion

in human airway epithelium. J Virol 82: 2274–2285.

25. Rockx B, Baas T, Zornetzer GA, Haagmans B, Sheahan T, et al. (2009) Early

upregulation of acute respiratory distress syndrome-associated cytokines

promotes lethal disease in an aged-mouse model of severe acute respiratory

syndrome coronavirus infection. J Virol 83: 7062–7074.

26. de Lang A, Baas T, Teal T, Leijten LM, Rain B, et al. (2007) Functional

genomics highlights differential induction of antiviral pathways in the lungs of

SARS-CoV-infected macaques. PLoS Pathog 3: e112. doi:10.1371/journal.

ppat.0030112.

27. Baas T, Roberts A, Teal TH, Vogel L, Chen J, et al. (2008) Genomic analysis

reveals age-dependent innate immune responses to severe acute respiratory

syndrome coronavirus. J Virol 82: 9465–9476.

28. Haagmans BL, Kuiken T, Martina BE, Fouchier RA, Rimmelzwaan GF, et al.

(2004) Pegylated interferon-alpha protects type 1 pneumocytes against SARS

coronavirus infection in macaques. Nat Med 10: 290–293.

29. Andeweg AC, Haagmans BL, Osterhaus ADME (2008) Virogenomics: The

virus –host interaction revisited. Curr Opin Microbiol 11: 1–6.


Review

The Role of Genomics in Tracking the Evolution ofInfluenza A VirusAlice Carolyn McHardy1*, Ben Adams2

1 Computational Genomics and Epidemiology, Max Planck Institute for Informatics, Saarbruecken, Germany, 2 Department of Mathematical Sciences, University of Bath,

United Kingdom

Abstract: Influenza A virus causes annual epidemics andoccasional pandemics of short-term respiratory infectionsassociated with considerable morbidity and mortality. Thepandemics occur when new human-transmissible virusesthat have the major surface protein of influenza A virusesfrom other host species are introduced into the humanpopulation. Between such rare events, the evolution ofinfluenza is shaped by antigenic drift: the accumulation ofmutations that result in changes in exposed regions of theviral surface proteins. Antigenic drift makes the virus lesssusceptible to immediate neutralization by the immunesystem in individuals who have had a previous influenzainfection or vaccination. A biannual reevaluation of thevaccine composition is essential to maintain its effective-ness due to this immune escape. The study of influenzagenomes is key to this endeavor, increasing our under-standing of antigenic drift and enhancing the accuracy ofvaccine strain selection. Recent large-scale genomesequencing and antigenic typing has considerably im-proved our understanding of influenza evolution: epi-demics around the globe are seeded from a reservoir inEast-Southeast Asia with year-round prevalence of influ-enza viruses; antigenically similar strains predominate inepidemics worldwide for several years before beingreplaced by a new antigenic cluster of strains. Future in-depth studies of the influenza reservoir, along with large-scale data mining of genomic resources and theintegration of epidemiological, genomic, and antigenicdata, should enhance our understanding of antigenic driftand improve the detection and control of antigenicallynovel emerging strains.

Influenza is a single-stranded, negative-sense RNA virus that

causes acute respiratory illness in humans. In temperate regions,

winter influenza epidemics result in 250,000–500,000 deaths per

year; in tropical regions, the burden is similar [1,2]. Influenza

viruses of three genera or types (A, B, and C) circulate in the

human population. Influenza viruses of the types B and C evolve

slowly and circulate at low levels. Type A evolves rapidly and can

evade neutralization by antibodies in individuals who have been

previously infected with, or vaccinated against, the virus. As a

result it regularly causes large epidemics. Furthermore, distinct

reservoirs of influenza A exist in other mammals and in birds. Four

times in the last hundred years these reservoirs have provided

genetic material for novel viruses that have caused global

pandemics [3–8].

The genome of influenza A viruses comprises eight RNA

segments of 0.9–2.3 kb that together span approximately 13.5 kb

and encode 11 proteins [9]. Segment 4 encodes the major surface

glycoprotein called hemagglutinin (H), which is responsible for

attaching the virus to sialic acid residues on the host cell surface and

fusing the virus membrane envelope with the host cell membrane,

thus delivering the viral genome into the cell (Figure 1). Segment 6

encodes another surface glycoprotein called neuraminidase (N),

which cleaves terminal sialic acid residues from glycoproteins and

glycolipids on the host cell surface, thus releasing budding viral

particles from an infected cell [10]. Influenza A viruses are further

classified into distinct subtypes based on the genetic and antigenic

characteristics of these two surface glycoproteins. Sixteen hemag-

glutinin (H1–16) and nine neuraminidase subtypes (N1–9) are

known to exist, and they occur in various combinations in influenza

viruses endemic in aquatic birds [10,11]. Viruses with the subtype

composition H1N1 and H3N2 have been circulating in the human

population for several decades. Of these two subtypes, H3N2

evolves more rapidly, and has until recently caused the majority of

infections [1,12,13]. In the spring of 2009, however, a new H1N1

virus originating from swine influenza A viruses, and only distantly

related to the H1N1 already circulating, gained hold in the human

population. The emergence of this virus has initiated the first

influenza pandemic of the twenty-first century [7,14,15].

Hemagglutinin is about five times more abundant than

neuraminidase in the viral membrane and is the major target of

the host immune response [16–18]. Following exposure to the

virus, whether by infection or vaccination, the host immune system

acquires the capacity to produce neutralizing antibodies against

the viral surface glycoproteins. These antibodies participate in

clearing an infection and may protect an individual from future

infections for many decades [19]. Five exposed regions on the

surface of hemagglutinin, called epitope sites, are predominantly

recognized by such antibodies [16,17]. However, the human

subtypes of influenza A continuously evolve and acquire genetic

mutations that result in amino acid changes in the epitopes. These

changes reduce the protective effect of antibodies raised against

previously circulating viral variants. This ‘‘antigenic drift’’

necessitates frequent modification and readministration of the

influenza vaccine to ensure efficient protection (Box 1).


Citation: McHardy AC, Adams B (2009) The Role of Genomics in Tracking theEvolution of Influenza A Virus. PLoS Pathog 5(10): e1000566. doi:10.1371/journal.ppat.1000566



Copyright: � 2009 McHardy et al. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.

Funding: The authors received no specific funding for this work.




To monitor for novel emerging strains, the World Health

Organization (WHO) maintains a global surveillance program. A

panel of experts meets twice a year to review antigenic, genetic, and

epidemiological data and decides on the vaccine composition for the

next winter season in the northern or southern hemisphere [20]. If

an emerging antigenic variant is detected and judged likely to

become predominant, an update of the vaccine strain is recom-

mended. This ‘‘predict and produce’’ approach mostly results in

efficient vaccines that substantially limit the morbidity and mortality

of seasonal epidemics [21]. The recommendation has to be made

almost a year before the season in which the vaccine is used,

however, because of the time required to produce and distribute a

new vaccine. Problems arise when an emerging variant is not

identified early enough for an update of the vaccine composition

[22–24]. Thus, gaining a detailed understanding of the evolution

and epidemiology of the virus is of the utmost importance, as it may

lead to earlier identification of novel emerging variants [20].

The development of high-throughput sequencing has recently

provided large datasets of high-quality, complete genome

sequences for viral isolates collected in a relatively unbiased

manner, regardless of virulence or other unusual characteristics

[9,25]. Analyses of the genome sequence data combined with

large-scale antigenic typing [26,27] have given insights into the

pattern of global spread, the genetic diversity during seasonal

epidemics, and the dynamics of subtype evolution. Influenza data

repositories such as the NCBI Influenza Virus Resource (http://

www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html) [28] and the

Global Initiative on Sharing All Influenza Data (GISAID; http://

platform.gisaid.org/) database [29] make the genomic information

publicly available, together with epidemiological data for the

sequenced isolates. The GISAID model for data sharing requires

users to agree to collaborate with, and appropriately credit, all

data contributors. A notable success of this initiative has been the

contribution of countries, such as Indonesia and China, which

Figure 1. Schematic representation of an influenza A virion. Three proteins, hemagglutinin (HA, a trimer of three identical subunits),neuraminidase (NA, a tetramer of four identical subunits), and the M2 transmembrane proton channel (a tetramer of four identical subunits), areanchored in the viral membrane, which is composed of a lipid bilayer. The large, external domains of hemagglutinin and neuraminidase are the majortargets for neutralizing antibodies of the host immune response. The M1 matrix protein is located below the membrane. The genome of the influenzaA virus is composed of eight individual RNA segments (conventionally ordered by decreasing length, bottom row), which each encode one or twoproteins. Inside the virion, the eight RNA segments are packaged in a complex with nucleoprotein (NP) and the viral polymerase complex, consistingof the PA, PB1, and PB2 proteins.doi:10.1371/journal.ppat.1000566.g001


have previously been reticent about placing data in the public

domain. The WHO also supports the endeavor of rapid

publication of all available sequences for influenza viruses and

there is hope that comprehensive submission to public databases

will soon become a reality [24,30]. In the future, mining these

resources and establishing a statistical framework based on

epidemiological, antigenic, and genetic information could provide

further insights into the rules that govern the emergence and

establishment of antigenically novel variants and improve the

potential for influenza prevention and control.

Host Immune Evasion by Antigenic Drift and Shift

Influenza viruses can rapidly acquire genetic diversity because

of high replication rates in infected hosts, an error-prone RNA

polymerase (which introduces mutations during genome replica-

tion), and segment reassortment (Figure 2). Mutations that change

amino acid residues appear significantly more often than silent

mutations in the evolution of the hemagglutinin gene of human

influenza A, particularly in the protein epitopes [31–34]. This

observation indicates that selection for antigenic change of the

virus is the driving force in the evolutionary ‘‘arms race’’ between

the virus and the immunity of the human population [35].

Reassortment of the eight genome segments between two distinct

viruses present simultaneously in a host cell can result in hybrid

viruses with genome segments from two different progenitors.

Antigenic mapping allows researchers to generate a quantita-

tive, two-dimensional representation of antigenic distances be-

tween genetically divergent strains [26]. This technique has

revealed that the relationship between antigenic change and

genetic change is nonlinear for the hemagglutinin of influenza A/

H3N2. The rate of genetic change of the virus is almost constant

over time, but some mutations exert a disproportionately large

effect on the antigenic type, whereas others are ‘‘hitchhikers’’ with

no phenotypic effect. Elucidating the effects of different mutations

at individual sites on the antigenic type will improve our

understanding of the overall genotype-to-phenotype mapping for

antigenic drift. Furthermore, the antigenic drift of H3N2 is not

continuous but punctuated: antigenically homogenous clusters of

strains predominate for an average of 3 years before being

replaced by a new cluster. In accordance with the punctuated

nature of antigenic drift, periods of predominantly neutral

evolution alternate with periods of strong selection for antigenic

change [13,36]. Phylogenetic trees illustrating the evolution of the

hemagglutinin gene of H3N2 have a cactus-like shape with a

strong temporal structure in which the trunk represents the

succession of surviving viral lineages over time. Short side

branches indicate that most strains are driven to extinction and

that viral diversity at any given time is limited [31,34]. The

underlying causes of this punctuated antigenic drift and limited

viral diversity at a given point in time have been investigated in

phylodynamic modeling studies (Box 2).

Major changes in antigenicity (antigenic shift) are associated

with the introduction of novel viruses into the human population

that have a hemagglutinin segment of an influenza A virus from

another host species and can be transmitted efficiently among

humans [5]. Such viruses may arise by segment reassortment

between a human influenza A virus and an influenza A virus from

another host species. Alternatively, an entire virus from another

host species may cross into the human population. The

appearance of such viruses is rare, as it requires the viral genes

encoded by the different segments to be compatible with each

other and the virus to be capable of replication and transmission in

the human population, which is also thought to be a polygenic trait

[6,7,10,37,38]. Antigenic shift can have grave consequences

because neutralizing antibodies against the viral surface proteins

offer limited or no cross-protection across subtypes. Cross-

protection can also be very limited between viruses of the same

subtype that have evolved independently in different hosts for long

periods of time [14]. Thus, a larger part of the population is

susceptible to infection with such viruses than to infection with

endemic viruses [10,14]. Antigenic shift caused three global

pandemics in the twentieth century, the 1918 H1N1 pandemic,

the 1957 H2N2 pandemic, and 1968 H3N2 pandemic (reviewed

in [3–5,8]): The 1918 pandemic had the most devastating impact,

with an estimated 20–50 million deaths worldwide [39]. There is

some uncertainty concerning the origin of the 1918 virus due to

the lack of data from this time [6,40–43]. A recent phylogenetic

study suggests that this virus may have been generated by

reassortment of avian viruses with already circulating viruses in a

mammalian host such as human or swine [44]. The H2N2 virus

that caused the 1957 pandemic was a reassortant of five human

H1N1 segments and avian segments encoding the viral surface

proteins and the PB1 protein. Similarly, the reassortant H3N2

virus of the 1968 pandemic featured avian segments encoding

hemagglutinin and PB1. H3N2 still circulates today, together with

an H1N1 lineage introduced in 1977, which is similar to the H1N1

viruses circulating in the 1950s [4].

The first pandemic virus of the twenty-first century probably

entered the human population in January or February of 2009

[15]. Phylogenetic analyses of the viral genome determined that

the virus has a complex reassortment history with segments of

‘‘avian-like’’ Eurasian swine influenza A viruses (NA and M) that

were first observed in Eurasian swine in 1979, and of a triple

reassortant virus identified in North American swine after 1998.

The segments derived from the triple reassortant stem themselves

from human H3N2 (PB1), an avian influenza A virus (PA, PB2),

and classical North American swine influenza A viruses (HA, NP,

NS), which have a common ancestry with the 1918 H1N1 virus

[14,45]. Experiments have shown that the new H1N1 virus

replicates efficiently in mammalian model organisms such as

Box 1. Broadly Protective Vaccines

Current influenza vaccines are based on detergent-inactivated viruses. They elicit antibodies with a narrowrange of protection that target predominantly the variableregions of the hemagglutinin protein. Accordingly, theseasonal influenza vaccine includes one strain withsegments of the surface proteins for each of the A/H1N1,A/H3N2 and B viruses, and it is updated every 1–3 years tomatch the predominant variants of influenza. Researchinto vaccines that offer broader protection across diversesubtypes and antigenic drift variants is ongoing [21,59–61].This research is particularly important with respect to theemergence of novel viruses with pandemic potential, suchas the 2009 H1N1 virus. In such an event, the time periodbetween the detection of the virus and the onset of apandemic is too short to produce a specific vaccine forimmediate vaccination of the population. Work in this areais focused on developing vaccines that elicit antibodiesagainst conserved viral components, such as certainregions of hemagglutinin, neuraminidase, and the M2proton channel in the viral membrane [60]. Other types ofvaccines based on live attenuated viruses or plasmid DNAexpression vectors, or supplemented with adjuvants, showpromise in inducing a more broadly protective immuneresponse [61].


ferrets, mice, and cynomolgus macaques and is likely to be capable

of long-term circulation in the human population, particularly in

the event of further adaptive changes through mutation or

reassortment [46–48]. The novel H1N1 appears, so far, to cause

relatively mild human infections in comparison to other viruses

such as the highly pathogenic H5N1 avian influenza A viruses

that, since 1997, have repeatedly been transmitted to humans and

caused severe disease but so far have not been capable of sustained

transmission between humans. The emergence of a novel

pandemic virus, which may have been circulating undetected in

swine for a decade [14,45], has highlighted the need for increased

genomic surveillance of the viral populations in mammalian hosts

such as swine. These hosts could be a vessel for mammalian

adaptation of avian viruses, either by reassortment with human or

swine viruses or through adaptive changes [8], but have been

monitored less intensely than avian populations. The latest

emergence of a pandemic H1N1 virus has also underscored the

vital importance of further research into the molecular factors that

determine the host range and capacity for sustained human-to-

human transmission of influenza A viruses.

Reassortment in Subtype Evolution

Whole-genome studies have revealed that segment reassortment

between different viruses of the same subtype is an important

mechanism in the evolution of human-adapted subtypes and

generates extensive genome-wide diversity [34,36,49–51]. Periodic

selective sweeps caused by a novel antigenic drift variant rising to

predominance reduce the genomic diversity of the circulating viral

population, either genome-wide or for the hemagglutinin segment

only [12]. Reassortment results in substantial differences in the

evolutionary histories of individual segments. However, similarities

in the histories of some segments indicate that besides the antigenic

characteristics of hemagglutinin, the genomic context and compat-

ibility of certain segment combinations might be an important

contributor to viral fitness [12,51]. A case in point is the

antigenically novel ‘‘Fujian’’ strain which became predominant in

the 2003–2004 season, following a reassortment event that placed a

hemagglutinin segment from a lineage that had been circulating at

low levels for several years into a new genomic context [49]. The

importance of other segments in the adaptive evolution of the virus

is further supported by the observation that a number of other

segments, including the one encoding neuraminidase, evolve at

similar rates to the segment encoding hemagglutinin [12].

Geographic Spread

Genomic analysis has led to profound insights into the global

patterns of circulation and evolution of influenza A. Over the

course of seasonal epidemics in temperate regions, little evidence

has been found for selection for amino acid change and adaptive

evolution in the antigenic regions of the surface proteins [36].

There is, however, substantial genetic diversity due to multiple

introductions of distinct strains, wide spatial spread, and frequent

Figure 2. Generation of genetic diversity and antigenic drift in the evolution of human influenza A viruses. Blue and yellow virusesdepict two antigenically similar strains of the same subtype circulating in the human population. The genetic diversity of the circulating viralpopulation increases through mutation and reasssortment. Single white arrows indicate relationships between ancestral and descendant viruses.White marks on the segments indicate neutral mutations and red marks indicate mutations that affect the antigenic regions of the surface proteins.Incoming pairs of orange arrows indicate the generation of reassortants with segments from two different ancestral viruses. As these viruses continueto circulate, immunity against them builds up in the host population, represented here by the narrowing of the bottleneck. In parallel, viruses withmutations affecting the antigenic regions of the surface proteins accumulate in the viral population. At some point a novel antigenic drift variant,indicated by a red colored virus, which is less affected by immunity in the human population, is generated. This variant is able to cause widespreadinfection and founds a new cluster of antigenically similar strains.doi:10.1371/journal.ppat.1000566.g002


segment reassortment in seasonal epidemics [9,12,36,49,50]. The

viral population circulating in one season does not directly seed the

epidemic in the following one. Instead, gene flow and viral spread

are global, with similar strains appearing in northern and southern

hemisphere epidemics across several seasons. There is a global

reservoir of viral diversity from which seasonal epidemics in

temperate regions are seeded [12,27,52]. This reservoir is located

in East-Southeast Asia, where a region-wide network of temporally

overlapping epidemics maintains infection incidence throughout

the year [27]. Novel strains appear in this region on average 6–9

months before they emerge in Oceania, Europe, and North

America and 12–18 months before they reach South America.


A key objective for research into the antigenic drift of influenza A is

to improve the accuracy of vaccine strain choice, in particular for

seasons preceding the establishment of novel antigenic drift variants.

More intensive surveillance and sampling, particularly in East-

Southeast Asia, could facilitate the early detection of novel emerging

drift variants and alleviate problems related to the time required for

vaccine production. A better understanding of the evolutionary and

epidemiological rules governing antigenic drift, viral fitness, the role

of the source region, and establishment of predominance would be

particularly helpful for the selection of vaccine strains when

considerable variation among antigenically novel strains is observed

and it is unclear which, if any, will become predominant. Such

insights are likely to come both from phylodynamic modeling studies

and by mining genomic resources for genome-wide properties

associated with viral fitness and predominance. Some molecular

properties of hemagglutinin with predictive value for this task have

already been identified [53–56], such as the number of changes at

sites under positive selection or in the most extensively altered

epitope, although the sites under selection might change over time

[26]. It is notable that the lack of antigenic information for sequenced

viral isolates in public repositories currently restricts the direct analysis

of genetic determinants in antigenic drift [24]. If the World Health

Organization were to establish similar policies for the deposition of

antigenic information into public databases as exist for sequence data,

this could create a valuable resource for research in this area. As

existing databases grow, new statistical and computational techniques

are being developed for interpretation of these large-scale, popula-

tion-level genomic datasets in combination with epidemiological and

phenotypic information [57]. Ultimately, the expert analysis of the

WHO in the detection and control of antigenically novel emerging

strains could be extensively supported by the development of a

suitable predictive framework based on statistical learning that takes

into consideration the population-level phylodynamics of antigenic

change [57,58]. Such a framework could utilize epidemiological,

genomic, and antigenic information and detailed knowledge of the

genetic and epidemiological characteristics of antigenic drift to assess

the likelihood of strains rising to predominance.

Acknowledgments

We thank Linus Roune for his help creating the figures.

References

1. WHO (2003) Fact sheet number 211. Available: http://www.who.int/

mediacentre/factsheets/fs211/en/. Accessed 13 August 2009.

2. Viboud C, Alonso WJ, Simonsen L (2006) Influenza in tropical regions. PLoS

Med 3: e89. doi:10.1371/journal.pmed.0030089.

3. Palese P (2004) Influenza: old and new threats. Nat Med 10: S82–87.

4. Kilbourne ED (2006) Influenza pandemics of the 20th century. Emerg Infect Dis

12: 9–14.

5. Cox NJ, Subbarao K (2000) Global epidemiology of influenza: past and present.

Annu Rev Med 51: 407–421.

6. Morens DM, Taubenberger JK, Fauci AS (2009) The persistent legacy of the

1918 influenza virus. N Engl J Med 361: 225–229.

7. Neumann G, Noda T, Kawaoka Y (2009) Emergence and pandemic potential of

swine-origin H1N1 influenza virus. Nature 459: 931–939.

8. Horimoto T, Kawaoka Y (2005) Influenza: Lessons from past pandemics,

warnings from current incidents. Nat Rev Microbiol 3: 591–600.

9. Ghedin E, Sengamalay NA, Shumway M, Zaborsky J, Feldblyum T, et al. (2005)

Large-scale sequencing of human influenza reveals the dynamic nature of viral

genome evolution. Nature 437: 1162–1166.

10. Webster RG, Bean WJ, Gorman OT, Chambers TM, Kawaoka Y (1992)

Evolution and ecology of influenza A viruses. Microbiol Rev 56: 152–179.

11. Fouchier RA, Munster V, Wallensten A, Bestebroer TM, Herfst S, et al. (2005)

Characterization of a novel influenza A virus hemagglutinin subtype (H16)

obtained from black-headed gulls. J Virol 79: 2814–2822.

12. Rambaut A, Pybus OG, Nelson MI, Viboud C, Taubenberger JK, et al. (2008)

The genomic and epidemiological dynamics of human influenza A virus. Nature

453: 615–619.

13. Wolf YI, Viboud C, Holmes EC, Koonin EV, Lipman DJ (2006) Long intervals

of stasis punctuated by bursts of positive selection in the seasonal evolution of

influenza A virus. Biol Direct 1: 34.

14. Garten RJ, Davis CT, Russell CA, Shu B, Lindstrom S, et al. (2009) Antigenic

and genetic characteristics of swine-origin 2009 A(H1N1) influenza viruses

circulating in humans. Science 325: 197–201.

15. Fraser C, Donnelly CA, Cauchemez S, Hanage WP, Van Kerkhove MD, et al.

(2009) Pandemic potential of a strain of influenza A (H1N1): Early findings.

Science 324: 1557–1561.

16. Wiley DC, Wilson IA, Skehel JJ (1981) Structural identification of the antibody-

binding sites of Hong Kong influenza haemagglutinin and their involvement in

antigenic variation. Nature 289: 373–378.

17. Wilson IA, Cox NJ (1990) Structural basis of immune recognition of influenza

virus hemagglutinin. Annu Rev Immunol 8: 737–771.

Box 2. Modeling Antigenic Evolution

There is a long history of the use of mathematical modelsto study epidemiological and evolutionary ystems [63].For rapidly evolving RNA viruses such as influenza thedynamics of these systems are densely interwoven, andrecent work has sought to develop unified ‘‘phylody-namic’’ models to examine the processes underlying theobserved epidemiological and evolutionary patterns (re-viewed in [35]). A better understanding of the mechanismsdriving viral evolution will enhance our capacity toaccurately identify novel emerging strains. For influenza,phylodynamic models have been developed to probe thecomplex processes relating to viral persistence in thehuman population, antigenic turnover, and the limitedgenetic diversity at any given point in time. The firstmodels predicted that diversity increases exponentiallyunless long-term, partial cross-immunity between strains issupplemented by temporary broad immunity that lasts forseveral months and protects against all infections,regardless of the genetic or antigenic similarity of strains[64,65]. Subsequently, it has been proposed that agenotype-to-phenotype mapping defined by neutralnetworks underlies influenza evolution [66]. A neutralnetwork is a set of genotypes linked by single mutationsthat all map to the same phenotype, in this case theantigenic characteristics of a virus. Hence, genetic diver-gence is not accompanied by antigenic divergence as longas the genotype remains in the same network. In certaingenetic contexts, however, mutations can move agenotype onto an adjacent network, resulting in asignificant change in the antigenic phenotype. Incorpo-rating this evolutionary framework into an epidemiologicalmodel leads to both epidemiological and evolutionarypatterns characteristic of human influenza A/H3N2.


s

18. Wilson IA, Skehel JJ, Wiley DC (1981) Structure of the haemagglutinin

membrane glycoprotein of influenza virus at 3 A resolution. Nature 289:366–373.

19. Yu X, Tsibane T, McGraw PA, House FS, Keefer CJ, et al. (2008) Neutralizing

antibodies derived from the B cells of 1918 influenza pandemic survivors. Nature455: 532–536.

20. Russell CA, Jones TC, Barr IG, Cox NJ, Garten RJ, et al. (2008) Influenzavaccine strain selection and recent studies on the global migration of seasonal

influenza viruses. Vaccine 26(Suppl 4): D31–34.

21. Karlsson Hedestam GB, Fouchier RA, Phogat S, Burton DR, Sodroski J, et al.(2008) The challenges of eliciting neutralizing antibodies to HIV-1 and to

influenza virus. Nat Rev Microbiol 6: 143–155.22. de Jong JC, Beyer WE, Palache AM, Rimmelzwaan GF, Osterhaus AD (2000)

Mismatch between the 1997/1998 influenza vaccine and the major epidemicA(H3N2) virus strain as the cause of an inadequate vaccine-induced antibody

response to this strain in the elderly. J Med Virol 61: 94–99.

23. CDC (2004) Preliminary assessment of the effectiveness of the 2003–04inactivated influenza vaccine—Colorado, December 2003. MMWR Morb

Mortal Wkly Rep 53: 8–11.24. Salzberg S (2008) The contents of the syringe. Nature 454: 160–161.

25. Obenauer JC, Denson J, Mehta PK, Su X, Mukatira S, et al. (2006) Large-scale

sequence analysis of avian influenza isolates. Science 311: 1576–1580.26. Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, et al.

(2004) Mapping the antigenic and genetic evolution of influenza virus. Science305: 371–376.

27. Russell CA, Jones TC, Barr IG, Cox NJ, Garten RJ, et al. (2008) The globalcirculation of seasonal influenza A (H3N2) viruses. Science 320: 340–346.

28. Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, et al. (2008) The

influenza virus resource at the National Center for Biotechnology Information.J Virol 82: 596–601.

29. Enserink M (2007) Data sharing. New Swiss influenza database to test promisesof access. Science 315: 923.

30. Bogner P, Capua I, Lipman DJ, Cox NJ, et al. (2006) A global initiative on

sharing avian flu data. Nature 442: 981.31. Fitch WM, Leiter JM, Li XQ, Palese P (1991) Positive Darwinian evolution in

human influenza A viruses. Proc Natl Acad Sci U S A 88: 4270–4274.32. Fitch WM, Bush RM, Bender CA, Cox NJ (1997) Long term trends in the

evolution of H(3) HA1 human influenza type A. Proc Natl Acad Sci U S A 94:7712–7718.

33. Bush RM, Fitch WM, Bender CA, Cox NJ (1999) Positive selection on the H3

hemagglutinin gene of human influenza virus A. Mol Biol Evol 16: 1457–1465.34. Nelson MI, Holmes EC (2007) The evolution of epidemic influenza. Nat Rev

Genet 8: 196–205.35. Grenfell BT, Pybus OG, Gog JR, Wood JL, Daly JM, et al. (2004) Unifying the

epidemiological and evolutionary dynamics of pathogens. Science 303: 327–332.

36. Nelson MI, Simonsen L, Viboud C, Miller MA, Taylor J, et al. (2006) Stochasticprocesses are key determinants of short-term evolution in influenza A virus.

PLoS Pathog 2: e125. doi:10.1371/journal.ppat.0020125.37. Lowen AC, Palese P (2007) Influenza virus transmission: Basic science and

implications for the use of antiviral drugs during a pandemic. Infect Disord DrugTargets 7: 318–328.

38. Kuiken T, Holmes EC, McCauley J, Rimmelzwaan GF, Williams CS, et al.

(2006) Host species barriers to influenza virus infections. Science 312: 394–397.39. Johnson NP, Mueller J (2002) Updating the accounts: Global mortality of the

1918–1920 ‘‘Spanish’’ influenza pandemic. Bull Hist Med 76: 105–115.40. Taubenberger JK, Reid AH, Lourens RM, Wang R, Jin G, et al. (2005)

Characterization of the 1918 influenza virus polymerase genes. Nature 437:

889–893.41. Reid AH, Taubenberger JK, Fanning TG (2004) Evidence of an absence: The

genetic origins of the 1918 pandemic influenza virus. Nat Rev Microbiol 2:909–914.

42. Antonovics J, Hood ME, Baker CH (2006) Molecular virology: Was the 1918 flu

avian in origin? Nature 440: E9; discussion E9–10.

43. Taubenberger JK (2006) The origin and virulence of the 1918 ‘‘Spanish’’

influenza virus. Proc Am Philos Soc 150: 86–112.

44. Smith GJ, Bahl J, Vijaykrishna D, Zhang J, Poon LL, et al. (2009) Dating the

emergence of pandemic influenza viruses. Proc Natl Acad Sci U S A 106:

11709–11712.

45. Smith GJ, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, et al. (2009) Origins

and evolutionary genomics of the 2009 swine-origin H1N1 influenza Aepidemic. Nature 459: 1122–1125.

46. Maines TR, Jayaraman A, Belser JA, Wadford DA, Pappas C, et al. (2009)Transmission and pathogenesis of swine-origin 2009 A(H1N1) influenza viruses

in ferrets and mice. Science 325: 484–487.

47. Munster VJ, de Wit E, van den Brand JM, Herfst S, Schrauwen EJ, et al. (2009)Pathogenesis and transmission of swine-origin 2009 A(H1N1) influenza virus in

ferrets. Science 325: 481–483.

48. Itoh Y, Shinya K, Kiso M, Watanabe T, Sakoda Y, et al. (2009) In vitro and in

vivo characterization of new swine-origin H1N1 influenza viruses. Nature;E-pubahead of print. doi:10.1038/nature08260.

49. Holmes EC, Ghedin E, Miller N, Taylor J, Bao Y, et al. (2005) Whole-genome

analysis of human influenza A virus reveals multiple persistent lineages andreassortment among recent H3N2 viruses. PLoS Biol 3: e300. doi:10.1371/


50. Nelson MI, Edelman L, Spiro DJ, Boyne AR, Bera J, et al. (2008) Molecular

epidemiology of A/H3N2 and A/H1N1 influenza virus during a single epidemic

season in the United States. PLoS Pathog 4: e1000133. doi:10.1371/journal.-ppat.1000133.

51. Nelson MI, Viboud C, Simonsen L, Bennett RT, Griesemer SB, et al. (2008)Multiple reassortment events in the evolutionary history of H1N1 influenza A

virus since 1918. PLoS Pathog 4: e1000012. doi:10.1371/journal.ppat.1000012.

52. Nelson MI, Simonsen L, Viboud C, Miller MA, Holmes EC (2007) Phylogenetic

analysis reveals the global migration of seasonal influenza A viruses. PLoS


53. Fitch WM, Bush RM, Bender CA, Subbarao K, Cox NJ (2000) The Wilhelmine

E. Key 1999 Invitational lecture. Predicting the evolution of human influenza A.J Hered 91: 183–185.

54. Gupta V, Earl DJ, Deem MW (2006) Quantifying influenza vaccine efficacy andantigenic distance. Vaccine 24: 3881–3888.

55. Blackburne BP, Hay AJ, Goldstein RA (2008) Changing selective pressure

during antigenic changes in human influenza H3. PLoS Pathog 4: e1000058.doi:10.1371/journal.ppat.1000058.

56. Kryazhimskiy S, Bazykin GA, Plotkin J, Dushoff J (2008) Directionality in theevolution of influenza A haemagglutinin. Proc Biol Sci 275: 2455–2464.

57. Pybus OG, Rambaut A (2009) Modelling: Evolutionary analysis of the dynamics

of viral infectious disease. Nat Rev Genet 10: 540–550.

58. Bishop CM (2006) Pattern recognition and machine learning. In: Jordan M,

Kleinberg J, Schoellkopf B, eds. , Singapore: Springer.

59. Sui J, Hwang WC, Perez S, Wei G, Aird D, et al. (2009) Structural and

functional bases for broad-spectrum neutralization of avian and humaninfluenza A viruses. Nat Struct Mol Biol 16: 265–273.

60. Gerhard W, Mozdzanowska K, Zharikova D (2006) Prospects for universal

influenza virus vaccine. Emerg Infect Dis 12: 569–574.

61. Carrat F, Flahault A (2007) Influenza vaccine: The challenge of antigenic drift.

Vaccine 25: 6852–6862.

62. Fisher RA (1999) The genetical theory of natural selection. Oxford (UK):

Oxford University Press. pp 318.

63. Ross R (1910) The prevention of malaria. New York: E.P. Dutton. pp 669.

64. Ferguson NM, Galvani AP, Bush RM (2003) Ecological and immunological

determinants of influenza evolution. Nature 422: 428–433.

65. Tria F, Lassig M, Peliti L, Franz S (2005) A minimal stochastic model for

influenza evolution. J Stat Mech;doi:10.1088/1742-5468/2005/07/P07008.

66. Koelle K, Cobey S, Grenfell B, Pascual M (2006) Epochal evolution shapes the

phylodynamics of interpandemic influenza A (H3N2) in humans. Science 314:1898–1903.


Review

The Past and Future of Tuberculosis ResearchInaki Comas, Sebastien Gagneux*

Division of Mycobacterial Research, MRC National Institute for Medical Research, London, United Kingdom

Abstract: Renewed efforts in tuberculosis (TB) researchhave led to important new insights into the biology andepidemiology of this devastating disease. Yet, in the faceof the modern epidemics of HIV/AIDS, diabetes, andmultidrug resistance—all of which contribute to suscep-tibility to TB—global control of the disease will remain aformidable challenge for years to come. New high-throughput genomics technologies are already contribut-ing to studies of TB’s epidemiology, comparative geno-mics, evolution, and host–pathogen interaction. We arguehere, however, that new multidisciplinary approaches—especially the integration of epidemiology with systemsbiology in what we call ‘‘systems epidemiology’’—will berequired to eliminate TB.

Introduction

Tuberculosis (TB) remains an important public health problem

[1]. With close to 10 million new cases per year, and a pool of two

billion latently infected individuals, control efforts are struggling in

many parts of the world (Figure 1). Nevertheless, the renewed

interest in research and improved funding for TB give reasons for

optimism. Recently, the Stop TB Partnership, a network of

concerned governments, organizations, and donors lead by the

WHO (http://www.stoptb.org/stop_tb_initiative/), outlined a

global plan to halve TB prevalence and mortality by 2015 and

eliminate the disease as a public health problem by 2050 [2].

Attaining these goals will depend on both strong government

commitment and increased interdisciplinary research and devel-

opment. As existing diagnostics, drugs, and vaccines will be

insufficient to achieve these objectives, a substantial effort in both

basic science and epidemiology will be necessary to develop better

tools and strategies to control TB [3]. Here we review the recent

history of TB research and some of the latest insights into the

evolutionary history of the disease. We then discuss ways in which

we could benefit from a more comprehensive systems approach to

control TB in the future.

Recent History of the Field

TB is caused by several species of gram-positive bacteria known

as tubercle bacilli or Mycobacterium tuberculosis complex (MTBC).

MTBC includes obligate human pathogens such as Mycobacterium

tuberculosis and Mycobacterium africanum, as well as organisms adapted

to various other species of mammal. In the developed world, TB

incidence declined steadily during the second half of the 20th

century and so funds available for research and control of TB

decreased substantially during that time [4]. When TB started to

reemerge in the early 1990s, fuelled by the growing pandemic of

HIV/AIDS (Box 1), scientists and public health officials were

caught off-guard; billions of dollars of emergency funds were

necessary to control TB outbreaks [5]. Moreover, long-term

neglect of basic TB research and product development meant that

global TB control relied on a 100-year-old diagnostic method (i.e.

sputum smear microscopy) of poor sensitivity, an 80-year-old and

largely ineffective vaccine (Bacille Calmette-Guerin [BCG]), and

just a few drugs that were decades old (streptomycin, rifampicin,

isoniazid, ethambutol, pyrozinamide) [3]. Tragically, these are the

tools still in use today in most parts of the world where TB remains

one of the most important public health problems (Figure 1).

In addition to the lack of appropriate tools to control TB

globally, much about the disease was unknown in the early 1990s

and many dogmas were guiding the field at the time. These

included the view that differences in the clinical manifestation of

TB were primarily driven by host variables and the environment

as opposed to bacterial factors, a notion reinforced by early DNA

sequencing studies that reported very limited genetic diversity in

MTBC compared with other bacterial pathogens [6]. According

to other dogmas, TB was mainly a consequence of reactivation of

latent infections rather than ongoing disease transmission, and that

mixed infections and exogenous reinfections with different strains

were very unlikely.

The development of molecular techniques to differentiate

between strains of MTBC made it possible to readdress some of

these points. One of these methods, a DNA fingerprinting protocol

based on the Mycobacterium insertion sequence IS6110, quickly

evolved into the first international gold standard for genotyping of

MTBC [7]. It also became a key component of pragmatic public

health efforts, such as detecting disease outbreaks and ongoing TB

transmission [8], and allowed differentiation between patients who

relapsed due to treatment failure and those reinfected with a

different strain [9]. This latter finding demonstrated for the first

time that previous exposure to MTBC does not protect against

subsequent exogenous reinfection and TB disease, which is a

phenomenon with implications for vaccine design. Many other

new insights were gained through these molecular epidemiological

studies [10], which, for the most part, were performed in wealthy

countries; corresponding data from most high-burden areas

remained limited because of poor infrastructure and lack of

funding.

Routine genotyping of MTBC for public health purposes also

revived discussions about the role of pathogen variation in

Citation: Comas I, Gagneux S (2009) The Past and Future of TuberculosisResearch. PLoS Pathog 5(10): e1000600. doi:10.1371/journal.ppat.1000600



Copyright: � 2009 Comas, Gagneux. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.

Funding: Work in our laboratory is supported by the Medical Research Council,UK, and the US National Institutes of Health grants HHSN266200700022C andAI034238. The funders had no role in study design, data collection and analysis,decision to publish, or preparation of the manuscript.





outcome of infection and disease. Some strains of MTBC

appeared over-represented in particular patient populations,

which suggested that strain diversity may have epidemiological

implications. The completion of the first whole genome sequence

of M. tuberculosis in 1998 [11] and the development of DNA

microarrays offered a new opportunity to address this question by

interrogating the entire genome of multiple clinical strains of

MTBC. These comparative genomics studies revealed that

genomic deletions, also known as large sequence polymorphisms

(LSPs), are an important source of genome plasticity in MTBC

[12]. Furthermore, statistical analyses of patient data suggested

possible associations between strain genomic content and disease

severity in humans [13]. Clinical phenotypes in TB are difficult to

standardize, however, and whether MTBC genotype plays a

meaningful role in TB severity remains controversial [14].

Comparative genomics of MTBC also yielded interesting insights

into the evolution and geographic distribution of the organism.

Because MTBC has essentially no detectible horizontal gene transfer

[15,16], LSPs can be used as phylogenetic markers to trace the

evolutionary relationships of different strain families. Following such

an approach, studies have shown that humans did not, as previously

believed, acquire MTBC from animals during the initiation of animal

domestication, rather the human- and animal-adapted members of

MTBC share a common ancestor, which might have infected humans

even before the Neolithic transition [17,18]. LSPs also allowed

researchers to define several discrete strain lineages within the human-

adapted members of MTBC, which are associated with different

human populations and geographical regions (Figures 2 and 3)

[15,19,20]. Because of the lack of horizontal gene exchange in

MTBC, phylogenetic trees derived using various molecular markers

define the same phylogenetic groupings [21], and several studies based

on single nucleotide polymorphisms (SNPs) and other molecular

makers have gathered additional support for the highly phylogeo-

graphical population structure of MTBC [22–25].

Ancient History of the Pathogen

Although LSPs have proven very useful for defining different

lineages within MTBC, these markers do not reflect actual genetic

distances, and the mode of molecular evolution in MTBC cannot

be easily inferred from them [21]. By contrast, DNA sequence-

based methods can provide important clues about the evolutionary

forces shaping bacterial populations. Multilocus sequence typing

(MLST), in which fragments of seven structural genes are

Figure 1. The global incidence of TB. The number of new TB cases per 100,000 population for the year 2007 according to WHO estimates(adapted from [1]).doi:10.1371/journal.ppat.1000600.g001

Box 1. The Influence of Modern Epidemics onTB Incidence

HIV/AIDS and diabetes are important comorbidities thatdramatically increase the susceptibility to TB. The synergybetween TB and HIV/AIDS is a particular problem in sub-Saharan Africa, while the impact of diabetes on TB isincreasing in many rapidly growing world economies; itmay already be a more important risk factor for TB thanHIV/AIDS in places like India and Mexico. The emergenceof multidrug-resistant strains represents an additionalthreat to global TB control. The strong associationbetween HIV/AIDS and drug-resistant TB has been wellestablished, but whether similar interactions exist betweendrug-resistant TB and diabetes needs to be exploredfurther.


sequenced for each strain [26], has been used very successfully to

define the genetic population structure of many bacterial species

[27]. Because of the low degree of sequence polymorphisms in

MTBC, however, standard MLST is uninformative [28]. A recent

study of MTBC extended the traditional MLST scheme by

sequencing 89 complete genes in 108 strains, covering 1.5% of the

genome of each strain [29]. Phylogenetic analysis of this extended

multilocus sequence dataset resulted in a tree that was highly

congruent with that generated previously using LSPs (Figure 3).

The new sequence-based data also revealed that the MTBC

strains that are adapted to various animal species represent just a

subset of the global genetic diversity of MTBC that affects different

human populations [29]. Furthermore, by comparing the

geographical distribution of various human MTBC strains with

their position on the phylogenetic tree, it became evident that

MTBC most likely originated in Africa and that human MTBC

originally spread out of Africa together with ancient human

migrations along land routes. This view is further supported by the

fact that the so-called ‘‘smooth tubercle bacilli,’’ which are the

closest relatives of the human MTBC, are highly restricted to East

Africa [30]. The multilocus sequence data reported by Hershberg

et al. [29] further suggested a scenario in which the three

‘‘modern’’ lineages of MTBC (purple, blue, and red in Figure 3)

seeded Eurasia, which experienced dramatic human population

expansion in more recent times. These three lineages then spread

globally out of Europe, India, and China, respectively, accompa-

nying waves of colonization, trade and conquest. In contrast to the

ancient human migrations, however, this more recent dispersal of

human MTBC occurred primarily along water routes [29].

The availability of comprehensive DNA sequence data has also

allowed researchers to address questions about the molecular

evolution of MTBC. In-depth population genetic analyses by

Hershberg et al. highlight the fact that purifying selection against

slightly deleterious mutations in this organism is strongly reduced

compared to other bacteria [29]. As a consequence, nonsynon-

ymous SNPs tend to accumulate in MTBC, leading to a high ratio

of nonsynonymous to synonymous mutations (also known as dN/

dS). The authors hypothesized that the high dN/dS in MTBC

compared to most other bacteria might indicate increased random

genetic drift associated with serial population bottlenecks during

past human migrations and patient-to-patient transmission. If

confirmed, this would indicate that ‘‘chance,’’ not just natural

selection, has been driving the evolution of MTBC. Although these

kinds of fundamental evolutionary questions are often underap-

preciated by clinicians and biomedical researchers, studying the

evolution of a pathogen ultimately allows for better epidemiolog-

ical predictions by contributing to our understanding of basic

biology, particularly with respect to antibiotic resistance.

A Vision for the Future

Thanks to recent increases in research funding for TB [4],

substantial progress has been made in our understanding of the

basic biology and epidemiology of the disease. Unfortunately, this

increased knowledge has not yet had any noticeable impact on the

current global trends of TB (Figure 1). While TB incidence

appears to have stabilized in many countries, the total number of

cases is still increasing as a function of global human population

growth [1]. Of particular concern are the ongoing epidemics of

multidrug-resistant TB [31], as well as the synergies between TB

and the ongoing epidemics of HIV/AIDS and other comorbidities

such as diabetes (Box 1).

Figure 2. Global distribution of the six main lineages of human MTBC. Each dot represents the most frequent lineage(s) circulating in acountry. Colours correspond to the lineages defined in Figure 3 (adapted from [20]).doi:10.1371/journal.ppat.1000600.g002


As our understanding of TB improves, we would like to be able to

make better predictions about the future trajectory of the disease

and to develop new tools to control the disease better and ultimately

reverse global trends. For this to be feasible, TB epidemiology needs

to evolve into a more predictive, interdisciplinary endeavour; a

discipline we might refer to as ‘‘systems epidemiology’’ (Figure 4).

Systems biology is already a rapidly emerging field, in which cycles

of mathematical modelling and experiments using various large-

scale ‘‘-omics’’ datasets are integrated in an iterative manner [32].

Novel biological processes are being discovered through these

systems approaches, which might not have been possible using more

traditional methods [33–35].

Last year, Young et al. argued that systems biology approaches

will be necessary to elucidate some of the key aspects of host–

pathogen interactions in TB [36] and to develop new drugs,

vaccines, and biomarkers to evaluate new interventions [3]. For

example, according to another dogma in the TB field, latent TB

infections are caused by physiologically dormant bacilli and can

thus be differentiated from active disease where MTBC is actively

growing and dividing [37]. In reality, however, the phenomenon

of TB latency most likely reflects a whole spectrum of responses to

TB infection, involving phenotypically distinct bacterial subpop-

ulations and spanning various degrees of bacterial burden and

associated host immune responses [38]. We agree with Young

et al. [36] that TB latency and similar biological complexities will

only be adequately addressed using systems approaches, and we

argue further that to comprehend the current TB epidemic as a

whole, and to better predict its future trajectory, a complementary

systems epidemiology approach will be necessary (Figure 4).

Mathematical models are already being used extensively to study

the epidemiology of TB and to guide control policies [39]. Recent

applications have shown that socioeconomic factors are key drivers

Figure 3. The global phylogeny of Mycobacterium tuberculosis complex (MTBC). The phylogenic relationships between various human- andanimal-adapted strains and species are largely consistent when defined by using either (A) large sequence polymorphisms (LSPs) or (B) singlenucleotide polymorphisms (SNPs) identified by sequencing 89 genes in 108 MTBC strains. Numbers inside the squares in (A) refer to specific lineage-defining LSPs. Colors indicate congruent lineages (adapted from [20] and [29]).doi:10.1371/journal.ppat.1000600.g003


of today’s TB epidemic [40]. In addition, much theoretical emphasis

has been placed on trying to define the impact that drug resistance

will have on the global TB epidemic [41]. Some of this theoretical

work has become more complex by incorporating new biological

insights obtained empirically and through targeted experimental

studies. Early theoretical studies on the spread of drug-resistant

MTBC were based on the assumption that all drug-resistant

bacteria had an inherent fitness disadvantage compared to drug-

susceptible strains [42]; however, as is becoming clear from

experimental and molecular epidemiological investigation, substan-

tial heterogeneity exists with respect to the reproductive success of

drug-resistant strains [43–46]. Newer mathematical models account

for some of this heterogeneity [47–49].

One could imagine an expansion of such mathematical

approaches—much as systems biology operates—in which epide-

miological modelling is combined with more comprehensive

biological data related to the host, the pathogen, and their

interactions (Figure 4). Of course, environmental and sociological

data would also need to be considered [40]. As mathematical

models become more finely tuned, they could in turn inform

future experimental work to test some of the specific predictions.

The genomics revolution now offers the opportunity to study host–

pathogen interactions at an unprecedented depth. To be able to

make sense out of the current and upcoming deluge of -omics data,

however, scientists will have to rely on a mathematically and

statistically robust analytical framework. Ideally, some of these

theoretical approaches will be able to accommodate increasingly

diverse sets of data in order to capture the various biological,

environmental, and social aspects of TB.

Among the newly emerging technologies, we believe that next-

generation DNA sequencing will play an important role in

improving our understanding of TB [50]. Whole-genome

sequencing could potentially become the new gold standard for

strain typing in routine molecular epidemiology [51]. For host

genetics and TB susceptibility, too, de novo DNA sequencing

based approaches could have advantages over traditional SNP

typing [52]. For example, many of the human populations

carrying the largest proportion of the global TB burden have

not been sufficiently characterised genetically (Figure 1) [53,54],

and screening for currently limited human SNP collections might

have little relevance for these populations [55]. Furthermore,

comprehensive DNA sequencing of TB patients and controls in

various human populations could help unveil rare but biologically

relevant mutations [56]. Another approach increasingly being

Figure 4. A systems epidemiology approach to TB research. The spread of TB is influenced by social and biological factors. On the one hand,the new discipline of systems biology integrates approaches that address the host, the pathogen, and interactions between the two. On the otherhand, epidemiology addresses the burden of the disease and the social, economic, and ecological causes of its frequency and distribution. There islittle crosstalk between these two disciplines at the moment. ‘‘Systems epidemiology’’ is an attempt to take into account the interactions betweenthese various fields of research.doi:10.1371/journal.ppat.1000600.g004


used to study both the host and the pathogen is sequence-based

transcriptomics, in which gene expression is measured by whole

genome sequencing of RNA transcripts; a method referred to as

RNA-seq [57]. One of the advantages of this approach over

existing microarray-based methods is that changes in the

expression of noncoding RNAs and other novel transcripts can

be easily detected. RNA-seq is particularly useful for genome-wide

studies of small regulatory RNAs, as such studies are more difficult

to perform using standard DNA microarrays. Recent studies, for

example, have reported a role for small regulatory RNAs in M.

tuberculosis [58], and there is little doubt more regulatory RNAs will

soon be identified by RNA-seq [57].


Advances in TB research are hampered by the fact that MTBC

is a Biosafety Level 3 pathogen with a long generation time,

making it slow and complex to culture. Moreover, TB is a chronic

disease that can develop over many years, and is characterised by

extended periods of latency during which MTBC cannot be

isolated from infected individuals. All of these factors complicate

and prolong the development of new interventions and their

assessment in clinical trials. As we have already mentioned, the

field has been marked by a number of dogmas that, in some cases,

might have contributed to the slow progress in TB research. New

insights are now questioning some of these views, but at the same

time, new opinions could well evolve into new dogmas. For

example, we and others have spent much of our scientific careers

seeking convincing evidence for the role of MTBC strain diversity

in human disease. Although some pieces of evidence have recently

started to emerge [59–61], the subject needs more work. One of

the problems has been that the macrophage and mouse infection

models used in these studies relied on poorly characterised strains,

and finding relevant links to human disease has been all but

impossible [14,21].

In TB control, too, potential new dogmas might emerge to limit

future progress. A strong T cell–derived interferon gamma (INFc)

response appears to be crucial for the immunological control of

TB, and many MTBC antigens have been identified based on

their capacity to elicit INFc responses in TB patients or their

infected contacts [62]. Some of these antigens are being developed

into new TB diagnostics and vaccines, but the potential impact of

MTBC diversity on immune responses is not generally being

considered [21]. A recent study in The Gambia showed that INFcresponses to one of the key MTBC antigens differed in an MTBC

lineage–specific manner [63]. Developing a universally effective

vaccine might be the only way to eliminate TB in the future [3].

This is particularly true given the large reservoir of latently

infected individuals in the world, which would be impossible to

eliminate through prophylactic drug treatment. Considering that

natural TB infection does not protect against exogenous

reinfection and disease, however, mimicking natural infection

using attenuated strains or a cocktail of traditional INFc-inducing

antigens might not necessarily be the most promising vaccine

strategy. Indeed, the largely unsuccessful implementation of BCG

vaccination might serve as a warning [64].

Acknowledgments

We thank Peter Small and Douglas Young for comments on the

manuscript.

References

1. World Health Organization (2009) Global tuberculosis control - surveillance,

planning, financing. Geneva, Switzerland: WHO.

2. Stop TB Partnership (2006) The global plan to stop TB 2006–2015. Geneva:WHO.

3. Young DB, Perkins MD, Duncan K, Barry CE (2008) Confronting the scientific

obstacles to global control of tuberculosis. J Clin Invest 118: 1255–1265.

4. Kaufmann SH, Parida SK (2007) Changing funding patterns in tuberculosis.Nat Med 13: 299–303.

5. Frieden TR, Fujiwara PI, Washko RM, Hamburg MA (1995) Tuberculosis in

New York City–turning the tide. N Engl J Med 333: 229–233.

6. Sreevatsan S, Pan X, Stockbauer KE, Connell ND, Kreiswirth BN, et al. (1997)

Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex

indicates evolutionarily recent global dissemination. Proc Natl Acad Sci U S A94: 9869–9874.

7. van Embden JD, Cave MD, Crawford JT, Dale JW, Eisenach KD, et al. (1993)

Strain identification of Mycobacterium tuberculosis by DNA fingerprinting:recommendations for a standardized methodology. J Clin Microbiol 31:

406–409.

8. Small PM, Hopewell PC, Singh SP, Paz A, Parsonnet J, et al. (1994) Theepidemiology of tuberculosis in San Francisco. A population-based study using

conventional and molecular methods. N Engl J Med 330: 1703–1709.

9. Small PM, Shafer RW, Hopewell PC, Singh SP, Murphy MJ, et al. (1993)Exogenous reinfection with multidrug-resistant Mycobacterium tuberculosis in

patients with advanced HIV infection. N Engl J Med 328: 1137–1144.

10. Mathema B, Kurepina NE, Bifani PJ, Kreiswirth BN (2006) Molecularepidemiology of tuberculosis: current insights. Clin Microbiol Rev 19: 658–685.

11. Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, et al. (1998) Deciphering

the biology of Mycobacterium tuberculosis from the complete genome sequence.Nature 393: 537–544.

12. Tsolaki AG, Hirsh AE, DeRiemer K, Enciso JA, Wong MZ, et al. (2004)

Functional and evolutionary genomics of Mycobacterium tuberculosis: insights fromgenomic deletions in 100 strains. Proc Natl Acad Sci U S A 101: 4865–4870.

13. Kato-Maeda M, Rhee JT, Gingeras TR, Salamon H, Drenkow J, et al. (2001)

Comparing genomes within the species Mycobacterium tuberculosis. Genome Res11: 547–554.

14. Nicol MP, Wilkinson RJ (2008) The clinical consequences of strain diversity in

Mycobacterium tuberculosis. Trans R Soc Trop Med Hyg 102: 955–65.

15. Hirsh AE, Tsolaki AG, DeRiemer K, Feldman MW, Small PM (2004) Stable

association between strains of Mycobacterium tuberculosis and their human host

populations. Proc Natl Acad Sci U S A 101: 4871–4876.

16. Supply P, Warren RM, Banuls AL, Lesjean S, Van Der Spuy GD, et al. (2003)

Linkage disequilibrium between minisatellite loci supports clonal evolution ofMycobacterium tuberculosis in a high tuberculosis incidence area. Mol Microbiol 47:

529–538.

17. Brosch R, Gordon SV, Marmiesse M, Brodin P, Buchrieser C, et al. (2002) Anew evolutionary scenario for the Mycobacterium tuberculosis complex. Proc Natl

Acad Sci U S A 99: 3684–3689.

18. Mostowy S, Cousins D, Brinkman J, Aranaz A, Behr MA (2002) Genomicdeletions suggest a phylogeny for the Mycobacterium tuberculosis complex. J Infect

Dis 186: 74–80.

19. Reed MB, Pichler VK, McIntosh F, Mattia A, Fallow A, et al. (2009) MajorMycobacterium tuberculosis lineages associate with patient country of origin. J Clin


20. Gagneux S, Deriemer K, Van T, Kato-Maeda M, de Jong BC, et al. (2006)Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc Natl Acad

Sci U S A 103: 2869–2873.

21. Gagneux S, Small PM (2007) Global phylogeography of Mycobacterium tuberculosis

and implications for tuberculosis product development. Lancet Infect Dis 7:328–337.

22. Baker L, Brown T, Maiden MC, Drobniewski F (2004) Silent nucleotide

polymorphisms and a phylogeny for Mycobacterium tuberculosis. Emerg Infect Dis10: 1568–1577.

23. Gutacker MM, Mathema B, Soini H, Shashkina E, Kreiswirth BN, et al. (2006)

Single-nucleotide polymorphism-based population genetic analysis of Mycobac-

terium tuberculosis strains from 4 geographic sites. J Infect Dis 193: 121–128.

24. Filliol I, Motiwala AS, Cavatore M, Qi W, Hernando Hazbon M, et al. (2006)

Global phylogeny of Mycobacterium tuberculosis based on single nucleotidepolymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic

accuracy of other DNA fingerprinting systems, and recommendations for a

minimal standard SNP set. J Bacteriol 188: 759–772.

25. Brudey K, Driscoll JR, Rigouts L, Prodinger WM, Gori A, et al. (2006)Mycobacterium tuberculosis complex genetic diversity: mining the fourth interna-

tional spoligotyping database (SpolDB4) for classification, population geneticsand epidemiology. BMC Microbiol 6: 23.

26. Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, et al. (1998) Multilocus

sequence typing: a portable approach to the identification of clones withinpopulations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95:

3140–3145.

27. Maiden MC (2006) Multilocus sequence typing of bacteria. Annu Rev Microbiol60: 561–588.


28. Achtman M (2008) Evolution, population structure, and phylogeography of

genetically monomorphic bacterial pathogens. Annu Rev Microbiol 62: 53–70.

29. Hershberg R, Lipatov M, Small PM, Sheffer H, Niemann S, et al. (2008) High

functional diversity in Mycobacterium tuberculosis driven by genetic drift and human

demography. PLoS Biol 6: e311.

30. Gutierrez C, Brisse S, Brosch R, Fabre M, Omais B, et al. (2005) Ancient origin

and gene mosaicism of the progenitor of Mycobacterium tuberculosis. PLoS

Pathogens 1: 1–7.

31. World Health Organization (2008) Anti-tuberculosis drug resistance in the world

report no. 4. Geneva, Switzerland: WHO.

32. Zak DE, Aderem A (2009) Systems biology of innate immunity. Immunol Rev

227: 264–282.

33. Gilchrist M, Thorsson V, Li B, Rust AG, Korb M, et al. (2006) Systems biology

approaches identify ATF3 as a negative regulator of Toll-like receptor 4. Nature

441: 173–178.

34. Querec TD, Akondy RS, Lee EK, Cao W, Nakaya HI, et al. (2009) Systems

biology approach predicts immunogenicity of the yellow fever vaccine in

humans. Nat Immunol 10: 116–125.

35. Stuart LM, Boulais J, Charriere GM, Hennessy EJ, Brunet S, et al. (2007) A

systems biology analysis of the Drosophila phagosome. Nature 445: 95–101.

36. Young D, Stark J, Kirschner D (2008) Systems biology of persistent infection:

tuberculosis as a case study. Nat Rev Microbiol 6: 520–8.

37. Gill WP, Harik NS, Whiddon MR, Liao RP, Mittler JE, et al. (2009) A

replication clock for Mycobacterium tuberculosis. Nat Med 15: 211–4.

38. Young DB, Gideon HP, Wilkinson RJ (2009) Eliminating latent tuberculosis.

Trends Microbiol 17: 183–188.

39. Cohen T, Dye C, Colijn C, Murray M (2009) Mathematical models of the

epidemiology and control of drug-resistant TB. Expert Rev Resp Med in press.

40. Lonnroth K, Jaramillo E, Williams BG, Dye C, Raviglione M (2009) Drivers of

tuberculosis epidemics: The role of risk factors and social determinants. Soc Sci

Med 68: 2240–6.

41. Borrell S, Gagneux S (2009) Infectiousness, reproductive fitness, and evolution of

drug-resistant Mycobactyerium tuberculosis. Int J Tuberc Lung Dis in press.

42. Dye C, Williams BG, Espinal MA, Raviglione MC (2002) Erasing the world’s

slow stain: strategies to beat multidrug-resistant tuberculosis. Science 295:

2042–2046.

43. Bottger EC, Springer B, Pletschette M, Sander P (1998) Fitness of antibiotic-

resistant microorganisms and compensatory mutations. Nat Med 4: 1343–1344.

44. Gagneux S, Burgos MV, DeRiemer K, Encisco A, Munoz S, et al. (2006) Impact

of bacterial genetics on the transmission of isoniazid-resistant Mycobacterium

tuberculosis. PLoS Pathog 2: e61.

45. Gagneux S, Long CD, Small PM, Van T, Schoolnik GK, et al. (2006) The

competitive cost of antibiotic resistance in Mycobacterium tuberculosis. Science 312:

1944–1946.

46. van Soolingen D, de Haas PE, van Doorn HR, Kuijper E, Rinder H, et al.

(2000) Mutations at amino acid position 315 of the katG gene are associated with

high-level resistance to isoniazid, other drug resistance, and successful

transmission of Mycobacterium tuberculosis in the Netherlands. J Infect Dis 182:

1788–1790.47. Cohen T, Murray M (2004) Modeling epidemics of multidrug-resistant M.

tuberculosis of heterogeneous fitness. Nat Med 10: 1117–1121.

48. Blower SM, Chou T (2004) Modeling the emergence of the ‘hot zones’:tuberculosis and the amplification dynamics of drug resistance. Nat Med 10:

1111–1116.49. Dye C (2009) Doomsday postponed? Preventing and reversing epidemics of

drug-resistant tuberculosis. Nat Rev Microbiol 7: 81–87.

50. Mardis ER (2008) Next-generation DNA sequencing methods. Annu RevGenomics Hum Genet 9: 387–402.

51. MacLean D, Jones JD, Studholme DJ (2009) Application of ‘next-generation’sequencing technologies to microbial genetics. Nat Rev Microbiol 7: 287–296.

52. Hardy J, Singleton A (2009) Genomewide association studies and humandisease. N Engl J Med 360: 1759–1768.

53. Tishkoff SA, Reed FA, Friedlaender FR, Ehret C, Ranciaro A, et al. (2009) The

genetic structure and history of Africans and African Americans. Science 324:1035–44.

54. Basu A, Mukherjee N, Roy S, Sengupta S, Banerjee S, et al. (2003) Ethnic India:a genomic view, with special reference to peopling and structure. Genome Res

13: 2277–2290.

55. Campbell MC, Tishkoff SA (2008) African Genetic Diversity: Implications forhuman demographic history, modern human origins, and complex disease

mapping. Annu Rev Genomics Hum Genet 9: 403–33.56. Goldstein DB (2009) Common genetic variation and human traits. N Engl J Med

360: 1696–1698.57. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for

transcriptomics. Nat Rev Genet 10: 57–63.

58. Arnvig KB, Young DB (2009) Identification of small RNAs in Mycobacterium

tuberculosis. Mol Microbiol 73: 397–408.

59. de Jong BC, Hill PC, Aiken A, Awine T, Antonio M, et al. (2008) Progression toactive tuberculosis, but not transmission, varies by Mycobacterium tuberculosis

lineage in the Gambia. J Infect Dis 198: 1037–43.

60. Caws M, Thwaites G, Dunstan S, Hawn TR, Thi Ngoc Lan N, et al. (2008) Theinfluence of host and bacterial genotype on the development of disseminated

disease with Mycobacterium tuberculosis. PLoS Pathog 4: e1000034.61. Thwaites G, Caws M, Chau TT, D’Sa A, Lan NT, et al. (2008) The relationship

between Mycobacterium tuberculosis genotype and the clinical phenotype ofpulmonary and meningeal tuberculosis. J Clin Microbiol 46: 1363–8.

62. Ernst JD, Lewinsohn DM, Behar S, Blythe M, Schlesinger LS, et al. (2007)

Meeting report: NIH workshop on the Tuberculosis Immune Epitope Database.Tuberculosis (Edinb) 88: 366–70.

63. de Jong BC, Hill PC, Brookes RH, Gagneux S, Jeffries DJ, et al. (2006)Mycobacterium africanum elicits an attenuated T Cell response to Early Secreted

Antigenic Target, 6 kDa, in patients with tuberculosis and their household

contacts. J Infect Dis 193: 1279–1286.64. Andersen P, Doherty TM (2005) Opinion: The success and failure of BCG -

implications for a novel tuberculosis vaccine. Nat Rev Microbiol 3: 656–62.


Review

Helicobacter pylori ’s Unconventional Role in Health andDiseaseMarion S. Dorer, Sarah Talarico, Nina R. Salama*

Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America

Abstract: The discovery of a bacterium, Helicobacterpylori, that is resident in the human stomach and causeschronic disease (peptic ulcer and gastric cancer) wasradical on many levels. Whereas the mouth and the colonwere both known to host a large number of microorgan-isms, collectively referred to as the microbiome, thestomach was thought to be a virtual Sahara desert formicrobes because of its high acidity. We now know that H.pylori is one of many species of bacteria that live in thestomach, although H. pylori seems to dominate thiscommunity. H. pylori does not behave as a classicalbacterial pathogen: disease is not solely mediated byproduction of toxins, although certain H. pylori genes,including those that encode exotoxins, increase the risk ofdisease development. Instead, disease seems to resultfrom a complex interaction between the bacterium, thehost, and the environment. Furthermore, H. pylori was thefirst bacterium observed to behave as a carcinogen. Theinnate and adaptive immune defenses of the host,combined with factors in the environment of thestomach, apparently drive a continuously high rate ofgenomic variation in H. pylori. Studies of this geneticdiversity in strains isolated from various locations acrossthe globe show that H. pylori has coevolved with humansthroughout our history. This long association has givenrise not only to disease, but also to possible protectiveeffects, particularly with respect to diseases of theesophagus. Given this complex relationship with humanhealth, eradication of H. pylori in nonsymptomaticindividuals may not be the best course of action. Thestory of H. pylori teaches us to look more deeply at ourresident microbiome and the complexity of its interac-tions, both in this complex population and within ourown tissues, to gain a better understanding of health anddisease.

Common wisdom circa 1980 suggested that the stomach, with

its low pH, was a sterile environment. Then, endoscopy of the

stomach became common and, in 1984, pathologist Robin

Warren and gastroenterologist Barry Marshall saw an extracellu-

lar, curved bacillus, often in dense sheets, lining the stomach

epithelium of patients with gastritis (inflammation of the stomach)

and ulcer disease [1]. Soon, the medical community understood

that the gram-negative bacterium Helicobacter pylori, not stress, is

the major cause of stomach inflammation, which, in some infected

individuals, precedes peptic ulcer disease (10%–20%), distal gastric

adenocarcinoma (1%–2%), and gastric mucosal-associated lym-

phoid tissue (MALT) lymphoma (,1%) [2–5]. Thus, H. pylori

gained distinction as the only known bacterial carcinogen [6]. It is

believed that half of the world’s population is infected with H.

pylori; however, the burden of disease falls disproportionately on

less-developed countries. The incidence of infection in developed

countries has fallen dramatically, for unknown reasons, with a

corresponding decrease in gastric cancer [7]. This public health

success is tempered by the recent demonstration of an inverse

relationship between H. pylori infection and esophageal adenocar-

cinoma, Barrett’s esophagus, and reflux esophagitis [8]. H. pylori

has been with humans since our earliest days, thus it is not

surprising that its relationship is that of both a commensal

bacterium and a pathogen, causing some diseases and possibly

protecting against others. In addition, it is genetically diverse,

likely as a result of constant exposure to both environmental and

immunological selection, suggesting that genetic diversification is a

strategy for long-term colonization.

The Role of Infection in Disease Risk

H. pylori infection is generally acquired during childhood and,

without specific antibiotic treatment, can persist for the lifetime of

the host. Disease often does not develop until adulthood, after

decades of infection, and H. pylori induces variable pathologies in

the stomach. Duodenal ulcer disease is characterized by gastritis

that is largely confined to the antrum (the distal compartment of

the stomach), relatively low inflammation of the corpus (the

middle, acid-secreting compartment), and high levels of stomach

acid secretion (Figure 1A). Those with gastric ulcer or stomach

cancer have high levels of inflammation of the corpus, multifocal

gastric atrophy, and low levels of stomach acid secretion, due to

the destruction of stomach acid–secreting parietal cells (Figure 1B)

[9,10]. Some of this inflammatory response is controlled by the

cytokine IL-1b, which is induced by H. pylori infection [11] and

both elicits a proinflammatory response and inhibits secretion of

gastric acid [12]. Polymorphisms in the interleukin gene cluster,

including IL-1b, are risk factors for H. pylori–associated gastric

cancer [13,14], and studies of the transcriptional response of both

human and model hosts to H. pylori confirm induction of

transcriptional regulators of proinflammatory programs. In


Citation: Dorer MS, Talarico S, Salama NR (2009) Helicobacter pylori’sUnconventional Role in Health and Disease. PLoS Pathog 5(10): e1000544.doi:10.1371/journal.ppat.1000544



Copyright: � 2009 Dorer et al. This is an open-access article distributed underthe terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.

Funding: Work in the Salama lab is supported by National Institutes of Healthgrant AI054423. The funder had no role in study design, data collection andanalysis, decision to publish, or preparation of the manuscript.




addition, transcription profiles reveal induction of several

chemokines and cytokines including those produced by nonlym-

phoid cells, and robust induction of innate immune defenses

including iron sequestration proteins and antimicrobial peptides

[15]. These studies suggest it would be wise to explore diverse

functional classes of genes for host genetic variant associations with

H. pylori disease progression. To this end, H. pylori researchers are

eagerly awaiting an unbiased genome-wide association study of

risk factors associated with progression to intestinal-type gastric

cancer or peptic ulcer disease in patients infected with H. pylori.

Such a study has been completed for sporadic diffuse-type gastric

cancer, which can be associated with H. pylori infection, revealing

two candidate loci, one that encodes a likely tumor suppressor

(prostate stem cell antigen [PSCA]) [16]. Genomic studies of this

sort will help elucidate host factors that synergize with H. pylori

infection to cause disease.

The association of H. pylori infection with gastric cancer raises

the interesting question of whether H. pylori encodes one or more

oncogenes. Oncogenic viruses initiate and promote cellular

transformation by integrating virally encoded oncogenes into the

host genome [17,18]. By contrast, H. pylori remains primarily

extracellular and does not integrate its genome into the host DNA.

The bacterium can still affect the function of host cells, however,

by translocating a bacterial protein, CagA, into host cells via a

specialized secretion system called the cag Type IV secretion

system (T4SS) [19,20]. In host cells, CagA interacts with a number

of cellular complexes implicated in oncogenesis [21,22]. Despite

elucidation of potentially transforming activities, transgenic

expression of CagA in the mouse stomach is only weakly

oncogenic [23]. As the cag T4SS also induces proinflammatory

cytokines via the intracellular bacterial peptidoglycan recognition

molecule Nod1, cancer progression may occur through synergy

with the host inflammatory response [24]. While CagA may not

promote cancer itself, exposure to CagA and inflammatory insults

may select for heritable host cell changes (genetic or epigenetic)

that together contribute to cancer progression.

H. pylori expands our view of how microbes survive at high levels

while activating inflammatory responses and shows us that microbes

may be underappreciated as an important factor in chronic disease

pathogenesis. In the case of pathogens that cause acute infections,

there is a massive inflammatory response, which often supports

bacterial replication and transmission. Alternatively, some patho-

gens, such as Mycobacterium tuberculosis, persist in the host by

manipulating the immune response to create a protected compart-

ment. H. pylori introduces a third strategy; it actively replicates and

maintains a continuous balance with the inflammatory response

over years of infection with little evidence for increased H. pylori–

related disease upon immune suppression [25]. As the role of

chronic inflammation in many diseases including cardiovascular

disease, diabetes mellitus, Alzheimer’s disease, and others is

increasingly recognized, researchers are focusing on infectious

agents as one possible source of this chronic inflammation.

Genomic Insights into the Biology of H. pylori

The study of H. pylori is strongly influenced by the genomic age.

The sequencing of its genome was completed in 1997 [26], just 13

years after Marshall and Warren reported their discovery.

However, almost a quarter (24%) of H. pylori genes have no

sequence similarity with genes available in public databases [27],

suggesting that lessons learned from well-studied bacteria like

Escherichia coli would not necessarily apply to this evolutionarily

distinct Epsilonproteobacteria. By using more advanced bioinfor-

matic approaches, researchers are now identifying some pathways

first thought absent in H. pylori. For example, H. pylori appeared to

lack the E. coli recBCD pathway, which is involved in homologous

recombination and DNA double-strand break repair. More careful

examination of conserved domains and motifs, however, identified

the H. pylori addA and addB genes, which are present in most gram-

positive and many gram-negative bacteria and whose protein

products have enzymatic functions similar to those of the recBCD

pathway [28].

By 1999, H. pylori was the first species to have complete genomes

sequenced from two different strains—an important milestone,

given its genetic diversity. Comparison of the two genomes

revealed that 6%–7% of the genes were present in one strain but

not in the other. There was also a high level of nucleotide diversity

between the two strains, with only eight genes sharing at least 98%

nucleotide identity; however, most nucleotide differences were

synonymous changes [27]. Microarrays designed upon these

sequences were then used for comparative genomic hybridization

of H. pylori strains isolated from different ethnic groups and

geographic areas [29,30]. These studies found that 25% of H. pylori

genes are variably present among strains. Such genome-wide

analyses have played an important role in dividing H. pylori genes

into two classes: variable genes that are absent in some strains and

core genes that are present in all strains analyzed. The variable

genes are likely adaptive for different environmental niches, which

for the human stomach–restricted H. pylori comprise genetically

distinct hosts. The largest annotated class of variable genes encode

proteins expressed on or that modify the bacterial cell surface

(outer membrane proteins and proteins involved in lipopolysac-

charide synthesis) [30], consistent with a function at the interface

of the bacteria and host. The core genes have diverse functions.

Some core genes are required for viability in culture. A genomic

study that utilized microarray-based mapping of a genome-

saturating transposon library (a collection of H. pylori strains that

includes transposon mutants randomly distributed throughout the

genome) revealed that 23% of the genome is required for viability

in culture because these genes could not tolerate transposon

Figure 1. Distinct pathologies of H. pylori–induced disease. (A)Duodenal ulcer disease correlates with high inflammation in the antrum(red bursts), lower levels of inflammation in the corpus, and high acidsecretion (+). (B) Gastric ulcer or adenocarcinoma correlates withincreased inflammation in the corpus, low acid secretion, and multifocalatrophy (wavy lines).doi:10.1371/journal.ppat.1000544.g001

Learning about Disease from H. pylori


insertion [31]. Additional core genes are essential only in the

context of host infection and several groups have completed

screens for transposon mutants that fail to colonize animal models

of infection [32,33]. An example of such a colonization core gene

is addA, which is required for recombinational repair of DNA

double-strand breaks, presumably caused by the host inflamma-

tory response [28].

The nucleotide sequence diversity in H. pylori’s core genes can

distinguish between different ethnic and geographic human

populations, demonstrating that passage of H. pylori between

closely related humans has continued uninterrupted over tens of

thousands of years (see Box 1). Different geographic and ethnic

groups that have similar infection rates have quite varied relative

risks of H. pylori–associated diseases such as gastric cancer [34].

Thus, in addition to host genetic and environmental exposures,

differences among strains likely contribute to variation in disease

risk. Consequently, studies of pathogenesis need to be reproduced

in representative strain backgrounds to ensure that discoveries in

one strain apply in strain populations with a diverse evolutionary

history.

H. pylori Diversification during Persistent Infection

Genetic diversification can aid in the persistence of organisms

that continue to replicate during chronic infection, allowing them

to sample adaptive variants. HIV, for example, has a flexible

reverse transcriptase that makes point mutations, insertions,

deletions, transversions, and duplications that produce variants

that may have a selective advantage [35]. Genetic variation in a

microbe indicates constant selection by a dynamic environment,

and H. pylori is a very genetically diverse species of bacteria [36–

38]. Genetic diversification may help H. pylori to adapt to a new

host after transmission, to different micro-niches within a single

host, and to changing conditions in the host over time—for

example, by avoiding clearance by host defenses.

Genetic diversity arises from within-genome diversification as

well as from reassortment by recombination with DNA from other

infecting H. pylori, generating novel clones within the stomach

(Figure 2). Within-genome diversification can include point

mutations, intragenomic recombination, and slipped-strand mis-

pairing during DNA replication within repetitive sequences.

Reassortment can occur by recombination with either DNA from

a superinfecting H. pylori strain or a variant clone of the same

strain. Central to this reassortment is H. pylori’s natural

competence—the ability to take up exogenous DNA and

incorporate it into its genome. Evidence from our lab shows that

natural competence is induced by DNA damage, suggesting that

H. pylori responds to stress by diversifying its genome (MSD and

NRS, unpublished data). However, there are controls on this

rampant genetic exchange: restriction-modification systems, which

include a restriction endonuclease that cleaves a specific DNA

sequence and a DNA methyltransferase that protects the

bacterium’s own DNA from being cleaved by methylating the

target DNA sequence. Genes that encode restriction-modification

systems compose the second largest class of variably present genes

with known function, so the complement of available restriction-

modification systems varies between strains, giving a methylation

code to the DNA from each strain. This mechanism serves to limit

or prevent recombination between H. pylori strains as well as

between H. pylori and other bacteria or eukaryotic cells [39].

The H. pylori genome encodes relatively few proteins that

regulate transcription. Instead, some of the same processes that

govern the generation of genetic diversity (i.e., slipped-strand

mispairing, methyltransferase activity, and recombination) also

play an important role in varying gene expression in response to

environmental cues. There are 46 H. pylori genes that have long

repeats of one or two nucleotides that are prone to slipped-strand

mispairing during replication [26,27,40]. These genes are phase-

variable because changes in the number of repeats can shift the

reading frame of the gene, switching gene expression on or off

(Figure 2). In addition, many H. pylori promoters have mononu-

cleotide repeats that regulate gene expression by changing the

spacing between important regulatory sites in these promoters.

Orphan methyltransferases, which have lost their corresponding

restriction enzyme, may also regulate gene expression by

methylating sequences in the promoter region of genes, and some

of the methyltransferase genes are themselves subject to phase-

variable expression. Recombination regulates gene expression

through deletions and duplications that occur during gene

conversion and locus switching. These mechanisms suggest that

H. pylori survives by constantly generating variants that adapt its

physiology to new environments.

One example of how H. pylori’s genetic variability helps it adapt

to new environments involves its adhesin genes, which encode

proteins that bind to the Lewis human blood group antigens,

which are carbohydrate-based epitopes [41]. The protein encoded

by one of these adhesin genes, BabA, binds the Lewis-b antigen on

the gastric mucosa, helping the bacterium adhere to the mucosa.

The babA gene is silent in some H. pylori strains but can be

Box 1. Tracking Human Genealogy with H.pylori Genomics

Currently, a number of companies propose to predict your‘‘genetic genealogy’’ from the DNA in a cheek swab. Theydo this by analyzing informatively variable parts of ourgenomes (such as the Y chromosome or mitochondrialDNA) that show characteristic differences between ethnicand geographic populations; thus, they can tell if you maybe distantly related to Ghengis Khan, for example.Unfortunately, population bottlenecks [51], small popula-tion sizes, and long generation times have limited theamount of genetic diversity in the human population thatcan be used for these analyses. It turns out, however, thatgenomic sequencing of the H. pylori strain harbored by anindividual does a better job in resolving ancestry than theusual human genomic markers [52]. This is because of highgenetic diversity among H. pylori strains [53], a restrictedmode of transmission (primarily within families or house-holds [54]), and the association of H. pylori with humansthroughout our evolution [55]. A major source of H. pylori’sgenetic diversity is recombination between strains [38],which blurs signatures of descent. Despite this confoundingfactor, Achtman and colleagues [53] identified evolutionarysignatures in strain sequences from diverse geographicsources. These signatures, combined with new statisticaltools that take into account admixture and recombination[55], have tracked ancient human migrations, such as ouremergence from Africa [55], and more recent events such ascolonization of the Pacific islands [56]. H. pylori genesequences can even distinguish between the Buddhist andMuslim ethnic groups that have coexisted for at least 1,000years in Ladakh [52]. The fact that H. pylori has maintainedevolutionarily distinct strain signatures during many gener-ations of contact suggests either that interracial interactionsthat promote transmission are very limited or thatadditional mechanisms prevent strains from one ethnicpopulation from establishing a foothold in hosts of anotherethnic population.



expressed if it recombines with the babB gene, an event mediated

by homologous sequences at the 59 and 39 ends of the two genes

[42]. Thus, recombination can help H. pylori alter its adherence

properties to adapt to selective pressures in the host. These

selective pressures may include variation in the host receptors

present or in conditions that favor a shift in the ratio of bacteria

adherent to the gastric cell epithelium over those swimming freely

in the mucus.

Genetic variation may also be important for the ability of H.

pylori to evade the host immune system. H. pylori further exploits

the Lewis antigen system by ‘‘camouflaging’’ its surface lipopoly-

saccharide with its own Lewis-type antigen, which mimics that of

the individual host. The bacterium adapts the spectrum of Lewis

antigens it expresses by phase variation of the genes involved in

their biosynthesis [43]. Furthermore, recombination among the

many members of the large outer membrane protein (omp) gene

family has the potential to create mosaic omp genes, generating

antigenic variation that may keep H. pylori ahead of the ability of

the host’s immune system to recognize these cell surface exposed

epitopes.

H. pylori’s Interaction with the Microbiome

H. pylori share their niche with the stomach microbiome, the

collection of microorganisms living on and in us. Study of

microorganisms was once limited to only those microbes that could

be cultured in the laboratory. Advances in sequencing technology

now allow us to study the collection of genes encoded by any

group of organisms—so-called metagenomics—making it possible

to characterize also the microbes that cannot be cultured but

nevertheless affect our health. Given that H. pylori engages in DNA

exchange, the metagenome may serve as a repository for novel

traits. When present, H. pylori dominates the microbiome in the

stomach [44,45], although the effect of this dominance is not

known. Perhaps H. pylori infection changes the composition of the

stomach microbiome, with unknown consequences.


H. pylori is considered pathogenic, even carcinogenic. With this

simple view, eradication seems an obvious choice. In reality,

however, the relationship between H. pylori and disease is more

Figure 2. Mechanisms that create genetic diversity in H. pylori. Colored arrows represent different genes, and the correspondingly coloredtriangles, rectangles, and circles represent the proteins encoded by these genes. Diversification mechanisms (right side of figure) includespontaneous point mutations, slipped-strand mispairing, and intragenomic recombination. Allelic changes involving nonsynonymous pointmutations and mosaic genes resulting from intragenomic recombination can alter the function and/or the antigenic epitopes of the encoded protein.Gene expression can also be regulated by gene conversion resulting from intragenomic recombination, and phase variation mediated by slipped-strand mispairing. Reassortment of genes (left side of figure) by natural transformation with exogenous DNA also contributes to genetic diversity.Natural transformation with DNA from a superinfecting strain, for example, can introduce new genes and new alleles of already present genes(horizontal gene transfer). Similarly, natural transformation with DNA from a variant clone of the same strain can further propagate an advantageousallele acquired by within-genome diversification.doi:10.1371/journal.ppat.1000544.g002



nuanced. Like the cancer risk associated with smoking, a recent

trial showed that the cancer risk from H. pylori diminished

measurably only 12 years after eradication of the infection [46].

Some studies suggest that infection may prevent diseases of the

esophagus, and there is a debate in the literature concerning a

relationship between H. pylori and childhood asthma [8,47,48].

There is clear consensus that H. pylori should be eliminated in cases

of peptic ulcer disease, gastric MALT lymphoma, early gastric

cancer, first-degree relatives of gastric cancer patients, and

uninvestigated dyspepsia in high-prevalence populations. Despite

its potential to prevent ulcer and cancer, universal eradication of

H. pylori infection has not gained wide support, because of the

mixture of positive and negative disease associations with infection,

the lack of a definitive bacterial or host molecule accounting for

disease causation, and poor success rates of treating non-ulcer

dyspepsia by clearing H. pylori infection [49,50]. Thus a more

detailed picture of this host–pathogen interaction is needed and

likely will depend upon further advances in both endoscopy and

genomics.

We have a poor understanding of the immune responses to H.

pylori and the reasons that most hosts fail to clear infection. The

host restriction of H. pylori to humans and some nonhuman

primates has hampered development of robust animal models to

study the disease process. Thus progress will require improvements

in animal models and improved access to patient samples.

Endoscopy of the upper gastrointestinal tract is an invasive

procedure, so a major limitation to research is collection of

bacterial and human tissue samples from infected people.

Available samples are biased toward patients with severe

dyspepsia, ulcer symptoms, and gastric cancer, and only a small

fraction of the stomach can be sampled. Advances in less-invasive

methods, such as capsule endoscopy, may allow increased

sampling to monitor bacterial and tissue changes during chronic

colonization, including isolation and phenotypic analysis of

immune effector cells in infected tissue. Less-invasive methods

would also provide an opportunity to study infection in

asymptomatic individuals and transmission of H. pylori infection,

conditions in which the selective pressures that drive the observed

H. pylori genetic diversification likely operate.

A major opportunity to increase our understanding of how H.

pylori causes or prevents disease arises from recent advances in

high-throughput sequencing technologies. Currently, several

platforms allow researchers to accomplish in a single experiment

sequencing or resequencing of tens of H. pylori genomes,

characterization of host immune and epithelial cell types that

change during infection with highly sensitive digital expression tag

analysis, or analysis of the microbiome present in the stomach and

esophagus through metagenomic sequencing or targeted bacterial

or fungal small ribosomal subunit DNA sequencing. The sequence

data generated by such experiments will address several important

mysteries of H. pylori biology, including the timing and extent of H.

pylori genetic diversification. While strains from unrelated

individuals show dramatic variation in gene content and gene

sequence, the extent of sequence variation among clones during

persistent infection of a single host or upon transmission has not

been adequately sampled. Whole-genome sequencing of multiple

isolates of individual patients with dense spatial and temporal

sampling would definitively establish when, where, and by what

mechanisms genetic diversity is generated. This information will

inform efforts to combat resistance to current antibiotics, to

develop vaccines, and to understand H. pylori’s coevolution with

humans. Exploration of the influence of H. pylori on the

microbiome will identify organisms that collaborate with or can

be antagonized by H. pylori. Such organisms may mediate some of

the disease risks that have been associated with H. pylori presence

and absence. Finally, the rapid pace of resequencing of H. pylori’s

human host will provide a deeper understanding of genetic

variation in the human population that may influence risk for H.

pylori–associated pathologies and which, by association, could

provide clues to the cellular pathways disrupted in disease. Thus,

genomic approaches to study host response, the human micro-

biome, bacterial genetic variation, and, perhaps most importantly,

the intersections among these components, will help researchers

determine whether eradication is appropriate for all individuals in

all populations.

Acknowledgments

We thank Olivier Humbert and Laura Sycuro for their critical comments

on the manuscript and Laura Sycuro for providing H. pylori images.

References

1. Marshall BJ, Warren JR (1984) Unidentified curved bacilli in the stomach ofpatients with gastritis and peptic ulceration. Lancet 1: 1311–1315.

2. Nomura A, Stemmermann GN, Chyou P, Kato I, Perez-Perez G, et al. (1991)

Helicobacter pylori infection and gastric carcinoma among japanese americans in

Hawaii. N Engl J Med 325: 1132–1136.

3. Parsonnet J, Friedman GD, Vandersteen DP, Chang Y, Vogelman JH, et al.

(1991) Helicobacter pylori infection and the risk of gastric carcinoma. N Engl J Med

325: 1127–1131.

4. Parsonnet J, Hansen S, Rodriguez L, Gelb AB, Warnke RA, et al. (1994)Helicobacter pylori infection and gastric lymphoma. N Engl J Med 330: 1267–1271.

5. Kusters JG, van Vliet AH, Kuipers EJ (2006) Pathogenesis of Helicobacter pylori

infection. Clin Microbiol Rev 19: 449–490.

6. WHO (2006) Fact sheet No. 297, Cancer. World Health Organization.

7. Peek RM Jr, Blaser MJ (2002) Helicobacter pylori and gastrointestinal tractadenocarcinomas. Nat Rev Cancer 2: 28–37.

8. Anderson LA, Murphy SJ, Johnston BT, Watson RG, Ferguson HR, et al.

(2008) Relationship between Helicobacter pylori infection and gastric atrophyand the stages of the oesophageal inflammation, metaplasia, adenocarcino-

ma sequence: Results from the FINBAR case-control study. Gut 57:734–739.

9. Amieva MR, El-Omar EM (2008) Host-bacterial interactions in Helicobacter pylori

infection. Gastroenterology 134: 306–323.

10. Rubin CE (1997) Are there three types of Helicobacter pylori gastritis?Gastroenterology 112: 2108–2110.

11. Basso D, Scrigner M, Toma A, Navaglia F, Di Mario F, et al. (1996) Helicobacter

pylori infection enhances mucosal interleukin-1 beta, interleukin-6, and the

soluble receptor of interleukin-2. Int J Clin Lab Res 26: 207–210.

12. El-Omar EM (2001) The importance of interleukin 1beta in Helicobacter pylori

associated disease. Gut 48: 743–747.

13. El-Omar EM, Carrington M, Chow WH, McColl KE, Bream JH, et al. (2000)Interleukin-1 polymorphisms associated with increased risk of gastric cancer.

Nature 404: 398–402.

14. Figueiredo C, Machado JC, Pharoah P, Seruca R, Sousa S, et al. (2002)

Helicobacter pylori and interleukin 1 genotyping: an opportunity to identify high-

risk individuals for gastric carcinoma. J Natl Cancer Inst 94: 1680–1687.

15. Humbert O, Pinto-Santini DM, Salama NR (2008) Genomotyping of Helicobacter

pylori and its host: microarray-based insights on gene variation, expression and

function. In: Yamaoka Y, ed. Helicobacter pylori Molecular Genetics and Cellular

Biology. Norfolk, UK: Caister Academic Press. pp 205–244.

16. Sakamoto H, Yoshimura K, Saeki N, Katai H, Shimoda T, et al. (2008) Geneticvariation in PSCA is associated with susceptibility to diffuse-type gastric cancer.

Nat Genet 40: 730–740.

17. Maeda N, Fan H, Yoshikai Y (2008) Oncogenesis by retroviruses: Old and new

paradigms. Rev Med Virol 18: 387–405.

18. Howley PM, Livingston DM (2009) Small DNA tumor viruses: Large

contributors to biomedical sciences. Virology 384: 256–259.

19. Segal ED, Cha J, Lo J, Falkow S, Tompkins LS (1999) Altered states:

Involvement of phosphorylated CagA in the induction of host cellular growthchanges by Helicobacter pylori. Proc Natl Acad Sci U S A 96: 14559–14564.

20. Stein M, Rappuoli R, Covacci A (2000) Tyrosine phosphorylation of theHelicobacter pylori CagA antigen after cag-driven host cell translocation. Proc Natl

Acad Sci U S A 97: 1263–1268.

21. Bourzac KM, Guillemin K (2005) Helicobacter pylori-host cell interactions

mediated by type IV secretion. Cell Microbiol 7: 911–919.



22. Hatakeyama M (2006) Helicobacter pylori CagA — A bacterial intruder conspiring

gastric carcinogenesis. Int J Cancer 119: 1217–1223.

23. Ohnishi N, Yuasa H, Tanaka S, Sawa H, Miura M, et al. (2008) Transgenic

expression of Helicobacter pylori CagA induces gastrointestinal and hematopoietic

neoplasms in mouse. Proc Natl Acad Sci U S A 105: 1003–1008.

24. Viala J, Chaput C, Boneca IG, Cardona A, Girardin SE, et al. (2004) Nod1

responds to peptidoglycan delivered by the Helicobacter pylori cag pathogenicity

island. Nat Immunol 5: 1166–1174.

25. Romanelli F, Smith KM, Murphy BS (2007) Does HIV infection alter the

incidence or pathology of Helicobacter pylori infection? AIDS Patient Care STDS

21: 908–919.

26. Tomb JF, White O, Kerlavage AR, Clayton RA, Sutton GG, et al. (1997) The

complete genome sequence of the gastric pathogen Helicobacter pylori [published

erratum appears in Nature 1997 Sep 25;389(6649):412]. Nature 388: 539–547.

27. Alm RA, Ling LS, Moir DT, King BL, Brown ED, et al. (1999) Genomic-

sequence comparison of two unrelated isolates of the human gastric pathogen

Helicobacter pylori. Nature 397: 176–180.

28. Amundsen SK, Fero J, Hansen LM, Cromie GA, Solnick JV, et al. (2008)

Helicobacter pylori AddAB helicase-nuclease and RecA promote recombination-

related DNA repair and survival during stomach colonization. Mol Microbiol

69: 994–1007.

29. Gressmann H, Linz B, Ghai R, Pleissner KP, Schlapbach R, et al. (2005) Gain

and loss of multiple genes during the evolution of Helicobacter pylori. PLoS Genet

1: e43. doi:10.1371/journal.pgen.0010043.

30. Salama N, Guillemin K, McDaniel TK, Sherlock G, Tompkins L, et al. (2000) A

whole-genome microarray reveals genetic diversity among Helicobacter pylori

strains. Proc Natl Acad Sci U S A 97: 14668–14673.

31. Salama NR, Shepherd B, Falkow S (2004) Global transposon mutagenesis and

essential gene analysis of Helicobacter pylori. J Bacteriol 186: 7926–7935.

32. Baldwin DN, Shepherd B, Kraemer P, Hall MK, Sycuro LK, et al. (2007)

Identification of Helicobacter pylori genes that contribute to stomach colonization.


33. Kavermann H, Burns BP, Angermuller K, Odenbreit S, Fischer W, et al. (2003)

Identification and characterization of Helicobacter pylori genes essential for gastric

colonization. J Exp Med 197: 813–822.

34. Yamaguchi N, Kakizoe T (2001) Synergistic interaction between Helicobacter

pylori gastritis and diet in gastric cancer. Lancet Oncol 2: 88–94.

35. Johnson WE, Desrosiers RC (2002) Viral persistance: HIV’s strategies of

immune system evasion. Annu Rev Med 53: 499–518.

36. Israel DA, Salama N, Krishna U, Rieger UM, Atherton JC, et al. (2001)

Helicobacter pylori genetic diversity within the gastric niche of a single human host.

Proc Natl Acad Sci U S A 98: 14625–14630.

37. Salama NR, Gonzalez-Valencia G, Deatherage B, Aviles-Jimenez F,

Atherton JC, et al. (2007) Genetic analysis of Helicobacter pylori strain populations

colonizing the stomach at different times postinfection. J Bacteriol 189:

3834–3845.

38. Suerbaum S, Smith JM, Bapumia K, Morelli G, Smith NH, et al. (1998) Free

recombination within Helicobacter pylori. Proc Natl Acad Sci U S A 95:

12619–12624.

39. Humbert O, Salama NR (2008) The Helicobacter pylori HpyAXII restriction-

modification system limits exogenous DNA uptake by targeting GTAC sites butshows asymmetric conservation of the DNA methyltransferase and restriction

endonuclease components. Nucleic Acids Res 36: 6893–6906.

40. Salaun L, Linz B, Suerbaum S, Saunders NJ (2004) The diversity within anexpanded and redefined repertoire of phase-variable genes in Helicobacter pylori.

Microbiology 150: 817–830.41. Lloyd KO (2000) The chemistry and immunochemistry of blood group A, B, H,

and Lewis antigens: Past, present and future. Glycoconj J 17: 531–541.

42. Backstrom A, Lundberg C, Kersulyte D, Berg DE, Boren T, et al. (2004)Metastability of Helicobacter pylori bab adhesin genes and dynamics in Lewis b

antigen binding. Proc Natl Acad Sci U S A 101: 16923–16928.43. Wirth HP, Yang M, Peek RM Jr, Tham KT, Blaser MJ (1997) Helicobacter pylori

Lewis expression is related to the host Lewis phenotype. Gastroenterology 113:1091–1098.

44. Bik EM, Eckburg PB, Gill SR, Nelson KE, Purdom EA, et al. (2006) Molecular

analysis of the bacterial microbiota in the human stomach. Proc Natl AcadSci U S A 103: 732–737.

45. Andersson AF, Lindberg M, Jakobsson H, Backhed F, Nyren P, et al. (2008)Comparative analysis of human gut microbiota by barcoded pyrosequencing.

PLoS ONE 3: e2836. doi:10.1371/journal.pone.0002836.

46. Mera R, Fontham ET, Bravo LE, Bravo JC, Piazuelo MB, et al. (2005) Longterm follow up of patients treated for Helicobacter pylori infection. Gut 54:

1536–1540.47. Raj SM, Choo KE, Noorizan AM, Lee YY, Graham DY (2009) Evidence

against Helicobacter pylori being related to childhood asthma. J Infect Dis 199:914–915; author reply 915–916.

48. Chen Y, Blaser MJ (2008) Helicobacter pylori colonization is inversely associated

with childhood asthma. J Infect Dis 198: 553–560.49. Chey WD, Wong BC (2007) American College of Gastroenterology guideline on

the management of Helicobacter pylori infection. Am J Gastroenterol 102:1808–1825.

50. Malfertheiner P, Megraud F, O’Morain C, Bazzoli F, El-Omar E, et al. (2007)

Current concepts in the management of Helicobacter pylori infection: TheMaastricht III Consensus Report. Gut 56: 772–781.

51. Cann RL, Stoneking M, Wilson AC (1987) Mitochondrial DNA and humanevolution. Nature 325: 31–36.

52. Wirth T, Wang X, Linz B, Novick RP, Lum JK, et al. (2004) Distinguishinghuman ethnic groups by means of sequences from Helicobacter pylori: Lessons from

Ladakh. Proc Natl Acad Sci U S A 101: 4746–4751.

53. Achtman M, Azuma T, Berg DE, Ito Y, Morelli G, et al. (1999) Recombinationand clonal groupings within Helicobacter pylori from different geographical regions.

Mol Microbiol 32: 459–470.54. Schwarz S, Morelli G, Kusecek B, Manica A, Balloux F, et al. (2008) Horizontal

versus familial transmission of Helicobacter pylori. PLoS Pathog 4: e1000180.

doi:10.1371/journal.ppat.1000180.55. Falush D, Wirth T, Linz B, Pritchard JK, Stephens M, et al. (2003) Traces of

human migrations in Helicobacter pylori populations. Science 299: 1582–1585.56. Moodley Y, Linz B, Yamaoka Y, Windsor HM, Breurec S, et al. (2009) The

peopling of the Pacific from a bacterial perspective. Science 323: 527–530.



Review

Helminth Genomics: The Implications for Human HealthPaul J. Brindley1*, Makedonka Mitreva2, Elodie Ghedin3, Sara Lustigman4

1 Department of Microbiology, Immunology, and Tropical Medicine, George Washington University Medical Center, Washington, D. C., United States of America,

2 Genome Centre and Department of Genetics, Washington University School of Medicine, St. Louis, Missouri, United States of America, 3 Division of Infectious Diseases,

University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America, 4 New York Blood Center, Laboratory of Molecular Parasitology, New York,

New York, United States of America

Abstract: More than two billion people (one-third ofhumanity) are infected with parasitic roundworms orflatworms, collectively known as helminth parasites. Theseinfections cause diseases that are responsible for enor-mous levels of morbidity and mortality, delays in thephysical development of children, loss of productivityamong the workforce, and maintenance of poverty.Genomes of the major helminth species that affecthumans, and many others of agricultural and veterinarysignificance, are now the subject of intensive genomesequencing and annotation. Draft genome sequences ofthe filarial worm Brugia malayi and two of the humanschistosomes, Schistosoma japonicum and S. mansoni, arenow available, among others. These genome data willprovide the basis for a comprehensive understanding ofthe molecular mechanisms involved in helminth nutritionand metabolism, host-dependent development andmaturation, immune evasion, and evolution. They arelikely also to predict new potential vaccine candidates anddrug targets. In this review, we present an overview ofthese efforts and emphasize the potential impact andimportance of these new findings.

Helminth Infections—The Great NeglectedTropical Diseases

Helminth parasites are parasitic worms from the phyla

Nematoda (roundworms) and Platyhelminthes (flatworms)

(Figures 1 and 2); together, they comprise the most common

infectious agents of humans in developing countries. The collective

burden of the common helminth diseases—which range from the

dramatic sequelae of elephantiasis and blindness to the more

subtle but widespread effects on child development, pregnancy,

and productivity—rivals that of the main high-mortality condi-

tions such as HIV/AIDS or malaria [1]. For example, based on a

recent analysis [2], 85% of the neglected tropical disease (NTD)

burden for the poorest 500 million people living in sub-Saharan

Africa (SSA) results from helminth infections. Hookworm infection

occurs in almost half of the poorest people in SSA, including 40–

50 million school-aged children and 7 million pregnant women, in

whom it is a leading cause of anemia. Schistosomiasis (192 million

cases) is the second most prevalent NTD after hookworm,

accounting for 93% of the world’s number of cases of

schistosomiasis and possibly associated with increased horizontal

transmission of HIV/AIDS. Lymphatic filariasis (46–51 million

cases) and onchocerciasis (37 million cases) are also widespread in

SSA, each disease representing a significant cause of disability and

reduction in the region’s agricultural productivity. The disease

burden estimate in disability-adjusted life years (DALYs) for total

helminth infections in SSA is 5.4–18.3 million in comparison to

40.9 million DALYs for malaria and 9.3 million DALYs for

tuberculosis. Yet, research into helminth infections has not

received nearly the same level of support. This is partly because

helminthiases are diseases of the poorest people in the poorest

regions, but also because these pathogens are difficult to study in

the laboratory by comparison to most model eukaryotes and many

other pathogens. Standard tools and approaches, including cell

lines, culture in vitro, and animal models, are generally lacking. In

addition, the genomes of helminths are generally much more

complex than those of model organisms like yeast and fruit flies

[2].

Whereas helminth diseases are ancient scourges of humanity,

with some known from biblical times, most can also be considered

as re-emerging diseases in the sense that new outbreaks are

reported routinely in response to environmental and sociopolitical

changes [3]. For example, schistosomiasis has reemerged many

times in Africa in recent times in response to hydrological changes,

e.g. construction of dams, irrigation canals, reservoirs, etc. that

establish suitable new environments for the intermediate host

snails that transmit the parasites. Schistosomiasis has also

reemerged in mountainous and hilly regions in Sichuan, China,

where it had been controlled previously by intensive interventions

[4]. Furthermore, new strains of schistosomes are indeed emerging

through natural hybridizations between human and cattle species

of schistosomes [5].

Despite the difficulties with investigation of helminth parasites,

new insights into fundamental helminth biology are accumulating

through genome projects and the application of genome

manipulation technologies including RNA interference and

transgenesis (Figure 3). What’s more, research on immunology

of helminth infections has contributed enormously to our

understanding of Th2 immune responses, the function of

regulatory T cells, generation of alternatively activated macro-

phages, and the transmission dynamics of infectious agents. It is

hoped that this progress can be translated into new and robust

drugs, diagnostics, and vaccines for the helminth diseases


Citation: Brindley PJ, Mitreva M, Ghedin E, Lustigman S (2009) HelminthGenomics: The Implications for Human Health. PLoS Negl Trop Dis 3(10): e538.doi:10.1371/journal.pntd.0000538

Editor: Matty Knight, Biomedical Research Institute, United States of America


Copyright: � 2009 Brindley et al. This is an open-access article distributedunder the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.

Funding: Support from the NIH-NIAID, award numbers R01AI072773 (to PJB) andR01AI081803 (to MM) is gratefully acknowledged. The funder had no role in studydesign, data collection and analysis, decision to publish, or preparation of themanuscript.



www.plosntds.org 1 October 2009 | Volume 3 | Issue 10 | e538

of humanity and those of our livestock and companion species

[1,6–10].

Genomics Approaches to Investigating Helminths

Over the past decade, increasing numbers of helminth-specific

genome sequences have become available due to ever-improving

techniques for obtaining biological material, extracting RNA and

DNA, constructing complimentary DNA (cDNA)/whole genome

shotgun libraries, and, especially, major advances in the chemistry

and instrumentation for DNA sequencing and its concomitant

decreased cost. Helminth genomics began with the generation and

analysis of transcribed sequences (expressed sequence tags [ESTs]

[11]), which has proved to be a rapid and cost-effective route to

discover genes in other eukaryotes. In April 2009, there were

,550,000 nematode and 450,000 platyhelminth ESTs in the

dbEST division of GenBank, excluding those from the model

nematode Caenorhabditis rhabditis. Of these, 60% were from

parasites of humans and closely related animal pathogens used

to study human infections (Table 1). These ESTs have many

applications. They can be used to annotate helminth genomes (see

below) to determine alternative splicing, verify open reading

frames, and confirm exon/intron and gene boundaries. They are

valuable also, for example, in functional genomics to design probes

for expression microarrays (e.g., [12]) and to provide putative

protein sequence information for proteomics methods (e.g., [13]),

to name but a few applications. Quantitative analysis of ESTs

(transcriptomics), including serial expression of gene analysis, can

identify transcripts that are either over- or under-represented by

comparison to other transcripts in various helminth life cycle

stages or tissues (e.g., [14,15]), and the subset of genes evaluated

with gene ontology programs provide insights into cellular and

metabolic pathway functioning in the parasite (e.g., [16]).

Furthermore, one can identify potential targets for interventions

by applying a hierarchy of considerations including a matrix of

biological, expression, and phenotypic data [17] or by performing

a pan-phylum analysis to identify conserved parasite-specific genes

whose selective targeting will have low or no toxicity to the host

[18,19] or genes that have diverged enough from the host

counterpart, resulting in altered or absent functions [20].

The first multicellular genome sequenced was that of the free-

living roundworm C. elegans [21]; reported in 1998, it is still the

only metazoan for which the sequence of every nucleotide is

known with high confidence. Today, the genome sequences of 22

species of helminths that either infect humans or are closely related

parasites are completed or underway (Table 1). A comprehensive

genome analysis has been published for several of them, including

the lymphatic filarial nematode Brugia malayi [22], the dog

hookworm Ancylostoma caninum [23], and the blood flukes

Schistosoma japonicum and S. mansoni [24,25] (Figure 1; Table 1).

Figure 1. Montage of some of the major human helminth parasites, their developmental stages, and disease pathology. (A)Microfilaria of Brugia malayi in a thick blood smear, stained with Giemsa (http://www.dpd.cdc.gov/dpdx/html/frames/a-f/filariasis/body_Filariasis_mic1.htm); the microfilaria is about 250 mm in length. (B) Patient with lymphedema of the left leg due to lymphatic filariasis (http://www.cdc.gov/ncidod/dpd/parasites/lymphaticfilariasis/index.htm). (C) Hookworm egg passed in the stool of an infected person; the microscopic egg,barrel-shaped with a thin wall, is about 70640 mm in dimension. (D) longitudinal section through an adult hookworm attached to wall of smallintestine, ingesting host blood and mucosal wall. The parasite is about 1 cm in length. (E) Eggs of Schistosoma mansoni. The egg is about 150650 mmin dimension; the lateral spine is diagnostic for S. mansoni in comparison to the other human schistosome species. Fibrotic responses to schistosomeeggs trapped in the intestines, liver, and other organs of the infected person are the cause of the schistosomiasis pathology and morbidity. (F) A pairof adult worms of the blood fluke Schistosoma mansoni; the more slender female worm resides in the gynecophoral canal of the thicker male. Theworms are about 1.5 cm in length, and live for many years (http://www.dpd.cdc.gov/dpdx/HTML/ImageLibrary/Schistosomiasis_il.htm ).doi:10.1371/journal.pntd.0000538.g001


Some of the main obstacles to research on human parasites are

their life cycle complexity, tissue complexity, and the paucity of

genetic and transgenic methods for manipulating genes of interest.

Comparative genome analyses have also provided insights into the

adaptations of various parasites to niches in their human (and

vector) hosts as well as insights into the molecular basis of the

mutualistic relationship between the filarial nematode B. malayi

and its endosymbiont Wolbachia (see below).

The genomes of the schistosomes S. japonicum and S. mansoni are

the first complete genomes reported for members of the

Lophotrochozoa [24,25], a large taxon that includes about 50%

of all metazoan phyla including the mollusks, annelids, brachio-

pods, nemerteans, bryozoans, playthelminths, and others [26].

These schistosome genome sequences revealed remarkable

features of the host–parasite relationship. Among these, the

schistosome genome has lost numerous protein-encoding domains.

Whereas the total number (,6,000) of protein families is broadly

similar among schistosomes, humans, C. elegans, and fruit fly, about

1,000 protein domains have been abandoned by S. japonicum,

including some involved in basic metabolic pathways and defense,

implying that loss of these domains could be a consequence of the

adoption of a parasitic way of life. If so, the remaining molecular

repertoire must have evolved in parallel with this extensive domain

loss to permit the pathogen to locate and infect humans efficiently,

nourish itself, and interact with the external environment as well as

with the host. On the other hand, despite extensive gene and

domain loss, a number of schistosome gene families have

expanded and these provide insights into the requirements for a

parasitic lifestyle. Among the expanded gene families, a metallo-

protease called invadolysin (or leishmanolysin) has at least 12

putative family members in schistosomes compared to a single

orthologue in the human, fruit fly, and C. elegans genomes and only

three in the free-living flatworm S. mediterranea. This protease

family may facilitate skin penetration and tissue invasion by the

cercaria, the infective-stage larva of the schistosome [24,25].

Publication of genome sequences of filaria and schistosomes has

underscored the pressing need to develop functional genomics

approaches for these significant pathogens. Functional analyses—

which use approaches such as RNA interference (RNAi) and

translational studies—are essential to resolve uncertainties in the

molecular physiology of helminths and to illuminate mechanisms

of pathogenesis that may lead to development of new interventions

to control and eliminate these parasites or the diseases. Progress in

the functional genomics of helminths was reviewed recently

[6,27,28]. In brief, RNAi has been used to inactivate the RNA

products of several genes in schistosomes (e.g., [29–32]) and

nematodes (e.g., [33]; reviewed in [8]). In addition, the recent

genome sequences of S. mansoni and S. japonicum now make feasible

genome-scale investigation of transgene integration into schisto-

some chromosomes. Gene therapy–like approaches to transform

schistosomes include the use of the piggyBac transposon and

pseudotyped murine leukemia retrovirus as transgene vectors

Figure 2. Phylogeny of the major taxa of human helminths—nematodes and platyhelminths—as established by maximumlikelihood (ML) analysis of 18S ribosomal RNA from 18 helminth species. Sequences were aligned using ClustalX [93]. The topology of thetree was derived from a consensus tree by neighbor-joining–based bootstrapping, its branch lengths were computed using a ML-based method, andit was rooted with the orthologue from the brewer’s yeast, Saccharomyces cerevisiae. The branch lengths of the phylogenetic tree were computedusing DNAML (PHYLIP package [94]) by allowing rate variation among sites. The headings Chromadorea, Enoplea, Trematoda, and Cestoda are majorclasses of the phyla Nematoda and Platyhelminthes. The GenBank accession numbers of aligned sequences are DQ118536.1 (Trichuris trichiura),AY851265.1 (Trichuris suis), AF036637.1 (Trichuris muris), AY497012.1 (Trichinella spiralis), U94366.1 (Ascaris lumbricoides), AF036587.1 (Ascaris suum),AF036588.1 (Brugia malayi), AJ920348.1 (Necator americanus), AJ920347.2 (Ancylostoma caninum), AF036597.1 (Nippostrongylus brasiliensis), X03680.1(Caenorhabditis elegans), AF036605.1 (Strongyloides ratti), U81581.1 (Strongyloides ratti), AB453329.1 (Strongyloides ratti), AF279916.2 (Strongyloidesstercoralis), AB453315.1 (Strongyloides stercoralis), M84229.1 (Strongyloides stercoralis), EU011664.1 (Saccharomyces cerevisiae), , U27015.1(Saccharomyces cerevisiae), DQ157224.1 (Taenia solium), AF229852.1 (Clonorchis sinensis), Z11590.1 (Schistosoma japonicum), Z11976.1 (Schistosomahaematobium), U65657.1 (Schistosoma mansoni).doi:10.1371/journal.pntd.0000538.g002


[34–36] (Figure 3A), both of which offer a means to establish

transgenic lines of schistosomes, to elucidate schistosome gene

function and expression, and to advance functional genomics

approaches for these parasites. Notably, progress is also being

made to express reporter transgenes in parasitic nematodes

including Strongyloides stercoralis [37], in which transgene approaches

developed for use in C. elegans have recently been used to

demonstrate that morphogenesis of infective larvae requires the

DAF-16 orthologue FKTF-1 (Figure 3B) [38]. Progress is also

being made with systems for analysis of promoter sequences of

genes of parasitic nematodes (e.g., [39]).

Many future discoveries resulting from the parasitic helminths

genome information can be expected to emanate from the broader

scientific community rather than by the laboratory originating a

genome sequence project. For the specialized genome sequence

labs, dissemination of sequence information in a way that is useful,

consistent, centralized, and lasting has been therefore a key goal.

Efforts have gone well beyond depositing raw data in public

databases. Currently, helminthologists have available a number of

specialized sites for sequence analysis. C. elegans information is

easily accessible at http://www.wormbase.org [40]. Useful

information about the organism includes genome sequence,

genetic and physical maps, transcript data (EST, mRNA, SAGE,

TEC-RED, ORFeome, expression patterns from reporter gene

fusions, and microarrays), the developmental lineage of all cells,

connectivity of the nervous system, mutant phenotypes and genetic

markers, gene expression described at the level of single cells, 3D

protein structures, NCBI Clusters of Orthologous Groups, and

apoptosis and aging information. It also contains extensive

information from large-scale genomics analyses, including pre-

computed sequence similarity searches, protein motif analyses,

protein–protein interactions, findings from systematic RNAi

screens, single nucleotide polymorphisms (SNPs), orthologous

and paralogous relationships, and the assignment of Gene

Ontology (GO) terms to gene products. These resources greatly

aid in the interpretation of much of the sequence data emerging

from parasitic helminths.

However, accumulating evidence suggest that C. elegans is not a

good model for all parasitic helminths, especially for the ones that

are phylogenetically very distant such as the basic nematode and

zoonotic parasite Trichinella spiralis (e.g., [41]). The other

specialized site is Nematode.net (http://www.nematode.net)

[42]), developed with a primary aim to disseminate the diverse

collection of information for parasitic nematodes to the broader

scientific community in a way that is useful, consistent, centralized,

and enduring. In addition to sequence data, the site hosts

assembled NemaGene clusters in GBrowse views, characterizing

composition and protein homology, functional Gene Ontology

annotations presented via the AmiGO browser, KEGG-based

graphical display of NemaGene clusters mapped to metabolic

pathways, codon usage tables, NemFam protein families (which

represent conserved nematode-restricted coding sequences not

found in public protein databases), and a Web-based WU-BLAST

search tool that allows complex querying and other assorted

resources. Furthermore, Nematode.net, by connecting data across

the entire phylum Nematoda, has made a substantial contribution

toward integrating the historically separate fields of C. elegans,

vertebrate parasitology, and plant parasitology research. Finally,

Figure 3. Some recent approaches to expressing transgenes in human helminths. (A) Luciferase activity in Schistosoma mansoni larvae(schistosomules) after transduction with a pseudotyped retrovirus that expresses the luciferase reporter gene. Anti-luciferase antibody staining ofschistosomules three days after exposure to pseudotyped lentivirus carrying the firefly luciferase transgene. Schistosomules examined by confocallaser microscopy; (i) bright field, (ii) fluorescence red channel, (iii) merged images. Control non-transformed worms showed only background levels offluorescence (not shown; see [34–36] for relevant hypotheses and experimental methods). (B) Recent studies on transgenic Strongyloides stercoralisindicated that morphogenesis of the infective L3 stage larva requires the DAF-16 orthologue FKTF-1 [38]. L3s of this parasitic nematode weretransfected with plasmids carrying the transgene fktf-1b::gfp::fktf-1b and examined by fluorescence microscopy. (i, ii) Transgenic first-stage larvaeexpress green fluorescent protein (GFP) in the procorpus (arrow) of the pharynx. (iii, iv) A first-stage larva (L1) expresses the GFP::FKTF-1b(wt)transgene in the hypodermis. (v, vi) An infective L3 expresses the GFP::FKTF-1b(wt) fusion protein in the hypodermis and in a narrow band in thepharynx (arrow). Scale bars, 10 mm. Adapted from [38].doi:10.1371/journal.pntd.0000538.g003


Ta

ble

1.

Hu

man

par

asit

ich

elm

inth

s(a

nd

the

ircl

ose

rela

tive

s)w

ith

ge

no

me

seq

ue

nci

ng

pro

ject

sco

mp

lete

do

ru

nd

erw

ay.

Ph

ylu

mo

rC

lass

Sp

eci

es

Co

mm

on

Na

me

/D

ise

ase

Pri

ma

ryh

ost

Ge

no

me

siz

e,

Mb

Ge

nB

an

kP

roje

ctID

cDN

As

(37

30

AB

I),

1,0

00

sG

en

om

eS

eq

ue

nci

ng

Sta

tus

Se

qu

en

cin

gIn

stit

ute

a

Ne

ma

tod

a(r

ou

nd

wo

rms)

Cla

de

Vb

Nec

ato

ra

mer

ica

nu

sH

oo

kwo

rm/n

eca

tori

asis

Hu

man

—2

03

69

5In

pro

gre

ssW

UG

C

An

cylo

sto

ma

can

inu

mM

od

el

ho

okw

orm

Do

g3

44

12

84

18

1Im

pro

vin

gd

raft

WU

GC

Nip

po

stro

ng

ylu

sb

rasi

lien

sis

Mo

de

lh

oo

kwo

rmR

at—

20

44

51

4.7

Inp

rog

ress

SI

Cla

de

IVSt

ron

gyl

oid

esst

erco

ralis

Th

read

wo

rm/s

tro

ng

ylo

idia

sis

Hu

man

——

11

.4In

pro

gre

ssSI

S.ra

tti

Mo

de

lth

read

wo

rmR

at—

—2

7.4

Inp

rog

ress

SI/W

UG

C

Cla

de

IIIA

sca

ris

lum

bri

coid

esLa

rge

rou

nd

wo

rm/a

scar

iasi

sH

um

an2

30

—1

.8In

pro

gre

ssSI

A.

sum

Mo

de

lla

rge

rou

nd

wo

rmP

ig2

30

—5

5.7

Imp

rovi

ng

dra

ftW

UG

C/S

I

Bru

gia

ma

layi

Fila

ria/

lym

ph

atic

fila

rias

isH

um

an9

69

54

92

6.2

Imp

rovi

ng

dra

ftT

IGR

/Un

ive

rsit

yo

fP

itts

bu

rgh

Loa

Loa

Fila

ria/

loai

sis

(cu

tan

eo

us

fila

rias

is)/

Afr

ican

eye

wo

rmH

um

an—

—3

.3In

pro

gre

ssB

I

On

cho

cerc

avo

lvu

lus

Fila

ria/

rive

rb

lind

ne

ssH

um

an1

50

—1

5In

pro

gre

ssSI

Aca

nth

och

eilo

nem

avi

tea

eM

od

el

fila

ria

Ro

de

nt

—3

32

39

0In

pro

gre

ssU

MIG

S

Cla

de

ITr

ich

inel

lasp

ira

lisT

rich

ina

wo

rm/t

rich

ino

sis

Pig

toh

um

an7

11

26

05

25

.3D

raft

com

ple

ted

WU

GC

Tric

hu

ris

tric

hiu

raW

hip

wo

rm/t

rich

uri

asis

Hu

man

——

0In

pro

gre

ssSI

T.m

uri

sM

od

el

wh

ipw

orm

Mo

use

96

—7

Inp

rog

ress

SI

T.su

isM

od

el

wh

ipw

orm

Pig

-—

0In

pro

gre

ssW

UG

C

Ce

sto

da

(ta

pe

wo

rms)

Ech

ino

cocc

us

mu

ltilo

cula

ris

Tap

ew

orm

/alv

eo

lar

hyd

atid

osi

sR

od

en

t;la

rva

infe

cts

hu

man

s1

50

—1

Inp

rog

ress

SI

E.g

ran

ulo

sus

Tap

ew

orm

/un

ilocu

lar

hyd

atid

osi

sC

anid

s;la

rva

infe

cts

hu

man

s1

50

12

62

01

0In

pro

gre

ssSI

Taen

iaso

lium

Po

rkta

pe

wo

rm/t

aen

iasi

s/cy

stic

erc

osi

sH

um

an2

70

17

81

52

5D

raft

com

ple

ted

Me

xico

Cit

y

Tre

ma

tod

a(f

luk

es)

Sch

isto

som

am

an

son

iB

loo

dfl

uke

/in

test

inal

sch

isto

som

iasi

sH

um

an3

90

12

59

92

06

Dra

ftco

mp

lete

dSI

/TIG

R

S.h

aem

ato

biu

mB

loo

dfl

uke

/uri

nar

ysc

his

toso

mia

sis

Hu

man

—1

26

16

0In

pro

gre

ssSI

S.ja

po

nic

um

Blo

od

flu

ke/i

nte

stin

alsc

his

toso

mia

sis

Hu

man

40

02

94

91

10

4D

raft

com

ple

ted

CN

HG

C

Clo

no

rch

issi

nen

sis

Live

rfl

uke

/clo

no

rch

iasi

sH

um

an—

17

97

53

Inp

rog

ress

SNU

CM

aW

UG

C,

Was

hin

gto

nU

niv

ers

ity’

sG

en

om

eC

en

ter.

bP

hyl

og

en

yb

ase

do

nB

laxt

er

et

al.

[47

].B

I,B

road

Inst

itu

te;

CN

HG

C,

Ch

ine

seN

atio

nal

HG

C;

SI,

San

ge

rIn

stit

ute

;SN

UC

M,

Seo

ul

Nat

ion

alU

niv

ers

ity

Co

lleg

eo

fM

ed

icin

e;

TIG

R,

Th

eIn

stit

ute

for

Ge

no

mic

Re

sear

ch(n

ow

JCV

I).

do

i:10

.13

71

/jo

urn

al.p

ntd

.00

00

53

8.t

00

1


Nembase (http://www.nematodes.org [43]) also offers access to

parasite sequence and tools such as visualization of clusters by

stage of expression.

While each of these databases has been challenged by the

requirement to support the influx of new genomes and related

data, they nonetheless provide user communities with innovative

features and tools suited to their needs that are beyond the scope of

the large sequence repositories. For flatworms (Figure 2), it is

notable that public genome annotation and analysis tools are

already in place, including SchistoDB (http://schistoDB.net/), a

genomic database for S. mansoni that incorporates sequences and

annotation [44] and SjTPdb, http://function.chgc.sh.cn/sj-

proteome/index.htm, an integrated transcriptome and proteome

database and analysis platform for S. japonicum [45]. The genome

database for the planarian Schmidtea mediterranea, a model free-living

platyhelminth, can be expected to be advantageous to comparative

genome projects and specific research problems for the growing

number of parasitic flatworms that now are or will be subjects of

genome sequence analysis. In addition, because of the phyloge-

netic position of planarians as early bilaterian metazoans,

SmedGD (http://smedgd.neuro.utah.edu) will prove useful not

only to planarian research, but also to investigations on

developmental and evolutionary biology, comparative genomics

(specifically with parasitic flatworms including flukes and tape-

worms), stem cell research, and regeneration [46]).

Evolution of Parasitism in Helminths

Genomics research has helped our understanding of the

evolution of helminths of humans and other hosts, certainly with

regard to roundworms of the phylum Nematoda. The first

comprehensive study of the molecular evolution of helminths

was a phylogenetic analysis of the small subunit ribosomal DNA (ss

rDNA) sequences from 53 roundworms [47]. This study included

both major parasitic and free-living taxonomic groups. It identified

five major clades within the Nematoda and suggested that

parasitism of animal and plants arose independently multiple

times. A more recent study included 339 nearly full-length ss

rDNAs and proposed subdivision of the phylum into 12 clades

[48]. This revealed that nematodes that feed on fungi occupy a

basal position compared to their plant parasite relatives,

confirming that the parasitic nematodes of plants arose from

fungivorous ancestors. Phylogenetic methods are also being used

to study evolution of parasitism-related protein-coding genes (such

as the enzymes that degrade the plant cell wall in nematode

parasites of plants [cellulases, pectate lyases, etc.]) to understand

better the mechanisms underlying the evolution of parasitism

(reviewed in [49]). Recent genome-wide analysis of two plant

parasitic nematodes [50,51] provided a more complete picture of

the acquisition of these cellulase genes, apparently by horizontal

gene transfer (HGT) from prokaryotes. The subsequent expansion

and diversification of HGT genes in these nematodes allow

inferences about the evolutionary history of these parasites, and in

addition present potential targets for anti-nematode drugs. When

the genome of the necromenic nematode Pristionchus pacificus was

reported recently, it became was clear that cellulases were not

restricted to plant parasitic nematodes; their presence in this

species indicated preadaptation for parasitism of animals [52],

consistent with the intermediate evolutionary position of Pris-

tionchus between the microbivorous C. elegans and the animal

parasitic nematodes. In like fashion to evolution of parasitism

among nematodes, we can predict that additional analyses of

parasitic and free-living flatworm genomes will provide deeper

insights into how and when parasitism evolved in the phylum

Platyhelminthes, particularly in comparison to the fresh-water

planarian S. mediterranea, a non-parasitic flatworm for which a draft

genome is available [53]. In addition to evolution of parasitism of

humans and other vertebrate hosts, helminth parasite genome

sequences will also facilitate evolutionary studies on the role of

intermediate hosts/vectors such as the snail in schistosome

infections and the mosquito in filarial infections in this evolution.

Host–Parasite Relationships

Investigations of regulatory networks involved in the embryonic

development, organogenesis, development, and reproduction of

helminths based on newly available genome sequences have

revealed the presence of well-characterized signaling pathways,

including those for Wnt, Notch, Hedgehog, and transforming

growth factor b (TGF-b). These pathways can be recognized in the

B. malayi and schistosome genomes [22,24,25]. These include

endogenous hormones including epidermal growth factor (EGF)-

like and fibroblast growth factors (FGF)-like peptides. Predicted

components of the Ras–Raf–MAPK and TGF-b–SMAD signaling

pathways (including FGF and EGF receptors), for example,

encoded by these genomes, have components sharing high

sequence identity with their mammalian orthologs, implying that

schistosomes or filarial worms, in addition to utilizing their own

pathways, might exploit host growth factors as developmental

signals.

Immune regulation by helminth parasites includes suppression,

diversion, and alteration of the host immune response, resulting in

an anti-inflammatory environment that is favorable to parasite

survival. For example, chronic infections induce key changes in

host immune cell populations including dominance of the T-helper

2 (Th2) cells and selective loss of effector T cell activity, against a

background of regulatory T cells, alternatively activated macro-

phages, and Th2-inducing dendritic cells [54,55]. With advances

in genomics, numerous parasite-derived proteins, including

cytokine homologs, protease inhibitors, and an intriguing set of

novel products, as well as glycoconjugates and small lipid moieties,

have been discovered with known or hypothesized roles in

immune interference [56–61]. These studies suggest that secreted

parasite products interfere with different arms of the immune

system by influencing the cytokine network and signal transduc-

tion pathways or by inhibiting essential enzymes. Using bioinfor-

matics to compare the predicted proteome of B. malayi to proteins

implicated in the immune response (interleukins, chemokines, and

other signaling molecules), potential immune modulators pro-

duced by the filarial have been identified, including genes

encoding the macrophage migration inhibition family of signaling

proteins [62]. Furthermore, the genome of the blood fluke S.

mansoni encodes a large array of paralogues of fucosyl and

xylosyltransferases [25] that are involved in generating novel

glycans at the host–parasite interface and could have an important

role in the subverting the host immune system. A recent

comprehensive review summarizes our current understanding of

the growing number of individual helminth mediators that target

key receptors or pathways in the mammalian immune system [63].

Helminth infection can have a broad impact on the entire

immune system. Infection with trematode and nematode parasites,

for example, correlates with a reduced incidence of atopic,

allergic-type disorders [64]. Thus, helminth infection might

potentially be useful as a novel therapy for allergic or autoimmune

diseases [65]. Recently, worms, eggs, or purified nematode

parasite protein have been used in preclinical and clinical trials

to protect humans from allergy and autoimmunity (reviewed in

[66–70]), including Crohn’s disease and ulcerative colitis [71,72].


Other studies have shown that substances produced by helminths,

for example Ascaris suum, Nippostrongylus brasiliensis, and Acanthochei-

lonema viteae, can directly interfere with allergic responses or with

development of allergen-specific Th2 responses [73–75]. ES-62, a

molecule secreted by the filarial nematode A. viteae, directly inhibits

the FceRI-induced release of mediators from mast cells, protects

against mast-cell–dependent hypersensitivity in skin and lungs [76]

and inhibits collagen-induced arthritis [77]. Research is underway

to develop molecules that mimic the activity of ES-62 as drugs for

allergic and autoimmune diseases [66]. Other helminth-derived

products have the potential to reduce allergic responses. These

products include schistosomal lysophosphatidylserine (lyso-PS)

[61] and thioredoxin peroxidase from the liver fluke Fasciola

hepatica [78]. These findings demonstrated that helminths produce

products that can interfere with both the development of allergic

responses and the workings of host effector mechanisms.

The ‘‘Dependent’’ Helminth

As a consequence of evolution of an obligatory parasitic

existence, helminth parasites are dependent upon their interme-

diate and definitive hosts for many necessities including nutrients

such as amino acids; filariae are dependent on insect vectors to

transport them to the host. The newly available genome sequences

for schistosomes and B. malayi have confirmed earlier biochemical

studies that had revealed aspects of physiological/ biochemical

dependence of these parasites on the host. For example,

schistosomes cannot synthesize fatty acids de novo, or sterols,

purines, and nine human essential amino acids plus arginine or

tyrosine, and must catabolize complex precursors obtained from

their hosts. Loss or degeneracy of fatty acid, sterol, and purine

synthesis pathways in schistosomes likely relates to the adoption of

a parasitic lifestyle; it is notable that genes encoding all the key

enzymes for both the de novo fatty acid and purine syntheses are

complete in the (free-living) planarian S. mediterranea. To obtain

essential lipid nutrients, the schistosome genome encodes trans-

porters, including apolipoproteins, low-density lipoprotein recep-

tor, scavenger receptor, fatty-acid-binding protein, ATP-binding-

cassette transporters and cholesterol esterase, to exploit fatty acids

and cholesterol from host blood [25,79].

Many species of filarial nematodes are themselves infected by

the endosymbiotic bacterium Wolbachia. The genome sequence of

the Wolbachia species that infects the roundworm nematode B.

malayi (wBm) [80] helped establish which metabolites the

bacterium potentially provides to the nematode (riboflavin, flavine

adenine dinucleotide, heme, and nucleotides, for example) and

which are provided by the nematode to the endobacterium

(notably, amino acids). This type of information has opened up the

exciting possibility that drugs already registered for human use

might inhibit key biochemical pathways in Wolbachia that could

sterilize or kill the adult worms. Although the Wolbachia genome is

even more degenerate than that of the related pathogen Rickettsia,

it has retained more intact metabolic pathways than Rickettsia. This

may be important in its biochemical contribution to host (i.e.,

filarial) viability and fecundity.

The wBm genome encodes many more proteases and

peptidases than Rickettsia, which likely degrade host proteins in

the extracellular environment. Other proteins encoded by wBm

include a common type IV secretion system, as used by some

pathogenic gram-negative bacteria to transfer plasmids and

proteins into surrounding host cells, and an abundance of ankyrin

domain-containing proteins, which might regulate host gene

expression, as suggested for Ehrlichia phagocytophilia AnkA [81], as

well as several proteins predicted to localize on the cell surface.

Ankyrin domain–containing proteins are noteworthy because of

their roles in protein–protein interactions in a variety of cellular

processes. A number of other wBm molecules are of interest as

potential drug targets. For example, glutathione biosynthesis genes

may provide glutathione for the protection of the filaria from

oxidative stress or human immunological effector molecules.

Heme produced from wBm (all five synthesis genes are present)

could be vital to worm embryogenesis, as there is evidence that

molting and reproduction are controlled by ecdysteroid-like

hormones [82], synthesis of which requires heme. Depletion of

Wolbachia might therefore halt production of these hormones and

block molting and/or embryogenesis in B. malayi. Most, if not all,

nematodes, including B. malayi, appear to be unable to synthesize

heme, but must obtain it from extraneous sources, such as the host,

the food supply, or perhaps from endosymbionts.


The filarial and schistosome genome sequences now available

provide the vanguard for assembly of a genome sequence catalog

of the numerous other neglected helminth parasites (Table 1).

Comparative genomics will likely be a dominant approach to

organize, interpret, and utilize the vast amounts of genomic

information anticipated from the genomes of these parasites (e.g.

[83,84]). In terms of sequencing tools, the new generation of

‘‘massively parallel’’ sequencing platforms commercially available

today, (such as the Roche/454 pyrosequencer [85], Illumina/

Solexa [86], and SOLiD [87]) offer of the order of 100- to 1,000-

fold increases in throughput over the Sanger sequencing

technology [88] on capillary electrophoresis instruments. This

rapid change to producing millions of DNA sequence reads in a

short time will have a huge impact on research on NTDs. Each

platform has a specific application: while the Roche/454 is

optimal for in-depth analysis of whole transcriptomes and de novo

sequencing of bacterial and small eukaryotic genomes, the

Illumina and SOLiD systems are highly attractive for resequencing

projects aimed at identifying genetic variants (mutations, inser-

tions, deletions), profiling and discovering noncoding RNAs

(ncRNAs), and studying epigenetic modifications of histones and

DNA. With the increased read length and improved error rate of

massively parallel pyrosequencing technology, de novo sequencing

of helminth genomes has become possible at a fraction of earlier

costs. In the next five years, projects at the Washington

University’s Genome Center (http://www.genome.gov/

10002154) and the Wellcome Trust Sanger Institute (http://

www.sanger.ac.uk/Projects/Helminths/) should increase the

available sequence data on human helminths and their close

relatives by an order of magnitude, adding more than 20 draft

genomes to those listed in Table 1.

Once these reference genomes become available, sequencing of

clinical isolates is expected to accelerate. Sequencing of the clinical

strains and strain-to-reference comparisons can be performed

using platforms such as Illumina/Solexa and SOLiD to investigate

genome-wide polymorphism and provide a comprehensive picture

of natural helminth genome variation. These approaches should

also be valuable for exploring genetic changes involved in

resistance to anti-worm drugs and understanding the potential

mechanisms of drug resistance in human parasites, and can be

expected to facilitate development of genetic markers to monitor

and manage any future appearance and spread of drug resistance.

These phenomena are of tremendous importance, particularly

since some major neglected helminth diseases are being targeted in

mass drug treatment campaigns [89]. In addition, the new

generation of sequencing technologies has also provided unprec-


edented opportunities for high-throughput functional genomic

research (reviewed in [90]) that awaits application to helminth

research.

Although some details of immunomodulation by helminth

components have been characterized, we are just beginning to

understand how these parasite products act on immune responses

and to assemble fragmentary information on individual compo-

nents into a comprehensive picture. Comparisons of helminth

molecules with orthologues/paralogues from free-living relatives

will strengthen efforts to decipher the strategies adopted by

helminth parasites to evade and subvert their host immune

responses. This information will be exploitable for development of

drugs and vaccines against the parasites and potentially also novel

therapeutic biologics for use in humans. Future studies might

determine whether helminth proteins with unknown function

might be the source for the intriguing regulatory effects helminth

infections have on the host immune response.

Treatment for helminthic infections, responsible for hundreds of

thousands of deaths each year, depends almost exclusively on just

two or three drugs: praziquantel, the benzimidazoles, and

ivermectin. Vaccines and new drugs are needed, certainly because

drug resistance in human helminth parasites such as schistosomes,

whipworms, and filariae, to these compounds would present a

major problem for current treatment and control strategies.

Pharmacogenomics with the new helminth genomes represents a

practicable route forward toward new drugs. For example,

chemogenomics screening of the genome sequence of S. mansoni

identified .20 parasite proteins for which potential drugs are

available approved for other human ailments [25], and indeed for

which, in the case of the schistosome thioredoxin glutathione

reductase, auranofin (an anti-arthritis medication) was shown

recently to exhibit potent anti-schistosomal activity [91]. Finally,

the vast new sequence information will undoubtedly allow revision

of our understanding of the host–parasite relationship, its

evolution, vector–pathogen and helminth–symbiont interactions,

unique aspects of cell biology and biochemistry, phylogenetic

relationships, intervention targets, research approaches (e.g. [92]),

and so forth.

Acknowledgments

We thank Victoria Mann, Geoffrey Gobert and Gabriel Rinaldi for access

to their unpublished findings on schistosomes transduced with pseudotyped

virions.

References

1. Hotez PJ, Brindley PJ, Bethony JM, King CH, Pearce EJ, et al. (2008) Helminth

infections: The great neglected tropical diseases. J Clin Invest 118: 1311–1321.

2. Hotez PJ, Kamath A (2009) Neglected tropical diseases in sub-Saharan Africa:Review of their prevalence, distribution, and disease burden. PLoS Negl Trop

Dis 3: e412.

3. Patz JA, Graczyk TK, Geller N, Vittor AY (2000) Effects of environmentalchange on emerging parasitic diseases. Int J Parasitol 30: 1395–1405.

4. Liang S, Yang C, Zhong B, Qiu D (2006) Re-emerging schistosomiasis in hilly

and mountainous areas of Sichuan, China. Bull WHO 84: 139–144.

5. Huyse T, Webster BL, Geldof S, Stothard JR, Diaw OT, et al. (2009)

Bidirectional introgressive hybridization between a cattle and human schisto-some species. PLoS Pathog 5: e1000571. doi:10.1371/journal.ppat.1000571.

6. Kalinna BH, Brindley PJ (2007) Manipulating the manipulators: Advances in

parasitic helminth transgenesis. Trends Parasitol 23: 197–204.

7. Krasky A, Rohwer A, Schroeder J, Selzer PM (2007) A combined bioinformaticsand chemoinformatics approach for the development of new antiparasitic drugs.

Genomics 89: 36–43.

8. Mitreva M, Zarlenga DS, McCarter JP, Jasmer DP (2007) Parasitic nematodes -

From genomes to control. Vet Parasitol 148: 31–42.

9. Berriman M, Lustigman S, McCarter JP (2007) Helminth initiative for drugdiscovery – Report of the informal consultation, genomics and emerging drug

discovery technologies. Expert Opin Drug Discovery 2: S83–S89.

10. Lustigman S, Ford S, Crawford MJ (2008) RNA Interference: from functionalgenomics to validation of drug targets in helminths. In: RNA interference

research progress LylandRoger T, BrowningIrving B, eds. Nova Publishers. pp

135–162.

11. Franco GR, Adams MD, Soares MB, Simpson AJG, Venter JC, et al. (1995)Identification of new Schistosoma mansoni genes by the EST strategy using a

directional cDNA library. Gene 152: 141–147.

12. Gobert GN, Moertel L, Brindley PJ, McManus DP (2009) Developmental geneexpression profiles of the human pathogen Schistosoma japonicum. BMC Genomics

10: 128.

13. Robinson MW, Connolly B (2005) Proteomic analysis of the excretory-secretoryproteins of the Trichinella spiralis L1 larva, a nematode parasite of skeletal

muscle. Proteomics 5: 4525–4532.

14. Mitreva M, McCarter JP, Martin J, Dante M, Wylie T, et al. (2004)

Comparative genomics of gene expression in the parasitic and free-livingnematodes Strongyloides stercoralis and Caenorhabditis elegans. Genome Res 14:

209–220.

15. Taft AS, Vermeire JJ, Bernier J, Birkeland SR, Cipriano MJ, et al. (2009)Transcriptome analysis of Schistosoma mansoni larval development using serial

analysis of gene expression (SAGE). Parasitology 136: 469–485.

16. Mitreva M, McCarter JP, Arasu P, Hawdon J, Martin J, et al. (2005)

Investigating hookworm genomes by comparative analysis of two Ancylostoma

species. BMC Genomics 6: 58.

17. McCarter JP (2004) Genomic filtering: An approach to discovering novel

antiparasitics. Trends Parasitol 20: 462–468.

18. Wasmuth J, Schmid R, Hedley A, Blaxter M (2008) On the extent and origins ofgenic novelty in the phylum Nematoda. PloS Negl Trop Dis 2: e258.

doi:10.1371/journal.pntd.0000258.

19. Yin Y, Martin J, Abubucker S, Wang Z, Wyrwicz L, et al. (2009) Molecular

determinants archetypical to the phylum Nematoda. BMC Genomics 10: 114.

20. Wang Z, Martin J, Abubucker S, Yin Y, Gasser R, et al. (2009) Systematic

analysis of insertions and deletions specific to nematode proteins and their

proposed functional and evolutionary relevance. BMC Evol Biol 9: 23.

21. The C. elegans Sequencing Consortium (1998) Genome sequence of the

nematode C. elegans: A platform for investigating biology. Science 282:

2012–2018.

22. Ghedin E, Wang S, Spiro D, Caler E, Zhao Q, et al. (2007) Draft genome of the

filarial nematode parasite Brugia malayi. Science 317: 1756–1760.

23. Abubucker S, Martin J, Yin Y, Fulton L, Yang S-P, et al. (2008) The canine

hookworm genome: Analysis and classification of Ancylostoma caninum survey

sequences. Mol Biochem Parasitol 157: 187–192.

24. Schistosoma japonicum Genome Sequencing and Functional Analysis Consortium,

Liu F, Zhou Y, Wang ZQ, Lu G, et al. (2009) The Schistosoma japonicum genome

reveals features of host-parasite interplay. Nature 460: 345–351.

25. Berriman M, Haas BJ, LoVerde PT, Wilson RA, Dillon GP, et al. (2009) The

genome of the blood fluke Schistosoma mansoni. Nature 460: 352–358.

26. Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, et al. (2008) Broad

phylogenomic sampling improves resolution of the animal tree of life. Nature

452: 745–749.

27. Krautz-Peterson G, Bhardwaj R, Faghiri Z, Tararam C, Skelly PJ (2009) RNA

interference in schistosomes: machinery and methodology. Parasitology;E-pub

ahead of print. doi:10.1017/S0031182009991168.

28. Mann VH, Morales ME, Kines KJ, Brindley PJ (2008) Transgenesis of

schistosomes: approaches using mobile genetic elements. Parasitology 134: 1–13.

29. Freitas TC, Jung E, Pearce EJ (2007) TGF-beta signaling controls embryo

development in the parasitic flatworm Schistosoma mansoni. PLoS Pathog 3: e52.


30. Morales ME, Rinaldi G, Kines KJ, Gobert GN, Tort JF, et al. (2008) RNA

interference targeting Schistosoma mansoni cathepsin D, the apical enzyme of the

hemoglobin proteolysis cascade. Mol Biochem Parasitol 157: 160–168.

31. Rinaldi G, Morales ME, Alrefaei YN, Cancela M, Castillo E, et al. (2009) RNA

interference targeting leucine aminopeptidases inhibits hatching of eggs of the

human blood fluke, Schistosoma mansoni. Mol Biochem Parasitol 167: 118–126.

32. Faghiri Z, Skelly PJ (2009) The role of tegumental aquaporin from the human

parasitic worm, Schistosoma mansoni, in osmoregulation and drug uptake. FASEB J

23: 2780–2789.

33. Ford L, Zhang J, Liu J, Hashmi S, Fuhrman JA, et al. (2009) Functional analysis

of the cathepsin-like cysteine protease genes in adult Brugia malayi using RNA

interference. PLoS Negl Trop Dis 3: e377. doi: 10.1371/journal.pntd.0000377.

34. Morales ME, Mann VH, Kines KJ, Gobert GN, Kalinna BH, et al. (2007)

piggyBac transposon mediated transgenesis of the human blood fluke, Schistosoma

mansoni. FASEB J 21: 3479–3489.

35. Kines KJ, Mann VH, Morales ME, Shelby BD, Kalinna BH, et al. (2006)

Transduction of Schistosoma mansoni by vesicular stomatitis virus glycoprotein-

pseudotyped Moloney murine leukemia retrovirus. Exp Parasitol 112: 209–220.

36. Kines KJ, Morales ME, Mann VH, Gobert GN, Brindley PJ (2008) Integration

of reporter transgenes into Schistosoma mansoni chromosomes mediated by

pseudotyped murine leukemia virus. FASEB J 22: 2936–2948.

37. Li X, Massey HC, Jr., Nolan TJ, Schad GA, Kraus K, et al. (2006) Successful

transgenesis of the parasitic nematode Strongyloides stercoralis requires endogenous

non-coding control elements. Int J Parasitol 36: 671–679.


38. Castelletto ML, Massey HC, Jr., Lok JB (2009) Morphogenesis of Strongyloides

stercoralis infective larvae requires the DAF-16 ortholog FKTF-1. PLoS Pathog 5:e1000370. doi: 10.1371/journal.ppat.1000370.

39. de Oliveira A, Katholi CR, Unnasch TR (2008) Characterization of the

promoter of the Brugia malayi 12 kDa small subunit ribosomal protein (RPS12)

gene. Int J Parasitol 38: 1111–1119.

40. Rogers A, Antoshechkin I, Bieri T, Blasiar D, Bastiani C, et al. (2008) Wormbase2007. Nucleic Acids Res 36(Database issue). pp D612–617.

41. Mitreva N, Appleton J, McCarter JP, Jasmer DP (2005) Expressed sequence tags

from life cycle stages of Trichinella spiralis: Application to biology and parasitecontrol. Vet Parasitol 132: 13–17.

42. Martin J, Abubucker S, Wylie T, Yin Y, Mitreva M (2009) Nematode.net update

2008: Improvements enabling more efficient data mining and comparativenematode genomics. Nucleic Acids Res 37(Database issue): D571–578.

43. Parkinson J, Whitton C, Schmid R, Thomson M, Blaxter M (2004) NEMBASE:

A resource for parasitic nematode ESTs. Nucleic Acids Res 32: D427–430.

44. Zerlotini A, Heiges M, Wang H, Moraes RL, Dominitini AJ, et al. (2009)

SchistoDB: A Schistosoma mansoni genome resource. Nucleic Acids Res37(Database issue): D579–582.

45. Liu F, Chen P, Cui SJ, Wang ZQ, Han ZG (2008) SjTPdb: Integrated

transcriptome and proteome database and analysis platform for Schistosoma

japonicum. BMC Genomics 9: 304.

46. Robb SMC, Ross E, Sanchez Alvarado A (2008) SmedGD: The Schmidtea

mediterranea genome database. Nucleic Acids Res 36(Database issue). ppD599–D606.

47. Blaxter ML, De Ley P, Garey JR, Liu LX, Scheldeman P, et al. (1998) A

molecular evolutionary framework for the phylum Nematoda. Nature 392:

71–75.

48. Holterman M, van der Wurff A, van den Elsen S, van Megen H, Bongers T,et al. (2006) Phylum-wide analysis of SSU rDNA reveals deep phylogenetic

relationships among nematodes and accelerated evolution toward crown clades.Mol Biol Evol 23: 1792–1800.

49. Mitreva M, Smant G, Helder J (2009) Role of horizontal gene transfer in the

evolution of plant parasitism among nematodes. In: Horizontal Gene Transfer.

Methods Mol Biol 532: 517–535.

50. Abad P, Gouzy J, Aury J-M, Castagnone-Sereno P, Danchin EG, et al. (2008)Genome sequence of the metazoan plant-parasitic nematode Meloidogyne incognita.

Nat Biotech 26: 909–915.

51. Opperman CH, Bird DM, Williamson VM, Rokhsar DS, Burke M, et al. (2008)Sequence and genetic map of Meloidogyne hapla: A compact nematode genome for

plant parasitism. Proc Natl Acad Sci U S A 105: 14802–14807.

52. Dieterich C, Clifton SW, Schuster LN, Chinwalla A, Delehaunty K, et al. (2008)The Pristionchus pacificus genome provides a unique perspective on nematode

lifestyle and parasitism. Nat Genet 40: 1193–1198.

53. Robb SM, Ross E, Sanchez Alvarado A (2008) SmedGD: The Schmidtea

mediterranea genome database. Nucleic Acids Res 6: D599–D606.

54. Maizels RM, Balic A, Gomez-Escobar N, Nair M, Taylor MD, et al. (2004)Helminth parasites–Masters of regulation. Immunol Rev 201: 89–116.

55. Ohnmacht C, Voehringer D (2009) Basophil effector function and homeostasis

during helminth infection. Blood 113: 2816–2825.

56. Hartmann S, Kyewski B, Sonnenburg B, Lucius R (1997) A filarial cysteineprotease inhibitor down-regulates T cell proliferation and enhances interleukin-

10 production. Eur J Immunol 27: 2253–2260.

57. Hartmann S, Lucius R (2003) Modulation of host immune responses bynematode cystatins. Int J Parasitol 33: 1291–1302.

58. Harnett W, McInnes IB, Harnett MM (2004) ES-62, a filarial nematode-derived

immunomodulator with anti-inflammatory potential. Immunol Lett 94: 27–33.

59. Gomez-Escobar N, Lewis E, Maizels RM (1998) A novel member of the

transforming growth factor-beta (TGF-beta) superfamily from the filarialnematodes Brugia malayi and B. pahangi. Exp Parasitol 88: 200–209.

60. Gomez-Escobar N, Gregory WF, Maizels RM (2000) Identification of tgh-2, a

filarial nematode homolog of Caenorhabditis elegans daf-7 and human transforminggrowth factor beta, expressed in microfilarial and adult stages of Brugia malayi.


61. van der Kleij D, Latz E, Brouwers JF, Kruize JC, Schmitz M, et al. (2002) Anovel host-parasite lipid cross-talk. Schistosomal lyso-phosphatidylserine acti-

vates toll-like receptor 2 and affects immune polarization. J Biol Chem 277:

48122–48129.

62. Pastrana DV, Raghavan N, FitzGerald P, Eisinger SW, Metz C, et al. (1998)Filarial nematode parasites secrete a homologue of the human cytokine

macrophage migration inhibitory factor. Infect Immun 66: 5955–5963.

63. Hewitson JP, Grainger JR, Maizels RM (2009) Helminth immunoregulation:The role of parasite secreted proteins in modulating host immunity. Mol

Biochem Parasitol 167: 1–11.

64. Yazdanbakhsh M, van den Biggelaar A, Maizels RM (2001) Th2 responses

without atopy: Immunoregulation in chronic helminth infections and reducedallergic disease. Trends Immunol 22: 372–377.

65. Imai S, Fujita K (2004) Molecules of parasites as immunomodulatory drugs.

Curr Top Med Chem 4: 539–552.66. Harnett W, Harnett MM (2008) Therapeutic immunomodulators from

nematode parasites. Expert Rev Mol Med 10: e18.

67. Harnett W, Harnett MM (2008) Parasitic nematode modulation of allergicdisease. Curr Allergy Asthma Rep 8: 392–397.

68. Johnston MJ, MacDonald JA, McKay DM (2009) Parasitic helminths: Apharmacopeia of anti-inflammatory molecules. Parasitology 136: 125–147.

69. McKay DM (2009) The therapeutic helminth? Trends Parasitol 25: 109–114.

70. Erb KJ (2009) Can helminths or helminth-derived products be used in humansto prevent or treat allergic diseases? Trends Immunol 30: 75–82.

71. Summers RW, Elliott DE, Urban JF, Jr., Thompson R, Weinstock JV (2005)Trichuris suis therapy in Crohn’s disease. Gut 54: 87–90.

72. Summers RW, Elliott DE, Urban JF, Jr., Thompson RA, Weinstock JV (2005)Trichuris suis therapy for active ulcerative colitis: A randomized controlled trial.

Gastroenterology 128: 825–832.

73. Lima C, Perini A, Garcia ML, Martins MA, Teixeira MM, et al. (2002)Eosinophilic inflammation and airway hyper-responsiveness are profoundly

inhibited by a helminth (Ascaris suum) extract in a murine model of asthma. ClinExp Allergy 32: 1659–1566.

74. Schnoeller C, Rausch S, Pillai S, Avagyan A, Wittig BM, et al. (2008) A

helminth immunomodulator reduces allergic and inflammatory responses byinduction of IL-10-producing macrophages. J Immunol 180: 4265–4272.

75. Melendez AJ, Harnett MM, Pushparaj PN, Wong WS, Tay HK, et al. (2007)Inhibition of Fc epsilon RI-mediated mast cell responses by ES-62, a product of

parasitic filarial nematodes. Nat Med 13: 1375–1381.76. McInnes IB, Leung BP, Harnett M, Gracie JA, Liew FY, et al. (2003) A novel

therapeutic approach targeting articular inflammation using the filarial

nematode-derived phosphorylcholine-containing glycoprotein ES-62.J Immunol 171: 2127–2133.

77. Donnelly S, O’Neill SM, Sekiya M, Mulcahy G, Dalton JP (2005) Thioredoxinperoxidase secreted by Fasciola hepatica induces the alternative activation of

macrophages. Infect Immun 73: 166–173.

78. Holland MJ, Harcus YM, Riches PL, Maizels RM (2000) Proteins secreted bythe parasitic nematode Nippostrongylus brasiliensis act as adjuvants for Th2

responses. Eur J Immunol 30: 1977–1987.79. Han ZG, Brindley PJ, Wang S, Chen Z (2009) Schistosome genomics: New

perspectives on schistosome biology and host parasite interaction. Annu RevGenomics Hum Genet 10: 211–240.

80. Foster J, Ganatra M, Kamal I, Ware J, Makarova K, et al. (2005) The Wolbachia

genome of Brugia malayi: endosymbiont evolution within a human pathogenicnematode. PLoS Biol 3: e121. doi:10.1371/journal.pbio.0030121.

81. Park J, Kim KJ, Choi K-S, Grab DJ, Dumler JS (1993) Anaplasma phagocytophilum

AnkA binds to granulocyte DNA and nuclear proteins. Cell Microbiol 6:

743–751.

82. Warbrick EV, Barker GC, Rees HH, Howells RE (1993) The effect ofinvertebrate hormones and potential hormone inhibitors on the third larval

moult of the filarial nematode, Dirofilaria immitis, in vitro. Parasitology 107:459–463.

83. Nisbet AJ, Cottee PA, Gasser RB (2008) Genomics of reproduction innematodes: prospects for parasite intervention? Trends Parasitol 24: 89–95.

84. Dieterich C, Sommer RJ (2009) How to become a parasite - Lessons from the

genomes of nematodes. Trends Genet 25: 203–209.85. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. (2005) Genome

sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380.

86. Bennett S (2004) Solexa Ltd. Pharmacogenomics 5: 433–438.

87. Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, et al. (2005)Accurate multiplex polony sequencing of an evolved bacterial genome. Science

309: 1728–1732.88. Sanger F, Niklen S, Coulson A (1977) DNA sequencing with chain-terminating

inhibitors. Proc Natl Acad Sci U S A 74: 5463–5467.

89. Fenwick A (2009) Host-parasite relations and implications for control. AdvParasitol 68: 247–261.

90. Morozova O, Marra MA (2008) Applications of next-generation sequencingtechnologies in functional genomics. Genomics 92: 255–264.

91. Kuntz AN, Davioud-Charvet E, Sayed AA, Califf LL, Dessolin J, et al. (2007)Thioredoxin glutathione reductase from Schistosoma mansoni: An essential parasite

enzyme and a key drug target. PLoS Med 4: e206. Erratum in: PLoS Med 2007,

4: e264.92. Cosseau C, Azzi AH, Smith K, Freitag M, Mitta G, et al. (2009) Native

chromatin immunoprecipitation (N-ChIP) and ChIP-Seq of Schistosoma mansoni:Critical experimental parameters. Mol Biochem Parasitol 166: 70–76.

93. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, et al. (2003) Multiple

sequence alignment with the Clustal series of programs. Nucleic Acids Res 31:3497–3500.

94. Felsenstein J (1988) Phylogenies from molecular sequences: Inference andreliability. Ann Rev Genet 22: 521–565.


the genomics of emerging infectious disease · the organism cause disease before but a new form is...

Documents