repeatsidentify novel virulence genesin haemophilus influenzae

5
Proc. Natl. Acad. Sci. USA Vol. 93, pp. 11121-11125, October 1996 Microbiology DNA repeats identify novel virulence genes in Haemophilus influenzae (microbiology/whole genome sequencing/pathogenicity) DEREK W. HOOD*t, MARY E. DEADMAN*, MICHAEL P. JENNINGS*t, MARINA BISERCIC*, ROBERT D. FLEISCHMANN§, J. CRAIG VENTER§, AND E. RICHARD MOXON* *Molecular Infectious Diseases Group, University of Oxford Department of Paediatrics, Institute of Molecular Medicine, John Radcliffe Hospital, Headington, Oxford OX3 9DU, United Kingdom; and §The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850 Communicated by Hamilton 0. Smith, Johns Hopkins University, Baltimore, MD, July 19, 1996 (received for review April 1, 1996) ABSTRACT The whole genome sequence (1.83 Mbp) of Haemophilus influenzae strain Rd was searched to identify tandem oligonucleotide repeat sequences. Loss or gain of one or more nucleotide repeats through a recombination- independent slippage mechanism is known to mediate phase variation of surface molecules of pathogenic bacteria, includ- ing H. influenzae. This facilitates evasion of host defenses and adaptation to the varying microenvironments of the host. We reasoned that iterative nucleotides could identify novel genes relevant to microbe-host interactions. Our search of the Rd genome sequence identified 9 novel loci with multiple (range 6-36, mean 22) tandem tetranucleotide repeats. All were found to be located within putative open reading frames and included homologues of hemoglobin-binding proteins of Neis- seria, a glycosyltransferase (IgtC gene product) of Neisseria, and an adhesin of Yersinia. These tetranucleotide repeat sequences were also shown to be present in two other epide- miologically different H. influenzae type b strains, although the number and distribution of repeats was different. Further characterization of the lgtC gene showed that it was involved in phenotypic switching of a lipopolysaccharide epitope and that this variable expression was associated with changes in the number of tetranucleotide repeats. Mutation of lgtC re- sulted in attenuated virulence of H. influenzae in an infant rat model of invasive infection. These data indicate the rapidity, economy, and completeness with which whole genome se- quences can be used to investigate the biology of pathogenic bacteria. Haemophilus influenzae is an important cause of human dis- ease worldwide. Serotype b capsular strains cause invasive bacteremic infections such as meningitis, septicemia, epiglot- titis, septic arthritis, and empyema, particularly in infants. Strains lacking a capsule are a common cause of otitis media, sinusitis, conjunctivitis, and acute lower respiratory tract in- fections, which account for many millions of childhood deaths in the Third World (1). A feature of pathogenic bacteria, including H. influenzae, is their propensity for varying the phenotypes of surface mole- cules which interact with host structures (2, 3). One genetic mechanism mediating phenotypic variation involves changes in the number of repeated nucleotides in mononucleotide (ho- mopolymeric) tracts or tandemly iterated oligonucleotides. As one example, dinucleotide repeats of TA have been associated with phase variation of H. influenzae fimbriae, an adhesin mediating attachment to respiratory epithelia (4). The mech- anism involves altered transcription of two divergently tran- scribed genes resulting from loss or gain of nucleotide repeats located in'the region of the overlapping promoters (5). Mul- tiple tandem DNA repeats of CAAT or GCAA have been identified within the 5' end of the translated reading frames of genes required for lipopolysaccharide (LPS) biosynthesis (6, 7). These repeat regions are also subject to loss or gain of one or more of the four-base repeats, presumably through slipped- strand mispairing (8) producing frame shifts and resulting in a high frequency of phase variation of oligosaccharide core structures. This genetic potential to generate a repertoire of variant antigens is one of the mechanisms by which pathogenic microbes can adapt to the differing microenvironments of the host and evade host immune responses (2, 3). Similar mech- anisms involving pentanucleotide and mononucleotide repeats have been described in Neisseria meningitidis, Neisseria gonor- rhoeae, and other pathogens (9-13). Given the importance of repetitive DNA, we hypothesized that a search of the complete genome sequence of H. influenzae strain Rd (14) for repetitive DNA sequences might identify novel genes whose products would have functions of relevance to pathogenicity. It was our aim to characterize these DNA repeats and their associated loci in strain Rd, to extend the study to include other H. influenzae strains, and then to show the biological role of one of these novel loci, IgtC, in mediating phase variation and bacterial virulence. MATERIALS AND METHODS Bacterial Strains and Culture Conditions. The H. influenzae type d strain RM1 18 (strain KW-20) was a derivative of strain Rd obtained from H. 0. Smith, the same source for the strain used in the sequencing project (14). Also used in this study were the type b strains RM153 (Eagan) and RM7004 (both disease isolates from the United States (15) and Holland (16), respectively). H. influenzae strains were grown at 37°C in brain heart infusion (BHI) broth supplemented with hemin (10 ,ug/ml) and NAD (2 Ag/ml). Following transformation, ka- namycin (10 ,zg/ml) was added to the growth medium. Escherichia coli, strain DH5a, was used to propagate cloned PCR products and was grown at 37°C in Luria-Bertani (LB) broth supplemented with ampicillin (100 [kg/ml). Searching for DNA Repeats in the H. influenzae Genome Sequence. All possible combinations of mononucleotide, dinu- cleotide, trinucleotide, and tetranucleotide motifs were searched in early versions of the H. influenzae strain Rd genome data base prior to completion of sequencing and annotation. Three or more repeats of tetranucleotide and trinucleotide motifs were searched by using the program BLAST (17), and mononucleotide (homopolymeric) tracts comprising two or more nucleotides of A(T) or C(G) and dinucleotide Abbreviation: LPS, lipopolysaccharide. tTo whom reprint requests should be addressed. e-mail: dhood@ molbiol.ox.ac.uk. tPresent address: School of Biomolecular and Biomedical Science, Faculty of Science and Technology, Griffith University, Queensland 4111, Australia. 11121 The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact.

Upload: others

Post on 15-Oct-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: repeatsidentify novel virulence genesin Haemophilus influenzae

Proc. Natl. Acad. Sci. USAVol. 93, pp. 11121-11125, October 1996Microbiology

DNA repeats identify novel virulence genes inHaemophilus influenzae

(microbiology/whole genome sequencing/pathogenicity)

DEREK W. HOOD*t, MARY E. DEADMAN*, MICHAEL P. JENNINGS*t, MARINA BISERCIC*, ROBERT D. FLEISCHMANN§,J. CRAIG VENTER§, AND E. RICHARD MOXON**Molecular Infectious Diseases Group, University of Oxford Department of Paediatrics, Institute of Molecular Medicine, John Radcliffe Hospital, Headington,Oxford OX3 9DU, United Kingdom; and §The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850

Communicated by Hamilton 0. Smith, Johns Hopkins University, Baltimore, MD, July 19, 1996 (received for review April 1, 1996)

ABSTRACT The whole genome sequence (1.83 Mbp) ofHaemophilus influenzae strain Rd was searched to identifytandem oligonucleotide repeat sequences. Loss or gain of oneor more nucleotide repeats through a recombination-independent slippage mechanism is known to mediate phasevariation of surface molecules of pathogenic bacteria, includ-ing H. influenzae. This facilitates evasion of host defenses andadaptation to the varying microenvironments of the host. Wereasoned that iterative nucleotides could identify novel genesrelevant to microbe-host interactions. Our search of the Rdgenome sequence identified 9 novel loci with multiple (range6-36, mean 22) tandem tetranucleotide repeats. All werefound to be located within putative open reading frames andincluded homologues of hemoglobin-binding proteins of Neis-seria, a glycosyltransferase (IgtC gene product) of Neisseria,and an adhesin of Yersinia. These tetranucleotide repeatsequences were also shown to be present in two other epide-miologically different H. influenzae type b strains, althoughthe number and distribution of repeats was different. Furthercharacterization of the lgtC gene showed that it was involvedin phenotypic switching of a lipopolysaccharide epitope andthat this variable expression was associated with changes inthe number of tetranucleotide repeats. Mutation of lgtC re-sulted in attenuated virulence ofH. influenzae in an infant ratmodel of invasive infection. These data indicate the rapidity,economy, and completeness with which whole genome se-quences can be used to investigate the biology of pathogenicbacteria.

Haemophilus influenzae is an important cause of human dis-ease worldwide. Serotype b capsular strains cause invasivebacteremic infections such as meningitis, septicemia, epiglot-titis, septic arthritis, and empyema, particularly in infants.Strains lacking a capsule are a common cause of otitis media,sinusitis, conjunctivitis, and acute lower respiratory tract in-fections, which account for many millions of childhood deathsin the Third World (1).A feature of pathogenic bacteria, including H. influenzae, is

their propensity for varying the phenotypes of surface mole-cules which interact with host structures (2, 3). One geneticmechanism mediating phenotypic variation involves changes inthe number of repeated nucleotides in mononucleotide (ho-mopolymeric) tracts or tandemly iterated oligonucleotides. Asone example, dinucleotide repeats of TA have been associatedwith phase variation of H. influenzae fimbriae, an adhesinmediating attachment to respiratory epithelia (4). The mech-anism involves altered transcription of two divergently tran-scribed genes resulting from loss or gain of nucleotide repeatslocated in'the region of the overlapping promoters (5). Mul-tiple tandem DNA repeats of CAAT or GCAA have been

identified within the 5' end of the translated reading frames ofgenes required for lipopolysaccharide (LPS) biosynthesis (6,7). These repeat regions are also subject to loss or gain of oneor more of the four-base repeats, presumably through slipped-strand mispairing (8) producing frame shifts and resulting in ahigh frequency of phase variation of oligosaccharide corestructures. This genetic potential to generate a repertoire ofvariant antigens is one of the mechanisms by which pathogenicmicrobes can adapt to the differing microenvironments of thehost and evade host immune responses (2, 3). Similar mech-anisms involving pentanucleotide and mononucleotide repeatshave been described in Neisseria meningitidis, Neisseria gonor-rhoeae, and other pathogens (9-13).

Given the importance of repetitive DNA, we hypothesizedthat a search of the complete genome sequence ofH. influenzaestrain Rd (14) for repetitive DNA sequences might identifynovel genes whose products would have functions of relevanceto pathogenicity. It was our aim to characterize these DNArepeats and their associated loci in strain Rd, to extend thestudy to include other H. influenzae strains, and then to showthe biological role of one of these novel loci, IgtC, in mediatingphase variation and bacterial virulence.

MATERIALS AND METHODSBacterial Strains and Culture Conditions. The H. influenzae

type d strain RM1 18 (strain KW-20) was a derivative of strainRd obtained from H. 0. Smith, the same source for the strainused in the sequencing project (14). Also used in this studywere the type b strains RM153 (Eagan) and RM7004 (bothdisease isolates from the United States (15) and Holland (16),respectively). H. influenzae strains were grown at 37°C in brainheart infusion (BHI) broth supplemented with hemin (10,ug/ml) and NAD (2 Ag/ml). Following transformation, ka-namycin (10 ,zg/ml) was added to the growth medium.

Escherichia coli, strain DH5a, was used to propagate clonedPCR products and was grown at 37°C in Luria-Bertani (LB)broth supplemented with ampicillin (100 [kg/ml).

Searching for DNA Repeats in the H. influenzae GenomeSequence. All possible combinations of mononucleotide, dinu-cleotide, trinucleotide, and tetranucleotide motifs weresearched in early versions of the H. influenzae strain Rdgenome data base prior to completion of sequencing andannotation. Three or more repeats of tetranucleotide andtrinucleotide motifs were searched by using the program BLAST(17), and mononucleotide (homopolymeric) tracts comprisingtwo or more nucleotides of A(T) or C(G) and dinucleotide

Abbreviation: LPS, lipopolysaccharide.tTo whom reprint requests should be addressed. e-mail: [email protected] address: School of Biomolecular and Biomedical Science,Faculty of Science and Technology, Griffith University, Queensland4111, Australia.

11121

The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement" inaccordance with 18 U.S.C. §1734 solely to indicate this fact.

Page 2: repeatsidentify novel virulence genesin Haemophilus influenzae

Proc. Natl. Acad. Sci. USA 93 (1996)

motifs from the possible combinations of AT(TA), AG(TC),AC(TG), or CG(GC) were searched for in the strain Rd database by using the program FINDPATrERN (17). Regions ofDNAaround significant numbers of repeats were isolated andsearched against the combined GenBank/EMBL data banksto predict the position and function of any associated readingframes.

Analysis of Tetrameric DNA Repeats. Oligonucleotideprimers were designed to amplify 4-600 bp of DNA, includingeach tetrameric motif region of greater than four repeats fromstrains RM118, RM153, and RM7004 by the PCR. Amplifi-cation was for 30 cycles of 1-min denaturation at 94°C, 1-minannealing at 50°C, then 1-min elongation at 72°C using TaqDNA polymerase (Boehringer Mannheim) under conditionsrecommended by the supplier. PCR-amplified products weresequenced either directly, using Dynabeads (Dynal, UK) andbiotinylated primers, or after insertion into plasmid pT7Blue(Novagen). The number of repeated motifs was determined ineach case. The distribution of repeat motifs was confirmed bySouthern analysis of restriction endonuclease-digested chro-mosomal DNA using oligonucleotides, comprising five copiesof each tetranucleotide motif identified, as hybridizationprobes (18).

Construction ofan lgtC Mutant. The lgtC gene was amplifiedfrom strain RM153 by PCR, then cloned in the plasmidpT7Blue. A kanamycin-resistance cassette (HincII ends frompUC4Kan, Pharmacia LKB Biotechnology) was ligated to thecut and end-filled HindIII site within the lgtC gene. Linearizedplasmid containing the interrupted lgtC gene was used totransform H. influenzae by the method of Herriott et al. (19).Transformants, selected on kanamycin, were confirmed bySouthern analysis of chromosomal DNA (18) and by PCR.

Analysis of LPS by Colony Immunoblotting. Midlogarith-mic-phase (OD600 = 0.4) cultures of bacteria were diluted andinoculated on BHI agar plates to obtain single colonies. Afterovernight growth, colonies were transferred to nitrocellulosemembranes, air-dried, and allowed to react with monoclonalantibodies 4C4, 5G8, 6A2, and 12D9, kindly provided by E. J.Hansen (University of Texas), as described by Roche andMoxon (20).

Analysis of Purified LPS by Electrophoresis. A 1-,ul loopfull of cells from fresh overnight growth ofH. influenzae strainsRM7004 and RM70041gtC on BHI plates was lifted and LPSwas isolated as described by Roche and Moxon (20). Then 6-pAIsamples of LPS after proteinase K treatment were analyzed byTricine/sodium dodecyl sulfate/polyacrylamide gel electro-phoresis (T/SDS/PAGE) using 16.5% acrylamide gels at 100V. Gels were fixed in 25% isopropyl alcohol/10% aceticacid/65% water (vol/vol) and stained with silver (Quick Silverkit, Amersham).Virulence Experiments. The virulence of H. influenzae

strains RM153 (Eagan) and RMl531gtC was determined ininfant rats by using a previously described model of H.influenzae infection (21). Five-day-old Sprague Dawley ratsfrom litters of 12 were inoculated intraperitoneally with aninoculum (geometric mean of 108 or 146 organisms for RM153and RM1531gtC, respectively) in 0.1 ml of phosphate-bufferedsaline. Bacteremia was assessed from 0.01 ml of blood sampledfrom all surviving rats after 48 hr by tail vein puncture asdescribed previously (21). Mortality was assessed daily up to 5days. The statistical significance of the data was determined bythe Mann-Witney and t tests.

RESULTS AND DISCUSSION

The Genome of H. influenzae Strain Rd Contains MultipleTetranucleotide Repeats

Our search of the H. influenzae strain Rd genome sequencedatabase identified 12 different loci with multiple (>4) tet-

ranucleotide repeats (Table 1). Six of these were associatedwith previously unidentified candidate virulence genes, find-ings which provided strong encouragement for our startinghypothesis and of the utility of complete microbial genomesequences. Three of the 12 loci containing multiple repeats ofCAAT were associated with previously characterized LPSbiosynthetic genes (6, 22). Thirty-two copies of AGTC werelocated within a locus of low homology to a gene encoding amethyltransferase of Salmonella. Although this gene is not anobvious candidate virulence gene, methylation has been shownto be important for the regulation of virulence determinants inother organisms-for example, the Pap pili ofE. coli (23). Twoloci, containing tetranucleotides of CAAC and TTTA, werelocated within open reading frames which were not homolo-gous to any sequences of known function in current databases.However, one of the most important uses of whole genomesequences may prove to be the identification of novel ("or-phan") genes.The abundance of loci containing tetranucleotide repeats in

the H. influenzae genome was a striking finding. By chance,given the size and A+T content of the genome, the finding ofmore than three tandem repeats for any of the 256 possiblecombinations of tetranucleotides would not have been ex-pected by chance. Furthermore, the evolution of these tet-ranucleotide repeats in Haemophilus was in striking contrast tofindings in E. coli, in which no instance of >3 tetranucleotiderepeats was found in approximately 2.8 Mbp of sequenceavailable in GenBank. Of two other completed microbialgenomes, tetranucleotide repeats were not present in signifi-cant numbers in the 0.6 Mbp ofMycoplasma genitalium (24) orthe 1.7 Mbp of Methanococcus jannaschii.We sought to determine whether the tetranucleotide repeats

in strain Rd were a particular feature of this organism,especially since it has been confined to the laboratory since itsisolation from a human in the 1940s (25). Therefore, weinvestigated two epidemiologically distinct type b strains(RM153 and RM7004) which were isolated from meningitispatients from Europe and -the United States, and also aderivative of strain Rd (RM118) which had been maintainedover several years in our laboratory (Table 1). Strains wereinvestigated by PCR amplification, cloning, and sequencingusing locus-specific primers and by hybridizing restrictionendonuclease-digested chromosomal DNA with oligonucleo-tide probes each having five copies of each tetranucleotideidentified. A general finding in all three strains was that thetetranucleotide repeats were located at the 5' end of theputative open reading frames, thereby affording the greatestprobability of disrupting translation of the polypeptidethrough frame shifting. There were differences in the numbersof tetranucleotide repeats present in equivalent loci, not justwhen comparing the genome sequence and the type b strainsbut also when comparing with strain RM118 (Table 1). Also,the nucleotide sequence of the tetrameric repeats associatedwith homologues of the methyltransferase gene (Table 1) wasdifferent; instead of the repeated AGTC tetranucleotidesfound in strain Rd, strains RM7004 and RM153 had multiplerepeats of AGCC. For this locus, we extended our study andfound the repeat AGTC associated in type d and f strains,whereas in other capsular types (a, b, c, and e) and twonontypeable strains, the tetranucleotide AGCC was present.Also, one of the four hemoglobin-binding proteins containingrepeats ofCAAC was not present in either of the type b strains(Table 1). This indicated the potential for differences in thedistribution of repeat-associated loci between strains, as well asin the number and sequence of repeated DNA motifs at anygiven locus.With regard to other forms of repetitive DNA in the H.

influenzae genome sequence, the search for homopolymerictracts showed the highest number of consecutive bases to be 10forA or T and 9 for C or G nucleotides. The size and frequency

11122 Microbiology: Hood et aL

Page 3: repeatsidentify novel virulence genesin Haemophilus influenzae

Proc. Natl. Acad. Sci. USA 93 (1996) 11123

Table 1. Characteristics of tetrameric repeats and associated loci identified through searching the H. influenzae strain Rd genome sequence

*Number of repeatstTetrameric Homology*repeat Homologue Genus* Function (BLAST) Rd RM118 RM153 RM7004 Open reading framet

CAAT Licl Haemophilus LPS biosynthesis 17 ND ND 30 ATG... .9 bp. ...ATG.. .5 bp...(CAAT)17...895 bp... TAA

CAAT Lic2 Haemophilus LPS biosynthesis 22 ND 17 16 .ATG... .11 bp... .(CAAT)22...698 bp... .TAG

CAAT Lic3 Haemophilus LPS biosynthesis 32 ND ND 22 ATG... .36 bp. ....ATG.. .76 bp...(CAAT)32...878 bp.. .TAG

GCAA YadA Yersinia Adhesin 7Oe-2o 25 24 15 23 ATG... .15 bp... .(GCAA)25...751 bp... TAA

GACA LgtC Neisseria Glycosyltransferase 2.9e37 22 ND 22 32 AT(GACA)22...899 bp... .TAA

CAAC Hemoglobin Neisseria Iron binding 7.4e84 36 36 22 22 ATG... .49 bp. ....ATG.. .12 bp...receptor (CAAC)36... .3029 bp... .TAA

CAAC Hemoglobin Neisseria Iron binding 1.8e-86 20 21 23 23 ATG... .24 bp. ...ATG.. .5 bp...receptor (CAAC)20... 2841 bp.... TAA

CAAC Hemoglobin Neisseria Iron binding 5.Oe85 18 20 § § AT... .25 bp... .(CAAC)18 ...receptor 2839 bp... .TAA

CAAC Hemoglobin Neisseria Iron binding 1.3e-87 20 ND ND ND ATG.. .12 bp. ...ATG.. .5 bp ...receptor (CAAC)20... .3029 bp... .TAG

CAAC No homology 15 14 8 7 ATG... .16 bp. ...ATG.. .3 bp...(CAAC)15... .550 bp... .TAA

AGTC Methyltransferase Salmonella Host restriction/ 4.7e3 32 26 301 31 ATG... .49 bp... .(AGTC)32 ...modification 1710 bp. ..TGA

TTTA 32.9-kDa protein Bacillus Unknown 4.7e5 6 ND ND ND ATG... .130 bp... .(TTTA)6 ..686 bp... .TGA

*Designation corresponds to the highest-value match after a BLAST search of the repeat-associated open reading frame against the combinedGenBank/EMBL data banks.

tThe number of repeats found for each associated locus in the genome sequence of strain Rd is given in boldface; the other values are those obtainedafter sequencing of cloned PCR products or direct sequencing of PCR products from the culture collection strains RM118, RM153, and RM7004.ND, not determined.tOpen reading frames are shown with the spacing between the repeat region and the nearest ATG translational start codons and the predictedtranslational stop codons.§None present. Hybridization with a (CAAC)5 oligonucleotide probe showed only three associated loci in RM153 and RM7004, consistent withthe lack of a PCR product for this locus.1Repeats were AGCC, not AGTC.

of occurrence of homopolymeric tracts, from 1 to 10 nucleo-tides, showed little deviation from a theoretical distributionbased on a random 1.8-Mbp DNA sequence with a bias of 61%A+T bases (data not shown). The frequencies of repeateddinucleotides have been determined in a similar manner, with6 copies found to be the maximum for any possible combina-tions of AT/TA, AG/TC, AC/TG, and CG/GC. For bothhomopolymeric tracts and dinucleotides, the locations of thehighest numbers of repeats were not uniquely within, orassociated with, open reading frames. Trinucleotide repeatshave not been reported to be associated with phase-variableloci in bacteria and, since loss or gain would not result intranslational frame shifting, slippage would have a plausiblefunction only through modulation of transcription. One locus,the heme-utilization gene hxuC, has been identified which has9 copies of the motif AAT located within the control regionacross the predicted -10 to -35 sequence (26). Other tripletnucleotide repeats were found at frequencies expected for thegenome size of strain Rd.

DNA Repeats Are Located Within Homologues of PutativeVirulence Genes

Three of the 12 tandem repeats of tetranucleotides have beenpreviously characterized and were already known to be asso-ciated with phase-variable oligosaccharide structures of LPS(22). On the basis of data base searches, the most strikingfindings were the four genes with homology to hemoglobin andother iron-binding proteins of Neisseria, each occupying adistinct position in the Rd genome. Iron is an essential nutrientof bacteria; the availability or the efficiency with which bac-

teria scavenge iron is known to be important for virulence (27).This apparent redundancy and potential for phase variation,although not yet formally investigated, is intriguing. Oneexplanation is that the genes mediate binding to differentsources of protein-bound iron-e.g., lactoferrin, transferrin, orhemoglobin. Indeed, N. meningitidis and N. gonorrhoeae havemultiple iron-sequestering proteins (28). Another possibility isthat, given the importance of iron acquisition and the externalpresentation of binding proteins, multiple alleles are present,each ofwhich can be switched on or off to afford a mechanismfor evading the development of specific host immune re-sponses. If antibodies to one or other protein are made by thehost, another antigenically distinct protein, when switched on,could be advantageous.A region comprising 25 repeats of GCAA was found at the

5' end of a gene encoding a homologue for YadA, an adhesinand virulence factor of Yersiniapseudotuberculosis and Yersiniapestis (29, 30). However, in strain Rd, there are translationalstops in all three reading frames, indicating that this proteinwould not be synthesized. This may relate to the long periodof time over which strain Rd has been in the laboratory, notrequiring the function of an adhesin. Further studies in recentclinical isolates of H. influenzae are needed to investigate thisgene appropriately.Formal investigations of the IgtC homologue in H. influenzae

were carried out, since this gene seemed to offer a tractableopportunity to demonstrate the principle of how informationfrom a genome sequence can yield experimental data ofbiological relevance. The proposed function of the lgtC gene inLPS biosynthesis was investigated by mutagenesis. Insertion ofa kanamycin-resistance cassette into lgtC resulted in strains

Microbiology: Hood et al.

Page 4: repeatsidentify novel virulence genesin Haemophilus influenzae

Proc. Natl. Acad. Sci. USA 93 (1996)

A (a) (b)

B

(1)

VII]

4C4mAb 5G8

6A212D9

++pv -

++pv

++pv ++pv

++pv ++pv

(11)

FIG. 1. Electrophoretic profile and monoclonal antibody reactivity of LPS from H. influenzae strains RM7004 and RM70041gtC. (A)Electrophoretic pattern of LPS after T/SDS/PAGE) of LPS purified from strain RM7004 (lane a) and strain RM70041gtC (lane b). Listedunderneath are the reactivities with the LPS-specific monoclonal antibodies (mAb) 4C4, 5G8, 6A2, and 12D9; + + indicates reactivity, - indicatesminimal or no reactivity, and pv indicates phase variation of colony reactivity. (B) Immunoblots of colonies from strains RM7004 and RM70041gtCthat had reacted with the monoclonal antibody 4C4. I shows wild-type strain RM7004 with colonies phase-varying to a nonreactive phenotype ata frequency of 0.5%. II shows the 4C4 nonreactivity of strain RM70041gtC. III shows a magnified view of RM7004 colonies from I; colonies are4C4 reactive with occasional nonreactive colonies or colony sectors as a result of phase variation

with an altered LPS electrophoretic profile and altered reac-tivity with a panel of LPS-specific monoclonal antibodies (Fig.1A). The LPS isolated from strain RM70041gtC had a simplerelectrophoretic pattern (reduced number of bands stainingwith silver) than that from RM7004, indicating that extensionof the LPS to form the higher molecular weight species isblocked in the mutant. On colony immunoblotting, the lgtCmutant had significantly reduced reactivity to the monoclonalantibodies 4C4 and 5G8, both of which recognize the terminalLPS epitope, Gala(1-4)p3Gal (31, 32). This epitope showsphase variation (Fig. 1B) at a frequency of 0.1-1% in wild-typeRM7004 (6). However, both strains RM7004 and RM70041gtCretained phase-variable reactivity with monoclonal antibodies6A2 and 12D9, which recognize alternative LPS core epitopes.By directly sequencing the GACA repeat region of 4C4-reactive and 4C4-nonreactive colonies shown in Fig. 1B, wehave shown a correlation between the number of repeats andcolony reactivity to monoclonal antibody 4C4. Colonies whichwere nonreactive with 4C4 had 32, 33, or 35 repeats of GACA,a number of repeats which would place the lgtC reading frameout of frame with its putative initiation codon. In contrast,colonies which bound monoclonal antibody 4C4 had 34 or 35GACA repeats placing lgtC either in or out of frame with itsinitiation codon. This finding that 4C4 colonies can have eithera functional or nonfunctional lgtC-encoded protein indicatesthe considerable complexity of LPS phase variation. LPSepitopes can be influenced by several gene products, and ourprevious work has identified at least one other gene (lic2) thatcan mediate phase variation of the Gala(1-4)f3Gal epitope(33). Interestingly, in Neisseria, IgtC has a homopolymeric tract

of guanosine nucleotides located within the reading framewhich may be responsible for the phase-variable expression ofa digalactoside core LPS structure (13).The pathogenicity of the RM1531gtC mutant was compared

with that of its isogenic wild-type, which has been usedextensively as a prototype strain in virulence studies in aninfant rat model. The IgtC mutant was found to be attenuatedas evidenced by reduced mortality and the degree of bactere-mia in surviving rats cultured 2 days following intraperitonealinoculation (Table 2). The results were combined from thedata collected by using two independent lgtC mutants overseveral virulence experiments.The message of this paper is to point out the power of using

whole genome sequence data in the study of pathogenicbacteria. The idea that short tandem repeats might identifycandidate virulence genes was based on the knowledge thatphase variation is a typical characteristic of these surfacemolecules. This study has identified six or more candidates,one of which, IgtC, has been shown to be involved in phasevariation of LPS and to influence virulence. We are proceed-ing with detailed investigation of the functional role of theother genes with tandem tetranucleotide repeats in Haemophi-lus and are also investigating repeats as a general phenomenonin other host-adapted mucosal pathogens.Whole genome sequences provide a major advance and, as

demonstrated here for H. influenzae, should facilitate novelapproaches to the study of a wide variety of pathogenicbacteria and other microbes, including parasites. Taking a

broader perspective, whole genome sequences provide a strat-egy to gain information on many fundamental questions of

Table 2. Effect of infection of infant rats with H. influenzae strains RM153 and RM1531gtC, wildtype and LPS mutant, respectively, on bacteremia and mortality

No. of No. Geometric mean logio(no. ofStrain rats bacteremic bacteria per ml of blood) No. of deaths

RM153 54 49/51 5.72 25 (46.3%)RM1531gtC 54 44/53 4.76 (P = 0.03) 9 (16.7%) (P = 0.001)

Five-day-old Sprague Dawley rats were inoculated intraperitoneally with an inoculum in 0.1 ml ofphosphate-buffered saline. Bacteremia was assessed from all surviving rats after 48 h by tail vein puncture.Mortality was assessed daily up to 5 days. Significance was measured by a two-tailed t test, which wasperformed on the results after the two sets of data had beenshown to be comparable by the Levine testfor equality of variance.

b14

11124 Microbiology: Hood et aL

Page 5: repeatsidentify novel virulence genesin Haemophilus influenzae

Proc. Natl. Acad. Sci. USA 93 (1996) 11125

biology, including the elucidation of complex biosyntheticpathways, the minimal set of genes required to sustain thefree-living state, and the evolutionary relationships amongbacteria, archeae, and eukarya.

We thank E. J. Hansen for kindly providing monoclonal antibodiesand M. Herbert for statistical analysis of the experimental data. Weacknowledge the technical assistance of T. Allen. This work wassupported by program grants from the U.K. Medical Research Coun-cil.

1. Moxon, E. R. (1995) in Principles and Practice of InfectiousDiseases, eds. Mandell, G. L., Bennett, J. E. & Dolin, R.(Churchill Livingstone, New York), pp. 2039-2045.

2. Robertson, B. C. & Meyer, T. F. (1992) Trends Genet. 8,422-427.3. Moxon, E. R., Rainey, P. B., Nowak, M. A. & Lenski, R. E.

(1994) Curr. Biol. 4, 24-33.4. van Alphen, L., Geelen van den Broek, L., Blaas, L., van Ham,

M. & Dankert, J. (1991) Infect. Immun. 59, 4473-4477.5. van Ham, M., van Alphen, L., Mooi, F. R . & van Putten, J. P. M.

(1993) Cell 73, 1187-1196.6. Weiser, J. N., Love, J. M. & Moxon, E. R. (1989) Cell 59,

657-665.7. Jarosik, G. P. & Hansen, E. J. (1994) Infect. Immun. 62, 4861-

4867.8. Levinson, G. & Gutman, G. A. (1987) Mol. Biol. Evol. 4,203-221.9. Stern, A., Brown, M., Nickel, P. & Meyer, T. F. (1986) Cell 47,

61-67.10. Sarkari, J., Pandit, N., Moxon, E. R. & Achtman, M. (1994) Mol.

Microbiol. 13, 207-217.11. Jonsson, A.-B., Nyberg, G. & Normark, S. (1991) EMBO J. 10,

477-488.12. Gotschlich, E. C. (1994) J. Exp. Med. 180, 2181-2190.13. Jennings, M. P., Hood, D. W., Peak, I. R. A., Virji, M. & Moxon,

E. R. (1995) Mol. Microbiol. 18, 729-740.14. Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A.,

Kirkness, E. F., et al. (1995) Science 269, 496-512.15. Anderson, P., Johnston, R. B. & Smith, D. H. (1972) J. Clin.

Invest. 51, 31-38.

16. van Alphen, L., Riemens, T., Poolman, J., Hopman, C. & Zanen,H. C. (1983) J. Infect. Dis. 148, 75-81.

17. Devereux, J., Haeberli, P. & Smithies, 0. (1984) Nucleic AcidsRes. 12, 387-395.

18. Sambrook, J., Fritsch, E. F. & Maniatis, T. (1989) in MolecularCloning: A Laboratory Manual (Cold Spring Harbor Lab. Press,Plainview, NY), 2nd Ed.

19. Herriott, R. M., Meyer, E. Y., Vogt, M. & Modan, M. (1970) J.Bacteriol. 101, 513-516.

20. Roche, R. J. & Moxon, E. R. (1994) FEMS Microbiol. Lett. 120,279-284.

21. Smith, A. L., Smith, D. H., Averill, D. R. & Moxon, E. R. (1973)Infect. Immun. 8, 278-290.

22. Moxon, E. R. & Maskell, D. (1992) in Molecular Biology ofBacterial Infection: Current Status and Future Perspectives, Sym-posium of the Society for General Microbiology, eds. Hor-maeche, C., Penn, C. W. & Smyth, C. J. (Cambridge Univ. Press,Cambridge, U.K.), Vol. 49, pp. 75-96.

23. Nou, X., Braaten, B. & Kaltenbach, L. (1995) EMBO J. 14,5785-5797.

24. Fraser, C. M., Gocayne, J. D., White, O., Adams, M. D., Clayton,R. A., et al. (1995) Science 270, 397-403.

25. Alexander, H. E. & Leidy, G. (1951) J. Exp. Med. 93, 345-359.26. Cope, L. D., Yogeve, R., Muller-Eberhard, U. & Hansen, E. J.

(1995) J. Bacteriol 177, 2644-2653.27. Weinberg, E. D. (1984) Physiol. Rev. 64, 65-102.28. Genco, C. A. & Desai, P. J. (1996) Trends Microbiol. 4, 179-184.29. Rosqvist, R., Skurnik, M. & Wolf-Watz, H. (1988) Nature

(London) 334, 522-525.30. Tamm, A., Tarkkanen, A.-M., Korhonen, T. K., Kuusela, P.,

Toivanen, P. & Skurnik, M. (1993) Mol. Microbiol. 10, 995-1011.31. Virji, M., Weiser, J. N., Lindberg, A. A. & Moxon, E. R. (1990)

Microb. Pathog. 9, 441-450.32. Gulig, P. A., Patrick, C. C., Hermanstorfer, L., McCracken,

G. H., Jr., & Hansen, E. J. (1987) Infect. Immun. 55, 513-520.33. High, N. J., Deadman, M. E. & Moxon, E. R. (1993) Mol. Mi-

crobiol. 9, 1275-1282.

Microbiology: Hood et al.