franklin lin - uncovering genes associated with human infections (poster)

1
Genome Bibliome Diseasome 128070006 "Abdominal infection" 62479008 "Acquired immune deficiency syndrome", ... 194922003 "Acute bacterial endocarditis", ... 193857008 "Acute conjunctivitis", Conjunctivitis, ... 54839009 "Acute poliomyelitis", Polio, ... 155264006 "Acute rheumatic fever", "Rheumatic fever" 27321001 "Acute sore throat", "Acute pharyngitis" 13906002 Anaplasmosis, "Tick fever", Gallsickness 33937009 "Lyme arthritis" 186103005 "Bacillary dysentery", "Shigella dysenteriae" 128350005 "Bacterial conjunctivitis" 274080003 "Bacterial gastroenteritis" 95883001 "Bacterial meningitis" 197171003 "Bacterial peritonitis" 10001005 "Bacterial septicaemia", "Bacterial sepsis", ... 198221007 "Bacterial vaginosis" 53084003 "Bacterial pneumonia" 55604004 "Bird flu", "Avian influenza", "Avian flu" 56304002 "Bovine virus diarrhoea", ... 86500004 Campylobacteriosis ... Uncovering genes associated with human infections in pathogenic bacteria: a novel integrative analysis linking genome, bibliome, and diseasome Frank Lin Centre for Health Informatics, The University of New South Wales, Sydney, Australia References: 1. Sintchenko V, Anthony S, Phan X-H, Lin F, and Coiera EW. PLoS ONE. 2010; 5(3): e9535. 2. Lin F, Coiera E, Lan R, Sintchenko V. BMC Bioinformatics 2009; 10:86. 3. Lin F, Lan R, Sintchenko V, Kong F, Gilbert GL, Coiera E. PLoS ONE (in press) 4. Gross L. PLoS Biol 4(9): e314. doi:10.1371/journal.pbio.0040314 5. Connell I, Agace W, Klemm P, Schembri M, Mărild S, Svanborg C. PNAS 1996; 93(18): 9827-32. E-mail: [email protected] Website: http://www.chi.unsw.edu.au/ Phone: +61 (2) 9385 3437 Contact: Dr Frank Lin Centre for Health Informatics University of New South Wales SYDNEY NSW 2052, Australia Motivations The identification of virulence genes in bacteria is an important topic in the management and study of infectious diseases (ID). We have developed a computational pipeline that assimilates the genomic (whole genome sequences), bibliomic (MEDLINE) and diseasomic (SNOMED-CT) databases to assist with this discovery task. Previously, we have reported that the co-occurrence of disease and pathogen names in MEDLINE can be used to infer causal associations [1]. We have also observed that protein phylogenetic profiles ( pp) may be used to predict pathway memberships [2] and to suggest potential virulence mechanisms [3] . It is perceived that the tool described in this work can be useful in suggesting markers for ID biosurveillance or in assisting with the development of effective vaccines. Motivations The identification of virulence genes in bacteria is an important topic in the management and study of infectious diseases (ID). We have developed a computational pipeline that assimilates the genomic (whole genome sequences), bibliomic (MEDLINE) and diseasomic (SNOMED-CT) databases to assist with this discovery task. Previously, we have reported that the co-occurrence of disease and pathogen names in MEDLINE can be used to infer causal associations [1]. We have also observed that protein phylogenetic profiles ( pp) may be used to predict pathway memberships [2] and to suggest potential virulence mechanisms [3] . It is perceived that the tool described in this work can be useful in suggesting markers for ID biosurveillance or in assisting with the development of effective vaccines. Summary The preliminary results from this study are encouraging but more analyses are required to prove its generalisability. We anticipate that this integrative approach is applicable to other pathogens, and is also able to extend to the translation of multi-omic knowledge into other biomedical disciplines. Summary The preliminary results from this study are encouraging but more analyses are required to prove its generalisability. We anticipate that this integrative approach is applicable to other pathogens, and is also able to extend to the translation of multi-omic knowledge into other biomedical disciplines. Staphylococcus aureus Escherichia coli Streptococcus agalactiae Klebsiella pneumoniae Streptococcus pneumoniae Staphylococcus epidermidis Listeria monocytogenes Pseudomonas aeruginosa Haemophilus influenzae Enterobacter cloacae Mycoplasma hominis ... Chlamydia trachomatis Campylobacter jejuni Ureaplasma urealyticum Neisseria meningitidis Bacteroides fragilis Enterococcus faecalis Acinetobacter baumannii Campylobacter fetus Salmonella typhi Streptococcus pyogenes Neisseria gonorrhoeae Clostridium difficile ... Staphylococcus aureus JH9 Staphylococcus aureus JH1 Staphylococcus aureus Newman ... Escherichia coli S88 Escherichia coli O103 H2 12009 Escherichia coli O157 H7 TW14359 ... Streptococcus agalactiae A909 Streptococcus agalactiae NEM316 Streptococcus agalactiae 2603 Klebsiella pneumoniae 342 Klebsiella pneumoniae NTUH K2044 Klebsiella pneumoniae MGH 78578 Streptococcus pneumoniae R6 Streptococcus pneumoniae JJA Streptococcus pneumoniae Hungary19A 6 Staphylococcus epidermidis RP62A Staphylococcus epidermidis ATCC 12228 Listeria monocytogenes 4b F2365 Pseudomonas aeruginosa LESB58 Pseudomonas aeruginosa PA7 Haemophilus influenzae PittEE Haemophilus influenzae PittGG Enterobacter cloacae ATCC 13047 uid48363 Mycoplasma hominis Chlamydia trachomatis A HAR-13 Chlamydia trachomatis B TZ1A828 OT ... Pathogens E. coli Klebsiella spp. Proteus spp. Pseudomonas spp. Clinical manifestations or syndromes Genomes Figure 3. A case study on the discovery of genes associated with urinary tract infection (UTI) in the Escherichia coli UTI89 genome: 327 SNOMED-CT ID concepts were used to mine gene-ID associations against all genes in 525 bacterial genomes. The discovery process was demonstrated in the discovery of genes associated with SNOMED concept “urinary tract infection” (#68,566,005). In the UTI89 genome, the fim gene cluster (boxed area in A and B) was found to be highly associated with frequent mentions of “UTI” in the literature (fimG gene, p=4.1×10 -20 ). Other known virulence gene clusters were also recovered, including the sfa (sfaH, p=2.5×10 -18 ) and pap clusters (papC, p=9.8×10 -7 ). The gene product of fim cluster is involved in the biosynthesis of type 1 fimbriae (panel C), an important determinant of uropathogenesis [4,5]. The fimF gene (p=1.6×10 -16 ), encoding type 1 fimbriae minor subunit precursor (GenBank Accession YP_543949), was also found to be associated with pathogens with mentions of other infectious disease concepts, including endotoxic shock, gastroenteritis, and bacterial meningitis (panel D). This finding is illustrative of potential virulence role of fimbriae in gram-negative pathogens in the infective pathogenesis in human. The transmission electron micrograph in panel C was adapted from Ref [4] under the Creative Commons license. NCBI NCBI whole whole genome genome sequences sequences MEDLINE MEDLINE SNOMED SNOMED CT CT A. A. B. B. C. C. Score SNOMED ID Description 50.857 371770009 Endotoxaemia, Endotoxemia 47.036 71057007 E. coli infection, Infection due to E. coli 39.170 371769008 Endotoxic shock 32.367 186091002 Enteric fever, Typhoid fever 25.127 186094005 Salmonella food poisoning/gastroenteritis 20.272 186103005 Bacillary dysentery, Shigella dysenteriae 18.908 197925009 Asymptomatic bacteriuria 18.862 154410004 Cestode/Trematode infestation, Helminthiases 17.171 154374002 Malaria 16.665 61462000 Paludism, Malaria, Plasmodiosis 16.382 74286002 TEM, Transmissible mink encephalopathy 16.361 11840006 Traveler's diarrhea, Turista 15.769 68566005 Urinary tract infection 12.481 276674008 Neonatal meningitis 12.454 111852003 Vaccinia 12.183 155862004 Acute pyelonephritis or pyonephrosis 12.111 76571007 Septicemic shock, Septic shock 12.045 36188001 Flexner's dysentery, Shigellosis 10.915 10087007 Schistosomiasis, Bilharzia, ... 9.376 11836002 Primary/spontaneous bacterial peritonitis 8.977 24043009 Malignant catarrhal fever, Snotsiekte 8.638 111909004 Amoebic infection, Amebiasis D. D. Figure 1. The “shared” virulence gene hypothesis. The relationships between genes shared by multiple pathogens and clinical syndromes can be exemplified by this illustration. The genes encoding a specific bacterial virulence mechanism may be evolutionarily conserved and hence are present in different species or genus (the “shared” virulence genes) [3]. These genes are candidates to be discovered by this method, based on multi-genomic analysis of different bacterial species with comparative genomics. Acknowledgments The author thank Drs. Vitali Sintchenko, Stephen Anthony, Xuan-Hieu Phan, and Fanrong Kong, Profs. Ruiting Lan, Gwendolyn Gilbert and Enrico Coiera for collaborating with the earlier works [1-3]. The author is supported by a postdoctoral fellowship of Australian National Health and Medical Research Council (NH&MRC) program grant #568,612. The body silhouette image in Figure 1 was a public domain image retrieved from Wikimedia Commons ( http://commons.wikimedia.org/wiki/File:Female_shadow_-_upper.png ). Extracting clinical concepts related to infectious disease from ontology database. The list of infectious disease names were extracted from the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED- CT) database by traversing through the hierarchy of concepts beneath the concept of “Infectious disease (disorder)” (SNOMED ID #40,733,004). A total of 327 infectious disease-related concepts were extracted. Determining the pathogen-disease matrix. The co- occurrences of the disease names with pathogen names in MEDLINE abstracts were used to infer the potential causal relationship. For example, the concept “urinary tract infection” co-occurs frequently with the pathogen name “Escherichia coli ”, which is the principle causal pathogen of UTI in humans and other mammals. The concept “pneumonia” is expected to occur in higher frequencies in articles with mentions of its primary causal pathogens (e.g., Streptococcus pneumoniae) List of pathogen names. The names of the entire list of bacterial species known to be associated with infectious diseases in humans were short-listed from the NCBI reference genome database. Two hundred and fifty four bacterial species were used in the subsequent analysis. Determining the phylogenetic profiles of each gene in the 525 genomes. The presence or absence of homologous genes in a set of reference genomes (phylogenetic profiles) was determined by performing all-against-all BLASTP with the E-value threshold at 10 -10 . The procedure was described in detail in Refs. [2] and [3]. Bacterial genomes used in this analysis. 1,131 genomes with available whole genome sequences (WGS) were retrieved from the NCBI FTP site. 525 WGS that also belong to the list of pathogens (see above) were used in the subsequent analyses. A Manhattan plot showing the association of all genes in the S. pneumoniae D39 genome with literatures with frequent mentions of ID concept “pneumonia”. For each gene, the relative association of a gene to a infectious disease concept (e.g., “pneumonia”) can be calculated by comparing the observed distributions of PP with the corresponding null distributions. In this example, the hyaluronidase gene (hysA, red oval) was among the highest ranked genes associated with frequent mentions of “pneumonia” in the literature. Reference genomes Genes Figure 2. The computational workflow for discovering genes associated with human infections in pathogenic bacteria.

Upload: amia

Post on 14-Oct-2014

59 views

Category:

Documents


0 download

DESCRIPTION

Uncovering genes associated with human infections in pathogenic bacteria: a novel integrative analysis linking genome, bibliome, and diseasomeFrank Lin, Centre for Health Informatics, The University of New South Wales, Sydney, Australia

TRANSCRIPT

Page 1: Franklin Lin - Uncovering Genes Associated With Human Infections (Poster)

Genome

Bibliome Diseasome

128070006 "Abdominal infection" 62479008 "Acquired immune deficiency syndrome", ...194922003 "Acute bacterial endocarditis", ...193857008 "Acute conjunctivitis", Conjunctivitis, ... 54839009 "Acute poliomyelitis", Polio, ...155264006 "Acute rheumatic fever", "Rheumatic fever" 27321001 "Acute sore throat", "Acute pharyngitis" 13906002 Anaplasmosis, "Tick fever", Gallsickness 33937009 "Lyme arthritis" 186103005 "Bacillary dysentery", "Shigella dysenteriae" 128350005 "Bacterial conjunctivitis" 274080003 "Bacterial gastroenteritis" 95883001 "Bacterial meningitis" 197171003 "Bacterial peritonitis" 10001005 "Bacterial septicaemia", "Bacterial sepsis", ...198221007 "Bacterial vaginosis" 53084003 "Bacterial pneumonia" 55604004 "Bird flu", "Avian influenza", "Avian flu" 56304002 "Bovine virus diarrhoea", ... 86500004 Campylobacteriosis

...

Uncovering genes associated with human infections in pathogenic bacteria: a novel integrative analysis linking genome, bibliome, and diseasome

Frank LinCentre for Health Informatics, The University of New South Wales, Sydney, Australia

References:1. Sintchenko V, Anthony S, Phan X-H, Lin F, and Coiera EW. PLoS ONE. 2010; 5(3): e9535. 2. Lin F, Coiera E, Lan R, Sintchenko V. BMC Bioinformatics 2009; 10:86.3. Lin F, Lan R, Sintchenko V, Kong F, Gilbert GL, Coiera E. PLoS ONE (in press)4. Gross L. PLoS Biol 4(9): e314. doi:10.1371/journal.pbio.00403145. Connell I, Agace W, Klemm P, Schembri M, Mărild S, Svanborg C. PNAS 1996; 93(18): 9827-32.

E-mail: [email protected]: http://www.chi.unsw.edu.au/Phone: +61 (2) 9385 3437

Contact: Dr Frank LinCentre for Health InformaticsUniversity of New South WalesSYDNEY NSW 2052, Australia

Motivations

The identification of virulence genes in bacteria is an important topic in the management and study of infectious diseases (ID). We have developed a computational pipeline that assimilates the genomic (whole genome sequences), bibliomic (MEDLINE) and diseasomic (SNOMED-CT) databases to assist with this discovery task. Previously, we have reported that the co-occurrence of disease and pathogen names in MEDLINE can be used to infer causal associations [1]. We have also observed that protein phylogenetic profiles (pp) may be used to predict pathway memberships [2] and to suggest potential virulence mechanisms [3]. It is perceived that the tool described in this work can be useful in suggesting markers for ID biosurveillance or in assisting with the development of effective vaccines.

Motivations

The identification of virulence genes in bacteria is an important topic in the management and study of infectious diseases (ID). We have developed a computational pipeline that assimilates the genomic (whole genome sequences), bibliomic (MEDLINE) and diseasomic (SNOMED-CT) databases to assist with this discovery task. Previously, we have reported that the co-occurrence of disease and pathogen names in MEDLINE can be used to infer causal associations [1]. We have also observed that protein phylogenetic profiles (pp) may be used to predict pathway memberships [2] and to suggest potential virulence mechanisms [3]. It is perceived that the tool described in this work can be useful in suggesting markers for ID biosurveillance or in assisting with the development of effective vaccines.

Summary

The preliminary results from this study are encouraging but more analyses are required to prove its generalisability. We anticipate that this integrative approach is applicable to other pathogens, and is also able to extend to the translation of multi-omic knowledge into other biomedical disciplines.

Summary

The preliminary results from this study are encouraging but more analyses are required to prove its generalisability. We anticipate that this integrative approach is applicable to other pathogens, and is also able to extend to the translation of multi-omic knowledge into other biomedical disciplines.

Staphylococcus aureusEscherichia coli

Streptococcus agalactiaeKlebsiella pneumoniae

Streptococcus pneumoniaeStaphylococcus epidermidis

Listeria monocytogenesPseudomonas aeruginosaHaemophilus influenzae

Enterobacter cloacaeMycoplasma hominis

...

Chlamydia trachomatisCampylobacter jejuni

Ureaplasma urealyticumNeisseria meningitidis

Bacteroides fragilisEnterococcus faecalis

Acinetobacter baumanniiCampylobacter fetus

Salmonella typhiStreptococcus pyogenesNeisseria gonorrhoeae

Clostridium difficile...

Staphylococcus aureus JH9Staphylococcus aureus JH1Staphylococcus aureus Newman

...Escherichia coli S88Escherichia coli O103 H2 12009Escherichia coli O157 H7 TW14359

...Streptococcus agalactiae A909Streptococcus agalactiae NEM316Streptococcus agalactiae 2603Klebsiella pneumoniae 342Klebsiella pneumoniae NTUH K2044Klebsiella pneumoniae MGH 78578Streptococcus pneumoniae R6Streptococcus pneumoniae JJAStreptococcus pneumoniae Hungary19A 6Staphylococcus epidermidis RP62AStaphylococcus epidermidis ATCC 12228Listeria monocytogenes 4b F2365Pseudomonas aeruginosa LESB58Pseudomonas aeruginosa PA7Haemophilus influenzae PittEEHaemophilus influenzae PittGGEnterobacter cloacae ATCC 13047 uid48363Mycoplasma hominisChlamydia trachomatis A HAR-13Chlamydia trachomatis B TZ1A828 OT

...

Pathogens

E. coli Klebsiella spp.

Proteus spp.

Pseudomonas spp.

Clinical manifestations or syndromesGenomes

Figure 3. A case study on the discovery of genes associated with urinary tract infection (UTI) in the Escherichia coli UTI89 genome: 327 SNOMED-CT ID concepts were used to mine gene-ID associations against all genes in 525 bacterial genomes. The discovery process was demonstrated in the discovery of genes associated with SNOMED concept “urinary tract infection” (#68,566,005). In the UTI89 genome, the fim gene cluster (boxed area in A and B) was found to be highly associated with frequent mentions of “UTI” in the literature (fimG gene, p=4.1×10-20). Other known virulence gene clusters were also recovered, including the sfa (sfaH, p=2.5×10-18) and pap clusters (papC, p=9.8×10-7). The gene product of fim cluster is involved in the biosynthesis of type 1 fimbriae (panel C), an important determinant of uropathogenesis [4,5]. The fimF gene (p=1.6×10-16), encoding type 1 fimbriae minor subunit precursor (GenBank Accession YP_543949), was also found to be associated with pathogens with mentions of other infectious disease concepts, including endotoxic shock, gastroenteritis, and bacterial meningitis (panel D). This finding is illustrative of potential virulence role of fimbriae in gram-negative pathogens in the infective pathogenesis in human. The transmission electron micrograph in panel C was adapted from Ref [4] under the Creative Commons license.

NCBI NCBI whole whole

genome genome sequencessequences

MEDLINEMEDLINE SNOMEDSNOMEDCTCT

A.A. B.B. C.C.

Score SNOMED ID Description50.857 371770009 Endotoxaemia, Endotoxemia 47.036 71057007 E. coli infection, Infection due to E. coli39.170 371769008 Endotoxic shock 32.367 186091002 Enteric fever, Typhoid fever25.127 186094005 Salmonella food poisoning/gastroenteritis20.272 186103005 Bacillary dysentery, Shigella dysenteriae 18.908 197925009 Asymptomatic bacteriuria 18.862 154410004 Cestode/Trematode infestation, Helminthiases17.171 154374002 Malaria 16.665 61462000 Paludism, Malaria, Plasmodiosis 16.382 74286002 TEM, Transmissible mink encephalopathy 16.361 11840006 Traveler's diarrhea, Turista

15.769 68566005 Urinary tract infection

12.481 276674008 Neonatal meningitis 12.454 111852003 Vaccinia 12.183 155862004 Acute pyelonephritis or pyonephrosis 12.111 76571007 Septicemic shock, Septic shock 12.045 36188001 Flexner's dysentery, Shigellosis10.915 10087007 Schistosomiasis, Bilharzia, ...

9.376 11836002 Primary/spontaneous bacterial peritonitis8.977 24043009 Malignant catarrhal fever, Snotsiekte8.638 111909004 Amoebic infection, Amebiasis

D.D.

Figure 1. The “shared” virulence gene hypothesis. The relationships between genes shared by multiple pathogens and clinical syndromes can be exemplified by this illustration. The genes encoding a specific bacterial virulence mechanism may be evolutionarily conserved and hence are present in different species or genus (the “shared” virulence genes) [3]. These genes are candidates to be discovered by this method, based on multi-genomic analysis of different bacterial species with comparative genomics.

AcknowledgmentsThe author thank Drs. Vitali Sintchenko, Stephen Anthony, Xuan-Hieu Phan, and Fanrong Kong, Profs. Ruiting Lan, Gwendolyn Gilbert

and Enrico Coiera for collaborating with the earlier works [1-3]. The author is supported by a postdoctoral fellowship of Australian National Health and Medical Research Council (NH&MRC) program grant #568,612. The body silhouette image in Figure 1 was a public domain image retrieved from Wikimedia Commons (http://commons.wikimedia.org/wiki/File:Female_shadow_-_upper.png).

Extracting clinical concepts related to infectious disease from ontology database. The list of infectious disease names were extracted from the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) database by traversing through the hierarchy of concepts beneath the concept of “Infectious disease (disorder)” (SNOMED ID #40,733,004). A total of 327 infectious disease-related concepts were extracted.

Determining the pathogen-disease matrix. The co-occurrences of the disease names with pathogen names in MEDLINE abstracts were used to infer the potential causal relationship. For example, the concept “urinary tract infection” co-occurs frequently with the pathogen name “Escherichia coli”, which is the principle causal pathogen of UTI in humans and other mammals. The concept “pneumonia” is expected to occur in higher frequencies in articles with mentions of its primary causal pathogens (e.g., Streptococcus pneumoniae)

List of pathogen names. The names of the entire list of bacterial species known to be associated with infectious diseases in humans were short-listed from the NCBI reference genome database. Two hundred and fifty four bacterial species were used in the subsequent analysis.

Determining the phylogenetic profiles of each gene in the 525 genomes. The presence or absence of homologous genes in a set of reference genomes (phylogenetic profiles) was determined by performing all-against-all BLASTP with the E-value threshold at 10-10. The procedure was described in detail in Refs. [2] and [3].

Bacterial genomes used in this analysis. 1,131 genomes with available whole genome sequences (WGS) were retrieved from the NCBI FTP site. 525 WGS that also belong to the list of pathogens (see above) were used in the subsequent analyses.

A Manhattan plot showing the association of all genes in the S. pneumoniae D39 genome with literatures with frequent mentions of ID concept “pneumonia”. For each gene, the relative association of a gene to a infectious disease concept (e.g., “pneumonia”) can be calculated by comparing the observed distributions of PP with the corresponding null distributions. In this example, the hyaluronidase gene (hysA, red oval) was among the highest ranked genes associated with frequent mentions of “pneumonia” in the literature.Reference genomes

Genes

Figure 2. The computational workflow for discovering genes associated with human infections in pathogenic bacteria.