mbrcs and microbial databases interconnection data … vasilenkovkm.pdf · mbrcs and microbial...

46
mBRCs and MICROBIAL DATABASES INTERCONNECTION DATA Alexander Vasilenko, Svetlana Ozerskaya, Oleg Stupar VKM, IBPM RAS

Upload: danglien

Post on 19-Jul-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

mBRCs and MICROBIAL DATABASES INTERCONNECTION DATA

Alexander Vasilenko, Svetlana Ozerskaya, Oleg Stupar VKM, IBPM RAS

The general goals to reach

As we know, microbial culture collections and the projects like MIRRI or GBRCN tried to meet the needs of biomedicine, agriculture, biotechnology, and science. Data integration looks like one of the key tools for this. The field of integration is Life Science data.

We know three big data sources in Life Science: (1) databases, (2) publications, (3) datasets. As far as we know the main structured data holdings in this list are the databases, and this report looks mostly at the integration opportunities for them. Data integration of mBRC microbial data with Life Science databases and first of all the integration with databases used in biomedicine, pharma, agriculture, and bioremediation.

Potential help for the Life Science world - mBRC contributions :

1. Assurance of repeatability of experimental data

2. Resolving nomenclatural issues related to microorganisms

3. Strain-specific characters

1

General CC - life science connection data

(1) In culture collections* 708 culture collections in WDCM/CCinfo

150 online catalogues (67 in EU)

(2) Life science

2076 databases collected online, 807 with microbial data

807 = 117(Mo) +483(Sp) + 207(St)

‐ Mo - microorganisms discovered (bacteria, fungi, yeasts, archaea, protists, microalgae) , viruses, but no species names,

‐ Sp - names are presented, but no strains, ‐ St - strains discovered.

Value 207(St) looks like 26% of 807. In fact this means minimal interconnection of left side and right side: (1) in the list Mo, Sp, St we indicated higher level discovered, (2) more than 50% of the strains were not in the culture collections, (3) we never discovered address of the strains, (4) in Life Science databases each strain is separate unlike Straininfo histries (next picture)

* Plus WDCM, Straininfo, CABRI and regional mBRC networks

2

StrainInfo strain exchange tree

3

General integration schema

4

CC1

CC2

CCn

.

.

.

MICRO-IS

Infrastructure_1

Infrastructure_2

Infrastructure_m

DB_1

DB_2

DB_3

DB_4

DB_5

DB_6

DB_k

.

.

.

Potentially this data integration could mean the tasks:

1. To make mBRC data visible and accessible from partner Life Science databases,

2. To make partner database records visible and accessible from mBRC aggregated catalogue,

in the formats:

a. To give this data integration for human access,

b. To give this data integration for computer programs.

5

Life Science databases inspected

Total number of life science database names or references discovered in this study is more than 12 800. The total number of database references inspected manually is more than 5500 (plus group of 7625 databases in BioCyc system, each of them present metabolic pathways and their operons for one bacterial strain). The total number of life science databases collected visible online is 2076, the number of databases with microbial data collected is 807 (plus 7625 bacterial databases in BioCyc).

Main sources inspected:

• MB (1802 entries (http://metadatabase.org/wiki/Help:Browsing), 26.12.2015),

• Biosharing (724 databases, (https://www.biosharing.org/), 26.12.2015),

• BioMedBrigeds (814 Databases, 27.12.2015), (http://wwwdev.ebi.ac.uk/fgpt/toolsui/),

• Pathguide (363 database names, 2013), (http://www.pathguide.org/)

• ELIXIR list (579 entries, (https://bio.tools/?q=database), 28.1.2016)

• ExPASy (85 + 665 databases, (http://www.expasy.org/old_links), 12.2.2016)

6

6

Databases parameters collected (an example)

• Unique identifier: BIODBCORE-000515

• Database acronym: Pfam

• Database name: Sanger Pfam Mirror

• Database URL: http://pfam.sanger.ac.uk/

• Access level: Open

• Practical domain: health, winemaking, baking, brewing

• Microbial level: st

• Year of the last correction: 2015

• Developer/Owner: UK, EMBL-EBI

• Comment

• Orientation: bacteriophages, viruses, bacteria

• Properties: protein

• Search by: OMIM ID, PubMed ID, ...

• Ontologies list: GO

• Partner databases: CATH, CDD, Europe PMC, HGNC, InterPro, iPfam, MEROPS, NCBI Gene, NCBI Taxonomy, OMIM, PDBe, PDBj, PDBsum, PMC, PRINTS, PROSITE, PubMed, RCSB PDB, RefSeq, SCOP, SMART, SUPERFAMILY, UCSC, UniProtKB

• Program interface: WEB UI, RESTful interface

7

Databases by properties 215 145 88 78 36 32 22 20 20 14 14 13 11 9 8 8 8 6 6 6 6 6

genome protein chemistry one taxon pathway RNA biodiversity enzyme taxonomy peptide pharmacology publications drug cell image ribosome web-portal antibody antimicrobic metabolite molecules toxicology

5 5 5 4 4 4 4 3 3 3 3 3 3 3 2 2 2 2 2 2 2

carbohydrates pathogenic structure biological activities of small molecules lipid metabolom promoter interactome plasmid structure biomolecule terminology toxin veterinary antibiotic resistance bacteriophages barcodes biodegradation biomolecules collection Cyanobacteria immunogenetics

2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1

immunology map mtDNA patent pathogenic mo phylogenetica phylogeny stem cell transport virulence factors acetilation allergenic allergenic molecules antibiotics ascomycetes bacteriocins Biocatalysis/Biodegradation Carbon chemical compounds crop protection

8

One Taxon section (78 total) 1 1 1 18 1 2 2 1 1 1 1 1 2 1 1 2 2 57 1 5 2 1 1

class: Mollicutes family: xylariaceae group: trichomycetes Genus: Ashbya Aspergillus Bacillus Canadensis Candida Corynebacterium Legionella Listeria Mycobacterium Prochlorococcus Pseudomonas Saccharomyces Streptococcus Species: Arabidopsis thaliana Bacillus cereus Bacillus subtilis Buchnera aphidicol Chlamydia trachomatis

1 15 1 1 2 1 1 1 1 1 1 1 1 4 1 8 1 1 1 1 1 1

Dictyostelium discoideum Escherichia coli Fusarium graminearum Helicobacter pylori Magnaporthe grisea Mycobacterium leprae Mycobacterium marinum Mycobacterium smegmatis Mycobacterium tuberculosis Mycobacterium ulcerans Mycoplasma genitalium Mycoplasma pulmonis Myxococcus xanthus Neurospora crassa Pyrococcus abyssi Saccharomyces cerevisiae Schizosaccharomyces pombe Sporisorium reilianum Staphylococcus aureus subsp.aureus Toxoplasma gondii Ustilago hordei Ustilago maydis

9

127 databases for specific organisms types (they also keep microbial data):

3 3 13 3 23 12 10 1 1 44 14

animal archea bacteria drosophila fungi human plant protists vertebrates viruses yeast

10

Biggest database producer: BESC (BioEnergy Science Center)

BioCyc pathway/genome database: 7667 databases totally (http://www.biocyc.org/biocyc-pgdb-list.shtml)

Group 1 are 7 databases: EcoCyc, MetaCyc, HumanCyc, AraCyc, YeastCyc, LeishCyc, TrypanoCyc

Group 2 are 41 databases generated by program with curation done each is one strain:

Group 3 are 7625 databases each database is one bacterial strain with no curation yet 11

Agrobacterium fabrum C58 Anopheles gambiae Aurantimonas manganoxydans SI85-9A1 Bacillus anthracis Ames Bacillus subtilis 168 Bacteroides thetaiotaomicron VPI-5482 Candidatus Cardinium hertigii Candidatus Evansia muelleri Candidatus Portiera aleyrodidarum BT-QVLC Caulobacter crescentus CB15 Caulobacter crescentus NA1000 Chlamydomonas reinhardtii Clostridium saccharoperbutylacetonicum ATCC 27021 Cryptosporidium hominis TU502 Cryptosporidium parvum Iowa Drosophila melanogaster Escherichia coli B str REL606 Escherichia coli CFT073 Escherichia coli K-12 substr W3110 Escherichia coli O157:H7 str EDL933 Eubacterium rectale ATCC 33656

Helicobacter pylori 26695 Listeria monocytogenes 10403S Methylosinus trichosporium OB3b Mus musculus Mycobacterium tuberculosis CDC1551 Mycobacterium tuberculosis H37Rv Penicillium chrysogenum Wisconsin 54-1255 Peptoclostridium difficile 630 Plasmodium berghei ANKA Plasmodium chabaudi Plasmodium falciparum 3D7 Plasmodium vivax Sal-1 Plasmodium yoelii 17XNL Schistosoma mansoni Shigella flexneri 2a str 2457T Streptomyces coelicolor A3(2) Synechococcus elongatus PCC 7942 Thalassiosira pseudonana CCMP1335 Toxoplasma gondii ME49 Vibrio cholerae O1 biovar El Tor str N16961

Second DBs producer: EMBL-EBI (98dbs) ArrayExpress

ASD

ASTD

ATD

BioModels

BioSamples

Cellular Phenotype Db

ChEBI

ChEMBL

CluSTr

CSA

DGVa

DNAtraffic

DrugPort

e!Ensembl

e!Ensembl S. cerevisiae

e!EnsemblBacteria

e!EnsemblCat

e!EnsemblChicken

e!EnsemblChimpanzee

e!EnsemblCow

e!EnsemblDog

e!EnsemblFugu

e!EnsemblFungi

e!EnsemblGenomes

e!EnsemblGorilla

e!EnsemblHorse

e!EnsemblMetazoa

e!EnsemblMouse

e!EnsemblPig

e!EnsemblPlants

e!EnsemblProtists

e!EnsemblRabbit

e!EnsemblZebrafish

EGA

EMBL

EMBL-EBI

EMDB

ENA

Ensembl

Enzyme Portal

Enzyme Structures

EVA

Expression Atlas

FunTree

GeneDB

GWAS Catalog

HGNC

HipSci

IGSR

IMEx

IMGT/HLA

IntAct

IntEnz

InterPro

IPD

IPD-ESTDAB

IPD-HPA

IPD-KIR

IPD-MHC

logRECOORD

MACiE

MEROPS

MetaboLights

Metal MACiE

MicroCosm

MIRIAM collection

MTBLS

NRNL1

NRNL2

NRPL1

NRPL2

OLDERADO

PANDIT

PDBe

PDBe EM Resources

PDBeChem

PDBsum

Pfam

Pfam

PhenoDigm

PICR

PomBase

PRIDE

PROCOGNATE

Reactome

RECOORD

Rfam

RNAcentral

SAS

SRS@EMBL-EBI

SureChEMBL

TreeFam

UniChem

UniProt-GOA

UniSave

VASCO

VectorBase

12

NCBI (71dbs) Assembly

BioProject

BioSample

BioSystems

Bookshelf

CCDS

CDD

ClinGen

ClinVar

Clone DB

COGs

dbEST

dbGaP

dbGSS

dbMHC

dbProbe

dbSNP

dbSTS

dbVar

Dengue virus database

ECRbase

Epigenomics

Genbank

Gene

Genetic Codes

Genome

GEO

GEO DataSets

GEO Profiles

GTR

Histone

HIV-1

Homologene

IBIS

Influenza Virus Resource

MapViewer

MedGen

MEDLINE

MeSH

MMDB

NCBI

NCBI taxonomy

NCBI Trace Archives

NLM Catalog

Nucleotide

OMIM

Organelle genomes

Plant Genome Central

PMC

PopSet

PRK

Probe

Protein

Protein Clusters

PubChem

PubChem BioAssay

PubChem Compound

PubChem Substance

PubMed

PubMed Health

RefSeq

RefSeqGene

Retroviruses

SKY/M-FISH and CGH

SRA

Structure

TPA

UniGene

UniVec

Viral genomes

Virus Variation

13

Biggest database partners lists: UniProtKB

See http://www.uniprot.org/database/

Allergome ArachnoServer Bgee BindingDB BioCyc . BioGrid BioMuta BRENDA CAZy CCDS CGD . ChEMBL . ChiTaRS CleanEx CollecTF . COMPLUYEAST-2DPAGE . ConoServer CTD dbSNP DDBJ . DEPOD dictyBase DIP DisProt DMDM DNASU DrugBank EchoBASE EcoGene . eggNOG ENA

Ensembl . EnsemblBacteria . EnsemblFungi . EnsemblMetazoa EnsemblPlants EnsemblProtists . ENZYME ESTHER euHCVdb EuPathDB ET ExpressionAtlas FlyBase GenAtlas GenBank . Gene3D GeneCards . GeneDB GeneFarm NCBI Gene GeneReviews Ensembl Genomes . Genevisible GeneWiki GenoList GenomeRNAi GPCRDB Gramene GuidetoPHARMACOLOGY H-InvDB HAMAP .

HGNC HOGENOM HOVERGEN HPA HUGE IMGT InParanoid IntAct InterPro iPTMnet. KEGG LegioList Leproma MaizeGDB MalaCards MaxQB MEROPS MGI Micado OMIM (1) MINT MobiDB ModBase . MoonProt mycoCLAP NextBio neXtProt OMA Orphanet OrthoDB PANTHER

PATRIC. PDBe PDBj PDBsum PeptideAtlas PeroxiBase Pfam PharmGKB PhosphoSite PhylomeDB PIR PIRSF PomBase PRIDE PRINTS ProDom ProMEX PROSITE PMP. Proteomes . ProtoNet PseudoCAP RCSB-PDB Reactome REBASE RefSeq REPRODUCTION-2DPAGE RGD Rouge SABIO-RK

SBKB SGD . SignaLink SMART . SMR SOURCE . STRING . SUPFAM . SWISS-2DPAGE SwissLipids TAIR TCDB . TIGRFAMs TreeFam TubercuList . UCD-2DPAGE UCSC. UniCarbKB UniGene UniPathway. VectorBase WBParaSite World-2DPAGE . WormBase Xenbase ZFIN

14

To find appropriate integration solution we constructed the table of partner references between Life Science databases. 2054 lines, each line mean specific Life Science database. 805 colons, each colon mean specific microbial database. Cell in line I colon J is value 1 if microbial database J has database I in the list of its partners. Otherwise 0.

With this table we control two integration parameters of specific database {A}:

1. Connection Factor - the list of database partners of {A} according to materials of {A} - the data sources, common fields, common data curation, etc. - the colon of the table. Connection Number - the number of elements=1 in this list - how many database partners it has.

2. Attraction Factor of database {A} - the list of the databases that have reference to {A} in their Connection Factor – the line of the table. Attraction Number - the number of elements=1 in this list - how popular is {A} in database community.

Attraction Number (AN) and Connection Number (CN) both indicate an integration level of specific database. In our research the best balanced values presents UniProtKB database:

CN=149, AN=350.

In CC and mBRC catalogues both values are mostly 0.

Just from the scratch we selected two lists of integration contract lists:

1. NCBI + UniProt integration schema + GeneCards integration Schema (“NCBI list”)

2. EMBL-EBI + UniProt integration schema + GeneCards integration Schema (“EMBL-EBI list”)

According to the table if we are success in three contracts MICRO-IS presents:

1. CN=163, initial value of AN=163, potential value 670, if NCBI list, or

2. CN=146, initial value of AN=146, potential value 654, if EMBL-EBI list.

15

Databases with more than 10 partners (CN) 149 UniProtKB

148 UniProt-GOA

84 GeneCards

61 Hits

59 SWISS-2DPAGE

54 NCBI

50 PiroplasmsDB

49 EMBL

49 ENA

46 SGD

43 e!EnsemblGenomes

42 MetaCyc

40 EcoGene

40 EMBL-EBI

40 Gene

37 PubChem

37 ViralZone

36 EcoCyc

36 GPCRs

36 NCBI taxonomy database

34 ESTHER

34 MalaCards

34 OMIM

34 OMIM (1)

33 PRODORIC

32 HOGENOM

31 ChEBI

30 Genome

30 HPIDB

30 JPGV

30 STITCH

29 EuPathDB

29 FungiDB

29 MACiE

29 T3DB

28 EcoProDB

28 InterMitoBase

28 PIR

28 Reactome

27 BioSystems

27 CAZy

27 RNAcentral

27 STRING

26 GPM

26 Guide to Pharmacology

26 IRD

26 PATRIC

26 ThaleMine

26 YeastMine

25 LPSN

25 TRRD

24 CCDB

24 dictyBase

24 DrugBank

24 GeneProf

24 Genolevures

24 TDR Targets

23 GTOP

23 Pfam

23 Pfam

22 Genetics Home Reference

22 MapViewer

22 MouseMine

22 PDBsum

22 PED

22 PomBase

21 dbSNP

21 Europe PMC

21 FlyMine

21 KEGG

21 Microbes Online

21 NextBio

21 OrthoDB

20 ConsensusPathDB

20 INstruct

20 MINT

20 Retroviruses

19 MetaboLights

19 TCDB

19 TriTrypDB

18 Ebolavirus

18 HMDB

18 InnateDB

18 MitoMiner

18 MODOMICS

18 WholeCellKB

17 ComPPI

17 EcoliWiki

17 InterPro

17 SDAP

17 ViPR

17 WikiPathways

17 YeastCyc

16 IEDB

16 PANTHER

16 Pseudomonas Genome Database

16 RCSB PDB

16 RNA Virus

16 toxoMine

15 CDD

15 DNAtraffic

15 EVA

15 HFV / Ebola Database

15 i2d

15 PhosphoGRID

15 Source

15 UniRef

15 Victors

14 APD

14 Biozon

14 DAMPD

14 GeneDB

14 HSDB

14 KEGG ORTHOLOGY

14 NPIDB

14 PANDORA

14 PLEXdb

14 PROMISCUOUS

14 Rhea

14 SRA

14 TargetDB

14 TransportDB

14 UniPathway

13 APID

13 ASAP

13 BioModels

13 BRENDA

13 e!Ensembl Saccharomyces cerevisiae

13 e!EnsemblBacteria

13 e!EnsemblFungi

13 e!EnsemblProtists

13 ECMDB

13 EPD

13 FooDB

13 GenoList

13 IMEx

13 KEGG BRITE

13 KEGG GENES

13 KEGG GENOME

13 MTBLS

13 neXtProt

13 NMPDR

13 ORENZA

13 PeptideAtlas

13 PhenomicDB

12 DBETH

12 Drug2Gene

12 Genbank

12 GoMapMan

12 IMG

12 IMGT/3Dstructure-DB

12 iRefWeb

12 MEROPS

12 ModBase

12 MOPED

12 PMP

12 P-POD

12 Proteome 2D-PAGE Database

12 PubChem BioAssay

12 PubChem Compound

12 PubChem Substance

12 Rfam

12 sRNAMap

12 TubercuList

12 YMDB

11 CCSB Interactome

11 CTD

11 CyanoBase

11 ExplorEnz

11 iHOP

11 IMP

11 INTEGRALL

11 KEGG LIGAND

11 KiMoSys

11 KinBase

11 MatrixDB

11 MPIDB

11 MycoBank

11 Nucleotide

11 PICR

11 PRGdb

11 PROSITE

11 REBASE

11 SMART

18

Databases with big attraction number (AN) 350 UniProtKB

(+Swiss-Prot +TrEEBL)

335 PubMed

181 RCSB PDB

166 Genbank

154 NCBI taxonomy

140 Gene

133 RefSeq

131 KEGG

108 EC

104 InterPro

89 Ensembl

87 Pfam

74 SGD

68 ENA

59 OMIM

56 Nucleotide

51 IntAct

49 PROSITE

46 Reactome

45 BioGRID

45 PIR

43 HGNC

42 FlyBase

41 COGs

41 SCOP

40 UniGene

39 CAS

39 MEDLINE

39 MGI

38 PubChem

38 SMART

37 DIP

37 GEO

36 ChEBI

35 DDBJ

35 MINT

34 WormBase

33 Genome

33 STRING

31 NCBI

31 TAIR

30 DrugBank

29 BRENDA

29 HPRD

29 TIGRFAMS

28 ENZYME

27 Pfam

27 SUPERFAMILY

25 BioCyc

25 GeneCards

24 EcoCyc

24 MeSH

24 PDBe

24 PRINTS

23 CDD

23 PANTHER

23 ProDom

22 BioProject

22 CATH

22 HomoloGene

22 KEGG PATHWAY

22 MetaCyc

22 PMC

21 ChEMBL

21 dbSNP

21 wwPDB

20 PDBsum

20 PIRSF

20 PRIDE

19

.

16

DBs partners NCBI List solution: addgene, Allergome, Assembly, BioCyc, BioGRID, BioProject, BioSample, BioSystems, Bookshelf, BRENDA, CAZy, CDD, CGD, ChEMBL, ClinicalTrials.gov, COGs, CollecTF, Compulyeast, CRISPRdb, CTD, dbEST, dbGSS, dbProbe, dbSNP, DDBJ, Dengue virus database, dictyBase, DNASU, DrugBank, e!EnsemblBacteria, e!EnsemblFungi, e!EnsemblGenomes, e!EnsemblProtists, EchoBASE, EcoGene, eggNOG, ENA, Ensembl, ESTHER, euGenes, euHCVdb, EuPathDB, Expression Atlas, Genbank, Gene, GeneCards, GeneDB, Genetic Codes, Genetics Home Reference, GenoList, Genome, GEO, GEO DataSets, GEO Profiles, Gramene, Guide to Pharmacology, HAMAP, Histone, HIV-1, HMDB, HOGENOM, Homologene, HOVERGEN, i2d, Influenza Virus Resource, InParanoid, IntAct, InterPro, iPTMnet, KEGG, LegioList, Leproma, LifeMap Discovery, MalaCards, MapViewer, MaxQB, MedGen, MEDLINE, MedlinePlus , MEROPS, MeSH, Micado, MINT, miRBase, miRTarBase, MMDB, MobiDB, ModBase, MoonProt, MOPED, mycoCLAP, NCBI, NCBI taxonomy, NCBI Trace Archives, NextBio, neXtProt, NONCODE, Nucleotide, OMA, OMIM, OMIM (1), Organelle genomes, OrthoDB, PANTHER, PATRIC, PaxDB, PDBe, PDBj, PDBsum, PeptideAtlas, PeroxiBase, Pfam, PharmGKB, PhylomeDB, PIR, PMC, PMP, PomBase, PopSet, PRIDE, PRK, Probe, PROSITE, Protein, Protein Clusters, Proteomes, ProteopediA, PseudoCAP, PubChem, PubChem BioAssay, PubChem Compound, PubChem Substance, PubMed, PubMed Health, RCSB PDB, Reactome, REBASE, RefSeq, Retroviruses, Rfam, SABIO-RK, SGD, SIMAP, SMART, Source, SRA, STRING, Structure, SUPERFAMILY, SWISS-2DPAGE, SWISS-MODEL, TCDB, TIGRFAMS, TubercuList, UCD 2D-PAGE, UCSC, UMLS, UniGene, UniPathway, UniProtKB, Viral, genomes, Virus Variation, World-2DPAGE Repository

In EMBL-EBI list solution: addgene, Allergome, ArrayExpress, ASTD, BioCyc, BioGRID, BioModels, BioSamples, BioSystems, Bookshelf, BRENDA, CAZy, CGD, ChEBI, ChEMBL, ClinicalTrials.gov, CollecTF, Compulyeast, CRISPRdb, CTD, dbSNP, DDBJ, dictyBase, DNASU, DNAtraffic, DrugBank, DrugPort, e!Ensembl, e!Ensembl Saccharomyces cerevisiae, e!EnsemblBacteria, e!EnsemblFungi, e!EnsemblGenomes, e!EnsemblProtists, EchoBASE, EcoGene, eggNOG, EMBL, EMBL-EBI, EMDB, ENA, Ensembl, Enzyme Structures, ESTHER, euGenes, euHCVdb, EuPathDB, EVA, Expression Atlas, Genbank, Gene, GeneCards, GeneDB, Genetics Home Reference, GenoList, Gramene, Guide to Pharmacology, HAMAP, HMDB, HOGENOM, Homologene, HOVERGEN, i2d, IMEx, InParanoid, IntAct, InterPro, iPTMnet, KEGG, LegioList, Leproma, LifeMap Discovery, MACiE, MalaCards, MaxQB, MedlinePlus, MEROPS, MeSH, MetaboLights, Micado, MINT, miRBase, miRTarBase, MobiDB, ModBase, MoonProt, MOPED, MTBLS, mycoCLAP, NCBI, NextBio, neXtProt, NONCODE, OMA, OMIM, OMIM (1), OrthoDB, PANTHER, PATRIC, PaxDB, PDBe, PDBe EM Resources, PDBj, PDBsum, PeptideAtlas, PeroxiBase, Pfam, Pfam, PharmGKB, PhylomeDB, PICR, PIR, PMP, PomBase, PRIDE, PROSITE, Proteomes, ProteopediA, PseudoCAP, PubChem, PubMed, RCSB PDB, Reactome, REBASE, RefSeq, Rfam, RNAcentral, SABIO-RK, SGD, SIMAP, SMART, Source, STRING, SUPERFAMILY, SWISS-2DPAGE, SWISS-MODEL, TCDB, TIGRFAMS, TubercuList, UCD 2D-PAGE, UCSC, UMLS, UniGene, UniPathway, UniProt-GOA, UniProtKB World-2DPAGE Repository

17

Task 1a: Name processing *

* Page content from: http://www.mycobank.org/BioloMICS.aspx?Table=Mycobank&Rec=18759&Fields=All 20

Task 1a: Strains algorithm

WDCM 133 Centraalbureau voor Schimmelcultures Filamentous fungi and Yeast Collection, Netherlands

WDCM 18, Food, Science, Australia, Ryde

WDCM 758, IBT, Culture Collection of Fungi, Denmark

WDCM 214, CABI, Genetic Resource Collection, UK

21

Task 2b tools

Based on: ELIXIR network infrastructure, BioMedBriges reports schema in semantic WEB, LOD cloud

MICRO-IS 22

Ontologies in Task 2b Number of microbial databases with ontologies – 243, total number of ontologies – 63

Ontologies used in more than one database:

Ontologies used in just one database:

ARO, CAVEman, CELDA, Cereal plant growth stage (GRO), Dictyostelium anatomy ontology, DOID, ENVO, EC, EO, FAO, FlyBase Controlled Vocabulary, FYPO, GBIF Taxonomic checklists, GR_tax, MA, MapMan, MEO, MeGO, MetaCyc Pathway Ontology, Metathesaurus, MGED, MO, MSH, NCI Thesaurus, NCIM, ncRNA vocab, ORDO, Pathway Tools reaction ontology, PDO, Phenotype Ontology, PhiGO, Plant anatomy, Plant Development Ontology, PSI-MOD, PSO, QuickGO, The Gene categories from Monica Riley, TO, YPO

GO 207 Pathway Tools pathway ontology 6 BTO 2

KO 17 PatPathway Tools Evidence Ontology 6 CAB Thesaurus 2

SO 9 Pathway Tools compound ontology 6 CL 2

ECO 9 PRO 5 DO 2

PW 7 ChEBI 5 EPO 2

FMA 7 PO 3 JPO 2

EFO 6 MultiFun 3 KIPO 2

CCO 6 HPO 3 KPSI MI Ontology 2

23

MICRO-IS catalogue data fields

If database integration is the only goal of MICRO-IS the

minimal catalogue data standard could have six fields:

- Name,

- Accession number,

- Original URL of strain passport,

- History of deposit,

- Type of microorganism,

- List of Life Science databases that present data on this strain with contacts to these data

24

Thank you

25

PICR database partners

Look at: http://www.ebi.ac.uk/Tools/picr/userguide.do

PICR

EMBL() Ensembl() Ensemble Genomes() EPO, FlyBase, H Inv, IPI JPO, KIPO, PDB, PIR, PRF, Refseq() SEGUID, SGD, TAIR, SwissProt() TrEMBL() TROME() UniMES, UniParc, USPTO, VEGA(), WormBase

EMBL databases

Species-specific Refseq releases

SwissProt variant databases

TrEMBL variant databases

Species-specific Trome releases

Species-specific Vega releases

Species-specific Ensembl releases

List of taxon specific databases

Swiss-Prot varsplic and TrEMBL varsplic in output options

16

ChEBI data communication Data sources Generated cross-references

IntEnz

NIST Chemistry WebBook

KEGG COMPOUND

PDBeChem

ChEMBL .

Expression Atlas

GMD

ChemIDplus

ChEBI

IUBMB

NURSA

IUPAC

JCBN

CBN

PDB

BBD. RESID

COMe

NMRShiftDB

Enzyme Portal

BRENDA

IntEnz

Rhea

ArrayExpress .

SABIO-RK

PubChem

Reactome

BioModels

IntAct

UniProtKB .

UniProt .

EMBL

EuroFir

LIPID MAPS WebElements

UniProt . MolBase

KEGG GLYCAN

KEGG DRUG

Patent DrugBank

EBI Industry Programme 17

iProClass database connections

http://pir.georgetown.edu/pirwww/dbinfo/iproclass.shtml 18

UniParc database partners

Look at: http://www.uniprot.org/help/uniparc

It keeps databases records: It contains cross-references with databases:

EMBL-Bank/DDBJ/GenBank nucleotide sequence databases. Ensembl. EnsemblGenomes European Patent Office (EPO) FlyBase H-Invitational Database (H-InvDB) International Protein Index (IPI) Japan Patent Office (JPO) Korean Intellectual Property Office (KIPO) Pathosystems Resource Integration Center (PATRIC) PIR-PSD Protein Data Bank (PDB) Protein Research Foundation (PRF) RefSeq Saccharomyces Genome database (SGD) . TAIR Arabidopsis thaliana Information Resource The Seed (SEED) TROME USA Patent Office (USPTO) UniProtKB/Swiss-Prot, protein isoforms, UniProtKB/TrEMBL Vertebrate Genome Annotation database (VEGA) WormBase WormBase ParaSite (WBParaSite)

PIR PIRARC REMTREMBL UniMES TREMBLNEW TrEMBL_varsplic

19

Pathguide: (all) database interactions

http://pathguide.org/interactions.php 20

URL: http://www.genome.jp/dbget/dbget.links.html (2010) 21

An example for the possible service

URL: http://biodb.jp/ https://github.com/micommunity/psicquic http://www.ebi.ac.uk/Tools/webservices/psicquic/registry/registry?action=STATUS 26

MIRRI-IS Biomedical solution

The sources not inspected yet • http://cgsc.biology.yale.edu/BioLinks.php - Other Biology Servers

• http://collectf.umbc.edu/browse/links/ - Other resources

• http://phossnp.biocuckoo.org/links.php - Computational resources of protein phosphorylation:

• http://cpla.biocuckoo.org/links.php - Acetylation Databases:

• http://dbptm.mbc.nctu.edu.tw/ - after the line "Databases" long list

• http://minisatellites.u-psud.fr/ - Genomes and PolyMorphismS

• http://www.hawaii.edu/abrp/biorlinks.html - Links to some Bioremediation Sites

• http://toxnet.nlm.nih.gov/

• http://ecoliwiki.net/colipedia/index.php/Category:Databases - online databases containing information related to E. coli K-12 and other genomics information

• https://rarediseases.info.nih.gov/research/9/tools-for-researchers#category29 - lists of medical resources

• http://www.genecards.org/Guide/GeneCard - list of Antibodies databases after words "This section also provides links to Antibodies from"

• http://www.hgvs.org/locus-specific-mutation-databases/ page 2+ - List of Locus-specific Databases

• http://gydb.uv.es/index.php/Main_Page - links of interest

• http://genomics.senescence.info/links.html - Links on Ageing and Computational Biology

• https://sis.nlm.nih.gov/enviro/databasedescriptions.html - toxinity

• http://world-2dpage.expasy.org/repository/ - World-2DPAGE Repository databases

• http://www.virology.net/ - All the virilogy on the WWW

• http://viralzone.expasy.org/all_by_species/677.html - ViralZone Links (137 contacts)

• https://scicrunch.org/scicrunch - SciCrunch - over 13000 research resources (datasets SW databases etc.) mostly biomedical

7

General comparison

(1) In culture collections* 708 culture collections in WDCM/CCinfo

150 online catalogues (67 in EU)

(2) Outside

2056 databases collected online, 807 with microbial data

Biomedicine 171

Agriculture 23

Pharma, biochemistry 35

Any type 243

Global EU

Biomedicine 53 21

Agriculture 67 26

Pharma, biochemistry 32 12

Any type 109 43

807 = 117(Mo) +483(Sp) + 207(St)

‐ Mo - microorganisms discovered (bacteria, fungi, yeasts, archaea, protists, microalgae) , viruses, but no species names,

‐ Sp - names are presented, but no strains, ‐ St - strains discovered.

Value 207(St) looks like 26% of 807. In fact this means minimal interconnection of left side and right side: (1) in the list Mo, Sp, St we indicated higher level discovered, (2) more than 50% of the strains were not in the culture collections, (3) we never discovered address of the strains, (4) in databases each strain is separate unlike Straininfo histries (next picture)

* Plus WDCM, Straininfo, CABRI and regional mBRC networks

3

Best integration values

Life Science:

MICRO-IS: 1. CN=163, initial value of AN=163, potential value 670 (NCBI list),

2.CN=146, initial value of AN=146, potential value 654 (EMBL-EBI list).

CN AN

149 UniProtKB 350 UniProtKB

148 UniProt-GOA 335 PubMed

84 GeneCards 181 RCSB PDB

Key groups and application fields

• Biggest groups: BioCyc (7667dbs), EMBL-EBI (98dbs), NCBI (71dbs), SIB(65dbs), KEGG, UniProt

• Databases with biggest list of partners: COL (159dbs), PIR(158DBS), UniProt (150dbs)

• The biggest attractors: UniProtKB (350), PubMed (335), RCSB PDB (181)

• Application fields: Life science, Biomedicine, Pharma, Agriculture, BioRemediation

Databases in practical areas (42/243): agriculture, baking, biodegradation, biotechnology, brewing, enzyme production, food industry, Pesticide residues, Heavy metals, health,

patent, pharma, remediation, veterinary drug, winemaking

abYsis, AgBiotechNet, AGRICOLA, Allergome, ALTBIB, Anti-HIV Compounds, APD, ApiEST-DB, ARDB, Aspergillus Genomes, BacMap, BBD, BEI, Bio Synthesis, BioGRID, BioModels, Bionemo, BioRadBase, BMRB, BRENDA, BuG@Sbase, BuruList, CADRE, CARD, CATH, CCDB, CCSB Interactome, ChEBI, ChEMBL, ClinicalTrials.gov, CLU-IN, COGEME, Colibri, ConceptWiki, Cost Estimates of Foodborne Illnesses, CPC, CTD, CTDB, DAA, DailyMed, DAnCER, DART, DBETH, dbSNP, Diseases Database, DNAtraffic, Dr.VIS, Drug2Gene, DrugBank, DrugPort, e!Ensembl, EAWAG-BBD, Ebolavirus, EcoGene, EcoliWiki, ELM, EMBL-EBI, ENA, Ensembl, EpiFlu™, EPIMHC, Espacenet, ESTHER, EuPathDB, EVA, FCP, FluKB, ForestScience Current Database, FunCoup, FungiDB, GARD, GB, GeneCards, GeneMANIA, Genetics Home Reference, GENE-TOX, GenoList, Global Atlas of Infectious Diseases, GlycomeDB, GOBASE, GoMapMan, GPCRs, Gramene, Guide to Pharmacology, HAGR, HAPPI, Hawaii Bioremediation Database, HC DPD, HCV Immunology Database, HCV sequence database, HCVIVdb, HealthMash, HFV / Ebola Database, HIV Drug Resistance Database, HIV MID, HIV mutation browser, HIV Sequence Database, HIV Structural Database, HIV Structural Database and Chem-BLAST, HIV/AIDS Clinical Trials, HIV/SIV Vaccine, HIVBrainSeqDB, HMDB, HorizonScan database, HSDB, i2d, ICD-10, ICD-9-CM, IEDB, IMID, ImmPort, IMP, Influenza Virus Resource, INTEGRALL, iPfam, IRD, iRefWeb, KEGG, KEGG BRITE, KEGG DISEASE, KEGG MEDICUS, LAMP, LEGER, LegioList, LifeMap Discovery, ListiList, LMPD, LMSD, MalaCards, MedHunt, MedicMine, MEDLINE, MedlinePlus, MeSH, MGG, microbedb, MLSTDB, MolliGen, MouseMine, MPID, MSRDB, MTBLS, MUHDB, MUMDB, MvirDB, MycoBrowser leprae, MycoBrowser marinum, MycoBrowser tuberculosis, MypuList, NAPP, NCIm, NCIt, NCPI, NDF-RT, NextBio, neXtProt, NFSD, NRSub, OMIM, OMIM (1), OMMBID, ORegAnno, OrthoDisease, PAGED, PathoPlant, PathPred, PATRIC, PC, PDRhealth, PDTD, PED, PepBank, PeroxisomeDB, PharmGKB, PhenoM, PhosphoGRID, PhytAMP, PIMRider, PINA, PLEXdb, PLOS One, PPIRA, PRGdb, PRIMOS, PROFESS, PROMISCUOUS, PseudoCAP, PSP, PubChem, PubMed, RCSB PDB, Reactome, Reference Strain Catalogue, RhizoBase, RIKEN, SagaList, SALAD, Scansite, ScerTF, SCMD, SCRIPDB, SDAP, SGD, SMD, SPIDer, SPPS, SubtiList, Subviral RNA Database, SuperSite, SuperTarget, SwissVar, T3DB, TDR Targets, Telomerase database, TiPs, TOXLINE, toxoMine, TriTrypDB, TubercuList, UCD 2D-PAGE, UMLS, UniProtKB, VetMed Resource, VFDB, Victors, ViPR, Virhostome Interactome Database, WikiPathways, Wiki-Pi, Wong's Virology, YDPM, Yeast Interactome Database, Yeast Resource Center, Yeast snoRNA, YeastCyc, YeastGE, YeastMine, YEASTNET, YEASTRACT, YMDB

32

Bioremediation **

- Hawaii Bioremediation DBs, - EAWAG-BBD ., - BioRadBase, - BIOREM, - ECSI, - CLU-IN, - OxDBase, - RhizoBase ., - TechProfiles.org, Also: PathSearch ., PathComp., PathPred, KEGG, BioCyc ., ...

Dehalogenation Especially 4 categories of databases are extremely helpful in dehalogenation *: 1. Databases of sequence and structure (NCBI, EMBL, DDBJ, MBDG, CMR, ExPASy, PDB, CSD, SCOP, FSSP) 2. Databases of enzymes and metabolic Pathways (BRENDA, ExplorEnz, UM-BBD, MetaCyc, WIT, KEGG, Pathway

Commons) 3. Databases of molecules (PubChem NCBI, ChemDB, ZINC, Pollution Database, ECOTOX) 4. Databases of organisms (Taxonomy NCBI, BSD, CBS, PAMDB, JCM)

* R. Satpathy, V.B. Konkimalla, J. Ratha. Application of bioinformatic tools in microbial dahalogenation research (a review). 2015 ** In Silico Approach for the Bioremediation of Toxic Pollutants. F. Khan, M. Sajid and S. S. Cameotra. Petroleum & Environmental Biotechnology

33

Aquamicrobium defluvii in EAWAG-BBD

v

34

A. defluvii pathways in BioCyc

b

35

BRIO species in Life Science databases

.

301 299 298 298 298 298 294 270 248 216 175 143 136 128 126 113 99 96 95 91 83 79 69 53 52 52 52

EMBL-EBI ENA Genbank NCBI taxonomy Nucleotide RefSeq PubMed UniProtKB PLOS One ForestScience Current Database CPC ALTBIB Animalscience Espacenet BRENDA VetMed Resource MetaCyc BioCyc KEGG KEGG BRITE Gene KEGG GENOME KEGG GENES KEGG DISEASE KEGG LIGAND KEGG MEDICUS KEGG MODULE

52 52 52 52 50 47 43 41 41 38 34 31 30 28 25 25 25 25 18 17 12 11 9 9 9 7 4 4

KEGG Organisms KEGG ORTHOLOGY KEGG PATHWAY PathComp RCSB PDB CLU-IN InterPro Addgene AGRICOLA GoMapMan ACLAME 5S RNA Database NAPP IntAct ABCdb BBD EAWAG-BBD SCOP Bionemo PROSITE Ensembl Allergome ABAC AFTOL BioRadBase RhizoBase ABCISSE Reference Strain Catalogue

3 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

PhytAMP A.pernix AFLP PLEXdb PRGdb 3D RIBOSOMAL MODIFICATION MAPS abYsis AfCS PLASMID AgBiotechNet AGD AgrID AH-DB ARAMEMNON COGEME MGG MPID MSRDB MUHDB MUMDB NCPI PathoPlant PathPred PC PPIRA Reactome SGD SubtiList

36

.

37

Example for Task 2a access

.

Acronim EMBL-EBI ENA Genbank Nucleotide RefSeq NCBI taxonomy PubMed UniProtKB PLOS One ForestScience Current Database CPC ALTBIB Animalscience Espacenet BRENDA VetMed Resource MetaCyc BioCyc KEGG Gene PathComp RCSB PDB CLU-IN InterPro addgene AGRICOLA GoMapMan ACLAME 5S RNA Database NAPP IntAct ABCdb BBD

Species name Acaulospora nivalis Acaulospora nivalis Acaulospora nivalis Acaulospora nivalis Acaulospora nivalis Acaulospora nivalis Achromobacter marplantensis Acidovorax delafieldii Achromobacter marplantensis Acidovorax sp. Acidovorax sp. Achromobacter marplantensis Acidovorax delafieldii Acidovorax delafieldii Acidovorax sp. Acidovorax delafieldii Acidovorax delafieldii Acidovorax delafieldii Acidovorax sp. Acinetobacter venetianus Pseudomonas putida Acidovorax sp. Acidovorax sp. Acidovorax delafieldii Acidovorax sp. Alcaligenes sp. Acidovorax sp. Acidovorax sp. Bjerkandera adusta Bacillus cereus Arthrobacter sp. Acidovorax sp. Alcaligenes sp.

URL of access to this species https://www.ebi.ac.uk/ebisearch/search.ebi?query=Acaulospor... http://www.ebi.ac.uk/ena/data/search?query=Acaulospora+nivalis http://www.ncbi.nlm.nih.gov/nuccore/?term=Acaulospora+nivalis http://www.ncbi.nlm.nih.gov/nuccore/?term=Acaulospora+nivalis https://www.ncbi.nlm.nih.gov/nuccore/?term=Acaulospora+nivalis http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi http://www.ncbi.nlm.nih.gov/pubmed/?term=Achromobacter+marp... http://www.uniprot.org/uniprot/?query=Acidovorax+delafieldi... http://journals.plos.org/plosone/search?q=%22Achromobacter+... http://www.cabi.org/forestscience/search/?q=%22Acidovorax+s... http://www.cabi.org/cpc/search/?q=%22Acidovorax+sp.%22 http://www.ncbi.nlm.nih.gov/pubmed?cmd=Search&term=%22Acido... http://www.cabi.org/animalscience/search/?q=%22Acidovorax+d... https://worldwide.espacenet.com/searchResults?submitted=tru... http://www.brenda-enzymes.org/search_result.php?quicksearch... http://www.cabi.org/vetmedresource/search/?q=%22Acidovorax+... http://www.biocyc.org/organism-summary?object=ADEL573060 http://www.biocyc.org/organism-summary?object=ADEL573060 http://www.genome.jp/dbget-bin/www_bfind_sub?mode=bfind&max... http://www.ncbi.nlm.nih.gov/gene/?term=Acinetobacter+venetianus http://www.genome.jp/tools-bin/pathcomp?org_name=ppg&org_na... http://www.rcsb.org/pdb/search/advSearch.do?search=new https://clu-in.org/search/default.cfm?search_term=Acidovora... http://www.ebi.ac.uk/interpro/search?q=Acidovorax+delafieldii https://www.addgene.org/search/google_results?q=Acidovorax+sp. http://agricola.nal.usda.gov/cgi-bin/Pwebrecon.cgi?Search_A... http://www.gomapman.org/search/gmm/Acidovorax%20sp.?entity=... http://aclame.ulb.ac.be/perl/Aclame/search.cgi?keys=Acidovo... http://biobases.ibch.poznan.pl/5SData/ http://napp.u-psud.fr/Niveau2.php?specie=76&Name=Bacillus_c... http://www.ebi.ac.uk/intact/interactions?conversationContext=1 https://www-abcdb.biotoul.fr/ http://eawag-bbd.ethz.ch/servlets/search 38