img-abc: new features for bacterial secondary metabolism ... · pdf filesynthesis is typically...

6
D560–D565 Nucleic Acids Research, 2017, Vol. 45, Database issue Published online 28 November 2016 doi: 10.1093/nar/gkw1103 IMG-ABC: new features for bacterial secondary metabolism analysis and targeted biosynthetic gene cluster discovery in thousands of microbial genomes Michalis Hadjithomas 1,* , I-Min A. Chen 2 , Ken Chu 2 , Jinghua Huang 2 , Anna Ratner 2 , Krishna Palaniappan 2 , Evan Andersen 2 , Victor Markowitz 2 , Nikos C. Kyrpides 1 and Natalia N. Ivanova 1,* 1 Microbial Genome and Metagenome Program, Department of Energy Joint Genome Institute, Walnut Creek, CA 94598, USA and 2 Biosciences Computing, Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Received October 07, 2016; Editorial Decision October 26, 2016; Accepted November 08, 2016 ABSTRACT Secondary metabolites produced by microbes have diverse biological functions, which makes them a great potential source of biotechnologically relevant compounds with antimicrobial, anti-cancer and other activities. The proteins needed to synthesize these natural products are often encoded by clusters of co-located genes called biosynthetic gene clusters (BCs). In order to advance the exploration of mi- crobial secondary metabolism, we developed the largest publically available database of experimen- tally verified and predicted BCs, the Integrated Mi- crobial Genomes Atlas of Biosynthetic gene Clus- ters (IMG-ABC) (https://img.jgi.doe.gov/abc/ ). Here, we describe an update of IMG-ABC, which includes ClusterScout, a tool for targeted identification of cus- tom biosynthetic gene clusters across 40 000 isolate microbial genomes, and a new search capability to query more than 700 000 BCs from isolate genomes for clusters with similar Pfam composition. Addi- tional features enable fast exploration and analysis of BCs through two new interactive visualization fea- tures, a BC function heatmap and a BC similarity net- work graph. These new tools and features add to the value of IMG-ABC’s vast body of BC data, facilitating their in-depth analysis and accelerating secondary metabolite discovery. INTRODUCTION Microbes produce a variety of compounds, known as sec- ondary metabolites (SMs) or natural products, which play many important physiological roles. Some SMs confer the ability to survive adverse conditions, while others function in bacterial communication (1) or are used as weapons of inter- and intra-species competition (2). These diverse func- tions make SMs a great source of biotechnologically rele- vant compounds with antimicrobial, anti-cancer and other activities (3–5). Additionally, biologically synthesized com- pounds achieve a chemical structure complexity unmatched by synthetic chemistry. Whereas traditionally SMs have been isolated from microbial cultures, the recent advances in DNA sequencing technologies have opened a new avenue to the discovery and characterization of SMs (6). The cor- nerstone of this capability lies in the observation that SM synthesis is typically encoded by the genes that are clus- tered together on the chromosomes or plasmids (biosyn- thetic gene clusters or BCs) (7). This feature of SMs has been exploited by several computational tools, which were used to predict and annotate BCs in thousands of micro- bial genomes (8). The result of this breakthrough was not only the discovery of previously unknown biosynthetic po- tential in well-studied organisms (9), but also identification of proteins with novel enzymatic activities and novel classes of SMs (10). In order to facilitate the discovery and analysis of BCs and SMs in bacterial genomes and metagenomes, a data mart called the Atlas of Biosynthetic gene Clusters within the Integrated Microbial Genomes system (11) (IMG-ABC) was introduced in 2015 (12). BCs were predicted and an- notated across all microbial isolate genomes and a set of metagenomes using a combination of Clusterfinder (10) and antiSMASH (13). After the creation of the data mart, an- notation with antiSMASH became part of the standard annotation pipeline for newly imported isolate genomes. The result of this integration was the growth of the IMG- ABC database to contain more than 730 000 BCs in 40 034 isolate microbial genomes and >310 000 BCs in 2416 * To whom correspondence should be addressed. Email: [email protected] Correspondence may also be addressed to Natalia N. Ivanova. Tel: +1 925 296 5832; Fax: +1 925 296 5840; Email: [email protected] C The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Upload: doantruc

Post on 28-Mar-2018

218 views

Category:

Documents


4 download

TRANSCRIPT

D560–D565 Nucleic Acids Research, 2017, Vol. 45, Database issue Published online 28 November 2016doi: 10.1093/nar/gkw1103

IMG-ABC: new features for bacterial secondarymetabolism analysis and targeted biosynthetic genecluster discovery in thousands of microbial genomesMichalis Hadjithomas1,*, I-Min A. Chen2, Ken Chu2, Jinghua Huang2, Anna Ratner2,Krishna Palaniappan2, Evan Andersen2, Victor Markowitz2, Nikos C. Kyrpides1 and NataliaN. Ivanova1,*

1Microbial Genome and Metagenome Program, Department of Energy Joint Genome Institute, Walnut Creek, CA94598, USA and 2Biosciences Computing, Computational Research Division, Lawrence Berkeley NationalLaboratory, Berkeley, CA 94720, USA

Received October 07, 2016; Editorial Decision October 26, 2016; Accepted November 08, 2016

ABSTRACT

Secondary metabolites produced by microbes havediverse biological functions, which makes them agreat potential source of biotechnologically relevantcompounds with antimicrobial, anti-cancer and otheractivities. The proteins needed to synthesize thesenatural products are often encoded by clusters ofco-located genes called biosynthetic gene clusters(BCs). In order to advance the exploration of mi-crobial secondary metabolism, we developed thelargest publically available database of experimen-tally verified and predicted BCs, the Integrated Mi-crobial Genomes Atlas of Biosynthetic gene Clus-ters (IMG-ABC) (https://img.jgi.doe.gov/abc/). Here,we describe an update of IMG-ABC, which includesClusterScout, a tool for targeted identification of cus-tom biosynthetic gene clusters across 40 000 isolatemicrobial genomes, and a new search capability toquery more than 700 000 BCs from isolate genomesfor clusters with similar Pfam composition. Addi-tional features enable fast exploration and analysisof BCs through two new interactive visualization fea-tures, a BC function heatmap and a BC similarity net-work graph. These new tools and features add to thevalue of IMG-ABC’s vast body of BC data, facilitatingtheir in-depth analysis and accelerating secondarymetabolite discovery.

INTRODUCTION

Microbes produce a variety of compounds, known as sec-ondary metabolites (SMs) or natural products, which playmany important physiological roles. Some SMs confer the

ability to survive adverse conditions, while others functionin bacterial communication (1) or are used as weapons ofinter- and intra-species competition (2). These diverse func-tions make SMs a great source of biotechnologically rele-vant compounds with antimicrobial, anti-cancer and otheractivities (3–5). Additionally, biologically synthesized com-pounds achieve a chemical structure complexity unmatchedby synthetic chemistry. Whereas traditionally SMs havebeen isolated from microbial cultures, the recent advancesin DNA sequencing technologies have opened a new avenueto the discovery and characterization of SMs (6). The cor-nerstone of this capability lies in the observation that SMsynthesis is typically encoded by the genes that are clus-tered together on the chromosomes or plasmids (biosyn-thetic gene clusters or BCs) (7). This feature of SMs hasbeen exploited by several computational tools, which wereused to predict and annotate BCs in thousands of micro-bial genomes (8). The result of this breakthrough was notonly the discovery of previously unknown biosynthetic po-tential in well-studied organisms (9), but also identificationof proteins with novel enzymatic activities and novel classesof SMs (10).

In order to facilitate the discovery and analysis of BCsand SMs in bacterial genomes and metagenomes, a datamart called the Atlas of Biosynthetic gene Clusters withinthe Integrated Microbial Genomes system (11) (IMG-ABC)was introduced in 2015 (12). BCs were predicted and an-notated across all microbial isolate genomes and a set ofmetagenomes using a combination of Clusterfinder (10) andantiSMASH (13). After the creation of the data mart, an-notation with antiSMASH became part of the standardannotation pipeline for newly imported isolate genomes.The result of this integration was the growth of the IMG-ABC database to contain more than 730 000 BCs in 40034 isolate microbial genomes and >310 000 BCs in 2416

*To whom correspondence should be addressed. Email: [email protected] may also be addressed to Natalia N. Ivanova. Tel: +1 925 296 5832; Fax: +1 925 296 5840; Email: [email protected]

C© The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), whichpermits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please [email protected]

Nucleic Acids Research, 2017, Vol. 45, Database issue D561

Figure 1. Similarity Search (A) BC similarity search is accessible through the ‘Similar Clusters’ tab (i) in the Biosynthetic Cluster Detail page, in this casethe streptomycin BC from Streptomyces griseus. (B) The results are presented as a table which includes the ‘Jaccard Score’ and the ‘Adjusted Jaccard Score’.As expected, two more experimentally verified BCs known to produce streptomycin are retrieved, in addition to multiple predicted BCs. Users can furtheranalyze these BCs by (i) adding them to the BC Cart, or (ii) visualize the neighborhoods of selected BCs to compare them with the query BCs.

metagenomes. Additionally, a set of experimentally verifiedBCs and their associated SMs were collected from the NCBIand imported into the database. IMG-ABC also providedthe means to search these large datasets based on BC andSM attributes, such as BC type, Pfam (14) and EC content(15), BC length, SM activity etc. Here, we present an updateof IMG-ABC, which includes several new features improv-ing data mining and navigation, such as a set of tools forBC visualization and comparison, and targeted BC identi-fication across thousands of isolate bacterial genomes withClusterScout.

NEW FEATURES AND UPDATES

Open access to browsing

IMG-ABC can be accessed at https://img.jgi.doe.gov/abc/and can be browsed without a login requirement. However,creating a free IMG account enables users to utilize IMG-ABC’s ‘BC Set’ feature, which allows saving groups of BCson the server side for future retrieval and sharing.

Find similar BCs function

In the initial version of IMG-ABC, users could search theBC database using various attributes, such as BC type, SMproduced, or the presence of specific Pfams (14). With thisupdate, we are introducing the ability to search for BCswith Pfam composition similar to that of a query BC. Thisfunction is available through the ‘Similar Clusters’ tab onthe Biosynthetic Cluster Detail page (Figure 1A). Briefly, inorder to find BCs with similar Pfam composition, Pfamsassigned to the genes of the query BC are collected andsearched against the IMG-ABC database for BCs contain-ing at least three of these Pfams. In order to minimize therunning time of the search, the ubiquitous ABC Trans-porter family (PF00005) is not considered at this step. Atleast 500 BCs (if possible) are collected in order of descend-ing number of shared Pfams. Next, a pairwise compari-son between all collected BCs against the query BC is per-formed. In the resulting Pairwise Similarity Results tabletwo scores are reported: first is the ‘Jaccard Score’, whichis the Jaccard Index (the ratio of the number of shared dis-tinct Pfam domains to the total number of distinct Pfam in

both BCs); second is the ‘Adjusted Jaccard’, which is basedon the modified Jaccard Index introduced by Cimermancicet al. (10) and scaled by normalizing to 0.64e−1 so that thescore reported is between 0 and 1. The result of the similar-ity search is the 100 most similar BCs, sorted by descendingAdjusted Score (Figure 1B). In addition to the two similar-ity scores and other BC metadata, the table contains the to-tal number of distinct Pfams shared between the query andeach reported BC.

ClusterScout: custom identification of BCs in large datasets

The latest version of the popular BC identification tool an-tiSMASH (version 3) annotates >40 types of BCs basedon the presence of certain key protein domains (13). How-ever, this list is not comprehensive as there many BC typesnot currently included in the annotation logic. It shouldbe noted, however, that the stand-alone version of anti-SMASH provides the ability to implement a customizedsearch logic based not only on Pfam annotations but alsoon custom protein hidden Markov model (HMM) profiles.Similar functionality in IMG-ABC is provided by Cluster-Scout, a tool that uses gene Pfam content for the identifi-cation of BCs across all publicly available isolate bacterialgenomes in IMG.

The algorithm used by ClusterScout is summarized inFigure 2A. Briefly, a user must select a set of Pfams (‘hooks’,minimum 3) and specify the maximum distance betweenthem (maximum 20 kb) (Figure 1B). The user must also setthe minimum number of hooks required (minimum 3), andmay optionally define a subset of specific Pfam hooks thatneed to be present in the cluster (‘essential hooks’). Addi-tionally, the user can extend the boundaries of the clusterby up to 20 kb. When the extended boundary falls withina coding sequence, the cluster boundary is extended to thestart/stop of the overlapping gene(s). Lastly, users can de-fine the minimum distance between the identified clusterand the scaffold ends (default 1 kb), which is useful in elim-inating potentially incomplete BCs from the results.

Depending on the complexity of the query and the ubiq-uity of the selected Pfams, ClusterScout may take a few min-utes to several hours to complete its run. For this reason,the process is run in the background and, upon completion,the user is notified by an email with a link to the results

D562 Nucleic Acids Research, 2017, Vol. 45, Database issue

Figure 2. ClusterScout. (A) User Interface with algorithm logic schematic. (B) The results from a ClusterScout search are presented in a table, and customBCs can be added to the BC Cart for analysis.

(Figure 2B). The number of reported clusters is limited tothe top 500 candidates, selected first based on the numberof hooks they contain, and then by their sequence length.Custom clusters identified by ClusterScout can be added toa BC Cart for further analysis (Figure 3A). The results ofa ClusterScout search are saved for up to 24 hours on theIMG-ABC server, so users are encouraged to download theresults or, using a registered account, save them in a BC Set.

BC cart: enabling in-depth analysis of selected BCs

The new version of IMG-ABC includes the BC cart, a fea-ture familiar to IMG users (11), which serves to collect theBCs of interest for further in-depth analysis (Figure 3A).BCs can be added to the cart from several entry points inthe User Interface, including the Biosynthetic Cluster De-tail page, various summary tables that can be accessed bybrowsing BCs or SMs, or the results of searches. In the BCCart, BCs can be exported or imported through the ‘Upload& Export & Save’ tab. The BC Cart also serves as a portalto accessing IMG’s rich analysis features through the abil-ity to add data to other types of Carts, such as the Genome,Scaffold and Gene Carts. Additionally, the Pfam contents ofselected BCs can be added to a Function Cart for detailedfunctional analysis.

Most importantly, the BC Cart contains the featuresspecifically designed to facilitate visualization and analysisof BCs. A user can visualize and compare the architecturesof selected BCs, and the content of flanking genomic ar-eas, by using the ‘Neighborhoods’ tab. Additionally, Pfamcontent and modular architecture of BCs can be exploredthrough the ‘Function Heatmap’ tab and BC similaritiescan be visualized through the ‘Similarity Network’ tab. Thelatter two features are described in detail below.

Hierarchically clustered heatmap visualization

Calculating the similarity between BCs is an effective wayto identify BCs that may encode a similar SM. However,these similarity scores may be misleading due to impreciseprediction of BC boundaries, whereby the similarity scoresare influenced by the flanking genes erroneously includedin the BC. Additionally, small but important differences(e.g. presence or absence of a single protein domain con-ferring unique enzymatic activity) between highly similarBCs may not be reflected in the similarity scores. In order to

enable more detailed BC comparisons, we are introducingthe capability to dynamically build BC heatmaps based onPfam annotations. This is achieved through an implemen-tation of the InCHlib Javascript library (16). Cells in theheatmap represent counts of Pfams (columns) found in se-lected BCs (rows). The heatmap display appearance is cus-tomizable and allows users to inspect functional similari-ties and differences between multiple BCs by assessing theirPfam content. The user can also quickly access the Pfamand BC metadata by hovering the mouse pointer, respec-tively, over the column headers and columns on the rightof the map displaying BC taxonomic information (Figure3B). Additionally, the data are hierarchically clustered onboth the Pfam and the BC axis using an InCHlib clusteringtool with the Jaccard distance and Ward linkage options.As a result of 2D clustering, BCs with similar Pfam con-tent are clustered together, and Pfams with similar distribu-tions across the BCs of interest are also clustered together.Clustering the data on both axes facilitates the identifica-tion of conserved function modules. These may representthe core functions of the BC or auxiliary functions partici-pating in modification of the core SM. The number of BCsand Pfams shown in the user interface is limited to 100 BCsand 50 most frequent Pfams in the selected dataset, withan option to download the data for up to 500 BCs and allPfams in a format compatible with the Gene-e tool (http://www.broadinstitute.org/cancer/software/GENE-E/).

Dynamic BC similarity network visualization

The construction and visualization of networks based onthe similarity between BCs has been a successful approachin the identification of novel BC families (10). Additionally,superimposing experimentally verified BCs on these net-works can provide clues for the potential SM encoded byputative BCs. With this update of IMG-ABC, we are intro-ducing the ability to visualize BC similarity networks (Fig-ure 3C) by performing all-versus-all pairwise comparisonsof selected BCs to create a similarity matrix. This matrix isfiltered to include only the pairs with an Adjusted JaccardScore equal or >0.5, and BCs without a single score abovethis threshold are not represented in the network graph. Thevisualization of this network is based on an implementa-tion of linkurious (http://linkurio.us/), which enables usersto quickly navigate and inspect the BC similarity network,and to visualize a set of relevant metadata through a selec-

Nucleic Acids Research, 2017, Vol. 45, Database issue D563

Figure 3. IMG-ABC case study. (A) The BC Cart is the virtual space where BC analysis can be performed. (B) The function heatmap visualization canbe used to study the Pfam content of BCs. Cells are colored with hues of green based on the number of copies of the selected Pfam in the BC (darkersignifies a higher copy number). Hovering the mouse pointer over a cell provides the number of copies of that Pfam in the BC. Pfams (columns) that occurin all BCs (rows) likely define the core functions of the BCs in view. Column and row metadata are found on top and to the right, respectively, of theheatmap. Hovering the mouse pointer over these metadata cells provides more detail information. These metadata can be used for quick visual inspectionand identification of patterns. For example, betaproteobacteria containing the DAPG BC are easily discoverable (asterisk). (C) The similarity networkgraph provides another way to summarize the data. The BCs in this example fall into three distinct groups; two groups contain gammaproteobacteria(red nodes) while one group consists of betaproteobacteria (purple nodes). The green node represents experimentally verified BC for DAPG (from agammaproteobacterium Pseudomonas fluorescens) in the IMG-ABC database. Clicking on a node reveals the metadata associated with the BC (table onthe right). The color of the nodes can be changed to display different metadata, such as taxonomic classification or evidence. (D) Visualization of theputative DAPG BC neighborhoods from the four betaproteobacterial BCs and one BC from each gammaproteobacterial group shows that although theflanking regions of the BCs differ, the core genes are conserved, thus it is likely that these newly discovered BCs indeed encode the necessary proteins forDAPG production.

D564 Nucleic Acids Research, 2017, Vol. 45, Database issue

tion menu. The similarity data and metadata can also bedownloaded for visualization with Cytoscape (17).

IMG-ABC case study

To showcase the power of the new IMG-ABC features ap-plied to a large database of annotated genomes we presenta BC discovery and characterization case study using an ex-ample of 2,4-diacetylphloroglucinol (DAPG) biosynthesis,which is annotated by the current version of antiSMASH(13) as type 3 polyketide synthase cluster without SM prod-uct prediction. DAPG is a secondary metabolite with im-portant biocontrol properties (18). The aim of our casestudy is to identify all DAPG clusters in IMG genomes, an-alyze their structure and elucidate taxonomic and habitatdistribution of DAPG biosynthesis pathway. The steps usedin this case study are as follows:

1. We start by identifying the Pfams of enzymes participat-ing in DAPG biosynthesis, which will be then used asthe input in a ClusterScout query. These Pfams can befound by searching for experimentally verified DAPGbiosynthesis clusters using ‘Search by BC Attributes’menu and inspecting the Pfam content of each genein the cluster through the ‘Genes in Cluster’ tab ofBiosynthetic Cluster Detail page and IMG’s gene de-tail pages. In cases of genes with multiple Pfams, thePfam with the longest model can be selected to improvethe accuracy of the ClusterScout search. For the DAPGBC these are pfam08545, pfam00108 and pfam00195for PhlA (hydroxymethylglutaryl-CoA synthase), PhlC(acetyl-CoA acetyltransferase) and PhlD (phlorogluci-nol synthase), respectively. Next we use the ClusterScouttool, limiting the search to genomic regions in whichthese Pfams are found within 5000 bp from each other,and extend the boundaries of the discovered locus by5000 bp. We also set the minimum distance from the scaf-fold end to 1000 bp. This query returns 62 putative BCs(Supplementary Table SI), which are added to BC cartfor analysis (Figure 3A).

2. Next, we use the function heatmap feature to identifyco-occurring Pfams. In the Function Heatmap tab, byselecting ‘Plot’ option a set of Pfams present in all the pu-tative clusters (the ‘core’) can be identified (Figure 3B).By hovering the mouse pointer over the column head-ers of the ‘core’ Pfams we observe that in addition tothe original 3 ‘hook’ Pfams, pfam16859, pfam00440 andpfam07690 appear to be part of the ‘core’ DAPG path-way. This is in agreement with experimental data forDAPG biosynthesis pathway, as the first two of thesePfams are found in the transcription regulator (PhlF),which controls expression of this pathway, while the lastone is found in the putative DAPG exporter protein(PhlE) (18).

3. Taxonomic affiliation of putative DAPG BCs, which canbe explored by hovering the mouse pointer over rowmetadata (far right columns), identifies a distinct groupof clusters found in betaproteobacteria (marked with anasterisk in Figure 3B). To the best of our knowledge, thisis the first observation of the DAPG pathway in betapro-teobacteria.

4. Alternatively, the diversity of putative DAPG clusterscan be investigated by visualizing a BC similarity net-work by clicking on the ‘Similarity Network’ tab (Fig-ure 3C). In our example, BCs fall into three majorgroups, and by coloring the nodes based on their tax-onomy (Class of the host organism) we observe that twoof these groups are found in gammaproteobacteria, inwhich DAPG biosynthesis pathway has been experimen-tally elucidated, while the smallest group originates frombetaproteobacteria.

5. Lastly, the chromosomal organization of putativeDAPG clusters can be investigated using the ‘Neighbor-hood’ view (Figure 3D). In order to simplify visualiza-tion, we can select only one representative from eachof the two large gammaproteobacterial groups, and allfour of the newly discovered betaproteobacterial BCs.As expected, the organization of this cluster is very simi-lar between betaproteobacteria and distinct from bothgammaproteobacterial clusters. However, organizationof the ‘core’ genes is similar between all clusters suggest-ing that these betaproteobacterial clusters likely encodeDAPG biosynthesis pathway.

To summarize the results of our case study, we haveused ClusterScout to devise a custom query and searchfor putative DAPG BCs across >40 000 isolate genomesin the IMG database. This search and subsequent anal-ysis using IMG-ABC tools took less than an hour, andhas led to the identification of putative DAPG biosynthe-sis clusters in four betaproteobacteria: a bacterium iden-tified as Pseudogulbenkiania ferrooxidans EGD-HP2 (19)and three members of Chromobacterium genus, Chromobac-terium vaccinii MWU205 (20), C. vaccinii MWU328 (20)and Chromobacterium piscinae ND17 (21). The IMG toolsfor analysis of biogeography and habitat specificity of theseisolates indicate that all of them were collected from fresh-water environments across the globe (Northeastern India,Northeastern United States of America and Malaysia). Thisfinding is remarkable considering that so far DAPG pro-duction in bacteria has been described mostly in plant-associated bacteria and in a bacterial symbiont of the red-backed salamander (22). Interestingly, C. vaccinii strains,in which DAPG clusters were found, were studied becauseof their biocontrol activity against mosquito larvae (20).Since DAPG is known to elicit a lethal hyperactive im-mune response in Drosophila melanogaster larvae (23) andhas known toxicity against other organisms (24), this find-ing raises the possibility that DAPG may be responsiblefor larvicidal activity of the C. vaccinii strains. This exam-ple demonstrates the power of IMG-ABC data and tools insupporting the analysis of genome sequences, enabling bio-logical interpretation of sequence data and formulation oftestable hypotheses.

FUTURE DIRECTIONS

The explosion in the number of sequenced isolate micro-bial genomes and metagenomes provides the opportunityto use bioinformatic methods for the discovery and anal-ysis of biosynthetic gene clusters. IMG-ABC contains thelargest publicly available collection of experimentally iden-

Nucleic Acids Research, 2017, Vol. 45, Database issue D565

tified and predicted BCs. The value of this vast body ofdata increases by the addition of new features and toolsexpanding the ability of researchers to explore and inter-pret these data. Users can now design their own criteriato search for putative BCs using ClusterScout and analyzetheir findings using similarity networks and hierarchicallyclustered function heatmaps. Although these tools are cur-rently available for isolate genomes only, in future updateswe plan to expand their usage to metagenomes, as improvedmetagenomic assembly methods and sequencing strategiesincrease the length of metagenomic contigs and scaffolds.Additionally, we plan to introduce the ability to use othertypes of protein clusters and functional annotations, such asEC numbers (15) and COG assignments (25) as inputs forClusterScout-powered BC discovery. Lastly, the content ofthe IMG-ABC database will be updated to include BC pre-dictions from the latest version of antiSMASH, which notonly includes more BC types in the identification algorithm,but also implements ClusterFinder to improve BC bound-ary prediction (13). IMG-ABC will continue to be providedto the users without the need for registration, making theexploration of global microbial secondary metabolism ac-cessible to the entire scientific community.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

The Director, Office of Science, Office of Biological and En-vironmental Research, Life Sciences Division, U.S. Depart-ment of Energy [DE-AC02-05CH11231]; National EnergyResearch Scientific Computing Center, which is supportedby the Office of Science of the U.S. Department of Energy[DE-AC02-05CH11231]. Funding for open access charge:University of California.Conflict of interest statement. None declared.

REFERENCES1. Piel,J. (2009) Metabolites from symbiotic bacteria. Nat. Prod. Rep.,

26, 338–362.2. Thomashow,L.S. and Weller,D.M. (1988) Role of a phenazine

antibiotic from Pseudomonas fluorescens in biological control ofGaeumannomyces graminis var. tritici. J. Bacteriol., 170, 3499–3508.

3. Zhang,F., Rodriguez,S. and Keasling,J.D. (2011) Metabolicengineering of microbial pathways for advanced biofuels production.Curr. Opin. Biotechnol., 22, 775–783.

4. Newman,D.J. and Cragg,G.M. (2012) Natural products as sources ofnew drugs over the 30 years from 1981 to 2010. J. Nat. Prod., 75,311–335.

5. Ling,L.L., Schneider,T., Peoples,A.J., Spoering,A.L., Engels,I.,Conlon,B.P., Mueller,A., Schaberle,T.F., Hughes,D.E., Epstein,S.et al. (2015) A new antibiotic kills pathogens without detectableresistance. Nature, 517, 455–459.

6. Caboche,S. (2014) Biosynthesis: bioinformatics bolster a renaissance.Nat. Chem. Biol., 10, 798–800.

7. Jensen,P.R. (2016) Natural products and the gene cluster revolution.Trends Microbiol., doi:10.1016/j.tim.2016.07.006.

8. Medema,M.H. and Fischbach,M.A. (2015) Computationalapproaches to natural product discovery. Nat. Chem. Biol., 11,639–648.

9. Rutledge,P.J. and Challis,G.L. (2015) Discovery of microbial naturalproducts by activation of silent biosynthetic gene clusters. Nat. Rev.Microbiol., 13, 509–523.

10. Cimermancic,P., Medema,M.H., Claesen,J., Kurita,K., WielandBrown,L.C., Mavrommatis,K., Pati,A., Godfrey,P.A., Koehrsen,M.,Clardy,J. et al. (2014) Insights into secondary metabolism from aglobal analysis of prokaryotic biosynthetic gene clusters. Cell, 158,412–421.

11. Markowitz,V.M., Chen,I.-M.A., Chu,K., Szeto,E., Palaniappan,K.,Pillay,M., Ratner,A., Huang,J., Pagani,I., Tringe,S. et al. (2014)IMG/M 4 version of the integrated metagenome comparativeanalysis system. Nucleic Acids Res., 42, D568–D573.

12. Hadjithomas,M., Chen,I.-M.A., Chu,K., Ratner,A., Palaniappan,K.,Szeto,E., Huang,J., Reddy,T.B.K., Cimermancic,P., Fischbach,M.A.et al. (2015) IMG-ABC: a knowledge base to fuel discovery ofbiosynthetic gene clusters and novel secondary metabolites. mBio, 6,doi:10.1128/mBio.00932-15.

13. Weber,T., Blin,K., Duddela,S., Krug,D., Kim,H.U., Bruccoleri,R.,Lee,S.Y., Fischbach,M.A., Muller,R., Wohlleben,W. et al. (2015)antiSMASH 3.0––a comprehensive resource for the genome miningof biosynthetic gene clusters. Nucleic Acids Res., 43, W237–W243.

14. Finn,R.D., Coggill,P., Eberhardt,R.Y., Eddy,S.R., Mistry,J.,Mitchell,A.L., Potter,S.C., Punta,M., Qureshi,M.,Sangrador-Vegas,A. et al. (2016) The Pfam protein families database:towards a more sustainable future. Nucleic Acids Res., 44,D279–D285.

15. Webb,E.C. (1992) Enzyme nomenclature 1992. Recommendations ofthe Nomenclature Committee of the International UnionofBiochemistry and Molecular Biology on the Nomenclature andClassification of Enzymes.

16. Skuta,C., Bartunek,P. and Svozil,D. (2014) InCHlib – interactivecluster heatmap for web applications. J. Cheminformatics, 6, 44.

17. Smoot,M.E., Ono,K., Ruscheinski,J., Wang,P.-L. and Ideker,T.(2011) Cytoscape 2.8: new features for data integration and networkvisualization. Bioinformatics, 27, 431–432.

18. Bangera,M.G. and Thomashow,L.S. (1999) Identification andcharacterization of a gene cluster for synthesis of the polyketideantibiotic 2,4-diacetylphloroglucinol from Pseudomonas fluorescensQ2-87. J. Bacteriol., 181, 3155–3163.

19. Puranik,S., Talkal,R., Qureshi,A., Khardenavis,A., Kapley,A. andPurohit,H.J. (2013) Genome sequence of the pigment-producingbacterium Pseudogulbenkiania ferrooxidans, isolated from Loktaklake. Genome Announc., 1, doi:10.1128/genomeA.01115-13.

20. Voing,K., Harrison,A. and Soby,S.D. (2015) Draft genome sequenceof Chromobacterium vaccinii, a potential biocontrol agent againstmosquito (Aedes aegypti) larvae. Genome Announc., 3,doi:10.1128/genomeA.00477-15.

21. Chan,K.-G. and Yunos,N.Y.M. (2016) Whole-genome sequencinganalysis of chromobacterium piscinae strain ND17, aquorum-sensing bacterium. Genome Announc., 4,doi:10.1128/genomeA.00081-16.

22. Brucker,R.M., Baylor,C.M., Walters,R.L., Lauer,A., Harris,R.N.and Minbiole,K.P.C. (2007) The identification of2,4-diacetylphloroglucinol as an antifungal metabolite produced bycutaneous bacteria of the salamander plethodon cinereus. J. Chem.Ecol., 34, 39–43.

23. Parker,R. (2013) The effects of the bacteria, Pf-5, and its metabolite,DAPG, on the innate immune response of drosophila melanogaster.Acad. Excell. Showc. Sched.,http://digitalcommons.wou.edu/aes event/2013/biol/10.

24. Meyer,S.L.F., Halbrendt,J.M., Carta,L.K., Skantar,A.M., Liu,T.,Abdelnabby,H.M.E. and Vinyard,B.T. (2009) Toxicity of2,4-diacetylphloroglucinol (DAPG) to plant-parasitic andbacterial-feeding nematodes. J. Nematol., 41, 274.

25. Galperin,M.Y., Makarova,K.S., Wolf,Y.I. and Koonin,E.V. (2015)Expanded microbial genome coverage and improved protein familyannotation in the COG database. Nucleic Acids Res., 43, D261–D269.