sorting bioactive wheat from database chaff: challenges of discerning correct drug structures

1
Christopher Southan, Helen E. Benson, Elena Faccenda, Joanna L. Sharman, Adam J. Pawson and Jamie A. Davies, IUPHAR/BPS Guide to PHARMACOLOGY , Centre for Integrative Physiology, School of Biomedical Sciences, University of Edinburgh, Hugh Robson Building, Edinburgh, EH8 9XD, UK ([email protected]). Available at http://www.slideshare.net/cdsouthan/sorting-bioactive-wheat-fr Sorting bioactive wheat from database chaff: Challenges of discerning correct drug structures Interpreting the Venn Analysis of the consensus set The divergence between the drug collections can be expressed as the intersect being 29% of the union and 52% of this union being source- unique. The intersect is only 8 structures more than the similar comparison from 2009 using different sources and methods (PMID 20298516). Importantly, each of the sets is compiled from databases with established reputations and regular NAR database issue update papers. We thus emphasise the analysis is not a critique of these teams. However, results highlight the challenge of selecting drugs from the multiplexed options and that sources diverge in the rules they use for this selection. It also implies the concept of “correct” drug structure is illusory since the consensus is only ~40% of what could be expected (and there is no agreement on total counts anyway). Discussion We can summarise the results (presented as averages) as follows; Each of the 815 CIDs was merged from 92 submissions (i.e. (SID:CID). Note this is a direct measure of “popularity” amongst the PubChem sources since this ratio is only 2.8 for all of PubChem. “Same connectivity” establishes that each drug is structurally related to 23 other distinct CIDs (as a measure of multiplexing) “Same conn isotopes” establishes that 15 of the 23 are isotopic derivatives (surprisingly ~ 70% as virtual deuteration from patents) The related “no isotopes” query establishes that 7.5 from the 23 are alternative stereoisomer representations Each drug is included in 68 distinct mixture CID entries As a specific multiplexed example we can examine atorvastatin (GtoPdb ligand 2949). In PubChem this has 102 SIDs and 51 related CIDs. Of these 44 are isotopic (38 deuterated) and 7 are alternative stereoisomer forms. In addition there are 295 mixtures. Tracking multiplexing (as singletons or mixtures) by year in PubChem indicates that patent extractions are the main reasons for the recent increase. Representational multiplexing for bioactive chemistry in documents, web pages and databases in cheminformatics has broadly confounding effects. These include virtual screening and “big data” mining. Metrics defining some of the problems have been presented above. Consequently, our curatorial rules (see GtoPdb FAQ) have been revised. We now check same connectivity, SID counts and BioAssay records to support our choice of CID as ligand structure and collaborate with PubChem for QC. We also alert users to significant structural equivocality and split activity data. Our March 2015 release thus has 1105 approved drug CIDs concordant with either ChEMBL, DrugBank or TTD. The persistent discordance in approved drug database records is of concern but efforts to produce definitive sets will require more inter-source collaboration. In addition, regulatory bodies and pharmaceutical companies need to directly engage with the provenance of public database structures. We can formally analyse multiplexing via the detailed chemical relationships that PubChem pre-computes for the 68.3 million compound entries (April 7 th 2015). By using the 815 CID intersect (from fig.1) relationship counting operations were performed (see PubChem Help documentation for details) as presented in the table below. The structural multiplexing issue How many approved drugs are there ? Selection of approved drug sources Results of the 3-way comparison Since 2009 the Guide to PHARMACOLOGY database (GtoPdb) team have curated 7586 ligands from papers, including approved drugs, clinical candidates , research compounds peptides and clinical antibodies (PMID 24234439). As PubChem pushes towards 70 million compound identifiers (CIDs), we have noticed the problem of “multiplexing” during the curation of 5713 small molecules as CIDs. we encountered many representations (i.e. different CIDs) of the same pharmacological entities. Three types of variation dominate: stereochemistry, mixtures and isotopic analogues. These are known constitutive issues for chemical databases but in recent years we observed this multiplexing was reaching problematic proportions (i.e. more chaff), especially for clinically used drugs (i.e. proportionally less wheat). Given they represent the Crown Jewels of over five decades of drug development it is surprising that counts of approved small molecules span a range from 1216 in the FDA Maximum Daily Dose Database (PubChem Assay ID 1195) up to 2750 for the NCGC Pharmaceutical Collection (PMID 21525397). This was reflected in a comparison of three curated drug collections in 2009 that recorded only 807 structures in-common (PMID 20298516). The challenge faced by GtoPdb in 2015 is the choice of which drug structures to activity-map against which targets. For this reason we re-visited the comparison outlined in PMID:20298516 but within PubChem using their advanced chemical relationship mapping functionality. We chose three sources that a) submit to PubChem b) capture approved drugs c) updated within the last two years and d) had previously been compared in toto (PMID 24533037). For DrugBank (DrugB) approved drugs were selected as 1504 CIDs. For ChEMBL19 approved SMILES were selected from downloaded records and ID mapped to 1499 CIDs. For the Therapeutic Target Database (TTD) the approved drug SDF file was downloaded, converted to InChI strings and mapped to 1877 CIDs. The three sets were then compared inside PubChem. Fig 1. The Venn diagram above shows the ntersect between the three is 815 (i.e. CIDs in-common). The union (sum of all three) is 2750. Note also that 1435 CIDs are unique to each database (n.b. a TTD mapping enhancement increased the overlap from the figure of 749 mentioned in the abstract) Supported by:

Upload: guide-to-pharmacology

Post on 15-Jul-2015

32 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Sorting bioactive wheat from database chaff: Challenges of discerning correct drug structures

Christopher Southan, Helen E. Benson, Elena Faccenda, Joanna L. Sharman, Adam J. Pawson and Jamie A. Davies, IUPHAR/BPS Guide to

PHARMACOLOGY , Centre for Integrative Physiology, School of Biomedical Sciences, University of Edinburgh, Hugh Robson Building, Edinburgh, EH8 9XD, UK ([email protected]). Available at http://www.slideshare.net/cdsouthan/sorting-bioactive-wheat-fr

Sorting bioactive wheat from database chaff:

Challenges of discerning correct drug structures

Interpreting the Venn

Analysis of the consensus set

The divergence between the drug collections can be expressed as the intersect being 29% of the union and 52% of this union being source-unique. The intersect is only 8 structures more than the similar comparison from 2009 using different sources and methods (PMID 20298516). Importantly, each of the sets is compiled from databases with established reputations and regular NAR database issue update papers. We thus emphasise the analysis is not a critique of these teams. However, results highlight the challenge of selecting drugs from the multiplexed options and that sources diverge in the rules they use for this selection. It also implies the concept of “correct” drug structure is illusory since the consensus is only ~40% of what could be expected (and there is no agreement on total counts anyway).

Examples of GPCR database tables

Discussion

We can summarise the results (presented as averages) as follows;

• Each of the 815 CIDs was merged from 92 submissions (i.e. (SID:CID). Note this is a direct measure of “popularity” amongst the PubChem sources since this ratio is only 2.8 for all of PubChem.

• “Same connectivity” establishes that each drug is structurally related to 23 other distinct CIDs (as a measure of multiplexing)

• “Same conn isotopes” establishes that 15 of the 23 are isotopic derivatives (surprisingly ~ 70% as virtual deuteration from patents)

• The related “no isotopes” query establishes that 7.5 from the 23 are alternative stereoisomer representations

• Each drug is included in 68 distinct mixture CID entries

As a specific multiplexed example we can examine atorvastatin (GtoPdb ligand 2949). In PubChem this has 102 SIDs and 51 related CIDs. Of these 44 are isotopic (38 deuterated) and 7 are alternative stereoisomer forms. In addition there are 295 mixtures. Tracking multiplexing (as singletons or mixtures) by year in PubChem indicates that patent extractions are the main reasons for the recent increase.

Representational multiplexing for bioactive chemistry in documents, web pages and databases in cheminformatics has broadly confounding effects. These include virtual screening and “big data” mining. Metrics defining some of the problems have been presented above. Consequently, our curatorial rules (see GtoPdb FAQ) have been revised. We now check same connectivity, SID counts and BioAssay records to support our choice of CID as ligand structure and collaborate with PubChem for QC. We also alert users to significant structural equivocality and split activity data. Our March 2015 release thus has 1105 approved drug CIDs concordant with either ChEMBL, DrugBank or TTD. The persistent discordance in approved drug database records is of concern but efforts to produce definitive sets will require more inter-source collaboration. In addition, regulatory bodies and pharmaceutical companies need to directly engage with the provenance of public database structures.

We can formally analyse multiplexing via the detailed chemical relationships that PubChem pre-computes for the 68.3 million compound entries (April 7th 2015). By using the 815 CID intersect (from fig.1) relationship counting operations were performed (see PubChem Help documentation for details) as presented in the table below.

The structural multiplexing issue

How many approved drugs are there ?

Selection of approved drug sources

Results of the 3-way comparison

Since 2009 the Guide to PHARMACOLOGY database (GtoPdb) team have curated 7586 ligands from papers, including approved drugs, clinical candidates , research compounds peptides and clinical antibodies (PMID 24234439). As PubChem pushes towards 70 million compound identifiers (CIDs), we have noticed the problem of “multiplexing” during the curation of 5713 small molecules as CIDs. we encountered many representations (i.e. different CIDs) of the same pharmacological entities. Three types of variation dominate: stereochemistry, mixtures and isotopic analogues. These are known constitutive issues for chemical databases but in recent years we observed this multiplexing was reaching problematic proportions (i.e. more chaff), especially for clinically used drugs (i.e. proportionally less wheat).

Given they represent the Crown Jewels of over five decades of drug development it is surprising that counts of approved small molecules span a range from 1216 in the FDA Maximum Daily Dose Database (PubChem Assay ID 1195) up to 2750 for the NCGC Pharmaceutical Collection (PMID 21525397). This was reflected in a comparison of three curated drug collections in 2009 that recorded only 807 structures in-common (PMID 20298516). The challenge faced by GtoPdb in 2015 is the choice of which drug structures to activity-map against which targets. For this reason we re-visited the comparison outlined in PMID:20298516 but within PubChem using their advanced chemical relationship mapping functionality.

We chose three sources that a) submit to PubChem b) capture approved drugs c) updated within the last two years and d) had previously been compared in toto (PMID 24533037). For DrugBank (DrugB) approved drugs were selected as 1504 CIDs. For ChEMBL19 approved SMILES were selected from downloaded records and ID mapped to 1499 CIDs. For the Therapeutic Target Database (TTD) the approved drug SDF file was downloaded, converted to InChI strings and mapped to 1877 CIDs. The three sets were then compared inside PubChem.

Fig 1. The Venn diagram above shows the ntersect between the three is 815 (i.e. CIDs in-common). The union (sum of all three) is 2750. Note also that 1435 CIDs are unique to each database (n.b. a TTD mapping enhancement increased the overlap from the figure

of 749 mentioned in the abstract)

Supported by: