validation and standardization of molecular structures in general and sugars in particular: a case...

39
Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko, Antony Williams 6th Joint Sheffield Conference on Chemoinformatics 2013-07-24

Upload: kiera-cresap

Post on 14-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study

Colin Batchelor,

Ken Karapetyan, Valery Tkachenko, Antony Williams

6th Joint Sheffield Conference on Chemoinformatics

2013-07-24

Page 2: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Overview

Open PHACTS and chemical validation and standardization

RDF for chemoinformatics calculations

General case study: ChEMBL and DrugBank

Sugar case study: Perspective perception

Page 3: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Overview

Open PHACTS and chemical validation and standardization

RDF for chemoinformatics calculations

General case study: ChEMBL and DrugBank

Sugar case study: Perspective perception

Page 4: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Who is involved? 28 Consortium Members >45 Associated Partners

3-year European project funded by:

• European Pharmaceutical Industry• Innovative Medicines Initiative

Open PHACTS API

Applications using the Open PHACTS API

dev.openphacts.org

Explorer

www.openphacts.org Twitter: @open_phacts

Page 5: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

How do we fit in?

We integrate and standardize the chemical compound

collection underpinning Open PHACTS and provide

regular updates and on-going data curation.

The validation and standardization rules have been

derived from the FDA structure guidelines and have

been changed for consistency and input from

members of EFPIA.

Page 6: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Open PHACTS provides an integrated platform of publicly available pharmacological and physicochemical data ”“

Data accessible via:

• Free application programming interface (API) dev.openphacts.org

• Third-party applications built to use the API Open PHACTS app ecosystem

Page 7: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

How does Open PHACTS work?

Page 8: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Currently integrated databases

Database Millions of triples

ACD Labs / ChemSpider 161.3ChEBI 0.9ChEMBL 146.1ConceptWiki 3.7DrugBank 0.5Enzyme 0.1Gene Ontology 0.9

SwissProt 156.6WikiPathways 0.1

TOTAL 470.2

Page 9: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

CVSP and the OPS CRS

Standardization workflows (CVSP, FDA, OPS, custom) using modules such as:• SMIRKS transformations• layout (GGA)• canonical tautomers (ChemAxon)• sugar interpretation (RSC)

Page 10: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Overview

Open PHACTS and chemical validation and standardization

RDF for chemoinformatics calculations

General case study: ChEMBL and DrugBank

Sugar case study: Perspective perception

Page 11: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

RDF and Open PHACTS

The underlying language of Open PHACTS is RDF.

There are few constraints as such, only guidelines

for which classes of identifier to use and accounts

of best practice.

This RDF goes into the data cache and we access

the results through user interfaces built on RESTful

JSON web services.

Page 12: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

What does RDF look like?

In the Turtle format below, each line is a triple, in which

a binary predicate links a subject and an object.

:CSID1execution obo:OBO_0000299 :CSID1prop11 .

:CSID1prop11 obo:IAO_0000136 ops:OPS1 .

:CSID1prop11 rdf:type cheminf:CHEMINF_000349 .

:CSID1prop11 qudt:numericValue "1.049E-17"^^xsd:double .

:CSID1prop11 qudt:unit obo:UO_0000324 .

There is also RDF/XML, which is less human-

readable.

Page 13: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Royal Society of Chemistry data in Open PHACTS

1. Molecule synonyms and identifiers

2. Linksets between ChEBI, ChEMBL,

DrugBank and OPS identifiers

3. Molecule–molecule relations (“parent–

child”) of interest for drug discovery

4. Calculated physicochemical properties for

compounds (both molecular and

macroscopic)

Page 14: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Royal Society of Chemistry data in Open PHACTS

1. Molecule synonyms and identifiers

2. Linksets between ChEBI, ChEMBL,

DrugBank and OPS identifiers

3. Molecule–molecule relations (“parent–

child”) of interest for drug discovery

4. Calculated physicochemical properties for

compounds (both molecular and

macroscopic)

Page 15: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Calculated physicochemical properties (ACD 12.0)

log P log D (at pH 5.5, at pH 7.4)

bioconcentration factor KOC (at pH 5.5, at

pH 7.4) index of refraction polar surface

area molar refractivity molar volume

polarizability surface tension density at

STP boiling point at 1 atm flash point at 1

atm enthalpy of vaporization at STP

vapour pressure at STP

Page 16: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

RDF for calculated properties:vocabularies

Two dozen calculated properties for each of

>106 molecules.

CHEMINF ontology for kinds of calculation and

chemical data

QUDT for results

OPS IDs for molecules

OBI and IAO to connect calculations to results

Page 17: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

RDF for calculated properties:schema

benzene’s connection table

OPSbenzene

calculation result

QUDTdimensionless

quantity

“2.17”^^xsd:float

IAOis about

OBIhas specified

output

OBIhas specified

input

QUDThas value

QUDThas standard uncertainty

QUDThas unit

CHEMINFcalculated log P

rdf:type

CHEMINFconnection table

rdf:type

“0.234”^^xsd:float

calculation process

CHEMINFexecution of

ACD/Labs PhysChem software library

version 12.01

rdf:type

Page 18: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Overview

Open PHACTS and chemical validation and standardization

RDF for chemoinformatics calculations

General case study: ChEMBL and DrugBank

Sugar case study: Perspective perception

Page 19: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

ChEMBL and DrugBank analysed

Taking ChEMBL 16 (http://www.ebi.ac.uk/chembl/) which

contains 1 295 510 distinct molecules, CVSP found something

to say about 456 250 of them (35%).

DrugBank 3.0 (http://www.drugbank.ca/) contains 6510 distinct

molecules of which CVSP has found something to say about

662 of them (10%)

(We haven’t done all of CS yet; we will.)

Page 20: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

ChEMBL DrugBank

Potentially serious things

14218 1.09% 202 3.10% Not an overall neutral system

485 0.04% 21 0.32% Forbidden-valence atoms

44 — 0 — Has adjacent atoms with like charges

4 — 0 — Has more than one radical centre

Page 21: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

ChEMBL DrugBank

Aesthetics

57275 4.42% 70 1.08%

Uneven-length bonds

25736 1.99% 78 1.20%

Congested layout

23622 1.82% 24 0.37%

Containing not-quite-linear cyano groups

167 0.01% 1 — Zero-dimensional structures

70 0.01% 0 — Containing not-quite-linear isocyano groups

Page 22: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

ChEMBL DrugBank

Artwork molecules

0 0 Cyclobutane

8 0 Ethane molecules in the structure

6 0 Sulfur atoms with no explicit bonds

4 0 Boron atoms with no explicit bonds

1 0 Ethyne molecule(in the ChEMBL case it actually is acetylene)

3 0 Stray methane molecules

Page 23: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

ChEMBL DrugBank

FDA tautomer and metal rules

17508 1.35% 80 1.29% In enol form (or chalcogenoenol form)

9526 0.74% 4 0.07% N=C–OH tautomer of a carbonyl compound

2 — 1 — Nitroso-form oximes

1104 0.09% 6 0.09% Metal–nitrogen bond

845 0.06% 10 0.15% Non-metal–transition-metal bond

432 0.03% 10 0.15% Metal–oxygen bond

3 — 2 — Aluminium–non-metal bond

2 — 0 — Metal–fluorine bond

Page 24: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

ChEMBL DrugBank

Stereochemistry

185742 14.3% 39 0.60% G2-4: Has a single unknown stereocentre and no defined stereocentres: probably a racemate

68572 5.3% 13 0.20% G2-42 Has more than one unknown stereocentre and no defined stereocentres: probably problematic. Could indicate relative stereochemistry?

36572 2.8% 27 0.44% G2-44 At least one defined stereocentre, and one is stereocentre undefined or unknown: probably an epimer or mixture of anomers

26076 2.0% 11 0.17% G2-46 Has more than one unknown stereocentre and more than one defined stereocentre – probably problematic again

23113 1.8% 13 0.20% Unknown double bond arrangement

883 0.1% 1 — At least one ring containing stereobonds

Page 25: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Overview

Open PHACTS and chemical validation and standardization

RDF for chemoinformatics calculations

General case study: ChEMBL

Sugar case study: Perspective perception

Page 26: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Sugar depiction challenges

Stereochemistry not stored in V2000 format (though present in .cdx).

Page 27: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Consequences

Page 28: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

ChEMBL(19275)

DrugBank(153)

Sugar questions

5359 27.8% 138 90.2% At least one L-pyranose ring (often antibiotics contain these)

4748 24.6% 0 — At least one perspective chair

416 2.16% 0 — At least one Haworth ring

52 0.03% 0 — At least one perspective boat or twist boat

Page 29: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Sugar ring redepiction algorithm

1. Identify perspective conformation (boat, chair, Haworth)

2. Determine perspective stereo

3. Assign wedge or hash to bonds accordingly

4. Reconstruct sugar ring so as to minimize disruption to the rest of molecule

5. Tidy

Page 30: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,
Page 31: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,
Page 32: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Take the x-axis as parallel to the line through the top two chair atoms or through the bottom two chair atoms.

Δy positive: wedge

Δy negative: hash

Then remap chair to homotropous hexagon.

Page 33: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,
Page 34: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

In the boat case, the substituent further up the page is the wedge, while the one further down the page is the hash, regardless of whether bridgehead or not.

Page 35: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Depiction1. Identify mean bond

length and chair centroid.

2. Snap ring atoms to a regular-hexagonal grid.

3. Remove superfluous hydrogen atoms.

4. Only mark stereo on a single substituent if they are paired (cf. Grice).

Page 36: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Tidying: desiderata

Different problem from structure layout in general.

The structure we end up with is, in many important respects, fine.

Preserve drawing conventions—aglycones being on the top right hand side.

Page 37: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Next steps

Stable user-facing URI for CVSP (currently http://cvsp.beta.rsc-us.org/, but subject to change)

Apply CVSP to all of ChemSpider.

Investigate fused rings.

Page 38: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Acknowledgements

In particular,

Jon Steele (RSC)

David Sharpe (RSC)

John Blunt (Canterbury, NZ)

Page 39: Validation and Standardization of Molecular Structures in General and Sugars in Particular: a Case Study Colin Batchelor, Ken Karapetyan, Valery Tkachenko,

Any questions?

[email protected]

@documentvector