acs san francisco 2010 cinf talk
TRANSCRIPT
NCI/CADD: Open-access chemical structure web platform
NCI/CADD: Open-access chemical structure web platform
Markus Sitzmann1, Wolf-Dietrich Ihlenfeldt2, and Marc C. Nicklaus1
[1] Computer-Aided Drug Design Group, Chemical Biology Laboratory,NCI-Frederick, NIH, DHHS[2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany
NCI/CADD: Open-access chemical structure web platform
NCI/CADD Public Web Services
Enhanced NCI Database Browserhttp://cactus.nci.nih.gov/ncidb2
web service for NCI/DTP’s Open NCI Database
• first release 1998, updated 2001• ~250,000 structure records• ~60 million data points
Chemical Structure Lookup Servicehttp://cactus.nci.nih.gov/lookup
• first release 2006, updated 2008• ~74 million structure records
(~46 million unique structures)
structure lookup in over 100 database
NCI/CADD: Open-access chemical structure web platform
NCI/CADD Public Web Services
OSRA
http://cactus.nci.nih.gov/osra/
converts graphical representations of chemical structures injournal articles, patent documents, textbooks, trade magazines etc., into SMILES
Online SMILES Translatorhttp://cactus.nci.nih.gov/translate/
GIF Creator for Chemical Structureshttp://cactus.nci.nih.gov/gifcreator/
PROSIT: Online Pseudorotation Tool Version 2http://cactus.nci.nih.gov/prosit/
NCI/CADD: Open-access chemical structure web platform
http://cactus.nci.nih.gov
NCI/CADD: Open-access chemical structure web platform
New Web Services
NCI/CADD: Open-access chemical structure web platform
chemical structure
Chemical Structure Representations
NCI/CADD Identifiers
InChI/InChIKey
ChemSpider ID
PubChem SID/CID
chemical names
CAS Registry Number
NSC number
FDA UNII
ChemNavigator SID
SMILES
SD File
Chemical FormulaChEBI ID
PDB Ligand ID
MRV
CML
SYBYL Line Notation
GIF image
NCI/CADD: Open-access chemical structure web platform
http://cactus.nci.nih.gov/chemical/structure
Works as a resolver for different chemical structure identifiers. Allows one to convert a givenstructure identifier into anotherrepresentation or structureidentifier.
Chemical Identifier ResolverNCI/CADD Web Resources
NCI/CADD: Open-access chemical structure web platform
http://cactus.nci.nih.gov/chemical/structure
first beta release: July 2009second beta release: Nov. 2009third beta release: April/May 2010(beta versions will continue through 2010)
3.0 million requests since July 1, 2009(~11.000/day)
Chemical Identifier ResolverNCI/CADD Web Resources
NCI/CADD: Open-access chemical structure web platform
• it is usable by a simple URL API:
example: http://cactus.nci.nih.gov/chemical/structure/Tamiflu/cas
204255-11-8
http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”
MIME type: text/plain
Chemical Identifier ResolverNCI/CADD Web Resources
XML format: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”/xml
• if a request is not resolvable: HTTP404 status message
NCI/CADD: Open-access chemical structure web platform
identifier representation
http request
http response
detection ofthe identifier
type
identifier is afull structure representation
(e.g. SMILES, InChI)
calculation of therequested structure
representation
identifier is ahashed structure
representation(e.g. InChIKey),chemical name
etc.
database lookup
MIME type
Chemical Identifier ResolverNCI/CADD Web Resources
structure
e.g. InChI, GIF image
e.g. CAS number,chemical name
NCI/CADD: Open-access chemical structure web platform
“Chemical Structure Web Engine”
Chemical Structure Web Engine
NCI/CADDweb service
NCI/CADDweb service
NCI/CADD Chemical StructureDatabase (CSDB)
CACTVS
externalweb services
http
ChemicalIdentifierResolver
othersoftwarepackages
NCI/CADD: Open-access chemical structure web platform
• number of structure records: 103.9 million• number of unique structures:
Std. InChIKey : ~73.0 million
FICuS : ~70.6 million uuuuu : ~65.3 million
• from the set of ~83.6 million unique structures we havederived about ~10 million additional scaffold-type structures (for future structure searches); thus:
• for lookup “identifier structure” available: ~92.9 million Standard InChIKeys ~93.3 million NCI/CADD Identifiers ~70 million chemical names linked to ~16 million structures
}union set of unique structures: ~83.6 million
Chemical Structure DatabaseNCI/CADD Web Resources
NCI/CADD: Open-access chemical structure web platform
• ChemNavigator iResearch Librarycompilation of commercially availablescreening compounds from ~300 inter-national chemistry suppliers
• PubChem databaseincluding Open NCI database, EPA DSSTox databases, NIAID HIVdatabases, NIST Webbook, NLM ChemIDplus, ChemSpider …
• Commercial Sources / othersAsinex, Comgenex, … as of March 2010:
140 chemical structure databases103.9 million structure records
~70.6 million unique structures by FICuS
ChemNav.iResearch Lib.~56%
PubChem~38%
others
~6%
Chemical Structure DatabaseNCI/CADD Web Resources
NCI/CADD: Open-access chemical structure web platform
• based on hashcodes calculated by the chemoinformatics toolkit CACTVS
• CACTVS hashcodes: represent a chemical structure uniquely as
16-digit hexadecimal number (64-bit unsigned) have a high sensitivity to structural features of a
compound change if connectivity changes
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
HNN NH2
OH
O
9850FD9F9E2B4E25
NCI/CADD: Open-access chemical structure web platform
charged form
A3DAE0788050DDE4 3ECEF579D7DF025A
tautomers
isotope“errors”
E92E4BA2869F36118A7AD1EB498CC76Astereoisomers6C16DE2351F9FF50
HNN NH2
OH
O
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
salt
HNN NH2
O-
ONa+
HNN NH3
+O-
O
8F7A1DE5A733F0E0
O
HNN NH2
ONa
60525E1AF41497B6
HNN NH
OH
O
B2FDA68AEDA06DB9
NHN 15NH2
OH
O
9850FD9F9E2B4E25
NCI/CADD: Open-access chemical structure web platform
inputstructure
MDL MolfileMDL SDFSMILESChemDraw cdxPDB
structurenormalization
parentstructure
MDL SDFSMILESdatabase
NCI/CADDIdentifier
hashcodecalculation
NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures
E_HASHISY
NCI/CADD: Open-access chemical structure web platform
• adjustable levels of sensitivity:
NCI/CADD Structure Identifiers
Fragments
sensitive
keep only largestorganic fragment
Isotopes
ignoreisotope labels
sensitive
D
D
D
D
D
D
Charges
uncharge
sensitive
find canonicaltautomer
O O
Stereochemistry
sensitive
COOH
NH2
discard stereoinformation
O-
O
NH3+
OH
O
NH2
un-sensitive un-sensitive un-sensitive un-sensitive
sensitive
O OH
O OH
Tautomers
COOH
HNH2
COOH
NH2
HNa+
O
O-
O
OH
Structure Normalization
un-sensitive
NCI/CADD: Open-access chemical structure web platform
NCI/CADD Structure Identifiers
Fragments Isotopes Charges
sensitive
sensitive
sensitive
D
D
D
D
D
D
O OCOOH
NH2
un-sensitive un-sensitive un-sensitive un-sensitive
O-
O
NH3+
OH
O
NH2
Tautomers Stereochemistry
sensitive
sensitive
O OH
O OH
COOH
HNH2
COOH
NH2
HNa+
O
O-
O
OH
Structure Normalization
NCI/CADD: Open-access chemical structure web platform
NCI/CADD Structure Identifiers
Fragments Isotopes Charges
sensitive
sensitive
sensitive
D
D
D
D
D
D
O OCOOH
NH2
FF II CC
FICTS identifier: representation of the exact drawing
un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive
TT
O-
O
NH3+
OH
O
NH2
≠ ≠ ≠
Tautomers Stereochemistry
sensitive
sensitive
O OH
O OH
COOH
HNH2
COOH
NH2
H
≠
≠
SS
Na+
O
O-
O
OH
=
=
≠
≠
Structure Normalization
NCI/CADD: Open-access chemical structure web platform
NCI/CADD Structure Identifiers
Fragments Isotopes Charges
sensitive
sensitive
sensitive
D
D
D
D
D
D
O OCOOH
NH2
FF II CC
FICuS identifier: comes closest to how a chemist perceives a compound
un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive
uu
O-
O
NH3+
OH
O
NH2
≠≠ ≠ ≠
Tautomers Stereochemistry
sensitive
sensitive
O OH
O OH
COOH
HNH2
COOH
NH2
H=
= ≠
≠
SS
Na+
O
O-
O
OH
Structure Normalization
NCI/CADD: Open-access chemical structure web platform
NCI/CADD Structure Identifier
Fragments Isotopes Charges Tautomers Stereochemistry
Na+
sensitive
sensitive
sensitive
sensitive
sensitive
O
O-
D
D
D
D
D
D
O-
O
NH3+
O OH
O OH
COOH
HNH2
COOH
NH2
H
O
OH
O OCOOH
NH2OH
O
NH2
=
=== = = =
=
uuuuu identifier: closely related forms of the same compound
uu uuuuuuuu
un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive
Structure Normalization
NCI/CADD: Open-access chemical structure web platform
A3DAE0788050DDE4-FICTS E5F83F10C5DB080A-FICTS
B2FDA68AEDA06DB9-FICTS
9850FD9F9E2B4E25-FICTS
E5F83F10C5DB080A-FICTS
E92E4BA2869F3611-FICTS8A7AD1EB498CC76A-FICTS6C16DE2351F9FF50-FICTS
HNN NH2
OH
O
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
9850FD9F9E2B4E25-FICTS
charged form
tautomers
isotope
salt
stereoisomers
FICTS
“errors”
NCI/CADD: Open-access chemical structure web platform
A3DAE0788050DDE4-FICuS E5F83F10C5DB080A-FICuS
B2FDA68AEDA06DB9-FICuS
9850FD9F9E2B4E25-FICuS
E5F83F10C5DB080A-FICuS
E92E4BA2869F3611-FICuS8A7AD1EB498CC76A-FICuS9850FD9F9E2B4E25-FICuS
HNN NH2
OH
O
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
9850FD9F9E2B4E25-FICuS
charged form
tautomers
isotope
salt
stereoisomers
FICuS
“errors”
NCI/CADD: Open-access chemical structure web platform
9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-FICuS
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu
HNN NH2
OH
O
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+O-
O
O
HNN NH2
ONa
HNN NH
OH
ONH
N 15NH2
OH
O
9850FD9F9E2B4E25-uuuuu
charged form
tautomers
isotope
stereoisomers
salt
uuuuu
“errors”
NCI/CADD: Open-access chemical structure web platform
NCI/CADD Chemical Structure Database
NCI/CADD:RID NCI/CADD:CID
structure records compounds(structures unique by
CACTVS HASHISY)
FICTS associations~72.0 million
FICuS associations~70.6 million
uuuuu associations~65.3 million
103.5 million 83.6 million
~130 millionlinkouts to
originaldatabase
records
linked to:• StdInChI[Key]• chemical names• chemical formula• properties• etc.
NCI/CADD: Open-access chemical structure web platform
resolver
chemical namesCAS numbers
SMILES stringsIUPAC
InChI/InChIKeysNCI/CADD Identifiers
CACTVS HASHISYNSC number
PubChem SID/CIDFDA UNII
ChemSpider IDChemNavigator SID
Chemical Formula
/smiles/names, /iupac_name/cas/inchi, /stdinchi/inchikey, /stdinchikey/ficts, /ficus, /uuuuu /image/file, /sdf/mw, /monoisotopic_mass /formula/twirl, /3d/urls/unii/chemspider_id/pubchem_sid/chemnavigator_sid
“identifier” “representation”
http://cactus.nci.nih.gov/chemcial/structure
Chemical Identifier ResolverNCI/CADD Public Web Resources
NCI/CADD: Open-access chemical structure web platform
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA-N/smiles
Standard InChIKeyChemical Identifier Resolver
• can resolve ~93.0 million Standard InChIKeys into a full structure representation:
CCO
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA/smiles
CCOCC[OH2+]
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ/smiles
C(C(O)([2H])[2H])[2H]CC(O)([2H])[2H]C(CO)([2H])([2H])[2H]CC[17OH]C(CO)[2H][14CH3]COCCO
NCI/CADD: Open-access chemical structure web platform
alc Alchemy formatcdxml CambridgeSoft ChemDraw XML formatcerius MSI Cerius II formatcharmm Chemistry at HARvard Macromolecular Mechanics file formatcif Crystallographic Information Filecml Chemical Markup Languagectx Gasteiger Clear Text formatgjf Gaussian input data filegromacs GROMACS file formathyperchem HyperChem file formatjme Java Molecule Editor formatmaestro Schroedinger MacroModel structure file formatmol Symyx molecule filesybyl2/mol2 Tripos Sybyl MOL2 formatmrv ChemAxon MRV formatpdb Protein Data Banksdf Symyx Structure Data Formatsdf3000 Symyx Structure Data Format 3000sln SYBYL Line Notationsmiles SMILESxyz xyz file format
• available formats:http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA-N/file?format=sdf
File RepresentationChemical Identifier Resolver
NCI/CADD: Open-access chemical structure web platform
http://cactus.nci.nih.gov/chemical/structure/buckyball/image?height=300&width=300&bgcolor=black&bondcolor=white
http://cactus.nci.nih.gov/chemical/structure/aspirin/image?height=200&width=200&symbolfontsize=7&footer="Aspirin"
Aspirin
Structure Image GenerationChemical Identifier Resolver
NCI/CADD: Open-access chemical structure web platform
TwirlyMolChemical Identifier Resolver
implemented by Noel O'Boyle (University College Cork, Ireland)
Chrome Safari FF3.5/3.6 FF3.0 FF2.0 IE8 IE7 IE6
simple javascript that allows you to render a rotatable/zoomable3D representation of a molecule in your web browser
no plugin is needed, only a modern browser:
NCI/CADD: Open-access chemical structure web platform
• simple viewer:http://cactus.nci.nih.gov/chemical/structure/restasis/twirl
• embed into a web page:
<div id=“canvas” height=“400” width=“400”></div>
<script src=“http://cactus.nci.nih.gov/chemical/structure/restasis/twirl_cached/
canvas” />
TwirlyMolChemical Identifier Resolver
NCI/CADD: Open-access chemical structure web platform
restasis
NCI/CADD: Open-access chemical structure web platform
http://www.coronene.com/blog/
http://chemical-quantum-images.blogspot.com
http://baoilleach.blogspot.com/
TwirlyMolChemical Identifier Resolver
NCI/CADD: Open-access chemical structure web platform
ethanol
name a specific resolver module:
http://cactus.nci.nih.gov/chemical/structure/CCO/iupac_name?resolver=name
2-[[3-(3-chlorophenyl)-1,2,4-oxadiazol-5-yl]sulfanyl]acetic acid
• e.g. the string “CCO”, can be resolved as SMILES string of “ethanol” abbreviation for “Carboxymethylthio-3-(3-Chlorphenyl)-1,2,4-Oxadiazol)”
Ambiguous IdentifiersChemical Identifier Resolver
http://cactus.nci.nih.gov/chemical/structure/CCO/iupac_name?resolver=smiles
NCI/CADD: Open-access chemical structure web platform
<?xml version="1.0" encoding="UTF-8" ?> <request string="CCO" representation=“iupac_name">
<data id="1" resolver="smiles" string_class="SMILES String"><item id="1">ethanol</item>
</data><data id="2" resolver="name" string_class="Chemical Name">
<item id="1">2-[[3-(3-chlorophenyl)-1,2,4-oxadiazol-5-yl]sulfanyl]acetic acid</item>
</data></request>
XML format:
• e.g. the string “CCO”, can be resolved as SMILES string of “ethanol” abbreviation for “Carboxymethylthio-3-(3-Chlorphenyl)-1,2,4-Oxadiazol)”
Chemical Identifier Resolver
Ambiguous Identifiers
http://cactus.nci.nih.gov/chemical/structure/CCO/iupac_name/xml
NCI/CADD: Open-access chemical structure web platform
<?xml version="1.0" encoding="UTF-8" ?>
<request string="restasis" representation="urls"><data id="1" resolver="name" string_class="Chemical Name">
<item id="1" classification="exact" database="ChemSpider" publisher="ChemSpider">
http://chemspider.com/structure.4939506</item><item id="2" classification="exact" database="ChemSpider“
publisher="PubChem">http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=43028058
</item><item id="3" classification="exact" database="NLM ChemIDplus"
publisher="NLM">http://chem.sis.nlm.nih.gov/chemidplus/direct.jsp?
result=advanced®no=059865133[…]
</data></request>
• get the URL of the original structure records:
http://cactus.nci.nih.gov/chemical/structure/restasis/urls/xml
Chemical Identifier Resolver
Database URL Lookup
NCI/CADD: Open-access chemical structure web platform
• get available names:
http://cactus.nci.nih.gov/chemical/structure/CC(=O)Oc1ccccc1C(O)=O/names/xml
Chemical Identifier Resolver
Name Lookup
<?xml version="1.0" encoding="UTF-8" ?> <request string="CC(=O)Oc1ccccc1C(O)=O" representation="names">
<data id="1" resolver="smiles" string_class="SMILES String" description="CC(=O)Oc1ccccc1C(O)=O">
<item id="1" classification="PUBCHEM_IUPAC_NAME">2-acetyloxybenzoic acid</item>
<item id="2" classification="PUBCHEM_IUPAC_OPENEYE_NAME">2-Acetoxybenzoic acid</item>
<item id="3" classification="PUBCHEM_GENERIC_REGISTRY_NAME">50-78-2</item><item id="4"
classification="PUBCHEM_GENERIC_REGISTRY_NAME">11126-35-5</item><item id="5"
classification="PUBCHEM_GENERIC_REGISTRY_NAME">11126-37-7</item><item id="6"
classification="PUBCHEM_GENERIC_REGISTRY_NAME">2349-94-2</item><item id="7"
classification="PUBCHEM_GENERIC_REGISTRY_NAME">26914-13-6</item><item id="8" classification="PUBCHEM_SUBSTANCE_SYNONYM">NCGC00090977-
04</item><item id="9"
classification="PUBCHEM_SUBSTANCE_SYNONYM">KBioSS_002272</item><item id="10" classification="PUBCHEM_SUBSTANCE_SYNONYM">SBB015069</item><item id="11" classification="PUBCHEM_SUBSTANCE_SYNONYM">Aspirin</item><item id="12" classification="PUBCHEM_SUBSTANCE_SYNONYM">D00109</item>
[…]
NCI/CADD: Open-access chemical structure web platform
http://cactus.nci.nih.gov/blog
/chemical/structure Blog
NCI/CADD: Open-access chemical structure web platform
In Development
http://cactus.nci.nih.gov/TEST_chemical/structure
NCI/CADD: Open-access chemical structure web platform
• manipulates the structure created from the identifier• new representation is calculated after structure
manipulation
http://cactus.nci.nih.gov/chemical/structure/operator:identifier/representation
“Chemical Operators”Chemical Identifier Resolver
operators: tautomers, canonical_tautomer, addh, removeh, nostereo, rings, …
NCI/CADD: Open-access chemical structure web platform
N
NH
NH
N
O
H2N
N
NH
N
HN
O
H2N
N
NH
N
N
OH
H2N
HN
N NH
N
O
H2N
N
N NH
N
OH
H2N
HN
N N
HN
O
H2N
N
N N
HN
OH
H2N
HN
N N
N
OH
H2N
HN
NH
NH
N
O
HN
N
NH
NH
N
OH
HN
HN
NH
N
HN
O
HN
N
NH
N
HN
OH
HN
HN
NH
N
N
OH
HN
HN
N NH
N
OH
HN
HN
N N
HN
OH
HN
Tautomers“Chemical Operator”
http://cactus.nci.nih.gov/chemical/structure/tautomers:guanine/”representation”
NCI/CADD: Open-access chemical structure web platform
• (hopefully) there will be many resolvers from differentproviders with different background:
publishers
commercial databases
free sources and databases: ChemSpider,PubChem, ChEBI, …
• Std. InChI[Key] is the perfect tool to interlink the resolvers
• ChemSpider and NCI/CADD are working on a test protocolfor a federated InChI/InChIKey resolver
IUPAC InChI/InChIKey Resolver
NCI/CADD: Open-access chemical structure web platform
IUPAC Root Resolver
Resolver 1
Resolver 2
Resolver 3
Resolver 3.1
Resolver 3.2
Resolver 3.3
ClientsChemical Identifier Resolver
IUPAC InChI/InChIKey Resolver
NCI/CADD: Open-access chemical structure web platform
http://cactus.nci.nih.gov/chemical/structure
Chemical Identifier ResolverNCI/CADD Web Resources
http://cactus.nci.nih.gov/blog
NCI/CADD: Open-access chemical structure web platform
Acknowledgments
ChemNavigator
Scott Hutton
Tad Hurst
CADD Group, CBL, NCI
Igor Filippov
Noel O'Boyle
Hans-Juergen Himmler (Akos)
Thanks to all database providers!
http://cactus.nci.nih.gov
Our web site:
NCI/CADD: Open-access chemical structure web platform
Users
webel.py - A Cinfony module
IUPHAR DATABASEhttp://www.iuphar-db.org
http://baoilleach.blogspot.com/2009/11/introducing-webel-cheminformatics.html
http://www.akosgmbh.eu/globalsearch/index.htm
avogadro.openmolecules.net/
CACTVS
http://www.xemistry.com
in silico toxicologyhttp://www.in-silico.ch/
Symyx Draw Resolver
http://www.symyx.com/