the nci/cadd group's inchi usage and analysis of ...bulletin.acscinf.org/pdfs/247nm56.pdf ·...
TRANSCRIPT
Chemical Biology LaboratoryCenter for Cancer ResearchNational Cancer Institute
National Institutes of HealthFrederick, Maryland 21702
The NCI/CADD Group's InChI Usage and Analysis of Tautomerism for InChI V2
Marc C. Nicklaus
Computer-Aided Drug Design (CADD) Group
2
Chemical Identifier Resolver (CIR)
http://cactus.nci.nih.gov/chemical/structure
• “Resolves” structure identifiers or representations, i.e. converts one
structure identifier/representation into another
• Usable by humans; but optimized for communication between computers
(returns MIME/text pages wherever possible)
• Launched in June 2009
• Planned: major update of services and underlying database, the CADD
Group’s “Chemical Structure DataBase” (CSDB)
Chemical Identifier Resolver (CIR) Flowchart
identifier representation
http request
http response
detection ofthe identifier
type
identifier is afull structure
representation(e.g. SMILES, InChI)
calculation of therequested structure
representation
identifier is ahashed structurerepresentation
(e.g. InChIKey), orchemical name etc.
database lookup
structure
e.g. InChI, GIF image
e.g. CAS number,chemical name
Chemical Identifier Resolver (CIR)
http://cactus.nci.nih.gov/chemical/structure/CDBRNDSHEYLDJV-FVGYRXGTSA-M/smiles
[C@H](C2=CC1=CC=C(OC)C=C1C=C2)(C([O-])=O)C.[Na+] MIME type: text/plain
Examples:
http://cactus.nci.nih.gov/chemical/structure/
XMWRBQBLMFGWIX-UHFFFAOYSA-N/image
?height=300&width=300&bgcolor=black&bondcolor=white
Buckyball
Naproxen sodium
Chemical Structure Database (CSDB) in CIR
• ChemNavigator/Sigma iResearch Librarycompilation of commercially available screeningcompounds from ~300 international chemistrysuppliers
• PubChem databaseincluding Open NCI database, EPA DSSTox databases, NIAID HIV database, NIST Webbook, NLM ChemIDplus, ChemSpider, …
• Commercial Sources / othersAsinex, Comgenex, eMolecules, …
ChemNav.
iResearch
Lib.
~56%
PubChem
~38%
others~6%
140 chemical structure databases
120 million structure records
84.6 million unique structures by FICuS
110 million unique Standard InChIKeys for lookup(includes ~25 million derived scaffold and ring structures)
Currently available in CIR:
Tautomerism in large small-molecule databases
NCI/CADD Structure Identifiers
Fragments Isotopes Charges
sensitive sensitive sensitive
D
D
D
D
D
D
O OCOOH
NH2
F I C
Based on CACTVS hashcodes; 16-digit hex numbers (64 bit unsigned)
un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive
u
O-
O
NH3+
OH
O
NH2
≠≠ ≠ ≠
Tautomers Stereochemistry
sensitive sensitive
O OH
O OH
COOH
HNH2
COOH
NH2
H=
= ≠
≠
S
Na+
O
O-
O
OH
T
u u u u
NCI/CADD Structure Identifiers
Sitzmann et al. SAR QSAR Environ. Res. 2008, 19, 1–9
FICuS identifier: comes closest to how a chemist perceives a compound
Tautomerism in large small-molecule databases
HNDVDQJCIGZPNO-UHFFFAOYSA-N
HNDVDQJCIGZPNO-CDYZYAPPSA-N
HNDVDQJCIGZPNO-RXMQYKEDSA-N
HNDVDQJCIGZPNO-YFKPBYRVSA-NHNDVDQJCIGZPNO-UHFFFAOYSA-N
HNN NH2
OH
O
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+
O-
O
O
HNN NH2
ONa
HNN NH
OH
O
NHN 15NH2
OH
O
HNDVDQJCIGZPNO-UHFFFAOYSA-N
charged form
tautomer
isotope
stereoisomers
salt Std. InChIKey
“errors”
HNDVDQJCIGZPNO-UHFFFAOYSA-N
UHPNKBYGGMJTIM-UHFFFAOYSA-M
UHPNKBYGGMJTIM-UHFFFAOYSA-M
9850FD9F9E2B4E25-FICuS
9850FD9F9E2B4E25-FICuS
E5F83F10C5DB080A-FICuS
E5F83F10C5DB080A-FICuS
E92E4BA2869F3611-FICuS
8A7AD1EB498CC76A-FICuS
A3DAE0788050DDE4-FICuS
B2FDA68AEDA06DB9-FICuS
9850FD9F9E2B4E25-FICuS-01-78
FICuS
Histidine
calculation of Standard InChIKey and NCI/CADD identifiers(from ~350 million original raw records including multiple database versions)
structurenormalization
parent
structureNCI/CADDIdentifier
hashcodecalculationoriginal
structure
record
FICTS
FICuS
uuuuu
E_HASHISY
union set:
Standard InChIKey 1.04
Set 1 Set 2 Set 3Standard
Chemical Structure Database (current version)
InChIKeys
Standard: 167,722,852Set 1: 167,722,850Set 2: 167,722,852Set 3: 167,698,426any unique: 167,723,824
NCI/CADD Identifiers(unique counts)
FICTS: 125,009,738FICuS: 121,429,689uuuuu: 108,993,792
(before normalization: ~127M)
Total unique records: 168,015,387(includes ~40M derived
scaffold and ring structures)
Tautomers
Tautomers are isomers that can transform into
each other through chemical equilibrium reactions
enol form keto form
cyclic form acyclic form
- Prototropic tautomerism:intramolecular movement of a hydrogen atom
- Ring-chain tautomerism:
movement of the proton accompanied by opening/closing of a ring
Strongly environment-
dependent
(pH, solvent, T, time, ... )
9
The existence of multiple tautomeric forms of the same molecule can create problems!
Ligand
dockingClustering
diversityRegistration in
databases
Property
calculation
Hydrogen
bonding
interactions are
different for
different
tautomers
(Tanimoto) similarity
between tautomers
can be very low
May lead to duplicate
registration, missed
molecules in
searches, or incorrect
identification of two
structures as the
same compound
Variations across
tautomers by several
orders of magnitude.
Eg. pKa, logP
May impact the success of drug discovery
Importance of Getting Tautomers Right
X-ray
crystallographic
What is the
tautomeric form
present in ligand-
protein
complexes?
10
Most important for InChI
Tautomerism in large small-molecule databases
Average tautomeric overlap per DB ~0.3%
Tautomeric overlap across all DBs: ~10%
Structures capable of tautomerism: ~68%
NCI/CADD Chemical Structure Database
Tautomer Analysis
number
database
releases
0
10
20
30
40
50
60
70
80
90
100
0.0 0.5 1.0 1.5 2.0
frequency
percentage of actual duplicates FICTS - FICuS parent structure
in each database release
tautomeric overlap within each
individual database release Asinex
ChemBridge
ComGenex
ChemNavigator
Columbia University
Molecular Screening
Center
EPA DSSTox
Specs
Ambinter
BIND
BindingDB
ChemNavigator
KEGG
NCI Open Database
NIST WebBook
NLM ChemIDplus
NMRShiftDB
Thomson Pharma
Wombat
NCI/DTP
PASS Training Set
SGC-Ox
ChemDB
ZINCChEBIChemSpider
Sitzmann M, Ihlenfeldt WD, Nicklaus MC. J Comput Aided Mol Des. 2010 Jun;24(6-7):521-51.
12
Tautomerism in Chemoinformatics
• In reality, tautomerism is a quantum-mechanical effect (orbital changes)
• In principle calculable at the QM level
• But incorporating the conditions (solvent, pH, ...) is not easy
• ...and these calculations can take weeks for one molecule
• In chemoinformatics: rule-based (you have maybe 100 ms per structure!)
• These rules will be “correct” only in a statistical sense
• Whether QM or rule-based: one has to agree on what set and ranges of conditions to use to define “tautomerism” in a practical application, e.g. for identifiers used for compound registration in a database or repository
13
Tautomerism and Identifiers
Different chemical identifiers are sensitive to tautomerism to different degrees:
•InChI/InChIKey [IUPAC International Chemical Identifier]
– one identifier with a layer structure
– in principle designed to be tautomer-invariant
– but not all types of common tautomerism used by default (e.g. not invariant by default to keto/enol tautomerism in v. 1.04)
•NCI/CADD identifiers1
– several identifiers with different sensitivities to chemical features; most important one:
• FICuS – tautomer-invariant
1 Sitzmann et al. SAR QSAR Environ. Res. 2008, 19, 1–9
14
Example of Known Issues in InChI
Dmitrii Tchekhovskoi, IUPAC InChI Committee Meeting, March 2012
1,4-oxime/nitroso tautomerism not currently handled by InChI. Adding it will break current InChI.
N+
O-
OH N
OH
ON
O
OH
InChI=1S/C5H5NO2/c7-5-3-1-2-4-6(5)8/h1-4,7H
InChI=1S/C5H5NO2/c7-5-3-1-2-4-6(5)8/h1-4,8H
15
Tautomerism Transform Rules
• CACTVS: rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS
• Types of tautomerism covered:
rule 12: furanones
rule 11: 1.11 (aromatic) heteroatom H shift
rule 9: 1.7 (aromatic) heteroatom H shift
rule 8: 1.5 (aromatic) heteroatom H shift (2)
rule 7: 1.5 (aromatic) heteroatom H shift (1)
rule 6: 1.3 heteroatom H shift
rule 5: 1.3 aromatic heteroatom H shift
rule 4: special imine
rule 3: simple (aliphatic) imine
rule 2: 1.5 (thio)keto/(thio)enol
rule 1: 1.3 (thio)keto/(thio)enol*
rule 21: phosphonic acids
rule 20: isocyanides
rule 19: formamidinesulfinic acids
rule 18: cyanic/iso-cyanic acids
rule 17: oxim/nitroso via phenol
rule 16: oxim/nitroso
rule 15: pentavalent nitro/aci-nitro
rule 14: ionic nitro/aci-nitro
rule 13: keten/ynol exchange
CACTVS by: Wolf-Dietrich Ihlenfeldt, Xemistry GmbH
rule 10: 1.9 (aromatic) heteroatom H shift
* rule 1 has been merged with rule 6
16
Tautomer Overlap in Commercial Catalogs
Aldrich Market Select (AMS) database
of commercially available samples:
5,755,574 molecules (2012-09 version)
31,156 conflicts 62,872 molecules
n-tuples Conflicts
2 30,619
3 514
4 21
5 1
Examples (prices per 1 g):
$300$188
Same original supplier!
$350$313
Quadruple Tautomeric Case
17
But wait, there is more: Ring-Chain Tautomerism
18
“The need for computer programs that predict ring-chain tautomerization, a capability absent from the current tautomer generation programs”
Y.C. Martin, J Comput Aided Mol Des. 2009: 23:693-704 Let’s not forget tautomers
Prototropic tautomerism handledby most chemoinformatics tools:
XH: Nucleophilic center, YZ: Electrophilic center
exocyclic
endocyclic
OUR GOAL
...but not ring-chain tautomerism:
etc.
Example for Ring-Chain Tautomerism: Warfarin
19
Anticoagulant drug used in the prevention of thrombosis
Introduced in 1948 as a pesticide against rats and miceApproved in 1954 for use as a medicationMost widely prescribed oral anticoagulant drug in the U.S.
Inhibits vitamin K epoxide reductase (recycles oxidized vitamin K1 to its reduced form)
Valente E.J. et al., J. Med. Chem. 1977, 20, 1489-1493Karlsson, B.C.G. et al., J. Phys. Chem. B 2007, 111, 10520-10528Porter, R.P., J. Comput. Aided Mol. Des. 2010, 24, 553–573Nicholls, I.A. et al., J. Mol. Recognit. 2010, 23, 604–608
Submitted
to PubChem
HO
O
O
HO
O
O
O
HO
O
O
O
O
O
O
OH
O
HO
O
O
O
HO
O
OH
O
HO
O
OH
O
HO
O
O
O
HO
O
O
HO
Tautomerism of Warfarin – What to Expect
Mentioned
in literature
Confirmed
experimentally
O
O
OHO
O
O
O
HO
20
Can exist in principle in as many as 40 topologically distinct tautomeric forms!
HO
O
O
HO
O
O
O
HO
O
O
O
O
O
O
OH
O
HO
O
O
O
HO
O
OH
O
HO
O
OH
O
HO
O
O
O
HO
O
O
HO
Tautomerism of Warfarin – FICuS Identifier
O
O
OHO
O
O
O
HO
2121
prototropic tautomerism
http://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/ficus
prototropic tautomerism
Not covered
by current
rule set!
D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS 09BB2FAADA1508A7-FICuS
D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS
D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS
8F5519DD1E62B6B2-FICuS
ring-chaintautomerism
HO
O
O
HO
O
O
O
HO
O
O
O
O
O
O
OH
O
HO
O
O
O
HO
O
OH
O
HO
O
OH
O
HO
O
O
O
HO
O
O
HO
Tautomerism of Warfarin – InChIKey Identifier
O
O
OHO
O
O
O
HO
2222
prototropic tautomerism
QTXVAVXCBMYBJW-UHFFFAOYSA-N VWSXIGYSLWNCBN-VAWYXSNFSA-N GRAAPKVUSREWIL-UHFFFAOYSA-N
FQEPJUOLUDFINX-UHFFFAOYSA-N UCKRWKACBKRIKB-VAWYXSNFSA-N NNLYDNMZCAHUOV-UHFFFAOYSA-N
PJVWKTKQMONHTI-UHFFFAOYSA-N FVSFCRPKSVCTBA-VAWYXSNFSA-N BBOSKMPTDUUMKL-UHFFFAOYSA-N
LSCYDZJASSKSMJ-UHFFFAOYSA-N
PIBBOXWKSPNJFI-UHFFFAOYSA-N
All InChIKey
identifiers are
different!!
Revamp
handling of
tautomerism
in InChI V.2
ring-chaintautomerism
prototropic tautomerism
http://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/stdinchikey
Ring sizes being formed: 3 – 7
Breaking bonds: exocyclic (exo) or endocyclic (endo)
Geometry of carbon atoms: sp3 (tet), sp2 (trig) or sp (dig)
Baldwin's Rules
as starting point
Definition of Ring-Chain Tautomerism Rules
Disfavoured / favoured ring closures
23
J.E. Baldwin, J Chem Soc Chem Comun. 1976: 734-736
11 ring-chain rules → each rule is encoded as a SMIRKS string
OurRules
Baldwin’sRules
Guasch, L. et al.. JCIM (under revision)
6-exo-Trig
1
25
6
8 7 4
3
2
6
8 7 3
51
4
111
111
5 5
66
7
7
4
4
3
3
2
2
5-exo-Dig
111
111333
4
47
78
8
6
6
5
6
5
2
2
7-endo-Trig
Examples: Ring-chain tautomerism rules
24
Tautomers per Structure in the AMS Database
Count % Count %
no tautomers
(single molecule)1,393,612 24.21 5,297,864 92.05
one tautomer 1,235,979 21.47 101,890 1.77
2 tautomers 833,492 14.48 214,488 3.73
3 tautomers 483,057 8.39 16,490 0.29
4 tautomers 223,114 3.88 40,606 0.71
5 – 10 tautomers 889,118 15.45 32,661 0.57
11- 50 tautomers 584,842 10.16 37,267 0.65
51- 100 tautomers 72,832 1.27 3,905 0.07
101 – 200 tautomers 35,901 0.62 7,078 0.12
201 – 500 tautomers 3,486 0.06 3,017 0.05
501 – 1000 tautomers 141 0.00 308 0.01
Prototropic Tautomerism
21 Rules
Ring-chain Tautomerism
11 Rules
Ring-chain tautomersim is in the minority relative to prototropic tautomerism ̶ but it is not
an “exotic” occurrence either.
25
26
Tautomerism Analysis by “Experimental Chemoinformatics”
Procedure:• Select 100-200 tautomer tuples from AMS by
– coverage of types of tautomer transforms
– chemical diversity
– solubility
– availability from same original supplier
– likelihood to be distinguishable by NMR
– price
• Purchase samples
• Analyze by NMR spectroscopy
– measure as function of temperature, solvent, pH, shelf time...
Goal:• Investigate prevalence of tautomeric overlap in a real commercial catalog
• Test which of the tautomer transform rules may be too “aggressive”
27
NMR Experiments
338 molecules from AMS
127 prototropic tautomeric pairs5 prototropic tautomeric triples34 ring-chain tautomeric pairs
Bruker AVANCETM 500 – Autosampler (24)
Solvent: DMSO-d6Room temperature
1H and13C NMR Spectra
28
Keto/enol Tautomerism (conflict 26)
1H NMR Spectrum
13C NMR Spectrum
Same 1H and 13C NMR spectrum between samples. Assignment of
chemical shifts indicates enol form is present in both samples
26_1
26_2
InChI=1S/C14H12N4OS/c1-9-10(5-4-8-15)13(19)18(17-9)14-16-11-6-2-3-7-12(11)20-14/h2-3,6-7,17H,4-5H2,1H3InChIKey=HGTVTJWWZHYAQS-UHFFFAOYSA-N
InChI=1S/C14H12N4OS/c1-9-10(5-4-8-15)13(19)18(17-9)14-16-11-6-2-3-7-12(11)20-14/h2-3,6-7,19H,4-5H2,1H3 InChIKey=KTGFJMXHOLHDQA-UHFFFAOYSA-N
29
Heteroaromatic Tautomerism (conflict 36)
1H NMR Spectrum
13C NMR Spectrum
Same 1H and 13C NMR spectrum
between samples. Double number of
peaks, both samples have the same
mixture of tautomers.
InChI=1S/C9H13ClN4/c1-9(2,3)14-7-6(5-12-14)4-11-8(10)13-7/h5,12H,4H2,1-3H3InChIKey=PCZNHIHWFBLCGG-UHFFFAOYSA-N
InChI=1S/C9H13ClN4/c1-9(2,3)14-7-6(5-12-14)4-111-8(10)13-7/h5H,4H2,1-3H3,(H,1,13)InChIKey=XRAYTMYJALHNQU-UHFFFAOYSA-N
InChI=1S/C14H14N2O2S/c15-13(17)12-10-5-1-2-6-11(10)19-14(12)16-8-9-4-3-7-18-9/h3-4,7-8H,1-2,5-6H2,(H2,15,17)InChIKey=VKEHQTAMAUSRIS-UHFFFAOYSA-N
30
Ring-Chain Tautomerism (conflict 652)
1H NMR Spectrum
13C NMR Spectrum
InChI=1S/C14H14N2O2S/c17-13-11-8-4-1-2-6-10(8)19-14(11)16-12(15-13)9-5-3-7-18-9/h3,5,7,12,16H,1-2,4,6H2,(H,15,17)/t12-/m0/s1 InChIKey=XMZKSUZSNNZEDZ-LBPRGKRZSA-N
Same 1H and 13C NMR spectrum between samples. Assignment of
chemical shifts indicates open form is present in both samples
31
Preliminary Results
• More than 200 spectra have been analyzed so far.
• VERY PRELIMINARY conclusions: Around 80% of prototropic tautomeric cases and 50% of ring-chain tautomeric cases show the same 1H and 13C NMR spectra
• Usually the same visual appearance of the samples of a pair (texture, color etc.) corresponds to identical NMR results.
• We have assigned the chemical shifts of some spectra to determine which tautomer is present in the samples.
• Some tautomeric conflicts, e.g. involving triazole, imidazole or pyrazole moieties, are practically indistinguishable by standard NMR experiments:
• We are starting to review the tautomeric rules based on the NMR results.
12_1 12_2
InChI=1S/C6H4BrN3/c7-4-1-2-5-6(3-4)9-10-8-5/h1-3H,(H,8,9,10)InChIKey=BQCIJWPKDPZNHD-UHFFFAOYSA-N
InChI=1S/C6H4BrN3/c7-4-1-2-5-6(3-4)9-10-8-5/h1-3H,(H,8,9,10)InChIKey=BQCIJWPKDPZNHD-UHFFFAOYSA-N
32
Tautomerism – Ongoing and Planned Activities
• Database of tautomerism data (structures; ratios, interconversion rates, relative energies...) from literature, both experimental and computational
• IUPAC Working Group: “Redesign of Handling of Tautomerism for InChI V2”
• Agree on conditions for “tautomerism” in InChI
• Addition of ring-chain tautomerism rule set
• Recommendation for a tautomerism rule set for InChI V2
• Definition of a canonical tautomer
NCI/CADD Team
Alexey ZakharovLaura GuaschMegan Peach Marc NicklausMarkus Sitzmann
Xemistry GmbH
Wolf-Dietrich Ihlenfeldt
Acknowledgements
ChemNavigator
Scott Hutton
InChI Team
PubChem
All other database providers
• based on hashcodes calculated by the chemoinformatics
toolkit CACTVS
• CACTVS hashcodes:
represent a chemical structure uniquely as
16-digit hexadecimal number (64-bit unsigned)
high sensitivity to structural features of a compound
change if connectivity changes
NCI/CADD Structure Identifiers
Unique Representation of Chemical Structures
HNN NH2
OH
O
9850FD9F9E2B4E25
HNDVDQJCIGZPNO-UHFFFAOYSA-N
HNDVDQJCIGZPNO-CDYZYAPPSA-N
HNDVDQJCIGZPNO-RXMQYKEDSA-N
HNDVDQJCIGZPNO-YFKPBYRVSA-NHNDVDQJCIGZPNO-UHFFFAOYSA-N
HNN NH2
OH
O
NNH NH2
OH
O
HNN
OH
O
NH2
HNN
OH
O
NH2
HNN NH2
O-
ONa+
HNN NH3
+
O-
O
O
HNN NH2
ONa
HNN NH
OH
O
NHN 15NH2
OH
O
HNDVDQJCIGZPNO-UHFFFAOYSA-N
charged form
tautomer
isotope
stereoisomers
salt Std. InChIKey
“errors”
HNDVDQJCIGZPNO-UHFFFAOYSA-N
UHPNKBYGGMJTIM-UHFFFAOYSA-M
UHPNKBYGGMJTIM-UHFFFAOYSA-M
9850FD9F9E2B4E25-FICuS
9850FD9F9E2B4E25-FICuS
E5F83F10C5DB080A-FICuS
E5F83F10C5DB080A-FICuS
E92E4BA2869F3611-FICuS
8A7AD1EB498CC76A-FICuS
A3DAE0788050DDE4-FICuS
B2FDA68AEDA06DB9-FICuS
9850FD9F9E2B4E25-FICuS
FICuS
Chemical Identifier Resolver (CIR)
http://cactus.nci.nih.gov/chemical/structure/PGZUMBJQJWIWGJ-ONAKXNSWSA-N/cas
204255-11-8 MIME type: text/plain
examples:
http://cactus.nci.nih.gov/chemical/structure/
XMWRBQBLMFGWIX-UHFFFAOYSA-N/image
?height=300&width=300&bgcolor=black&bondcolor=white
Buckyball
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA-N/smiles
CCO
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA/smiles
CCO
CC[OH2+]
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ/smiles
C(C(O)([2H])[2H])[2H]
CC(O)([2H])[2H]
C(CO)([2H])([2H])[2H]
CC[17OH]
C(CO)[2H]
[14CH3]CO
CCO
• resolve Standard InChIKey into full structure representation: Ethanol
Partial InChIKey Lookup (in preparation)
InChI/InChIKey (Version 1.04) calculated with four InChI flag sets:
Set 1
Set 2
Set 3
Standard Standard InChIKey
DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T
DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T
DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T
Add H
Add H
Add H
Add H
CACTVS
:
:
:
:
Standard Set, Set 1 & Set 2: addition of hydrogen atoms by CACTVS
Set 3: addition of hydrogen atoms by the InChI library
Chemical Structure Database (current version)
structure
normalizationparent
structure
MDL SDF
SMILES
database
NCI/CADD
Identifier
hashcode
calculation
NCI/CADD Structure Identifiers
Unique Representation of Chemical Structures
E_HASHISY
• we calculate a set of parent structures with different
sensitivity to chemical features
• fine grained representation of chemical structures
FICTS FICuS uuuuu
original
structure
record
MDL Molfile
MDL SDF
SMILES
ChemDraw cdx
PDB
NCI/CADD Chemical Structure Database
Tautomer Analysis
0
5
10
15
20
25
30
0.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.518.5 20.5 22.5 24.5
occurrence of “tautomerism-critical” moleculeswithin each individual database release (%)
average: ~9.5% of FICuS parent structures
numberdatabasereleases
frequency
Sitzmann M, Ihlenfeldt WD, Nicklaus MC. J Comput Aided Mol Des. 2010 Jun;24(6-7):521-51.
Ring-chain tautomerism in a real database
Most commonly applicable ring-chain tautomerism rules in the AMS Database
SMIRKS rule Count %
3-exo-Trig 65,435 0.31
4-exo-Trig 10,560 0.05
5-exo-Trig 7,506,722 35.09
6-exo-Trig 5,289,114 24.72
7-exo-Trig 4,185,292 19.565-exo-Dig 179,567 0.84
6-exo-Dig 472,074 2.21
7-exo-Dig 3,293,445 15.4
5-endo-Trig 169,007 0.79
6-endo-Trig 156,371 0.73
7-endo-Trig 65,239 0.3
Most common
42
43
Imine/amine tautomerism (conflict 5)
1H NMR Spectrum
13C NMR Spectrum
InChI=1S/C14H11FN6O/c1-22-7-11-12(8-2-4-9(15)5-3-8)14-19-18-10(6-16)13(17)21(14)20-11/h2-5H,7,17H2,1H3InChIKey=ZUXILZLWKBIETJ-UHFFFAOYSA-N
5_1
5_2
InChI=1S/C14H11FN6O/c1-22-7-11-12(8-2-4-9(15)5-3-8)14-19-18-10(6-16)13(17)21(14)20-11/h2-5,17,19H,7H2,1H3InChIKey=ZDHCKDOWMHOUPE-UHFFFAOYSA-N
Same 1H and 13C NMR spectrum between samples. Assignation of chemical shifts indicates
imine form is present in both samples
InChI/InChIKey Resolver
InChI/InChIKey Resolver
“loose coupling”of InChI resolversprovided by differentorganizations
central list of resolvers
each resolvermust provide aspecific protocol.
InChI/InChIKey Resolver
Evan Bolton (NCBI, NLM, NIH)
Valery Tkachenko (RSC/ChemSpider)
Marc Nicklaus (CADD Group, NCI, NIH)
Steven Bachrach (Trinity University)
Antony Williams (RSC/ChemSpider)
Markus Sitzmann (CADD Group, NCI, NIH)