comparing cahn-ingold-prelog rule implementations: the

28
ACS Fall 2017, Washington, D.C. comparing cahn-ingold-prelog rule implementations: the need for an open cip John Mayfield, Daniel Lowe, Roger Sayle

Upload: others

Post on 02-Dec-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: comparing cahn-ingold-prelog rule implementations: the

ACS Fall 2017, Washington, D.C.

comparing cahn-ingold-prelog rule implementations:

the need for an open cipJohnMayfield,DanielLowe,RogerSayle

Page 2: comparing cahn-ingold-prelog rule implementations: the

“The Cahn–Ingold–Prelog (CIP) sequence rules … are a standard process used in organic chemistry to completely and unequivocally name a stereoisomer of a molecule.” - Wikipedia

Page 3: comparing cahn-ingold-prelog rule implementations: the

“The Cahn–Ingold–Prelog (CIP) sequence rules … are a standard process used in organic chemistry to completely and unequivocally name a stereoisomer of a molecule.” - Wikipedia

If you are not naming stereoisomers you (probably) don’t want to use CIP

Tools can give different answers, What can we do about it?

Page 4: comparing cahn-ingold-prelog rule implementations: the

NUMBER OF STEREOCENTRES PER ENTRY

chebi_154

chembl_23

pubchem

pubchem_substance

eMolecules170601

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100% of Dataset

Dat

aset

Count0123456789

eMolecules2017-Jun-01

PubChemSubstance

PubChemCompound(Aug17)

ChEMBL23

ChEBI154

+

14milliontotal

234milliontotal

93milliontotal

1.7milliontotal

95thousandtotal

Page 5: comparing cahn-ingold-prelog rule implementations: the

Many chemists are taught the CIP rules during their education and is deceptively simple

‣ Simple cases are easy for a human (and computers) ‣ Complex cases are hard for a

human (and computers)

IUPAC Blue Book (2013) extends recommendations but incomplete (and some mistakes)

Page 6: comparing cahn-ingold-prelog rule implementations: the

The Sequence RULES (in essence)

Rule 1 a. Higher atomic number precedes lower b. An atom node duplicated closer to the root ranks higher than one

duplicated further Rule 2 Higher atomic mass number precedes lower Rule 3 Z precedes E and this precedes nonstereogenic (nst) double bonds Rule 4

a. Chiral stereogenic units precede pseudoasymmetric stereogenic units and these precede nonstereogenic units (R = S > r = s > nst)

b. When two ligands have different descriptor pairs, the one with the first chosen like descriptor pairs has priority over the one with a corresponding unlike descriptor pairs

c. r precedes s Rule 5 An atom or group with descriptor R has priority over its enantiomorph S

Page 7: comparing cahn-ingold-prelog rule implementations: the

O

HH

HH

H

HH

H

H321 5

4

6

12 3

5

6 4

H

Example

1. In the sphere (i) C2 and C5 are tied O > C5 = C2 > H 2. In the sphere (ii) C2 and C5 are split C,H,H > H,H,H

and therefore C2 > C5 3. The priority is 4, 2, 5, 6 and the configuration is S

(i)

(ii)

Page 8: comparing cahn-ingold-prelog rule implementations: the

DIGRAPHS

• Rules are applied to hierarchal directed acyclic graphs (digraphs)

• Comparison proceeds in “spheres” out from the root of the graph

• Combinatorial explosions for some structures

H

OHH

H

H

H

H

H H

H

H

1

7

6

5

(1)

(1)65234

O

O

34 2

1

6 5

7

7

Page 9: comparing cahn-ingold-prelog rule implementations: the

PSEUDO-ASYMMETRYSome confusion of lower case r and s

• Assigned only when Rule 5 has been used • Not indication of non-constitutional

Why? Reflection is superimposable:

Page 10: comparing cahn-ingold-prelog rule implementations: the

AUXILIARY DESCRIPTORSAuxiliary descriptors are used to split ties by symmetric molecules by labelling the asymmetric digraphs

Tieininitialdigraph

Calculateauxiliarydescriptors

R>S(Rule5)3:rPicture: May, J. W. (2015). Cheminformatics for genome-scale metabolic reconstructions (doctoral thesis).

Page 11: comparing cahn-ingold-prelog rule implementations: the

mancude ring handling

P-92.1.4.4 Nomenclature of Organic Chemistry: IUPAC Recommendations and Preferred Names 2013

Kekulé forms can result if different digraphs

Handled using fractional atomic numbers

Page 12: comparing cahn-ingold-prelog rule implementations: the

The Sequence RULES (in essence)

Rule 1 a. Higher atomic number precedes lower b. An atom node duplicated closer to the root ranks higher than one

duplicated further Rule 2 Higher atomic mass number precedes lower Rule 3 Z precedes E and this precedes nonstereogenic (nst) double bonds Rule 4

a. Chiral stereogenic units precede pseudoasymmetric stereogenic units and these precede nonstereogenic units (R = S > r = s > nst)

b. When two ligands have different descriptor pairs, the one with the first chosen like descriptor pairs has priority over the one with a corresponding unlike descriptor pairs

c. r precedes s Rule 5 An atom or group with descriptor R has priority over its enantiomorph S

Page 13: comparing cahn-ingold-prelog rule implementations: the

ChEBI ChEMBL eMolecules PubChemCompound

PubChemSubstance

Rule1a 281K 99.6% 1.8M 98.6% 2.4M 97.0% 53.5M 100.0% 93.1M 98.7%

Rule1b 4 1 164 255

Rule2 14 3,565 6,789

Rule3 29 3 441 36 45

Rule4a 122 126 273 4 12,770

Rule4b 563 0.2% 4,037 0.2% 3,188 0.1% 125K 0.1%

Rule4c 19 558

Rule5 285 0.1% 23.4K 1.2% 69K 2.8% 15 1.1M 1.2%

Total 282K 1.9M 2.4M 53.5M 94.3M

MAJORITY HANDLED BY RULE 1a

Countisnumberofstereocentres,valuesofzeroandpercentagesclosetozeroremovedtoreducecomplexity

Page 14: comparing cahn-ingold-prelog rule implementations: the

05

101520253035404550556065707580859095

100

1 2 3 4 5 6 7 8 9 10Sphere

% o

f Ste

reoc

entre

s Datasetchebi_154chembl_23eMolecules170601pubchempubchem_substance

distance from rootMajority (but not all) stereocentres labelled within first few spheres

Best to generate digraph lazily as required

Some digraphs are far too big to generate fully (e.g. fullerenes)

05

101520253035404550556065707580859095

100

1 2 3 4 5 6 7 8 9 10Sphere

% o

f Ste

reoc

entre

s Datasetchebi_154chembl_23eMolecules170601pubchempubchem_substance

Page 15: comparing cahn-ingold-prelog rule implementations: the

comparison

Page 16: comparing cahn-ingold-prelog rule implementations: the

Rule 1AI II

Centres 2.0 R R JMol 14.20.3 R R ACD/ChemSketch 14.05beta R R Balloon 1.6.5beta R R KnowItAll ChemWindow 2018 R R ChemDraw 16.0 R R BIOVIA Draw 2017 R R MarvinSketch 17.17 R - Indigo 1.3.0Beta.r16 - R RDKit 2017.03.03 S R DataWarrior 4.6.0 R R CACTVS (NCI Resolver Aug 17) R R OPSIN 2.3.1 R R LexiChem (OEChem) 20170613 R R ChemDoodle 7.0.2 R R CDK 2.0 - R JUMBO 6 R -

I

II

Page 17: comparing cahn-ingold-prelog rule implementations: the

Rule 1B Centres 2.0 R JMol 14.20.3 R ACD/ChemSketch 14.05beta R Balloon 1.6.5beta R KnowItAll ChemWindow 2018 R ChemDraw 16.0 R BIOVIA Draw 2017 - MarvinSketch 17.17 - Indigo 1.3.0Beta.r16 - RDKit 2017.03.03 R DataWarrior 4.6.0 - CACTVS (NCI Resolver Aug 17) - OPSIN 2.3.1 R LexiChem (OEChem) 20170613 - ChemDoodle 7.0.2 - CDK 2.0 - JUMBO 6 -

Page 18: comparing cahn-ingold-prelog rule implementations: the

Rule 2Jan 2015 Aug 2017

Centres R R JMol n/a R ACD/ChemSketch R R Balloon 1.6.5beta n/a R KnowItAll ChemWindow n/a R ChemDraw S S Accelrys/BIOVIA Draw S R MarvinSketch S S Indigo R R RDKit S S DataWarrior S S CACTVS S R OPSIN R R LexiChem (OEChem) S R ChemDoodle S n/a CDK S S JUMBO - -RorS?Let’sVotehttps://nextmovesoftware.com/blog/2015/01/21/r-or-s-lets-vote/

Page 19: comparing cahn-ingold-prelog rule implementations: the

Rule 4b

SSS R

Centres 2.0 R JMol 14.20.3 R ACD/ChemSketch 14.05beta R Balloon 1.6.5beta R KnowItAll ChemWindow 2018 R ChemDraw 16.0 R BIOVIA Draw 2017 R MarvinSketch 17.17 R Indigo 1.3.0Beta.r16 R RDKit 2017.03.03 S DataWarrior 4.6.0 S CACTVS (NCI Resolver Aug 17) S OPSIN 2.3.1 - LexiChem (OEChem) 20170613 - ChemDoodle 7.0.2 s CDK 2.0 - JUMBO 6 -

Page 20: comparing cahn-ingold-prelog rule implementations: the

MANCUDE RINGS Centres 2.0 R R JMol 14.20.3 R R ACD/ChemSketch 14.05beta R R Balloon 1.6.5beta R R KnowItAll ChemWindow 2018 R R ChemDraw 16.0 R R BIOVIA Draw 2017 R R MarvinSketch 17.17 R R Indigo 1.3.0Beta.r16 S R RDKit 2017.03.03 R R DataWarrior 4.6.0 R R CACTVS (NCI Resolver Aug 17) S R OPSIN 2.3.1 S R LexiChem (OEChem) 20170613 S R ChemDoodle 7.0.2 S R CDK 2.0 S R JUMBO 6 S S

I II

I

II

Page 21: comparing cahn-ingold-prelog rule implementations: the

Centres 2.0 R JMol 14.20.3 R ACD/ChemSketch 14.05beta R Balloon 1.6.5beta R KnowItAll ChemWindow 2018 R ChemDraw 16.0 R BIOVIA Draw 2017 R MarvinSketch 17.17 - Indigo 1.3.0Beta.r16 - RDKit 2017.03.03 - DataWarrior 4.6.0 - CACTVS (NCI Resolver Aug 17) - OPSIN 2.3.1 - LexiChem (OEChem) 20170613 - ChemDoodle 7.0.2 - CDK 2.0 - JUMBO 6 -

AUX DESCRIPTORS

Page 22: comparing cahn-ingold-prelog rule implementations: the

hard to implement AMarvinSketch 17.17

(S)

O

O

(S)

OH

(S)

O

O

(R)

OH

Turning aromaticity on flips stereochemistry (e.g. CHEBI:16063)

Labels depend on input order

OH1

(S)2

(r)3

OH4

(R)5

OH6

(S)7

OH8

(s)9

HO10

(R)11

HO12

(S)1

OH2

OH3

(R)4

HO5

OH6

(R)7

OH8

(S)9

(R)10 (R)11

HO12

(r)1

OH2

(s)3

HO4

(S)5

(R)6

(S)7

(R)8 OH9

OH10

HO11

OH12

Page 23: comparing cahn-ingold-prelog rule implementations: the

hard to implement B

(R)OH

H

(CH2)2CH2HO OH

(R)OH

H

(CH2)11(CH2)10HO OH

OH

H

(CH2)17(CH2)16HO OH

Becomes undefined distance ≥ 16

ChemDraw 16.0

(R)

(s)

(CH2)2

(R)OH

(r)

(s)

(CH2)11

(R)OH

Page 24: comparing cahn-ingold-prelog rule implementations: the

open cip?Why? • Provide a blessed implementation that can be

used directly or compared against

• Toolkit agnostic library to facilitate downstream integration

Page 25: comparing cahn-ingold-prelog rule implementations: the

“FIX-CIP” CoLABORATIONRobert Hanson (JMol), John Mayfield (Centres)

Mikko Vainio (Balloon), Andrey Yerin (ACD/Name), Sophia Gillian Musacchio (St. Olaf College)

Goals • Discuss and resolve software inconsistencies • Generate comprehensive test set based on

BlueBook structure • Recomend rule amendments and additions

Publication in preparation

Page 26: comparing cahn-ingold-prelog rule implementations: the

should you use CIP?Yes

Systematic nomenclature Human conversation (if no pen is handy)

Probably not (better algorithms exist) Unique labelling (see right) Compute “conversation” Finding/cleaning stereocentres

No Relative comparison, e.g. substructure search

Page 27: comparing cahn-ingold-prelog rule implementations: the

should you use CIP?Yes

Systematic nomenclature Human conversation (if no pen is handy)

Probably not (better algorithms exist) Unique labelling (see right) Compute “conversation” Finding/cleaning stereocentres

No Relative comparison, e.g. substructure search

(S)

(S)

(R) (S)

(R)

(R)

(S)(R)

(S)

(S)

(R) (S)

(R)

(R)

(S)(R)

Page 28: comparing cahn-ingold-prelog rule implementations: the

acknowledgements SciMix Poster

Robert Hanson (JMol) Mikko Vainio (Balloon) Andrey Yerin (ACD/Name) Sophia Gillian Musacchio (St. Olaf College) Karl Nedwed (Bio-Rad) Noel O’Boyle (NextMove Software) Shuzhe Wang (NextMove Software)

JohnMayfield,DanielLoweandRogerSayleNextMoveSoftwareLtd,Cambridge,UK.

NextMoveSoftwareLimitedInnovationCentre(Unit23)

CambridgeScienceParkMiltonRoad,Cambridge

UKCB40EY

www.nextmovesoftware.com

Introduction

Robert Hanson, Andrey Yerin, Mikko Vainio, and Sophia Gillian Musacchio for initiating and participating in the “Fix CIP” collaboration and the many in-depth technical discussions that have lead to improvements in the tools. Karl Nedwed for providing KnowItAll results. Philip Skinner for providing ChemDraw licenses. Noel O’Boyle for feedback and suggestions.

the need for open-cip

The Cahn-Ingold-Prelog (CIP) priority rules rank atoms around a stereogenic unit to assign a stereo-descriptor that is invariant to atom order and layout, for example R (right) or S (left) for tetrahedral atoms.

A directed acyclic graph (digraph) is constructed for each stereogenic unit and the out edges from the root node compared and ranked according to eight sequence rules[1]. Each rule is applied exhaustively and tested on the entire digraph before applying the next rule[2].

Acknowledgements

Results

1. P-92.1.3 Nomenclature of Organic Chemistry: IUPAC Recommendations and Preferred Names 20132. Paulina Mata. The CIP System Again:  Respecting Hierarchies Is Always a Must. J. Chem. Inf. Comput. Sci., 1999,

39 (6)

Bibliography

ConclusionThe CIP sequence rules provide a standard way for chemists to effectively describe the configurations of most stereogenic units. However, beyond simple cases the complexity of the rules necessitates software is used as an aid to naming configurations. The results demonstrate even then, software implementations do not all agree on the configuration.

Through the results presented here and the on-going effort of the Fix CIP collaboration, software should aim to converge upon consistent stereochemistry naming. An Open CIP software tool could provide “blessed” stereochemistry configuration names and provide a standard algorithm implementation for other vendors to integrate or adapt.

Comparing Cahn-Ingold-Prelog Rule Implementations

Rule 1 a. Higher atomic number precedes lower b. An atom node duplicated closer to the root ranks higher than one duplicated further

Rule 2 Higher atomic mass number precedes lower Rule 3 Z precedes E and this precedes nonstereogenic (nst) double bonds Rule 4

a. Chiral stereogenic units precede pseudoasymmetric stereogenic units and these precede nonstereogenic units (R = S > r = s > nst)

b. When two ligands have different descriptor pairs, the one with the first chosen like descriptor pairs has priority over the one with a corresponding unlike descriptor pairs

c. r precedes s Rule 5 An atom or group with descriptor R has priority over its enantiomorph S

Stereochemistry in Databases

chebi_154

chembl_23

pubchem

pubchem_substance

eMolecules170601

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100% of Dataset

Datase

t

Count0123456789

eMolecules(June2017)

PubChemSubstance

PubChemCompound(Aug2017)

ChEMBL23

ChEBI154

14millionrecords

234millionrecords

93millionrecords

1.7millionrecords

95thousandrecords

chebi_154

chembl_23

pubchem

pubchem_substance

eMolecules170601

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100% of Dataset

Dat

aset

Count0123456789N

umbe

rofStereogen

icUnits

+

chebi_154

chembl_23

pubchem

pubchem_substance

eMolecules170601

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100% of Dataset

Dat

aset

Count0123456789

The number of defined stereogenic units per molecule varies between databases.

The application of Rule 1a to the digraph for 2-butanol ranks the out edges connected to the root as giving the label S (4 > 2 > 5 are anticlockwise looking towards 6).

ChEBI ChEMBL eMolecules PubChem Compound1

PubChem Substance

Rule 1a 281K 99.6% 1.8M 98.6% 2.4M 97.0% 53.5M 100.0% 93.1M 98.7%

Rule 1b 4 1 164 255

Rule 2 14 3,565 6,789

Rule 3 29 3 441 36 45

Rule 4a 122 126 273 4 12,770

Rule 4b 563 0.2% 4,037 0.2% 3,188 0.1% 125K 0.1%

Rule 4c 19 558

Rule 5 285 0.1% 23.4K 1.2% 69K 2.8% 15 1.1M 1.2%

Total 282K 1.9M 2.4M 53.5M 94.3M

The majority of stereogenic units are constitutionally asymmetric and can be ranked using Rule 1a. However, in some datasets the number of stereogenic units requiring Rule 4b and 5 can be significant.

I II III IV V VI VII VIII IX X XIa XIb XII XIII

Centres 2.0 R R R R R R R R R r R R r R JMol 14.20.3 R R R R R R R R R r R R r R ACD/ChemSketch 14.05beta R R R R R R R R R r R R r R Balloon 1.6.5beta R R R R R R R R R r R R r R KnowItAll ChemWindow 2018 R R R R R R R R R r R R r R5

ChemDraw 16.0 R R R R S R R R R r R R r R BIOVIA Draw 2017 R R R - R R R R R -1 R R -1 R MarvinSketch 17.17 R - - - S R - R - r R R r - Indigo 1.3.0Beta.r16 -2 R - - R - R R R r S R - - RDKit 2017.03.03 S R S R S R R S R R R R - - DataWarrior 4.6.0 R R R - S R R S R R R3 R - - CACTVS (NCI Resolver Aug 17) R R S - S4 R R S R R S R - - OPSIN 2.3.1 R R R R R - - - - - S R - - LexiChem (OEChem) 20170613 R R - - R - - - - - S R - - ChemDoodle 7.0.2 R R - - S - - s - r S R - - CDK 2.0 - R R5 - S - - - - - S R - - JUMBO 6 R - S - - - - - - - S S - -

Constitutional (Rule 1a, 1b, 2)

Geometrical + Topographical (Rule 3,4a,4b,4c,5)

Special (Mancude,

Aux Descriptors)

1. Pseudoasymmetric r/s labels not displayed but must be calculated due to answers given for IX and XIII

2. Runtime error occurs3. Impossible to test as different Kekulé forms are normalised4. R in CACTVS since Feb 2015, NCI resolver is old version5. Other descriptor is assigned differently

A set of fourteen structures was collected to identify differences between software implementations. The structures were selected to cover all the sequence rules and their applications to special cases.

Eight sequence rules (in essence)

Fix CIP CollaborationSince submitting this work for presentation the developers: Centres, JMol, ACD/ChemSketch, and Balloon have begun a collaboration. We are in the process of submitting for publication an extended in-depth validation set and proposing sequence rule refinements and additions where they are required.

1As part of the PubChem Compound’s processing, non-constitutional stereochemistry is removed: for example the nine stereoisomers of inositols are all represented by CID 892.

Atoms connected by double and triple bonds as well as ring closures result in duplicated nodes in the digraph. In the structure below atoms 5 and 6 appear twice and atom 1 (the root) appears three times.

Due to this duplication, complex ring systems can generate exponentially large digraphs that are not computationally tractable. Further complexity in digraphs is caused by the use of fractional atomic numbers in mancude ring-systems and assignment of auxiliary descriptors for applying Rules 3-5.

H

OHH

H

H

H

H

H H

H

H

1

7

6

5

(1)

(1)65234

O

O

34 2

1

6 5

7

7

O

HH

HH

H

HH

H

H321 5

4

6

12 3

5

6 4

H