extracting synthetic knowledge from reaction databases - archem at the 246th acs

31
from Reaction Databases Orr Ravitz SimBioSys Inc. 246 th ACS National Meeting Extracting Synthetic Knowledge

Upload: simbiosysinc

Post on 09-Jun-2015

1.490 views

Category:

Technology


2 download

DESCRIPTION

Underpinning the computer-aided synthesis design system, ARChem, are algorithms that extract synthetic knowledge from large reaction databases. The generation of reaction rules that facilitate retrosynthetic analysis, as well as the extraction of information about expected yields, regioselectivity, functional group compatibility, and stereo-chemistry are discussed in these slides.

TRANSCRIPT

Page 1: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

from Reaction Databases

Orr Ravitz

SimBioSys Inc.

246th ACS National Meeting

Extracting Synthetic Knowledge

Page 2: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

ARChem – main concepts

A computer-aided synthesis design system.

The Approach:

Comprehensive rule- and precedent-based retrosynthetic analysis back to available starting materials.

Automated rule generation with manual rule curation.

Generate many alternatives.

Provide supporting literature examples.

Allow user guidance and control.

Page 3: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Solution Display

Page 4: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Exploring Alternative Paths

Page 5: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Supporting Examples

Page 6: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Chemical Interference

Functional groups that may interfere with transformations are highlighted.

Page 7: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Functional Group Tolerance

Break down of example set based on the presence of functional groups beyond the reaction center provides evidence for compatibility.

Examples can be exported to database’s web interface for further analysis.

Page 8: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Stereochemistry

Currently: Exact matches Starting materials Coming soon: Rule-based

Page 9: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Essential Information

Automated extraction of knowledge

Reaction rules

Yield values

Chemical interference - functional group tolerance

Regioselectivity

Stereochemistry

Data Information Knowledge

Perceive

Generalize

Page 10: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

System Design

Reactions

Reaction Rules

Starting Materials Expert Knowledge-

bases

Target

Page 11: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Source reactions Esterification examples

Other examples

··· → ··· ··· → ··· ··· → ···

Esterification rule

Other rule

··· → ···

Reactions

Reaction Rules

Rule Extraction

Page 12: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Reactions

Reaction Rules

Reaction Perception

Source reaction:

Extracted core

Extended core

Reaction file with atom mapping

Atoms attached to bonds changed, made or broken in the reaction

Include all structural motifs that are essential for the reaction to occur

Page 13: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Extending the Core: Passengers vs Drivers

The goal of chemical perception is to discriminate between structural features that are essential for the reaction, and those that are passengers.

Shell-based approach: 1st shell

2nd shell

Graph-based methods are inappropriate.

Page 14: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Mechanism-Dependent Core Extension

Nucleophilic aromatic substitution:

Addition /elimination mechanism Requires a π acceptor group in ortho or para position

Via organometallic intermediate

Page 15: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Reactions

Reaction Rules

Rule Extraction

Similar extended cores

Completed reaction rule

Common extracted core

Nucleofuge (NF) - a leaving group which carries away the bonding electron pair.

Generalized rule

Generalized group (NF) is replaced by the most common group.

Page 16: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Interfering Functionality

Following rule abstraction, compatible functionality is detected by examining the examples:

Compatible Interfering

Moieties outside the extended core are listed as compatible.

Other functional groups will be inferred as `possibly interfering’.

Possibly interfering functionality will be penalized in scoring and highlighted to the user.

Page 17: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Regioselectivity – Main Steps

Recognize rule’s reaction type – electrophilic substitution, nucleophilic addition etc.

Only reactions prone to regioselectivity are subject to regio calculations.

Identify competing sites

Identify substituents and other structural motifs that may influence the directionality

Collect statistics from example set regarding selectivity in the reaction core as well as elsewhere in the molecule (chemoselectivity)

Assign regioselectivity to rule if predefined statistical requirements are met.

? ? ?

Page 18: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Collecting Statistics

Electrophilic aromatic substitutions

For each example in DB:

Evaluate ring activation including for heteroaromatic rings and fused rings Evaluate location, type and neighborhood of ring substituents Identify symmetry Compute environment signatures that include all aromatic features plus

relevant substituents

For each rule:

Cluster reacting vs. non-reacting signature-equivalent sites for reactions with yield > 20%

Define regioselectivity if examples ratio is 10:1

Page 19: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Regio Example

X=Cl, 84% X=Cl, 5.5% Rejected

Misinterpreted yield value provided

positive evidence

Page 20: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Stereochemistry – the challenge

Efficient machine perception and representation of a broad range of synthetically important stereogenic types Including tetrahedral C, S, N and P. Also alkenes, allenes and atropisomers

Representation of stereochemical reaction rules and stereochemical strategies

Develop a versatile stereochemical substructure algorithm to support retron matching

Efficient discovery of symmetry in stereochemically defined molecules and rules - avoid duplicate routes

Stereoselectivity is captured inaccurately and inconsistently across common

databases.

Page 21: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

The Data

Database content Portion of data Notes

Number of unmapped examples 14% Reaction type unknown

Number of examples belonging to reactions

with 5000 or more examples 4% Ubiquitous protection / deprotection reactions

Number of examples belonging to reactions

with 20 or less examples 16%

Bad atom maps (database errors)

Multistep reaction sequences

General useable examples 65% 65 %

0

10

20

30

40

50

60

70

80

90

100

yield cs de ee

% o

f d

atab

ase

Examples with quoted selectivity values

Selectivity metric

0

10

20

30

40

50

60

70

80

90

100

> 0% > 25% > 50% > 75% > 90% > 95% > 98%

yield

cs

de

ee

Examples with selectivity above a threshold

% o

f av

aila

ble

Threshold selectivity values

Page 22: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Stereo-Rules Generation – A Different Approach

Manually code rules for a diverse set of useful enantioselective and generally selective reaction types.

Mine supporting examples from existing large reaction databases to discover reaction scope and limitations for each rule.

Find effective strategies to aid planning of a stereo controlled synthesis

Reactions

Diels Alder Sharpless Reduction of C=C Reduction of C=O

Page 23: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

70 reaction types with ee>95% and more than 50 examples

Designing a Rule-Set

Reaction type Bond alterations Examples with ee ≥95% Notes

Addition of C nucleophiles to C=C CH + C=C → CCCH 1603 Mostly conjugate additions

Reduction of C=O C=O → HCOH 1553 Any type of carbonyl

Addition of C nucleophiles to C=O CH + C=O →CCOH 1265 Includes mostly Aldols + alkynylations

Reduction of C=C C=C → HCCH 1120 Wide variety of environments

Addition of C nucleophiles to C=N CH + C=N →CCNH 639 Any type of C=N

Epoxidation of C=C C=C → C1CO1 415 Sharpless, Jacobsen, Shi etc

Addition via R3B to C=C C-B + C=C → CCCH 329 Mostly conjugate addition to enones

Addition via R2Zn to C=O C-Zn + C=C → CCCH 306

Dihydroxylation of C=C C=C → HOCCOH 266

Reduction of C=N C=N → HCNH 256 Any type of C=N

Diels-Alder C=C + C=CC=C → C1CCC=CC1 222 Carbocyclic Diels-Alder

Cyclopropanation of C=C C=N + C=C → C1CC1 222 Via diazo precursor (carbene)

Mukaiyama Aldol SiOC=C + C=O → O=CCCOH 210

C substitution of Br CH + CBr → CC 199

[2+3] azomethine cycloaddition C=NCH + C=C → N1CCCC1 198

Addition via R2Zn to C=C CZn + C=C → CCCH 162 Mostly conjugate addition to enones

Addition via R3B to C=O CB + C=O → CCOH 141

Oxidation of sulphides S → S=O 137 Chiral sulphoxides

Page 24: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Perception of stereochemistry in structural diagrams

Enabling Technology

Stereocenter manipulation and stereo descriptors

Op 1 2 3 4

A E 1 2 3 4

B C23 1 3 4 2

C C13 1 4 2 3

D C2 2 1 4 3

E C13 2 3 1 4

F C23 2 4 3 1

G C23 3 1 2 4

H C13 3 2 4 1

J C2 3 4 1 2

K C13 4 1 3 2

L C23 4 2 1 3

M C2 4 3 2 1

Op 1 2 3 4

s 2 1 3 4

E + 8C3 + 3C2 Rotations

Reflection

Conceptual Model Stereo Descriptor

Page 25: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Chemical constraints layer of representation

Enabling Technology

CONNECTIONS=1,2,3 FUSION=BIARYL

RINGS=5+6,6+7 BRIDGEHEAD=YES

DIFFRING=1 EPS=0,1

SAMERING=1 HETS=0,1,2

DIFF=1 NONAROMHETS=0,1,2

SAME=1 HALOGENS=0,1,2

ARYL=YES FGS=ALCOHOL

SPCENTRE=1,2,3 FGNOT=CARBONYL

CHARGE=YES PROP=EWG

HS=0,1,2 PROPNOT=Lg

Substructure search/match

Page 26: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Reduction of Ketones to Secondary Alcohols

Level 1: + Environment constraints

Level 0: Bond change constraints only

Level 1: + Stereochemical constraints

Base ARChem rule Hits ee de (screen)

10,004 (10,004) Not unique to ketone → secondary alcohol conversion

8,442 (10,004) Unique to ketone → secondary alcohol conversion 140 tolerated functional groups

6,525 3,457 4,711 (6,765) Enantioselective and diastereoselective examples

Page 27: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Dihydroxylation of Alkenes

Level 1: Bond changes with environment constraints

Level 2: + Stereochemical constraints

Level 3: + Substitution patterns

2253 examples (2416 screened)

Hits ee de (screen)

1,428 1,008 1,151 (1,634)

428 117 352 (444)

Hits ee de

681 578 552

526 289 418

206 131 168

12 10 11

236 89 191

123 51 103

51 27 41

8 4 7

Page 28: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Conclusions

Useful chemical knowledge can be extracted algorithmically from reaction databases.

Automation is crucial given the size and growth of databases.

Different layers of knowledge are tightly entangled: regioselectivity, chemoselectivity and stereoselectivity overlap considerably.

The extracted knowledge can be applied effectively in computer-aided synthesis design, and empower chemists by offering new ideas a broader perspective on the literature.

But...

The quality of extracted knowledge highly depends on the accuracy and scope of the source data!

Page 29: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

The Rule-Set

Cut-off threshold

Useful reactions

Noise

Distractions

Low utility reactions

Bad atom maps (avoid) Rare multistep reaction sequences (low utility) Multiple concurrent reactions on substrate (very low utility) Exotic heterocycle formation (promote)

Ubiquitous protection / deprotection FGIs such as alcohol/ester, amine/amide etc (demote)

Page 30: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Conclusions Significant portion of data is being lost due to mapping errors and other problems.

Yield and selectivity information is captured inconsistently.

What can be done:

Meta data perception can be improved. (in progress)

Mapping algorithms should reflect contemporary mechanistic understanding of reactions.

Systematic mapping errors can be manually fixed (planned)

Extracted rules can be manually curated (continuous).

Page 31: Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Acknowledgements

SimBioSys

James Law - Regioselectivity

Victoria Lubitch

Yasamin Salmasi

Aniko Simon

Zsolt Zsoldos

Reaction Data

Elsevier – Reaxys

Wiley - CIRX

RSC - MOS

Accelrys - RefLib

University of Leeds

Tony Cook - Stereochemistry

Peter Johnson

Steve Marsden

Other Collaborators

ChemAxon

And…

ARChem users! THANK YOU!