fast descriptor calculation for combinatorial libraries geoff downs & john barnard sheffield, uk
TRANSCRIPT
![Page 1: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/1.jpg)
Fast Descriptor Calculation for
Combinatorial Libraries
Geoff Downs & John Barnard
Sheffield, UK
BC I B arn ard C h em icalI n form ation L td.
![Page 2: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/2.jpg)
Descriptor Generation for Combinatorial Libraries Need to calculate structure descriptors for large
virtual libraries• Subset selection better in property space than in
reactant (precursor) space Full library enumeration, followed by descriptor
calculation for single molecules can be slow Direct analysis of Markush representation of
library can offer order-of-magnitude speedups• gives accurately-calculated descriptors for all
molecules in the library
![Page 3: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/3.jpg)
Markush Structures
Scaffold plus R-groups Each R-group alternative (ni) shown once
Convenient for input and display Markush is O( ni)
• 1 core + 100 R1 + 100 R2 + 100 R3 = 301
Enumeration is O( ni)• 1 core × 100 R1 × 100 R2 × 100 R3 = 1,000,000
![Page 4: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/4.jpg)
Direct Analysis of Markush
Avoid multiple analysis of common/repeated parts Time and space advantages
• do as much work as possible in ni
• work in ni only when absolutely necessary Generate partial descriptors from individual building
blocks, and overlaps between them Combine these using appropriate logic to form full
descriptors for individual products Applicable where descriptors are “additive” in
nature
![Page 5: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/5.jpg)
Two-stage Descriptor Generation from Markush Structures1. Analyse core and R-group alternatives
• Build intermediate representation of “partial descriptors”
• Some partial descriptors may involve overlap between core and R-group(s)
• O( ni) [Sigma Phase]
2. Assemble “full” descriptor for each individual molecule in library
• Usually simple addition, concatenation or logical OR of partial descriptors
• O( ni) [Pi Phase]
![Page 6: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/6.jpg)
Descriptors from Markush
Previously described structure fingerprint generation • based on dictionary of predefined fragments• “Partial” fingerprints for relevant building blocks ORed
together for each specific structure
More recent work on calculation of property values• “Lipinski” properties• topological indices
![Page 7: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/7.jpg)
Markush Analysis Softwarec S L N
M ain in ternal M ark us hrepres en tation (E F CL )
M T Z F ile(S M IR K S /S M IL E S )
R G F ile
F ra g m e ntd ic tio na ry
C lu s te rNu m be rs
P a rtia lfing e rprints
En u m e ra te dFin g e rprin t s
C e n tro id/M o da l
F in g e rprin t
D ive rs ity E xplo re r E xc ha ng e F ile
(S M IR K S /S M IL E S )
P a rtia lS M IL E S
En u m e ra te dS M I L ES
P a rtia lpro pe rty va lue s
En u m e ra te dpro pe rty v a lu e s
S lo g Pa to m type s
R e a c tio n a ndP re c urs o r Input
M a rk us h Input
![Page 8: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/8.jpg)
Internal Markush Representation
Data structure held in memory only while needed for analysis• Separate building blocks (“partial structures”) with logical
relationships• Several (non-independent) substituent groups may be
included in a single structural variable Can be built from various input formats
• “Markush-type” input (e.g. RGfile, cSLN) imported directly• Generic reaction and precursor input is more complex
Representation may be “optimised” for efficient processing
![Page 9: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/9.jpg)
Reaction/Precursor input
Build Markush incrementally, one reaction step at a time
Each step modifies core and adds an R-group (clipped reagents)
Input modules based on Daylight reaction toolkit implementedhttp://www.daylight.com/meetings/mug00/Barnard
Module based on Accord SDK under development
![Page 10: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/10.jpg)
SMILES Enumeration Markush analysis can be used for fast enumeration
of non-canonical SMILES for library members Based on SMILES trick: “C1.C1” “CC”
• dot separates the two carbon atoms• “ring closure” numerals join them up again
Sigma Phase:• Generate Partial SMILES for each Partial Structure • Use unsatisfied ring closure numeral for bonds outside
the Partial Structure Pi phase:
• Concatenate Partial SMILES from each relevant PS
![Page 11: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/11.jpg)
SMILES Enumeration core R1 R2O=C%12Nc1ccc(ccc1)C(=O)OC%11 . C%12 . [H]%11O=C%12Nc1ccc(ccc1)C(=O)OC%11 . C%12 . Cl%11O=C%12Nc1ccc(ccc1)C(=O)OC%11 . C%12Br . [H]%11O=C%12Nc1ccc(ccc1)C(=O)OC%11 . C%12Br . Cl%11
Sigma phase: fast generation of partial SMILES• < 0.04s for 100x100x100 = 1M benzodiazepine library
Pi phase: simple concatenation of partial SMILES• 38,775 structures per sec (SGI R10k)• Producing canonical SMILES slows down enumeration
by factor of 45o Each individual molecule must be separately canonicalised
![Page 12: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/12.jpg)
Lipinski Property Generation Molecular weight
• trivial addition of partial molecular weights Count of aromatic rings
• addition of partial counts – optimisation of internal representation ensures that aromatic rings are not split between building blocks
Hydrogen bond donor/acceptor counts• Partial counts may depend on combination of more than
one R-group (e.g. where H is an alternative)• “Overlap” terms (combinations of building blocks) may
need to be included in addition• HBD/HBA definitions can be customised
![Page 13: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/13.jpg)
Lipinski Property Generation
Rotatable bond counts• Some complexities for bonds
between core and R-group• RB = any single bond except to a
terminal atom (H, Cl etc.) or a terminal group (CH3, NO2 etc.)
• In example, R1 to ring single bond is not rotatable when R1 is CH3 or R2 and R3 are identical terminal atoms
N
R1
N*R3
R2
CH3
*R1=
![Page 14: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/14.jpg)
Lipinski Property Generation
logP• Used SlogP atom-contribution method
o Wildman & Crippen, JCICS 1999, 39, 868-873o 68 atom types (+ 4 supplemental) defined as SMARTS patterns
• Atom types redefined as BCI Fragment Dictionary, e.g.[CH3][(N,O,S,P,F,Cl,Br,I)] => C as Xo 797 fragments (644 AA + 26 AS + 131 direct assignment)o Charged N,O and intermediate C, X, Y atom types
• Some atom types require examination of neighbouring building blocks
![Page 15: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/15.jpg)
Lipinski Generation Timings
100100100 = 1M benzodiazepine library SGI R10000 Sigma phase:
• calculation of partial property values • <0.04s
Pi phase: • assembly and output of full property values • 95.59s for all 1M molecules• 10,461 molecules/s
![Page 16: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/16.jpg)
Topological Index Generation
Many topological indices are based on summing the terms for small parts of structure• Simple extra calculation needed at end for some indices• Several implemented (others under development)
o Kier Chi connectivity indices; any ordero Counts of different subgraph types; any order o Kier Kappa and Phi shape indiceso Zagreb indexo (Wiener index and Balaban (JX, JY) indices)
• Hosoya Index not amenable to Markush approacho Requires analysis of full molecule
![Page 17: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/17.jpg)
Kier Index Generation
Sigma Phase • Identify all subgraphs up to n bonds (n is maximum index
order)• Count number of subgraphs of different types, and calculate
contributions to Chi indices Pi Phase
• Sum appropriate subgraph counts and index contributions for each molecule
• Kappa and Phi shape indices calculated from low-order subgraph counts
Sigma phase is significantly slower than for Lipinski properties and fingerprints
![Page 18: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/18.jpg)
Chi Index Sigma Phase Timings
0.01
0.1
1
10
100
1000
10000
100000
0 1 2 3 4 5 6 7 8 9
Maximum Subgraph Order
Tim
e/s
(log
scal
e)
Calculation of partial subgraph counts up to specified order
100 x 100 x 3 = 30,000 compounds400MHz Pentium Celeron, 64MB RAM
![Page 19: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/19.jpg)
Slowdown at higher orders – number of subgraphs Exponential increase in number of subgraphs at
higher orders• Also a problem when handling specific structures
Subgraph Types• P (Path) – nodes have 1 or 2 connections• C (Cluster) – nodes have 1 or 3 connections• PC (Path/Cluster) – nodes have 1, 2 or 3 connections• CH (Chain) – subgraph contains a ring
![Page 20: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/20.jpg)
Explosion in Number of Subgraphs
0
100
200
0 1 2 3 4 5 6 7 8 9
Subgraph Order
Path
Cluster
Chain
Path-Cluster
Mean number of subgraphs per molecule
![Page 21: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/21.jpg)
Explosion in Number of Subgraphs
0
1000
2000
3000
4000
5000
6000
0 1 2 3 4 5 6 7 8 9
Subgraph Order
Path
Cluster
Chain
Path-Cluster
Maximum number of subgraphs in any molecule
![Page 22: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/22.jpg)
Slowdown at higher orders – number of Rgroups Higher order subgraphs
can involve core and multiple R-groups• Order 6 PC can involve
all three R-groups
Depends on how well-separated R-groups are
R2
R1
R2
R3 " CH3
R1 : 100 alternativesR2 (symmetrical): 100 alternativesR3: 3 alternatives
![Page 23: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/23.jpg)
Speeding-up Kier Index Generation
Limit maximum order for subgraph counts and Kier connectivity indices
Avoid identifying PC/CH subgraphs if these indices are not required
• not available yet• some complications as
subgraphs can change type as bonds are added
Clu s ter
P ath
Ch ain( i .e . R i n g )
P athClu s ter
![Page 24: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/24.jpg)
Clustering Library Members Previously described clustering of library members on
the basis of fingerprints Lipinksi properties and topological indices can also be
used as basis for clustering• Descriptors are re-generated from partial descriptors as
needed (Pi phase) and need not be stored K-means relocation method needs O(N) time
• Non-hierarchical clustering method• Produces high-quality clusters• User specifies required number of clusters• Results can depend on random selection of cluster seeds
![Page 25: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/25.jpg)
Current Work: Library Overlap
Work in progress to identify the overlap between combinatorial libraries
Identify specific compounds in common• expressed as another Markush structure
“Brute force” algorithm would• Fully enumerate libraries involved• Compare lists of (e.g.) canonical SMILES for
common members
![Page 26: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/26.jpg)
Library Overlap
Markush algorithm originally designed for structure search in chemical patents
• uses “reduced graph” representation of Markusho avoids “segmentation problem” (different boundaries between R-
group and scaffold)
• eliminates non-matching parts very rapidly• slower (atom-by-atom) check to confirm matches
o worst case is matching library against itself
Implementation in software toolkit form• can be incorporated into users’ software• could form basis for Markush Registration/Search system
![Page 27: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/27.jpg)
Potential Future Work: 3D Conformation Generation Preliminary discussions with Gasteiger group (Univ.
Erlangen) on linking Markush approach with CORINA
CORINA works by• separating cyclic and acyclic components• establishing conformation for each independently• linking them back together • checking and adjusting for steric crowding
Some analogies with Markush approach• First two steps are equivalent to Sigma phase• Last two steps are equivalent to Pi phase
![Page 28: Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK](https://reader038.vdocuments.net/reader038/viewer/2022110206/56649cd95503460f949a35d5/html5/thumbnails/28.jpg)
References Barnard, J. M.; Downs, G. M.; von Scholley-Pfab, A.; Brown,
R.D., “Use of Markush structure analysis techniques for descriptor generation and clustering of large combinatorial libraries.” J. Mol. Graph. Modelling 2000, 18 (4/5), 452-463
Reactions Markush (Daylight MUG00 meeting) http://www.daylight.com/meetings/mug00/Barnard
P.S. we are recruiting too…
http://www.bci1.demon.co.uk
Copyright © Barnard Chemical Information Ltd., 2001