1 chemical structure representation and search systems lecture 5. nov 13, 2003 john barnard barnard...
TRANSCRIPT
![Page 1: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/1.jpg)
1Chemical Structure Representation
and Search Systems
Lecture 5. Nov 13, 2003
John Barnard
Barnard Chemical Information LtdChemical Informatics Software & Consultancy Services
Sheffield, UK
![Page 2: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/2.jpg)
2 Lecture 5: Topics to be Covered
• Reaction searchingo atom-atom mappingo Maximal Common Substructure search
• 3D substructure search• Searching Markush structures in patents
o nature and origin of Markush structureso fragment codeso topological systems (MARPAT, Markush DARC)
![Page 3: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/3.jpg)
3 Searching Chemical Reactions
each database entry contains several molecules• reactants• products• catalysts• solvents• etc.
may want query substructure confined to one of these• can be done by assigning role indicator to each
molecule but role indicators are not enough on their own for
a useful reaction search system
![Page 4: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/4.jpg)
4 Reaction search
Query: CO
COH
![Page 5: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/5.jpg)
5 Reaction search
Query:
“Hit”:
We didn’t get what we wanted because the hydroxyl in the product did not involve the same oxygen as the ketone in the reactant
We need to “map” the atoms between the reactant and product
CO
COH
O
OH
CH3
OH
Br
OCH3+ BrH +
![Page 6: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/6.jpg)
6 Atom mapping
atoms on each side of the reaction can be numbered to show which corresponds to which• similar mappings can be used in the query
automatic assignment of atom mapping is very important in reaction indexing systems• problem is obviously related to finding a graph
isomorphism between reactant and product sides• except that the two sides are NOT isomorphic
.6.
.5.
.4.
.3.
.2..1.
.9.
O.11.
.7.
OH.8.
CH3
.10.
.6.
.5.
.4.
.3.
.2..1.
.7.
OH.8.
Br.12.
.9.
O.11.
CH3
.10.
+ BrH.12.
+
![Page 7: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/7.jpg)
7 Maximal common subgraph
atoms and bonds in red represent the largest subgraph that is common to both sides• all these atoms have same neighbours on both sides• none of these bonds are made or broken
remaining atoms and bonds represent reaction site
.6.
.5.
.4.
.3.
.2.C.1.
C .9.
O.11.
.7.
OH.8.
CH3
.10.
.6.
.5.
.4.
.3.
.2.C.1.
.7.
OH.8.
Br.12.
CH
.9.
O.11.
CH3
.10.
+ BrH.12.
+
![Page 8: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/8.jpg)
8 Maximal common subgraph
Finding the MCS between two graphs is an NP-complete problem• even worse than subgraph isomorphism because you
don’t know in advance how big the subgraph will be• exhaustive backtracking is prohibitively slow• the best algorithms find an approximate solution (i.e. a
large, but not necessarily maximal, subgraph)• tricks can be used to determine an upperbound for the
size of the MCS (so you can stop looking when you’ve found one of this size)
• new algorithm published 2002
![Page 9: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/9.jpg)
9 Applications of MCS
MCS algorithms can be applied to other things than atom-atom mapping in reactions• structural similarity between molecules
o size of MCS (relative to size of molecules) can be used as measure of similarity of molecules
• approximate match searcheso search for molecules containing at least 80% of
query substructure
• multiple maximal common substructure
![Page 10: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/10.jpg)
10 Multiple MCS
largest substructure common to whole set of molecules• can be used to extract “core” for a Markush
structure• might represent features important for
biological activity• even more difficult than MCS of two molecules
o unfortunately it doesn’t work to find MCS of first two, and then MCS between that and the third, etc.
![Page 11: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/11.jpg)
11 3-D substructure search
Analogous to 2-D substructure search• need to find atoms in correct spatial orientation relative
to each othero some fuzziness (tolerance) permitted in distance values
• query can be defined as a group of atoms, with specified interatomic distances
o sometimes called a pharmacophore
• both query and database structures can be shown as topological graphs in which the nodes are atoms, but the edges are interatomic distances
![Page 12: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/12.jpg)
12 3-D substructure searching
the interatomic distances are the labels on the edges
graph is fully-connected (an edge between every pair of nodes)
the graph edges do not correspond to bonds in the molecule
matching is then a process of subgraph isomorphism between such graphs
N C
C
O
2.3Å
5.1Å
2.5Å
6.4Å
7.1Å 4.1Å
![Page 13: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/13.jpg)
13 3D substructure searching
subgraph isomorphism involving fully-connected graphs is computationally more demanding than for 2D substructure search
• Ullmann’s algorithm performs well• other approaches (e.g. clique detection) have also been used
fingerprint-like screening stages can also be applied in the search, based on 3D-fragments such as 3-point pharmacophores
• screens based on torsion and valance angles have also been used
Willett, P. Three-Dimensional Chemical Structure Handling. Wiley: New York (1991)
![Page 14: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/14.jpg)
14 Chemical patents
Contract between inventor and State to encourage innovation
• Inventor reveals nature of invention• State grants protected monopoly over its exploitation for limited
period Invention must be novel, useful and non-obvious
• new ways of making compounds• new compounds with useful properties (therapeutic uses)
Essential for success of pharmaceutical industry Knowledge of existing patents (prior art) essential to avoid
fruitless development
![Page 15: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/15.jpg)
15 Chemical patents
May claim single product or process More usually claim class of products or processes to
ensure protection for closely-related compounds etc. Very broad claims can disguise true nature of invention
• But may claim compounds which lack claimed activity• Nested series of claims (A, preferably B, more preferably C etc.)
can provide “fallback” positions Extremely broad claims have become more common as
Patent Offices moved to publication before examination• Sibley, J. F. “Too broad generic disclosures: a problem for all”
J. Chem. Inf. Comput. Sci. 1991, 31 (1) 5-8
![Page 16: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/16.jpg)
16R1-X-R36
R1 is a substituted or unsubstituted, mono-, di- or polycyclic, aromatic or non-aromatic carbocylic or heterocyclic ring system, or…
X is a single or double bond, substituted or unsubstituted heteroatom, or substituted carbon atom, or substituted or unsubstituted chain of two or more carbon atoms and/or heteroatoms…
R36 is substituted or unsubstituted asymmetrical heterocylic ring system having at least 3 nitrogens…[Structure 32 from Claim 105 of PCT Application 8704321,
claimed as novel]
![Page 17: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/17.jpg)
17 The patent explosion
Originally only granted patents published. Belgium (1950s), Netherlands (1964) and EPO
(1978) -> publishing all patent applications. Rapid publication makes information available
very quickly. Huge number of patents, many low quality,
insufficient or incorrect details, no novelty. Less work for patent examiners but greater
problems for retrieval systems.
![Page 18: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/18.jpg)
18
Structural information in chemical patents Uses mixture of:
• 2D structure diagrams
• linear formulae (e.g. “C2H5”, “EtOH”)
• specific nomenclature (e.g, “phenyl”, “isopropyl”)• generic nomenclature (e.g. “alkyl”, “heteroaryl”)• non-structural expressions (e.g. “pharmaceutically
acceptable cation”, “group known in the art”)
Many machine readable systems just show structural information as free text and images
![Page 19: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/19.jpg)
19 Specific Structures from Patents
Several databases contain specific molecules claimed in patents• Chemical Abstracts Registry• Derwent Registry• MDL announced major new database Nov 2003
o will include reactions, molecules and Markush displayo http://www.mdl.com/company/news/press_releases/2003
/pr_patentdb_07nov03.jsp
![Page 20: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/20.jpg)
20 Markush Structures
also known as “Generic Structures” or “R-group Structures”
chemical structures involving variable parts
OH
R1R2
Br
*
I*
Cl
*R1=
CH2
*
CH3CH2
* CH2CH3 CH2
* CH2CH2
CH3R2=
![Page 21: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/21.jpg)
21 Markush Structures
compact representation of a set or class of specific compounds with common structural features
used in • chemical patents• query structures in substructure search systems• Quantitative Structure-Activity Relationship (QSAR)
analysiso class of related compounds with activity data
• combinatorial librarieso rapid synthesis of large numbers of related compounds
• legislation (controlled drugs, chemical weapons)
![Page 22: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/22.jpg)
22 Variability in Markush Structures
s-variation (substituent variation)list of alternative values for an R-group
p-variation (position variation)variable point of attachment
f-variation (frequency variation)multiple occurrence of groups
h-variation (homology variation)generically described group (e.g. “alkyl”)• potentially infinite set of specific alternatives
![Page 23: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/23.jpg)
23 Types of variation
substituent variation
R1 is methyl or ethyl
homology variation
R2 is alkyl
position variation
R3 is amino
frequency variation
m is 1-3
OH
R1
R2
R3
(CH2)m
Cl
![Page 24: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/24.jpg)
24 Types of Markush structure
subst homol posn freq
Patents * * * *
Queries * (*) (*) (*)
QSAR * *
Libraries * (*) (*)
Legislation * * (*) (*)
![Page 25: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/25.jpg)
25 Markush Structures
Compact representation for sets of molecules• common parts shown once only
Can be considered as formal “grammar” for generating valid molecules (“sentences”)
Enumeration of coverage usually impractical and often impossible (infinite sets)
Appropriate algorithms for handling take advantage of Markush representation:• Avoid enumeration (especially infinite sets)• Compare finite grammars rather than infinite sets of valid
sentences
![Page 26: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/26.jpg)
26 Dr Eugene A. Markush
born Budapest, Hungary, c. 1888 migrated to USA, 1913 (Citizen, 1920) Founded Pharma Chemical Corporation (NJ),
1919 Filed US patent 1506316 on pyrolazone dyes, 9
January 1924, using expression “where R is a group selected from ...” to circumvent USPTO “rule against ‘or’ ”
died New York, 21 April 1968
![Page 27: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/27.jpg)
27 Markush storage and retrieval
Early systems (1950s, 1960s) developed in-house by pharmaceutical companies/consortiums
High costs of patent abstracting and technical difficulties with automation shifted development to specialist companies
Fragmentation code systems superseded by topological (structure graphics) systems
![Page 28: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/28.jpg)
28 Fragmentation Codes
Structural features (ring systems, functional groups, etc.) used as indexing terms
Structural relationships usually lost• all alternatives tend to be “over-coded”• retrieved structures include many “false drops” (“ballast”)
Codes originally assigned manually• Now usually generated (semi-)automatically from graphical input• Queries also generated automatically
Some codes use “closed” set of terms (periodically revised) Others are “open-ended”
![Page 29: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/29.jpg)
29 Fragmentation Codes
Derwent World Patent Index Chemical Code • Closed code with about one thousand terms• Large comprehensive backfile (from early 1960s)• Available for online searching (Questel)
IFI/Plenum Code• Open-ended code• Used for “CLAIMS” database (U.S. patents)• Available for online searching (STN)
o no graphical interface
![Page 30: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/30.jpg)
30 Fragmentation Codes
GREMAS code• Very sophisticated open-ended code• Private collaboration between (mainly) German
pharmaceutical companies• Good retrieval performance• Input discontinued in early 1990s• Backfile (from 1950s) still searched at a few
companies
![Page 31: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/31.jpg)
31 Graphical (“topological”) systems
Development started in early 1980s Intended to supplement graphical substructure
search systems for specific structures• MACCS, CAS Online, DARC, etc.
User draws graphical (sub)structure query System displays graphical Markush structure hits Two commercial systems implemented
• available for online searching only• each with its own database• no “in-house” systems or databases
![Page 32: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/32.jpg)
32 Markush DARC
Joint development of• Questel SA (software and online host) • Derwent Information Ltd (WPIM database)• INPI (French Patent Office) (PHARMSEARCH
database) Integrated database (“Merged Markush File”) now
available• http://www.inpi.fr/inpi/mms/index.htm• Extension forwards (Derwent) and backwards (INPI)
![Page 33: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/33.jpg)
33 MARPAT
software and database from Chemical Abstracts Service
available online via STN International • http://www.cas.org/CASFILES/marpat.html
integrated with CA Registry database of specific compounds
Proposal to allow Derwent database to be searched with MARPAT software dropped in mid 1990s for commercial reasons
![Page 34: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/34.jpg)
34 The Markush Problem
Representation• Mixture of structures and text• Generic (h-variant) expressions• Vagueness (“where by X we mean…”)
Search• The “translation” problem
o Specific groups (e.g. tert. butyl) must be matched against generic expressions (e.g. 1-6C alkyl)
• The “segmentation” problemo Boundaries between scaffold and R-groups may not coincide
in query and database structures
![Page 35: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/35.jpg)
35 Matching Markush Structures Translation and Segmentation problems coincide
to make it difficult to spot matching structures
O
O R1 R2 R1CH3
CH3
/ isopropylR1 = alkyl
O*
R4
R3R2 = NH2 /
R3 = O
R4 = cycloalkyl
R1 = t-butyl/ cycloalkyl
/ S
![Page 36: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/36.jpg)
36 Sheffield University Research
Extended project (1979-1994) on Markush structure storage and retrieval• designed external (GENSAL) and internal (ECTR)
storage formatso parameter lists for homology-variant groups
• developed novel matching algorithms based around graph isomorphism
o “reduced graph” concept
• influenced development of commercial systemso independent work also done at CAS, Derwent and Questel
Downs and Barnard, J. Documentation, 1998, 54 (1), 106-120
![Page 37: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/37.jpg)
37 GENSAL
formalised version of language used in patent specifications
design analogous to programming language lexical elements include
• structure diagrams• specific and generic chemical nomenclature• substitution operators• position/multiplicity values
GENSAL Interpreter program (compiler) generates internal representation based on “partial” connection tables with links between them
![Page 38: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/38.jpg)
38 GENSAL example
R1
R2
R1 = H / alkyl <1-4>;
R2 = F / Cl ;
R1 + R2 = SD
;
R3 = phenyl OSB <1-2> Cl;
IF R2 = Cl THEN R1 = H.
R3
*
*O
![Page 39: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/39.jpg)
39 Parameter Lists
Represent generic (“homology-variant”) expressions by set of permitted numerical ranges for structural parameterse.g. “alkyl”:• 1-n carbon atoms• 0 heteroatoms• 0 double or triple bonds• 0-n branch points• 0 rings
![Page 40: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/40.jpg)
40 Reduced Graphs
connected groups of atoms “collapsed” to form a single node of the reduced graph• atoms in the same ring system (R)• optionally branched carbon chains (C)• connected acyclic heteroatoms (Z)
N
NH
CH2C
OH
O
O O
Z 3 R 9 C 2
Z 1
Z 1
![Page 41: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/41.jpg)
41 Reduced Graphs
boundaries between nodes are non-arbitrary• thus provides solution to segmentation problem
each node can be described by a parameter list
homology-variant groups can also be represented as reduced graph nodes with parameter lists
• thus provides solution to translation problem:o first identify isomorphism between reduced graphso if parameter lists match can do atom-by-atom match on original atoms in
specific groups, if necessary
N 1 O 2C 8 N 1R 6 :1R 5 :1
C 2
0 1
0 1
![Page 42: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/42.jpg)
42 Design of Commercial Systems
Sheffield system never implemented commercially Ideas incorporated into both Markush DARC and
MARPAT• also used by BCI Ltd. in various projects
Other ideas developed independently• both systems have patent protection
Basic concepts parallel those developed at Sheffield
• Barnard, J. M. “A comparison of different approaches to Markush structure handling” JCICS, 1991, 31 (1), 64-67
• Berks, A. “Current state of the art of Markush topological search systems”, World Patent Information, 2001, 23 5-13
![Page 43: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/43.jpg)
43 Markush DARC
Specific groups shown as structure diagrams• Rather clunky display (one R-group at a time)
Generic groups shown as “superatoms”• e.g. CHK = alkyl, HEF = fused heterocycle• qualitative attributes used in searching• quantitative parameters (texnotes) available for display
reduced graph concepts used in atom-by-atom search stage
![Page 44: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/44.jpg)
44 Markush DARC Display
![Page 45: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/45.jpg)
45 MARPAT
Part of CASLink substructure search system on STN
Input and display uses text and graphics • similar to GENSAL
Generic Group Nodes with quantitative attributes (not fully implemented for search)
![Page 46: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/46.jpg)
46 MARPAT Generic Group NodesR
an y g ro u p
C ycy c lic g ro u p
A kca rb o n ch a in
Qh e te ra to m
C bca rb o cy c le
H yh e te ro c y le
Xh a lo gen
Mm eta l
GGN definitions imply reduced graph concept “Spin-off” GGNs generated for specific groups to allow
specific-generic matching (“translation”)
![Page 47: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/47.jpg)
47 MARPAT Display
MSTR 1
G1 = N, CH G2 = H, X, SC,Cl DER: or acid addition salts MPL: Claim 1
![Page 48: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/48.jpg)
48 Conclusions from Lecture 5
Chemical reaction search requires atom-atom mapping between reactant and product
• Maximal Common Subgraph algorithms can be used 3D substructure search uses interatomic distances as edge
labels in fully-connected graphs Markush structures pose particular problems to structure
search systems• extremely broad classes• homology-variant (generic) expressions• segmentation between R-groups
Two publicly-available Markush search systems for chemical patents
• Markush DARC and MARPAT
![Page 49: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/49.jpg)
49 Further Reading
Chen, L.; Nourse, J. G.; Christie, B. D.; Leland, B. A.; Grier, D. L. “Over 20 years of reaction access from MDL: a novel reaction substructure search system”. J. Chem. Inf. Comput. Sci. 2002, 42, 1296-1310.
“Representation and manipulation of 3D molecular structures”. Chapter 2 (pp. 27-52) in A. R. Leach and V. J. Gillet, An Introduction to Chemoinformatics, Dordrecht: Kluwer, 2003
Berks, A. H. “Current state of the art of Markush topological search systems”. In J. Gasteiger (ed.) Handbook of Chemoinformatics: From Data to Knowledge, Vol 2, pp. 885-903, Wiley-VCH, 2003
![Page 50: 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software](https://reader037.vdocuments.net/reader037/viewer/2022110400/56649db35503460f94aa33d1/html5/thumbnails/50.jpg)
50 Lecture 6: Topics to be Covered
Similarity searching• similarity search vs. substructure search• similarity and distance metrics• different types of descriptor for similarity
search• choice of descriptors
The drug discovery process