cheminformatics: an overview
TRANSCRIPT
CHEMINFORMATICS
Subhasis Banerjee (Asst. Professor)(Gupta College of Technological Sciences, West Bengal. India)
Contents
•An overview on CHEMINFORMATICS
•Searching Chemical Structures
•Structure-spectra correlation
Chemoinformatics encompasses the design, creation, organisation, management, retrieval, analysis, dissemination, visualization and use of chemical information
The mixing of information resources to transform data into information and information into knowledge, for the intended purpose of making decisions faster in the arena of drug lead identification and optimization
Concepts and techniques:
Structure search (Various Databases)
Molecular similarity
Substructure search
Quantitative-structure activity relationships
(QSAR)
3D structure generation from a 2D structure
Conformer generation
Algorithms
How can a molecular structure be stored on a computer? Common names: aspirin IUPAC name: 2-acetoxybenzoic acid Formula: C9H8O4
As an image (PNG, GIF, etc.) CAS number: 50-78-2 File format: ChemDraw file, MOL file, etc. SMILES string: O=C(Oc1ccccc1C(=O)O)C Binary Fingerprint: 10000100000001100000100100000001
Simplified Molecular Input Line Entry System Weininger, J Chem Inf Comput Sci, 1988, 28, 31
Examples: CC represents CH3CH3 (ethane) CC(=O)O represents CH3COOH (acetic acid)
Basic guidelines: Hydrogens are implicit Parentheses indicate branches Each atom is connected to the preceding atom to its left
(excluding branches in-between) Single bonds are implicit, = for double, # for triple
What is C(C)(C)(C)C?
International Chemical Identifier Line notation developed by NIST(National Institute of
Standards and Technology) and IUPAC Goal: An index for uniquely identifying a molecule
Aspirin: InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H
Features Derived from the structure (unlike CAS number) One-to-one relationship between InChI and structure
Notes Not human readable or writeable
InChI
8
World Drug Index (WDI) Available Chemical Directory (ACD) National Cancer Institute (NCI) MayBridge (3D formatted database, by the company) Beilstein (the largest database for organic chemistry since 1771, now
maintained by Elsevier) ZINC (UCSF) (3D formatted database, by UCSF, since 2005) World of Molecular Bioactivity (WOMBAT) Toxic Substance Control Act (TSCA) Medchem Zinc CSD PDB
9
10
11
12
13
14
15
16
17
X-ray crystal structures of more than 250,000 compounds (organic and organometallic)
Established in 1965 Textual queries Structural queries Specific 3D constraints (conformation or distance
variables)
More than 25,000 X-ray and NMR structures of protein and protein-ligand complexes
Some nucleic acid and carbohydrate structures
Established in 1971 at Brookhaven National Laboratory; now run by a consortium
Retrieval by textual queries or in some interfaces by amino acid sequences
Full structure search; query is complete molecule
Substructure search; query is a pattern of atoms and bonds
Superstructure search; query is a complete molecule
Many databases contain millions of structures, so search speed is important
Simplest approaches uses canonical representation for query and database structures (e.g. canonical SMILES) could sort database SMILES into alphanumerical order search sorted list for match with query
“Hash table” lookup can improve search speed calculate hash-code (“idiot number” in predefined range) from SMILES for
each database structure this is address (disk file or memory) at which full representation is stored only SMILES which have same hash code need to be compared
New compounds need to be added regularly used to be done by chemical information specialists now frequently done directly by bench chemists
Registration system must check consistency of input data
e.g. compare molecular formula with structure check that compound is really new
different ways of handling tautomers, salts, stereoisomers etc. assign registry number add supplementary data (melting point etc.) make data immediately available for search
In graph theory terms, when two full structures match, their graphs are said to be isomorphic each node N1 in G1 must be mapped to a node N2 in G2
neighbours of N1 must map to neighbours of N2
so far we’ve considered matching one query substructure against one database full structure each structure from the database needs to be compared against
the query in turn many will fail because they don’t contain the query
substructure “screening” allows many of these to be eliminated before
we get to this stage
The fragments present in a structure can be represented as a sequence of 0s and 1s
00010100010101000101010011110100 0 means fragment is not present in structure 1 means fragment is present in structure (perhaps multiple
times) Each 0 or 1 can be represented as a single bit in the computer (a “bitstring”) For chemical structures often called structure “fingerprints”
Build a fingerprint for the query substructure
Only those database structures that contain all the fragments in the query can possibly match the queryQuery: 00000100010101000001010011010100DB struct 1: 00010100010101000101010011110100 MATCHDB struct 2: 00000000100101001001000011100000 NO MATCH
Comparing fingerprint bitstrings is very fast (logical & operation)
Only those structures that pass the screening stage need to be considered as candidates for atom-by-atom isomorphism search
Ideally we want to eliminate as many structures as possible at the screening stage 99% screenout or more would be good
Fingerprint construction can help in this frequency distributions of fragments in a large
database are very “skewed” A few fragments occur in almost all compounds
will therefore give little or no screenout Many fragments occur in very few compounds
need very long fingerprint (lots of fragments) to ensure that we will have some in the query
Do the time-consuming work in advance Full structure search provides an example of this Canonicalisation is a slow process (NP-complete)
but it can be done in a pre-processing of the file, independently of the query
then store the canonical representations can do rapid matches against a canonicalised query structure this is faster than using a graph isomorphism algorithm on non-
canonical representations
Similar principles are used in some substructure search systems A tree structure is built, classifying all the atoms found in all the
structures in the database first level based on atom type second level based on number of connections third level based on type of first neighbour fourth level based on type of second neighbour etc. lower levels based on classifications applied to neighbours (relaxation) bottom of tree lists structures that contain this class of atom
C — C — BrC — C — F
C — C C — OC — CC — Br
C — — C — — C —
— C —
C
C — C — F|
C
C — C — F|F
C e ntralato m type
N um be r o fC o nne c tio ns
F ir s tN e ighbo ur
Se c o ndN e ighbo ur
ThirdN e ighbo ur
Search can be done by tracing tree, looking for atom classes found in query combine lists of structures found at the bottom
A backtracking atom-by-atom search may be needed to check hits found
Best-known example is Beilstein’s Crossfire Main problem is updating the trees when new structures are added
to the database
Queries for substructure search systems may be more complicated than simple subgraphs
Different systems provide different capabilities variable atom and bond types specification of allowed substitution
Var iab lea to m ty p e
C(s0)
C(s1)
[F,Cl,Br,I]s in g le o r
d o u b le b o n d
n o f u r th ers u b s t itu en ts
o n e o r m o res u b s t itu en ts
Daylight uses an extension of SMILES to describe structure queries (SMARTS) can attach various properties to each atom
[CX3] carbon with 3 connections [Nr5] nitrogen in a ring of size 5
properties can be combined with logical operators ! (NOT) & (AND – high precedence) , (OR) ; (AND – low precedence)
complex patterns can be specified this way: [F,Cl, Br, I] any of the halogen atoms [!C;!R0] heteroatom in a ring
$(smarts_string) can also be used as an atom property this is called recursive SMARTS e.g. $(NC=*)
nitrogen single-bond carbon double-bond any-atom (i.e. an amide)
recursive SMARTS can be used to describe very complex patterns e.g. primary or secondary amine, but not amide
[N&X3;H2,H1;!$(NC=*)]ni t ro ge n
3 c o nns
ni t ro ge nc o nne c te d to
c arbo n with do ublebo nd to any ato m
AN D AN D O R
2 attac he dhydro ge ns
1 at tac he dhydro ge n
AN D N O T
Searching Markush structures in patents nature and origin of Markush structures fragment codes topological systems (MARPAT, Markush DARC)
Reaction searching atom-atom mapping
Maximal Common Substructure search what is the largest substructure common to two molecules?
Data mining for conformational properties and intermolecular interactions (CSD & PDB)
Data mining for information about intermolecular interactions (CSD & PDB)
Further understanding of the nature of protein structure and its relationship to amino acid sequence (PDB)
Homology modeling (comparative modeling) (PDB)
O
O N
a
b
c
a = 8.62 0.58 Angstroms
b = 7.08 0.56 Angstroms
c = 3.35 0.65 Angstroms +-
+-+-
OO
O
O
OO
N
OO
O
NN
N O
O
O
O
O
O
N
N
N
N
S
OO
O O P O
O
O P O
O POO
O
ON
N
N
N
N
OO
O O
O
N
N
N
O
N
O
OO
Instead of searching for all molecules containing a given substructure, we search for molecules “similar” to a given target molecule
Similar property principle:“Structurally similar molecules are expected to exhibit
similar properties or biological activities”Mark Johnson and Gerry Maggiora (Eds.) Concepts and Applications of
Molecular Similarity. Wiley, New York, 1990
“Similarity is in the eye of the beholder”Similarity can be measured in many different ways
equivalence classes can say that two molecules are similar, or that they are different
numerical measures can say that two molecules have a similarity of, e.g. 0.85 similarity coefficients usually have values between
0.0 (totally different) and 1.0 (identical) distance measures
“opposite” of similarity (0.0 = identical; may have no maximum, or may be normalised to fix maximum limit)
Two different molecules with the same graph if node and edge labels are ignored
N
OH
O
NCH3
CH3
Normally calculate some numerical measure of similarity between molecules
Query structure is a “target” molecule Database structures can be ranked in decreasing order of
similarity to target find all molecules with > threshold similarity to target find N most similar molecules to target
No particular substructure is required in the retrieved molecules but they will have structural features in common with target
Similarity measures are most commonly calculated from structure fingerprints count the bits that are “on” in both molecules count the bits that are “on” in each molecule separately
struct A: 00010100010101000101010011110100 13 bits on (A)struct B: 00000000100101001001000011100000 8 bits on (B)A AND B: 00000000000101000001000011100000 6 bits on (C)
similarity coefficient can be calculated from A, B and C A B
C
similarity = CA + B – C
= 6 / (13 + 8 – 6) = 0.4 The number of bits set in both molecules divided by the
number of bits set in either molecule The Tanimoto coefficient is the most commonly used
similarity coefficient in chemical informatics also called the Jaccard coefficient
A BC
similarity = 2CA + B
= 12 / (13 + 8) = 0.57 Does not give the same values as the Tanimoto
coefficient, but will rank molecules in the same order of similarity to a target i.e. “monotonic” with the Tanimoto coefficient
Also called the Czekanowski or Sørenson coefficient
A BC
similarity = C (A B)
= 6 / (13 8) = 0.588 Not monotonic with the Tanimoto and Dice coefficients,
but highly correlated with them also called the Ochiai coefficient
A BC
The three coefficients discussed so far all ignore bits that are off in both molecules is common absence of features evidence of similarity between
them? are a camel, a horse and a nematode similar because they all
lack wings? are a bat, a heron and a dragonfly similar because they all have
wings?
Are these molecules similar because they all lack heteroatoms?
CH
CH
CH
CH
CH
CH
CH2
CH2
CH2
CH2
CH2
CH2
CH2 CH2
CH2
CH2
CH2
A similarity coefficient that takes into account the fingerprint bits that are off in both molecules (D)
similarity = C + DN
= (6 + 17) / 32 = 0.719 N is length of fingerprint
N = A + B – C + D
A BCD
This provides a means to substructure similarity search also possible with maximal common subgraphs A and B could be number of atoms in each molecule,
and C could be number of atoms in their maximal common substructure
fingerprint-based similarity is generally faster than identifying MCS but common features (fragments) will be smaller
Many different fragment types have been used for generating fingerprints for use in similarity searching atom sequence
linear path of atoms and bonds through molecule may generate only paths of certain lengths
augmented atom atom and its immediate neighbours
Ring composition atom/bond sequence around a ring question of which rings to choose
Ring fusion patterns sequence of ring connectivities around a ring
for each atom specify number of ring bonds it has Atom pairs
pair of atoms in same molecule, with number of bonds in shortest path between them
additional differentiation between atom types number of attached hydrogens / pi-bonds
Topological torsions connected sequence of 4 atom types atom types as described for atom pairs
Sometimes specific fragments (with detailed description of atom and bond types) are too specific to be of much use in fingerprints very low frequency very sparse fingerprints
Atom and bond types can be generalised any ring bond any halogen (F, Cl, Br, I) any chalcogen (O, S, Se, Te)
This gives fragments with higher frequency
fragments can be used to describe the 3D structure of a molecule too usually involve interatomic distances and/or bond
angles because distance values are continuous variables, they
are “binned” each bin represents a range of distances e.g. distance of 3.000 – 3.999 Å
each bin corresponds to a fingerprint bit
a popular 3D descriptor is the 3-point pharmacophore molecule is analysed to identify “pharmacophoric points”
points in molecule likely to be involved in binding to a receptor site
positive charges negative charges hydrogen-bond donors (e.g. –OH, – NH2) hydrogen-bond acceptors (e.g. =O) aromatic groups hydrophobic groups
pharmacophoric points do not necessarily coincide with the positions of individual atoms
each fragment consists of 3 pharmacophoric points the distances between
each pair of these points are binned and used to set fingerprint bits
4-point pharmacophore fragments are also used
Different people have used slightly different definitions of pharmacophoric points
H B D
A ry
H B A2 .8 Å
3 .6 Å1 .2 Å
how do we decide which descriptors and similarity or distance measures are “best”?
go back to “similar property” principle:“structurally similar molecules are expected to exhibit similar
properties or biological activities” we can do some experiments using
various different sets of descriptors dataset of compounds with known biological activity or
measured physico-chemical property value
Problem may be that 3D descriptors we have (3-point pharmacophores with binned distances) are not good enough there may be “spurious accuracy” in the detailed distances
involved conformational flexibility may be causing problems (as one
distance gets larger, another gets smaller) molecule may change conformation during binding some improved success has been found by identifying
“projection points” for hydrogen bond donors/acceptors (i.e. where they’re pointing to, not where they are)
2D descriptors provide “bounds” on possible 3D conformations “2½D” descriptors (including some stereochemical
information) may be useful “Superiority” of 2D descriptors in some studies may be
artifact of datasets used datasets may have large numbers of close analogues these will have high 2D similarity, as well as correlated activity
Structure Spectra Correlation
Spectroscopy:
It all have to tell about the interaction of electromagnetic radiation with matter
How it HELPS?
With the sophisticated machineries, viz., UV, IR, NMR, MASS and many more, it unveil the real chemical feature of a compound of any values
Thank you