cheminformatics: an overview

64
CHEMINFORMATICS Subhasis Banerjee (Asst. Professor) (Gupta College of Technological Sciences, West Bengal. India)

Upload: subhasis-banerjee

Post on 21-Mar-2017

111 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Cheminformatics: An overview

CHEMINFORMATICS

Subhasis Banerjee (Asst. Professor)(Gupta College of Technological Sciences, West Bengal. India)

Page 2: Cheminformatics: An overview

Contents

•An overview on CHEMINFORMATICS

•Searching Chemical Structures

•Structure-spectra correlation

Page 3: Cheminformatics: An overview

Chemoinformatics encompasses the design, creation, organisation, management, retrieval, analysis, dissemination, visualization and use of chemical information

The mixing of information resources to transform data into information and information into knowledge, for the intended purpose of making decisions faster in the arena of drug lead identification and optimization

Page 4: Cheminformatics: An overview

Concepts and techniques:

Structure search (Various Databases)

Molecular similarity

Substructure search

Quantitative-structure activity relationships

(QSAR)

3D structure generation from a 2D structure

Conformer generation

Algorithms

Page 5: Cheminformatics: An overview

How can a molecular structure be stored on a computer? Common names: aspirin IUPAC name: 2-acetoxybenzoic acid Formula: C9H8O4

As an image (PNG, GIF, etc.) CAS number: 50-78-2 File format: ChemDraw file, MOL file, etc. SMILES string: O=C(Oc1ccccc1C(=O)O)C Binary Fingerprint: 10000100000001100000100100000001

Page 6: Cheminformatics: An overview

Simplified Molecular Input Line Entry System Weininger, J Chem Inf Comput Sci, 1988, 28, 31

Examples: CC represents CH3CH3 (ethane) CC(=O)O represents CH3COOH (acetic acid)

Basic guidelines: Hydrogens are implicit Parentheses indicate branches Each atom is connected to the preceding atom to its left

(excluding branches in-between) Single bonds are implicit, = for double, # for triple

What is C(C)(C)(C)C?

Page 7: Cheminformatics: An overview

International Chemical Identifier Line notation developed by NIST(National Institute of

Standards and Technology) and IUPAC Goal: An index for uniquely identifying a molecule

Aspirin: InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H

Features Derived from the structure (unlike CAS number) One-to-one relationship between InChI and structure

Notes Not human readable or writeable

InChI

Page 8: Cheminformatics: An overview

8

World Drug Index (WDI) Available Chemical Directory (ACD) National Cancer Institute (NCI) MayBridge (3D formatted database, by the company) Beilstein (the largest database for organic chemistry since 1771, now

maintained by Elsevier) ZINC (UCSF) (3D formatted database, by UCSF, since 2005) World of Molecular Bioactivity (WOMBAT) Toxic Substance Control Act (TSCA) Medchem Zinc CSD PDB

Page 9: Cheminformatics: An overview

9

Page 10: Cheminformatics: An overview

10

Page 11: Cheminformatics: An overview

11

Page 12: Cheminformatics: An overview

12

Page 13: Cheminformatics: An overview

13

Page 14: Cheminformatics: An overview

14

Page 15: Cheminformatics: An overview

15

Page 16: Cheminformatics: An overview

16

Page 17: Cheminformatics: An overview

17

Page 18: Cheminformatics: An overview

X-ray crystal structures of more than 250,000 compounds (organic and organometallic)

Established in 1965 Textual queries Structural queries Specific 3D constraints (conformation or distance

variables)

Page 19: Cheminformatics: An overview

More than 25,000 X-ray and NMR structures of protein and protein-ligand complexes

Some nucleic acid and carbohydrate structures

Established in 1971 at Brookhaven National Laboratory; now run by a consortium

Retrieval by textual queries or in some interfaces by amino acid sequences

Page 20: Cheminformatics: An overview
Page 21: Cheminformatics: An overview

Full structure search; query is complete molecule

Substructure search; query is a pattern of atoms and bonds

Superstructure search; query is a complete molecule

Page 22: Cheminformatics: An overview

Many databases contain millions of structures, so search speed is important

Simplest approaches uses canonical representation for query and database structures (e.g. canonical SMILES) could sort database SMILES into alphanumerical order search sorted list for match with query

“Hash table” lookup can improve search speed calculate hash-code (“idiot number” in predefined range) from SMILES for

each database structure this is address (disk file or memory) at which full representation is stored only SMILES which have same hash code need to be compared

Page 23: Cheminformatics: An overview

New compounds need to be added regularly used to be done by chemical information specialists now frequently done directly by bench chemists

Registration system must check consistency of input data

e.g. compare molecular formula with structure check that compound is really new

different ways of handling tautomers, salts, stereoisomers etc. assign registry number add supplementary data (melting point etc.) make data immediately available for search

Page 24: Cheminformatics: An overview

In graph theory terms, when two full structures match, their graphs are said to be isomorphic each node N1 in G1 must be mapped to a node N2 in G2

neighbours of N1 must map to neighbours of N2

Page 25: Cheminformatics: An overview

so far we’ve considered matching one query substructure against one database full structure each structure from the database needs to be compared against

the query in turn many will fail because they don’t contain the query

substructure “screening” allows many of these to be eliminated before

we get to this stage

Page 26: Cheminformatics: An overview

The fragments present in a structure can be represented as a sequence of 0s and 1s

00010100010101000101010011110100 0 means fragment is not present in structure 1 means fragment is present in structure (perhaps multiple

times) Each 0 or 1 can be represented as a single bit in the computer (a “bitstring”) For chemical structures often called structure “fingerprints”

Page 27: Cheminformatics: An overview

Build a fingerprint for the query substructure

Only those database structures that contain all the fragments in the query can possibly match the queryQuery: 00000100010101000001010011010100DB struct 1: 00010100010101000101010011110100 MATCHDB struct 2: 00000000100101001001000011100000 NO MATCH

Comparing fingerprint bitstrings is very fast (logical & operation)

Only those structures that pass the screening stage need to be considered as candidates for atom-by-atom isomorphism search

Page 28: Cheminformatics: An overview

Ideally we want to eliminate as many structures as possible at the screening stage 99% screenout or more would be good

Fingerprint construction can help in this frequency distributions of fragments in a large

database are very “skewed” A few fragments occur in almost all compounds

will therefore give little or no screenout Many fragments occur in very few compounds

need very long fingerprint (lots of fragments) to ensure that we will have some in the query

Page 29: Cheminformatics: An overview

Do the time-consuming work in advance Full structure search provides an example of this Canonicalisation is a slow process (NP-complete)

but it can be done in a pre-processing of the file, independently of the query

then store the canonical representations can do rapid matches against a canonicalised query structure this is faster than using a graph isomorphism algorithm on non-

canonical representations

Page 30: Cheminformatics: An overview

Similar principles are used in some substructure search systems A tree structure is built, classifying all the atoms found in all the

structures in the database first level based on atom type second level based on number of connections third level based on type of first neighbour fourth level based on type of second neighbour etc. lower levels based on classifications applied to neighbours (relaxation) bottom of tree lists structures that contain this class of atom

Page 31: Cheminformatics: An overview

C — C — BrC — C — F

C — C C — OC — CC — Br

C — — C — — C —

— C —

C

C — C — F|

C

C — C — F|F

C e ntralato m type

N um be r o fC o nne c tio ns

F ir s tN e ighbo ur

Se c o ndN e ighbo ur

ThirdN e ighbo ur

Page 32: Cheminformatics: An overview

Search can be done by tracing tree, looking for atom classes found in query combine lists of structures found at the bottom

A backtracking atom-by-atom search may be needed to check hits found

Best-known example is Beilstein’s Crossfire Main problem is updating the trees when new structures are added

to the database

Page 33: Cheminformatics: An overview

Queries for substructure search systems may be more complicated than simple subgraphs

Different systems provide different capabilities variable atom and bond types specification of allowed substitution

Var iab lea to m ty p e

C(s0)

C(s1)

[F,Cl,Br,I]s in g le o r

d o u b le b o n d

n o f u r th ers u b s t itu en ts

o n e o r m o res u b s t itu en ts

Page 34: Cheminformatics: An overview

Daylight uses an extension of SMILES to describe structure queries (SMARTS) can attach various properties to each atom

[CX3] carbon with 3 connections [Nr5] nitrogen in a ring of size 5

properties can be combined with logical operators ! (NOT) & (AND – high precedence) , (OR) ; (AND – low precedence)

Page 35: Cheminformatics: An overview

complex patterns can be specified this way: [F,Cl, Br, I] any of the halogen atoms [!C;!R0] heteroatom in a ring

$(smarts_string) can also be used as an atom property this is called recursive SMARTS e.g. $(NC=*)

nitrogen single-bond carbon double-bond any-atom (i.e. an amide)

Page 36: Cheminformatics: An overview

recursive SMARTS can be used to describe very complex patterns e.g. primary or secondary amine, but not amide

[N&X3;H2,H1;!$(NC=*)]ni t ro ge n

3 c o nns

ni t ro ge nc o nne c te d to

c arbo n with do ublebo nd to any ato m

AN D AN D O R

2 attac he dhydro ge ns

1 at tac he dhydro ge n

AN D N O T

Page 37: Cheminformatics: An overview

Searching Markush structures in patents nature and origin of Markush structures fragment codes topological systems (MARPAT, Markush DARC)

Reaction searching atom-atom mapping

Maximal Common Substructure search what is the largest substructure common to two molecules?

Page 38: Cheminformatics: An overview

Data mining for conformational properties and intermolecular interactions (CSD & PDB)

Data mining for information about intermolecular interactions (CSD & PDB)

Further understanding of the nature of protein structure and its relationship to amino acid sequence (PDB)

Homology modeling (comparative modeling) (PDB)

Page 39: Cheminformatics: An overview

O

O N

a

b

c

a = 8.62 0.58 Angstroms

b = 7.08 0.56 Angstroms

c = 3.35 0.65 Angstroms +-

+-+-

OO

O

O

OO

N

OO

O

NN

N O

O

O

O

O

O

N

N

N

N

S

OO

O O P O

O

O P O

O POO

O

ON

N

N

N

N

OO

O O

O

N

N

N

O

N

O

OO

Page 40: Cheminformatics: An overview

Instead of searching for all molecules containing a given substructure, we search for molecules “similar” to a given target molecule

Similar property principle:“Structurally similar molecules are expected to exhibit

similar properties or biological activities”Mark Johnson and Gerry Maggiora (Eds.) Concepts and Applications of

Molecular Similarity. Wiley, New York, 1990

Page 41: Cheminformatics: An overview

“Similarity is in the eye of the beholder”Similarity can be measured in many different ways

equivalence classes can say that two molecules are similar, or that they are different

numerical measures can say that two molecules have a similarity of, e.g. 0.85 similarity coefficients usually have values between

0.0 (totally different) and 1.0 (identical) distance measures

“opposite” of similarity (0.0 = identical; may have no maximum, or may be normalised to fix maximum limit)

Page 42: Cheminformatics: An overview

Two different molecules with the same graph if node and edge labels are ignored

N

OH

O

NCH3

CH3

Page 43: Cheminformatics: An overview

Normally calculate some numerical measure of similarity between molecules

Query structure is a “target” molecule Database structures can be ranked in decreasing order of

similarity to target find all molecules with > threshold similarity to target find N most similar molecules to target

No particular substructure is required in the retrieved molecules but they will have structural features in common with target

Page 44: Cheminformatics: An overview

Similarity measures are most commonly calculated from structure fingerprints count the bits that are “on” in both molecules count the bits that are “on” in each molecule separately

struct A: 00010100010101000101010011110100 13 bits on (A)struct B: 00000000100101001001000011100000 8 bits on (B)A AND B: 00000000000101000001000011100000 6 bits on (C)

similarity coefficient can be calculated from A, B and C A B

C

Page 45: Cheminformatics: An overview

similarity = CA + B – C

= 6 / (13 + 8 – 6) = 0.4 The number of bits set in both molecules divided by the

number of bits set in either molecule The Tanimoto coefficient is the most commonly used

similarity coefficient in chemical informatics also called the Jaccard coefficient

A BC

Page 46: Cheminformatics: An overview

similarity = 2CA + B

= 12 / (13 + 8) = 0.57 Does not give the same values as the Tanimoto

coefficient, but will rank molecules in the same order of similarity to a target i.e. “monotonic” with the Tanimoto coefficient

Also called the Czekanowski or Sørenson coefficient

A BC

Page 47: Cheminformatics: An overview

similarity = C (A B)

= 6 / (13 8) = 0.588 Not monotonic with the Tanimoto and Dice coefficients,

but highly correlated with them also called the Ochiai coefficient

A BC

Page 48: Cheminformatics: An overview

The three coefficients discussed so far all ignore bits that are off in both molecules is common absence of features evidence of similarity between

them? are a camel, a horse and a nematode similar because they all

lack wings? are a bat, a heron and a dragonfly similar because they all have

wings?

Page 49: Cheminformatics: An overview

Are these molecules similar because they all lack heteroatoms?

CH

CH

CH

CH

CH

CH

CH2

CH2

CH2

CH2

CH2

CH2

CH2 CH2

CH2

CH2

CH2

Page 50: Cheminformatics: An overview

A similarity coefficient that takes into account the fingerprint bits that are off in both molecules (D)

similarity = C + DN

= (6 + 17) / 32 = 0.719 N is length of fingerprint

N = A + B – C + D

A BCD

Page 51: Cheminformatics: An overview

This provides a means to substructure similarity search also possible with maximal common subgraphs A and B could be number of atoms in each molecule,

and C could be number of atoms in their maximal common substructure

fingerprint-based similarity is generally faster than identifying MCS but common features (fragments) will be smaller

Page 52: Cheminformatics: An overview

Many different fragment types have been used for generating fingerprints for use in similarity searching atom sequence

linear path of atoms and bonds through molecule may generate only paths of certain lengths

augmented atom atom and its immediate neighbours

Page 53: Cheminformatics: An overview

Ring composition atom/bond sequence around a ring question of which rings to choose

Ring fusion patterns sequence of ring connectivities around a ring

for each atom specify number of ring bonds it has Atom pairs

pair of atoms in same molecule, with number of bonds in shortest path between them

additional differentiation between atom types number of attached hydrogens / pi-bonds

Topological torsions connected sequence of 4 atom types atom types as described for atom pairs

Page 54: Cheminformatics: An overview

Sometimes specific fragments (with detailed description of atom and bond types) are too specific to be of much use in fingerprints very low frequency very sparse fingerprints

Atom and bond types can be generalised any ring bond any halogen (F, Cl, Br, I) any chalcogen (O, S, Se, Te)

This gives fragments with higher frequency

Page 55: Cheminformatics: An overview

fragments can be used to describe the 3D structure of a molecule too usually involve interatomic distances and/or bond

angles because distance values are continuous variables, they

are “binned” each bin represents a range of distances e.g. distance of 3.000 – 3.999 Å

each bin corresponds to a fingerprint bit

Page 56: Cheminformatics: An overview

a popular 3D descriptor is the 3-point pharmacophore molecule is analysed to identify “pharmacophoric points”

points in molecule likely to be involved in binding to a receptor site

positive charges negative charges hydrogen-bond donors (e.g. –OH, – NH2) hydrogen-bond acceptors (e.g. =O) aromatic groups hydrophobic groups

pharmacophoric points do not necessarily coincide with the positions of individual atoms

Page 57: Cheminformatics: An overview

each fragment consists of 3 pharmacophoric points the distances between

each pair of these points are binned and used to set fingerprint bits

4-point pharmacophore fragments are also used

Different people have used slightly different definitions of pharmacophoric points

H B D

A ry

H B A2 .8 Å

3 .6 Å1 .2 Å

Page 58: Cheminformatics: An overview

how do we decide which descriptors and similarity or distance measures are “best”?

go back to “similar property” principle:“structurally similar molecules are expected to exhibit similar

properties or biological activities” we can do some experiments using

various different sets of descriptors dataset of compounds with known biological activity or

measured physico-chemical property value

Page 59: Cheminformatics: An overview

Problem may be that 3D descriptors we have (3-point pharmacophores with binned distances) are not good enough there may be “spurious accuracy” in the detailed distances

involved conformational flexibility may be causing problems (as one

distance gets larger, another gets smaller) molecule may change conformation during binding some improved success has been found by identifying

“projection points” for hydrogen bond donors/acceptors (i.e. where they’re pointing to, not where they are)

Page 60: Cheminformatics: An overview

2D descriptors provide “bounds” on possible 3D conformations “2½D” descriptors (including some stereochemical

information) may be useful “Superiority” of 2D descriptors in some studies may be

artifact of datasets used datasets may have large numbers of close analogues these will have high 2D similarity, as well as correlated activity

Page 61: Cheminformatics: An overview

Structure Spectra Correlation

Page 62: Cheminformatics: An overview

Spectroscopy:

It all have to tell about the interaction of electromagnetic radiation with matter

How it HELPS?

With the sophisticated machineries, viz., UV, IR, NMR, MASS and many more, it unveil the real chemical feature of a compound of any values

Page 63: Cheminformatics: An overview
Page 64: Cheminformatics: An overview

Thank you