trec2010 chemical ir workshop

TREC 2010 Chemical IR Workshop

Rajarshi Guha NIH Chemical Genomics Center

November 2010 NIST, Gaithersburg

Acknowledgements

•  John Barnard •  Joseph Scheiber •  Daniel Lowe •  David Wild

•  Nina Jeliazkova

Outline

•  Chemical structure representaLon(s) •  Processing chemistry‐related documents

•  Structure retrieval •  Chemical informaLon toolkits

Chemical structure representaLon

Based on material from I571, David Wild, Indiana University

What to Include?

•  The representaLon depends on what you want to include (and defines what you can include) – Atoms

– ConnecLvity – Stereochemistry – Charges/Isotopes – 3D configuraLon

C8H9NO3

Visual RepresentaLons

•  2D structure diagram is the lingua franca of bench chemists

•  Display is supported by nearly every cheminformaLcs system

•  Summarizes –  ConnecLvity –  Stereochemistry –  Charge/isotope

•  We can use 2D representaLons as the starLng point for many cheminformaLcs tasks

Chemical Names

•  Most papers and documents referring to chemistry will name molecules

•  Two forms of chemical names – Trivial (short, pronounceable, uninformaLve) – SystemaLc (long, unpronounceable, can usually get structure back from the name)

Tyrosine

or

β‐(p‐hydroxyphenyl)alanine or α‐amino‐p‐hydroxyhydrocinnamic acid or 2‐amino‐3‐(4‐hydroxyphenyl) propanoic acid

Chemical Numbers

•  Also termed registry numbers •  Arbitrary numbers assigned to one (or more) structures

•  SomeLmes a hierarchy might be present – Parent compound – stereoisomer – salt …

•  CAS numbers / PubChem SID & CID / InChI Key

•  Only way to get back structure is lookup

Structures are Graphs (mostly)

•  For many cases, a chemical structure can be considered as a graph – Atoms are nodes, bonds are edges –  IdenLcal graphs imply idenLcal molecules

•  But there are limits to this representaLon – Polymers, inorganic compounds, stereochemistry

•  And chemical phenomena can create problems for a graph theoreLcal approach

AromaLcity & Graphs

•  These two molecules are idenLcal •  Yet their molecular graphs would suggest a different connecLvity

•  In fact, all atoms and bonds in benzene are equivalent due to resonance

•  In this case, we would (should) perceive each C‐C bond as aromaLc

1D RepresentaLons

•  1D representaLons are linear strings •  Generally only encode connecLvity, atom and bond type – Wiswesser line notaLon (WLN) – Sybyl line notaLon (SLN) – SMILES –  InChI

2D/3D RepresentaLons

•  MulL‐line text formats •  Contain connecLvity, atom and bond types, 3D coordinates as well as other (possibly arbitrary) informaLon – MDL MOL format

– PDB – Hyperchem HIN – Chemical Markup Language (CML)

•  A simple line notaLon, preey much the lingua franca of cheminformaLcs

•  Atoms are represented by their symbols –  Lower case indicates aromaLc atom

•  Single bonds are implicit, double bonds are “=“, triple bonds are “#” and aromaLc are “:”

•  Rings indicate by “ring closure numbers” –  In C1CC1, the two carbons marked by “1” are connected

SMILES

Canonical SMILES

•  Given a structure, you can write a SMILES representaLon in mulLple ways CC(C)CC CCC(C)C

•  As a result, comparing molecules or searching for molecules based on arbitrary SMILES can give misleading or wrong results

•  To avoid this we canonicalize SMILES

Canonical SMILES

•  Given two structures that have idenLcal atoms, bonds and connecLvity their canonical SMILES will be idenLcal

•  In general, any permutaLon of atom index will not affect the canonical SMILES

•  CanonicalizaLon is a key feature of structure registraLon – You want to be sure that you have a single, unique representaLon of each structure in the database

GeneraLng Canonical SMILES

•  All toolkits are capable of generaLng these •  Using OpenBabel at the command line is easy

•  Can convert lots of other file formats to canonical SMILES

•  Note that different products have different canonicalizaLon algorithms

$ echo "c1(O)ccccc1" | /usr/local/openbabel/bin/babel -ismi -ocan –cOc1ccccc1

$ echo "c1(ccccc1)O" | /usr/local/openbabel/bin/babel -ismi -ocan –cOc1ccccc1

Tautomerism

•  Simply generaLng a unique string representaLon of a molecule may not be sufficient to uniquely idenLfy that molecule

•  Tautomerism is a reacLon involving the movement of a H, resulLng in a change of bond order

Tautomerism

•  Due to the rapid equilibrium the molecule exists in both forms – which one do we store? – Consider the most stable tautomer –  Just go with a ‘canonical’ tautomer

•  In general, you need to generate the canonical SMILES for a canonical tautomer

•  Using InChI, you can choose whether to consider tautomers or not or just ignore tautomer informaLon

Tautomerism & InChI

hep://www.chemspider.com/blog/does‐inchi‐account‐for‐tautomers.html

With mobile‐H percepLon on

With mobile‐H percepLon off

Markush Structures

•  Compact representaLon of a set or class of specific compounds with common structural features

•  Used in –  chemical patents –  query structures in substructure search systems –  QuanLtaLve Structure‐AcLvity RelaLonship (QSAR) analysis •  class of related compounds with acLvity data

–  combinatorial libraries •  rapid synthesis of large numbers of related compounds

–  legislaLon (controlled drugs, chemical weapons)

Markush Structures

R1

R2

Markush Structures

•  S‐variaLon (subsLtuent variaLon) – List of alternaLve values for an R‐group

•  P‐variaLon (posiLon variaLon) – Variable point of aeachment

•  F‐variaLon (frequency variaLon) – MulLple occurrence of groups

•  H‐variaLon (homology variaLon) – Generically described group (e.g., alkyl) – A (possibly) infinite set of S‐variaLons

Markush Structures

•  S‐variaLon – R1 is methyl or ethyl •  H‐variaLon‐ R2 is alkyl •  P‐variaLon – R3 is amino

•  F‐variaLon – 3 (but more generally can be any number, say m)

Markush Structures

•  Can be considered as formal “grammar” for generaLng valid molecules (“sentences”)

•  EnumeraLon of coverage usually impracLcal and open impossible (infinite sets)

•  Appropriate algorithms for handling take advantage of Markush representaLon – Avoid enumeraLon (especially infinite sets) – Compare finite grammars rather than infinite sets of valid sentences

The Markush Problem

•  RepresentaLon – Mixture of structures and text

– Generic expressions (viz., h‐variaLon) – Vagueness (“ … where by X we mean …”)

•  Searching – TranslaLon problem – specific groups (ethy, butyl) must be matched against an expression (1‐6C alkyl)

– SegmentaLon problem – boundaries between R‐groups and the scaffold may not coincide

Markush TranslaLon/SegmentaLon

Chemical Similarity

•  When are two molecules similar? – They are both 6‐membered rings

– Both have carbons – Both have only single bonds

•  In the second case, both have a N

•  Many ways to define similarity

•  CriLcally dependent on representaLon

Why Chemical Similarity?

•  Much of medicinal chemistry is based on the similarity principle – Similar molecules exhibit similar acLviLes – J. Med. Chem., 2002, 45, 4350-4358

•  But there are many excepLons

•  Even then, looking for similar molecules gives us a useful starLng point in many cases

Fingerprint Based Similarity

•  We can represent a chemical structure using a bit string representaLon

•  The example is a “key” fingerprint – Each bit posiLon corresponds to a specific structural feature

1 0 1 1 0 0 0 1 0

Fingerprint Based Similarity

•  The similarity between two molecules is then defined in terms of the similarity between their fingerprints – Tanimoto, Dice, Cosine, Tversky

•  Clearly, depending on the nature of the fingerprints, two molecules can be more or less similar

Fingerprint Similarity •  InformaLon loss – fragments

presence and absence instead of counts

•  Bit string saturaLon – within a large database almost all bits are set

•  The average similarity appears to increase with the complexity of the query compound

•  Larger queries are more discriminaLng (flaeer curve, Tanimoto values spread wider)

•  Smaller queries have sharp peak, unable to disLnguish between molecules Flower D., On the Properties of Bit String-

Based Measures of Chemical Similarity, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998

The distribution of Tanimoto values found in database searches with a range of query molecules

Nina Jeliazkova, hep://vedina.users.sourceforge.net/publicaLons/2005/ChemicalSimilarity.ppt

Physical Similarity

•  Keyed fingerprints are inevitably lossy •  Hashed and circular fingerprints can be beeer •  But in the end they both ignore the 3D structure of a molecules

•  Shape and surface‐property based similariLes can be more relevant – Slower to evaluate – OpenEye ROCS is a tool to evaluate shape similarity

Structural Similarity & 3D Property VariaLon


Similarity Indices

Association indices Correlation indices

J. D. Holliday, C-Y. Hu† and P. Willett,(2002) Grouping of Coefficients for the Calculation of Inter-Molecular Similarity and Dissimilarity using 2D Fragment Bit-Strings, Combinatorial Chemistry & High Throughput Screening,5, 155-166 155


Structure retrieval

What Are We Asking For?

Exact Matches? Similar topology?

Substructure matches? Similar properLes?

What do we have?

Caveat

•  If structures are not registered properly, results of queries can be misleading, incomplete or wrong

•  At the very least – Remove salts

– Generate canonical tautomers – Create a canonical SMILES or InChI

Exact Structure Retrieval

Q Give a structure X, does the database have any instances of X?

•  The trivial way to do this is via string matching •  Could match on canonical SMILES

•  But then you have to ensure that the query and the database employ the same canonicalizer


Q Give a structure X, does the database have any instances of X?

•  The trivial way to do this is via string matching •  Beeer to use a hash code such as InChI •  Need to be careful that you and the database are using the same seungs (i.e., standard InChI)


•  This type of query is generally only useful during database registraLon

•  Is also handy when trying to match up one collecLon against and another

•  But can trip you up – Database stores stereochemistry, your query has none

– You won’t find a match, if you use a full InChI

– Similar problems with tautomerism

Similar Structure Retrieval

Q Give a structure X, does the database have any molecules similar to X?

•  This can open up a can of worms! •  Most common case is to find structurally similar molecules in terms of 2D

•  However, one can also consider 3D structural (i.e., shape) similarity

•  Finally, one could also idenLfy similar molecules based on similar physicochemical properLes

Property Similarity

•  Each molecule can be represented as an N‐dimensional vector of numbers – These can represent structural descriptors (number of rings, graph invariants)

– Physical characterisLcs (log P, polar surface area) •  Similarity is then defined in terms of the distance between the descriptor vectors of the query molecule & the target molecules

Substructure Retrieval

Q Give a structure X, does the database have any molecules that contain X?

•  This is basically subgraph isomorphism •  There are a number of variaLons

•  Find an exact substructure •  Fuzzy substructure (e.g., ignore atom type)

•  Maximum common substructure

Substructure Retrieval

•  The basic subgraph isomorphism algorithms are quite well known

•  All cheminformaLcs toolkits support this

•  In the simplest approach, we can specify a SMILES string as a query – c1ccccc1C(=O) as a query looks for molecules containing a benzaldehyde moiety

Generic Substructure Queries

•  Using a SMILES as a query implies that you look for a specific substructure

•  What about finding molecules containing an aromaLc ring connected to carbonyl via a N or a C?

•  Valid molecules would be

•  We can perform these queries using SMARTS

SMARTS Queries

•  Regular expressions for chemical structures •  The previous query is achieved by c1ccccc1[c,C,n,N]C=O

•  Very powerful system and fundamental to many cheminformaLcs methods

•  Also see SMARTSViewer to visualize SMARTS queries and the Daylight matcher to test them

Substructure Retrieval Performance?

•  Naively, SS queries require one to check each entry in a database

•  But can be significantly sped up by making sure that the target molecule has all the features of the query molecule

can never match

– Can do this by comparing fingerprints, which is much faster than doing graph isomorphism

Maximum Common Substructure

•  Largest subgraph common to two structures •  NP‐complete problem

•  Many approaches try to idenLfy an approximate soluLon

•  MulLple MCS can be used to idenLfy core scaffolds

3D (sub) Graph Isomorphism

•  Also known as shape matching (if we consider whole molecules) or pharmacophore searching (if we consider substructures)

•  EssenLally idenLfy molecules that contain a 3D geometric moLf – The moLf is defined in terms of atoms/groups and distance/angle/dihedral constraints amongst them

3D (sub) Graph Isomorphism

Processing chemistry‐related documents

Parsing Chemical Documents

•  I have minimal experLse in this area and more of a user than a developer of such tools

•  Two step process – Chemical enLty extracLon (from text or images) – EnLty (i.e., name) to structure conversion

•  Variety of tools for both steps

Parsing Chemical Documents Text en;ty recogni;on Image recogni;on

(a)  Extractors (IUPAC names) ‐ TEMIS Chemical EnLty RelaLonships Skill Cartridge ‐ Accelrys Pipeline Pilot extractor (NoLora) ‐ Fraunhofer (ProMiner Chemistry) ‐ Chemaxon (chemicalize.org) ‐ Oscar (Corbee, Murray‐Rust et al.) ‐ SureChem ‐ IBM ChemFrag Annotator

(b)  Converter ‐ CambridgeSop name=struct ‐ Openeye Lexichem ‐ Chemaxon

‐  OSRA (NIH)

‐  Clide Pro (Keymodule Ltd.)

‐  Fraunhofer chemoCR

‐  ChemReader

hep://www.chemaxon.com/library/user‐presentaLons/chemical‐enLty‐extracLon‐using‐the‐chemicalize‐org‐technology/

EnLty ExtracLon

•  Many tools use some form of dicLonary lookup (PubChem or ChEBI is a good source of chemical terms)

•  But dicLonaries are certainly not sufficient

Daniel Lowe, 239th ACS NaLonal MeeLng

OPSIN

•  Can be used as a library or as a command line tool

•  Outputs CML by default, but this can easily be converted to other formats by CDK or OpenBabel

•  See here for some benchmarks of OPSIN versus Lexichem and ChemAxon

OPSIN Example package gov.nih.ncgc;

import nu.xom.Element;import uk.ac.cam.ch.wwmm.opsin.NameToStructure;import uk.ac.cam.ch.wwmm.opsin.NameToStructureException;

import java.io.BufferedReader;import java.io.File;import java.io.FileReader;import java.io.IOException;

public class OpsinExample { String filename;

public OpsinExample(String filename) throws NameToStructureException, IOException { this.filename = filename; }

public void run() throws NameToStructureException, IOException { NameToStructure nameToStructure = NameToStructure.getInstance(); BufferedReader reader = new BufferedReader(new FileReader(new File(filename))); String line; while ((line = reader.readLine()) != null) { Element cmlElement = nameToStructure.parseToCML(line); System.out.println(cmlElement.toXML()); } }

public static void main(String[] args) throws IOException, NameToStructureException { OpsinExample oe = new OpsinExample("/Users/guhar/Documents/Presentations/trec/sw/chemnames.txt"); oe.run(); }}

Lexichem Example

from openeye.oechem import *from openeye.oeiupac import *

mol = OEGraphMol()for line in open('chemnames.txt'): line = line.strip() mol.Clear() OEParseIUPACName(mol,line) print OECreateCanSmiString(mol)

Chemical informaLon toolkits

Sopware Tools

•  Chemical InformaLon Toolkits – CDK (Java, LGPL) – OpenBabel (C++, GPL2) – RDKit (C++, BSD) –  Indigo (C++, Dual licensed) –  JChem (Java, commercial, free for academics) – OEChem (C++, commercial, free for academics) – Daylight (C, commercial)

Sopware Tools

•  Name to Structure applicaLons – OSCAR3 (Java, LGPL) consisLng of OPSIN and ChemTok

– ChemAxon (Java, commercial, free for academics)

– LexiChem (C++, commercial, free for academics)