trec2010 chemical ir workshop
TRANSCRIPT
TREC 2010 Chemical IR Workshop
Rajarshi Guha NIH Chemical Genomics Center
November 2010 NIST, Gaithersburg
Acknowledgements
• John Barnard • Joseph Scheiber • Daniel Lowe • David Wild
• Nina Jeliazkova
Outline
• Chemical structure representaLon(s) • Processing chemistry‐related documents
• Structure retrieval • Chemical informaLon toolkits
Chemical structure representaLon
Based on material from I571, David Wild, Indiana University
What to Include?
• The representaLon depends on what you want to include (and defines what you can include) – Atoms
– ConnecLvity – Stereochemistry – Charges/Isotopes – 3D configuraLon
C8H9NO3
Visual RepresentaLons
• 2D structure diagram is the lingua franca of bench chemists
• Display is supported by nearly every cheminformaLcs system
• Summarizes – ConnecLvity – Stereochemistry – Charge/isotope
• We can use 2D representaLons as the starLng point for many cheminformaLcs tasks
Chemical Names
• Most papers and documents referring to chemistry will name molecules
• Two forms of chemical names – Trivial (short, pronounceable, uninformaLve) – SystemaLc (long, unpronounceable, can usually get structure back from the name)
Tyrosine
or
β‐(p‐hydroxyphenyl)alanine or α‐amino‐p‐hydroxyhydrocinnamic acid or 2‐amino‐3‐(4‐hydroxyphenyl) propanoic acid
Chemical Numbers
• Also termed registry numbers • Arbitrary numbers assigned to one (or more) structures
• SomeLmes a hierarchy might be present – Parent compound – stereoisomer – salt …
• CAS numbers / PubChem SID & CID / InChI Key
• Only way to get back structure is lookup
Structures are Graphs (mostly)
• For many cases, a chemical structure can be considered as a graph – Atoms are nodes, bonds are edges – IdenLcal graphs imply idenLcal molecules
• But there are limits to this representaLon – Polymers, inorganic compounds, stereochemistry
• And chemical phenomena can create problems for a graph theoreLcal approach
AromaLcity & Graphs
• These two molecules are idenLcal • Yet their molecular graphs would suggest a different connecLvity
• In fact, all atoms and bonds in benzene are equivalent due to resonance
• In this case, we would (should) perceive each C‐C bond as aromaLc
1D RepresentaLons
• 1D representaLons are linear strings • Generally only encode connecLvity, atom and bond type – Wiswesser line notaLon (WLN) – Sybyl line notaLon (SLN) – SMILES – InChI
2D/3D RepresentaLons
• MulL‐line text formats • Contain connecLvity, atom and bond types, 3D coordinates as well as other (possibly arbitrary) informaLon – MDL MOL format
– PDB – Hyperchem HIN – Chemical Markup Language (CML)
• A simple line notaLon, preey much the lingua franca of cheminformaLcs
• Atoms are represented by their symbols – Lower case indicates aromaLc atom
• Single bonds are implicit, double bonds are “=“, triple bonds are “#” and aromaLc are “:”
• Rings indicate by “ring closure numbers” – In C1CC1, the two carbons marked by “1” are connected
SMILES
Canonical SMILES
• Given a structure, you can write a SMILES representaLon in mulLple ways CC(C)CC CCC(C)C
• As a result, comparing molecules or searching for molecules based on arbitrary SMILES can give misleading or wrong results
• To avoid this we canonicalize SMILES
Canonical SMILES
• Given two structures that have idenLcal atoms, bonds and connecLvity their canonical SMILES will be idenLcal
• In general, any permutaLon of atom index will not affect the canonical SMILES
• CanonicalizaLon is a key feature of structure registraLon – You want to be sure that you have a single, unique representaLon of each structure in the database
GeneraLng Canonical SMILES
• All toolkits are capable of generaLng these • Using OpenBabel at the command line is easy
• Can convert lots of other file formats to canonical SMILES
• Note that different products have different canonicalizaLon algorithms
$ echo "c1(O)ccccc1" | /usr/local/openbabel/bin/babel -ismi -ocan –cOc1ccccc1
$ echo "c1(ccccc1)O" | /usr/local/openbabel/bin/babel -ismi -ocan –cOc1ccccc1
Tautomerism
• Simply generaLng a unique string representaLon of a molecule may not be sufficient to uniquely idenLfy that molecule
• Tautomerism is a reacLon involving the movement of a H, resulLng in a change of bond order
Tautomerism
• Due to the rapid equilibrium the molecule exists in both forms – which one do we store? – Consider the most stable tautomer – Just go with a ‘canonical’ tautomer
• In general, you need to generate the canonical SMILES for a canonical tautomer
• Using InChI, you can choose whether to consider tautomers or not or just ignore tautomer informaLon
Tautomerism & InChI
hep://www.chemspider.com/blog/does‐inchi‐account‐for‐tautomers.html
With mobile‐H percepLon on
With mobile‐H percepLon off
Markush Structures
• Compact representaLon of a set or class of specific compounds with common structural features
• Used in – chemical patents – query structures in substructure search systems – QuanLtaLve Structure‐AcLvity RelaLonship (QSAR) analysis • class of related compounds with acLvity data
– combinatorial libraries • rapid synthesis of large numbers of related compounds
– legislaLon (controlled drugs, chemical weapons)
Markush Structures
R1
R2
Markush Structures
• S‐variaLon (subsLtuent variaLon) – List of alternaLve values for an R‐group
• P‐variaLon (posiLon variaLon) – Variable point of aeachment
• F‐variaLon (frequency variaLon) – MulLple occurrence of groups
• H‐variaLon (homology variaLon) – Generically described group (e.g., alkyl) – A (possibly) infinite set of S‐variaLons
Markush Structures
• S‐variaLon – R1 is methyl or ethyl • H‐variaLon‐ R2 is alkyl • P‐variaLon – R3 is amino
• F‐variaLon – 3 (but more generally can be any number, say m)
Markush Structures
• Can be considered as formal “grammar” for generaLng valid molecules (“sentences”)
• EnumeraLon of coverage usually impracLcal and open impossible (infinite sets)
• Appropriate algorithms for handling take advantage of Markush representaLon – Avoid enumeraLon (especially infinite sets) – Compare finite grammars rather than infinite sets of valid sentences
The Markush Problem
• RepresentaLon – Mixture of structures and text
– Generic expressions (viz., h‐variaLon) – Vagueness (“ … where by X we mean …”)
• Searching – TranslaLon problem – specific groups (ethy, butyl) must be matched against an expression (1‐6C alkyl)
– SegmentaLon problem – boundaries between R‐groups and the scaffold may not coincide
Markush TranslaLon/SegmentaLon
Chemical Similarity
• When are two molecules similar? – They are both 6‐membered rings
– Both have carbons – Both have only single bonds
• In the second case, both have a N
• Many ways to define similarity
• CriLcally dependent on representaLon
Why Chemical Similarity?
• Much of medicinal chemistry is based on the similarity principle – Similar molecules exhibit similar acLviLes – J. Med. Chem., 2002, 45, 4350-4358
• But there are many excepLons
• Even then, looking for similar molecules gives us a useful starLng point in many cases
Fingerprint Based Similarity
• We can represent a chemical structure using a bit string representaLon
• The example is a “key” fingerprint – Each bit posiLon corresponds to a specific structural feature
1 0 1 1 0 0 0 1 0
Fingerprint Based Similarity
• The similarity between two molecules is then defined in terms of the similarity between their fingerprints – Tanimoto, Dice, Cosine, Tversky
• Clearly, depending on the nature of the fingerprints, two molecules can be more or less similar
Fingerprint Similarity • InformaLon loss – fragments
presence and absence instead of counts
• Bit string saturaLon – within a large database almost all bits are set
• The average similarity appears to increase with the complexity of the query compound
• Larger queries are more discriminaLng (flaeer curve, Tanimoto values spread wider)
• Smaller queries have sharp peak, unable to disLnguish between molecules Flower D., On the Properties of Bit String-
Based Measures of Chemical Similarity, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998
The distribution of Tanimoto values found in database searches with a range of query molecules
Nina Jeliazkova, hep://vedina.users.sourceforge.net/publicaLons/2005/ChemicalSimilarity.ppt
Physical Similarity
• Keyed fingerprints are inevitably lossy • Hashed and circular fingerprints can be beeer • But in the end they both ignore the 3D structure of a molecules
• Shape and surface‐property based similariLes can be more relevant – Slower to evaluate – OpenEye ROCS is a tool to evaluate shape similarity
Structural Similarity & 3D Property VariaLon
Nina Jeliazkova, hep://vedina.users.sourceforge.net/publicaLons/2005/ChemicalSimilarity.ppt
Similarity Indices
Association indices Correlation indices
J. D. Holliday, C-Y. Hu† and P. Willett,(2002) Grouping of Coefficients for the Calculation of Inter-Molecular Similarity and Dissimilarity using 2D Fragment Bit-Strings, Combinatorial Chemistry & High Throughput Screening,5, 155-166 155
Nina Jeliazkova, hep://vedina.users.sourceforge.net/publicaLons/2005/ChemicalSimilarity.ppt
Structure retrieval
What Are We Asking For?
Exact Matches? Similar topology?
Substructure matches? Similar properLes?
What do we have?
Caveat
• If structures are not registered properly, results of queries can be misleading, incomplete or wrong
• At the very least – Remove salts
– Generate canonical tautomers – Create a canonical SMILES or InChI
Exact Structure Retrieval
Q Give a structure X, does the database have any instances of X?
• The trivial way to do this is via string matching • Could match on canonical SMILES
• But then you have to ensure that the query and the database employ the same canonicalizer
Exact Structure Retrieval
Q Give a structure X, does the database have any instances of X?
• The trivial way to do this is via string matching • Beeer to use a hash code such as InChI • Need to be careful that you and the database are using the same seungs (i.e., standard InChI)
Exact Structure Retrieval
• This type of query is generally only useful during database registraLon
• Is also handy when trying to match up one collecLon against and another
• But can trip you up – Database stores stereochemistry, your query has none
– You won’t find a match, if you use a full InChI
– Similar problems with tautomerism
Similar Structure Retrieval
Q Give a structure X, does the database have any molecules similar to X?
• This can open up a can of worms! • Most common case is to find structurally similar molecules in terms of 2D
• However, one can also consider 3D structural (i.e., shape) similarity
• Finally, one could also idenLfy similar molecules based on similar physicochemical properLes
Property Similarity
• Each molecule can be represented as an N‐dimensional vector of numbers – These can represent structural descriptors (number of rings, graph invariants)
– Physical characterisLcs (log P, polar surface area) • Similarity is then defined in terms of the distance between the descriptor vectors of the query molecule & the target molecules
Substructure Retrieval
Q Give a structure X, does the database have any molecules that contain X?
• This is basically subgraph isomorphism • There are a number of variaLons
• Find an exact substructure • Fuzzy substructure (e.g., ignore atom type)
• Maximum common substructure
Substructure Retrieval
• The basic subgraph isomorphism algorithms are quite well known
• All cheminformaLcs toolkits support this
• In the simplest approach, we can specify a SMILES string as a query – c1ccccc1C(=O) as a query looks for molecules containing a benzaldehyde moiety
Generic Substructure Queries
• Using a SMILES as a query implies that you look for a specific substructure
• What about finding molecules containing an aromaLc ring connected to carbonyl via a N or a C?
• Valid molecules would be
• We can perform these queries using SMARTS
SMARTS Queries
• Regular expressions for chemical structures • The previous query is achieved by c1ccccc1[c,C,n,N]C=O
• Very powerful system and fundamental to many cheminformaLcs methods
• Also see SMARTSViewer to visualize SMARTS queries and the Daylight matcher to test them
Substructure Retrieval Performance?
• Naively, SS queries require one to check each entry in a database
• But can be significantly sped up by making sure that the target molecule has all the features of the query molecule
can never match
– Can do this by comparing fingerprints, which is much faster than doing graph isomorphism
Maximum Common Substructure
• Largest subgraph common to two structures • NP‐complete problem
• Many approaches try to idenLfy an approximate soluLon
• MulLple MCS can be used to idenLfy core scaffolds
3D (sub) Graph Isomorphism
• Also known as shape matching (if we consider whole molecules) or pharmacophore searching (if we consider substructures)
• EssenLally idenLfy molecules that contain a 3D geometric moLf – The moLf is defined in terms of atoms/groups and distance/angle/dihedral constraints amongst them
3D (sub) Graph Isomorphism
Processing chemistry‐related documents
Parsing Chemical Documents
• I have minimal experLse in this area and more of a user than a developer of such tools
• Two step process – Chemical enLty extracLon (from text or images) – EnLty (i.e., name) to structure conversion
• Variety of tools for both steps
Parsing Chemical Documents Text en;ty recogni;on Image recogni;on
(a) Extractors (IUPAC names) ‐ TEMIS Chemical EnLty RelaLonships Skill Cartridge ‐ Accelrys Pipeline Pilot extractor (NoLora) ‐ Fraunhofer (ProMiner Chemistry) ‐ Chemaxon (chemicalize.org) ‐ Oscar (Corbee, Murray‐Rust et al.) ‐ SureChem ‐ IBM ChemFrag Annotator
(b) Converter ‐ CambridgeSop name=struct ‐ Openeye Lexichem ‐ Chemaxon
‐ OSRA (NIH)
‐ Clide Pro (Keymodule Ltd.)
‐ Fraunhofer chemoCR
‐ ChemReader
hep://www.chemaxon.com/library/user‐presentaLons/chemical‐enLty‐extracLon‐using‐the‐chemicalize‐org‐technology/
EnLty ExtracLon
• Many tools use some form of dicLonary lookup (PubChem or ChEBI is a good source of chemical terms)
• But dicLonaries are certainly not sufficient
Daniel Lowe, 239th ACS NaLonal MeeLng
OPSIN
OPSIN
• Can be used as a library or as a command line tool
• Outputs CML by default, but this can easily be converted to other formats by CDK or OpenBabel
• See here for some benchmarks of OPSIN versus Lexichem and ChemAxon
OPSIN Example package gov.nih.ncgc;
import nu.xom.Element;import uk.ac.cam.ch.wwmm.opsin.NameToStructure;import uk.ac.cam.ch.wwmm.opsin.NameToStructureException;
import java.io.BufferedReader;import java.io.File;import java.io.FileReader;import java.io.IOException;
public class OpsinExample { String filename;
public OpsinExample(String filename) throws NameToStructureException, IOException { this.filename = filename; }
public void run() throws NameToStructureException, IOException { NameToStructure nameToStructure = NameToStructure.getInstance(); BufferedReader reader = new BufferedReader(new FileReader(new File(filename))); String line; while ((line = reader.readLine()) != null) { Element cmlElement = nameToStructure.parseToCML(line); System.out.println(cmlElement.toXML()); } }
public static void main(String[] args) throws IOException, NameToStructureException { OpsinExample oe = new OpsinExample("/Users/guhar/Documents/Presentations/trec/sw/chemnames.txt"); oe.run(); }}
Lexichem Example
from openeye.oechem import *from openeye.oeiupac import *
mol = OEGraphMol()for line in open('chemnames.txt'): line = line.strip() mol.Clear() OEParseIUPACName(mol,line) print OECreateCanSmiString(mol)
Chemical informaLon toolkits
Sopware Tools
• Chemical InformaLon Toolkits – CDK (Java, LGPL) – OpenBabel (C++, GPL2) – RDKit (C++, BSD) – Indigo (C++, Dual licensed) – JChem (Java, commercial, free for academics) – OEChem (C++, commercial, free for academics) – Daylight (C, commercial)
Sopware Tools
• Name to Structure applicaLons – OSCAR3 (Java, LGPL) consisLng of OPSIN and ChemTok
– ChemAxon (Java, commercial, free for academics)
– LexiChem (C++, commercial, free for academics)