solved and unsolved problems in chemoinformaticsapplications of chemoinformatics: drug design...
TRANSCRIPT
C3 © Gasteiger
Solved and Unsolved Problems in
Chemoinformatics
Johann Gasteiger Computer-Chemie-Centrum
University of Erlangen-Nürnberg
D-91052 Erlangen, Germany
C3 © Gasteiger
Overview
• objectives of lecture
• achievements
• challenges
• unsolved problems
• summary
C3 © Gasteiger
Objectives
• many problems have been solved
let us be proud!
•there is still a lot to be done
chemoinformatics is a field of its own,
is attractive for students
C3 © Gasteiger
• Structure activity relationships 1963 Hantsch & Fujita
• structure representation
1965, Morgan
• structure elucidation
1965, Sasaki, Munk, DENDRAL
• synthesis design
1970, Corey & Wipke, Ugi+Gasteiger, Hendrickson
• molecular modeling
1970, Langridge, Marshall
• data analysis / chemometrics
1970, Kowalski, Wold
Chemoinformatics – An Old Discipline
C3 © Gasteiger
• access to chemical information
• learning from chemical information
• applications
Achievements
C3 © Gasteiger
Databases
• Chemical Abstracts Service (1975)
• Cambridge Structure Database (1984)
• Beilstein (1990)
• Gmelin (1990)
• ChemInformRX (1991)
• SpecInfo (1991)
• PubChem (2004)
• etc.
C3 © Gasteiger
Number of Compounds in Chemistry
compounds published in CAS
0
10
20
30
40
50
1965 1970 1975 1980 1985 1990 1995 2000 2005
year
co
mp
ou
nd
s (
millio
ns
)
102 million
compounds,
66 million
sequences
(June 2015)
x
C3 © Gasteiger
Chemical Structures
• computers have learnt the language of a chemist
communicating by structure diagrams
•molecules are stored with atomic resolution providing
access to each atom and bond
enabling substructure search
•the complex structures of molecules can be visualized
new insights can be gained
C3 © Gasteiger
• databases have strongly contributed to the
progress in chemistry and related fields
• without databases modern research in
chemistry would be inconceivable
Summary
C3 © Gasteiger
• learning from data
• QSAR/QSPR
• representation of chemical structures
Learning from Chemical Information
C3 © Gasteiger
Problem: Not Enough Information
102,000,000 chemical compounds
1,000,000 3D structures in
Cambridge Crystallographic Data File
we only have data on the 3D structure for less than
1% of the known compounds
C3 © Gasteiger
Problem: Not Enough Information
102,000,000 chemical compounds
1,000,000 3D structures in
Cambridge Crystallographic Data File
we only have data on the 3D structure for less than
1% of the known compounds
can we learn the rules from the known 3D structures
C3 © Gasteiger
Rule-Based Learning: CORINA
*
connection table & stereo descriptors
initialization of
internal coordinates
rings:
ring perception
ring template search
ring assembly
3D coordinates
acyclic systems:
removal of steric crowding
generates 3D coordinates for >99.5 % of all organic compounds
J.Sadowski, J.Gasteiger, Chem.Reviews, 1993, 93, 2567-2581.
http://www.molecular-networks.com/products/corina/
C3 © Gasteiger
Data-Based Learning
Structure-Property Relationships (QSPR, QSAR)
molecular
structure property
structure
descriptors
//
representation model building
C3 © Gasteiger
Representation of Chemical Structures
• topological indices
• fragment codes
• fingerprints
• ….
Several thousands of different types of
chemical descriptors have been developed
R.Todeschini, V.Consonni, Molecular Descriptors in
Chemoinformatics, 2 volumes, Wiley-VCH, 2009
C3 © Gasteiger
Representation of Chemical Structures
3D model
molecular
surface
constitution
N H
N O
O
N H 3 +
Of Molecules and Humans
J. Gasteiger,
J. Med. Chem., 2006, 55, 6429 - 6434
C3 © Gasteiger
• physical, e.g.
– aqueous solubility
– 13C NMR shifts
• chemical, e.g.
– acidity
• biological, e.g.
– toxicity
Prediction of Properties
C3 © Gasteiger
Applications of Chemoinformatics: Drug Design
Chemoinformatics has become an integral part of the
drug design process
• lead discovery
• virtual screening
• pharmacophore searching
• lead optimization
• QSAR
• molecular docking
• prediction of ADME properties
• solubility, adsorption, distribution, metabolism, excretion….
C3 © Gasteiger
Handbook of
Chemoinformatics
J. Gasteiger (Editor)
65 authors
73 contributions
4 volumes
1900 pages
Wiley-VCH, Weinheim
(August 2003)
From Data to Knowledge
C3 © Gasteiger
Chemoinformatics
- A Textbook -
J. Gasteiger, T. Engel
(Editors)
650 pages
Wiley-VCH, Weinheim
(September 2003)
C3 © Gasteiger
• essense of chemistry
• environmental impact of chemicals
• understanding chemistry
• understanding biological systems
• human health
applications in all fields of chemistry and related
scientific disciplines
Challenges
C3 © Gasteiger
Essense of Chemistry
Chemical Industry is not selling
compounds
but is selling
properties
i.e., compounds with desired properties
(drugs, paints, fibers, plastics, pesticides, catalysts,
ceramics, etc.)
C3 © Gasteiger
What structure do I need for a certain property?
structure-activity relationships
How do I make this structure?
synthesis design
What is the product of my reaction?
reaction prediction
structure elucidation
In all those areas the use of chemoinformatics could help!
Fundamental Questions in Chemistry
C3 © Gasteiger
REACH – Registration, Evaluation, Authorization and
restriction of Chemicals
• for those chemicals used with more than 10 tons/year manufactured or
imported into the European Union a Chemical Safety Report is needed
• law since June 1, 2007; registration until Dec 1, 2013
• applies to about 35,000 chemicals
• testing on harmful effects on human health or environment, determination of
persistence, bioaccumulation and toxicity
• testing is time-consuming, expensive and might need many animals
Risk Assessment of Chemicals
Use chemoinformatics methods for ranking of chemicals
C3 © Gasteiger
• for chemicals used in cosmetics products
• no compounds tested on animals are allowed in cosmetics
in Europe since 2009.
• all animal tested cosmetics will eventually be banned on the
European market.
Cosmetics Directive
Use chemoinformatics methods for developing
alternatives to animal testing
C3 © Gasteiger
• access to chemical information
• learning from chemical information
• applications
Problems Still to be Solved
C3 © Gasteiger
• input of chemical structures (hand writing, voice)
• representation of neglected compounds (boranes, organo-
metallic structures (ferrocene, etc.), Markush structures (patents),
polymers)
• text mining (optical character recognition)
• publishing chemical information (3D structures, spectra)
• publishing and searching on the internet
Improved Access to Chemical Information
C3 © Gasteiger
• on compounds
• store all available information (all properties)
• store all spectra
• on reactions
• give the entire stoichiometry of a reaction
• store all reaction conditions (solvent, temperature,
reaction time)
• kinetic data
Neede for Better Databases
C3 © Gasteiger
From Experiment to Reaction Database (Presently)
experiment
lab journal report
manuscript
patent publication
secondary literature abstract
input into reaction database
reaction database
printed reaction scheme
information
producer
many breaks
in
information
processing
information
consumer
written on paper
electronically
C3 © Gasteiger
From Experiment to Reaction Database (Future: Electronic Lab Notebooks)
experiment
e lab notebook report
manuscript
patent publication
secondary literature abstract
input into reaction database
reaction database
printed reaction scheme
information
producer
a direct
information
flow
information
consumer
written on paper
electronically
X
C3 © Gasteiger
• Note: standard QSAR studies have difficulties to get
published if they do not also
- provide new experimental results or
- increase our knowledge and insight
• represent structures by descriptors that can be interpreted
• combine substructures with physicochemical effects
• use data analysis methods that are not a black box
• a forest of decision trees gives more exact predictions but cannot be
interpreted; a single decision tree can be interpreted!
• combine a QSAR model with e.g. docking studies
Learning from Chemical Information
From models to interpretation
C3 © Gasteiger
• all fields of chemistry
• drug design
• analytical chemistry
• organic chemistry
• biochemistry
• toxicology
• material science
Applications
C3 © Gasteiger
• conformational flexibility of drugs and proteins
• docking into proteins
• no consensus scoring (science cannot be predicted by voting)
• try to model the physicochemistry of the process
• protein-protein and protein-DNA interactions
• prediction of ADME-Tox properties
• model the various organs of a human
Drug Design
C3 © Gasteiger
Analytical Chemistry
• origin of olive oils
• simulation of infrared spectra
• 3D structure from infrared spectrum
C3 © Gasteiger
Italian Olive Oils
9 regions
572 olive oil samples
contents of 8 fatty acids:
palmitic acid linoleic acid
palmitoleic acid arachidic acid
stearic acid linoleic acid
oleic acid eicosenoic acid
1
2
3
4
5
6
7 8
9
M.Forina, et al. Ann.Chim.(Rome), 1982, 72, 127 and 144.
C3 © Gasteiger
Self-Organizing Map of Olive Oils
6 6 6 6 6 5 5 5 5 5 5 8 8 8 8 8 8 8
6 6 6 6 5 5 5 5 5 7 8 8 8 8
6 6 6 6 5 5 5 5 5 5 7 8 8 8 8 8 6 6 6 6 5 5 5 5 5 5 7 7 7 7 8 8 8 8
6 6 5 5 5 5 7 7 7 7 7 7 8 8
3 3 5 5 5 7 7 7 9 9 9 9 8
3 5 5 5 7 7 7 9 9 9 9
3 3 5 5 7 7 7 7 9 9 9 9 9 3 3 3 7 7 7 7 9 9 9 1
3 3 3 3 3 3 3 7 7 7 9 1 4 1
3 3 3 3 3 3 3 2 2 4 2 1 1 1 1
3 3 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1
3 3 3 3 3 3 3 3 3 2 2 1 1 1 3 3 3 3 3 3 2 2 2 2 2 4 4
3 3 3 3 3 3 3 3 3 3 3 2 2 2 2
3 3 3 3 3 3 3 3 3 3 2 2 2 4 4 4
3 3 3 3 3 3 3 3 3 3 3 3 4 2 2 4 4 4 4
3 3 3 3 3 3 3 3 3 4 2 2 2 2 3 3 3 3 3 3 3 3 4 4 3 4 4 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 2 2 2 4
572 samples
9 regions
8 fatty acids
250 training
322 test
10 prediction errors
J.Zupan, M.Novic, X.Li, J. Gasteiger, Anal. Chim. Acta, 1994, 292, 219-234
C3 © Gasteiger
Self-Organizing Map of Olive Oils
1
2
3
4
5
6
7 8
9
6 6 6 6 6 5 5 5 5 5 5 8 8 8 8 8 8 8
6 6 6 6 5 5 5 5 5 7 8 8 8 8
6 6 6 6 5 5 5 5 5 5 7 8 8 8 8 8 6 6 6 6 5 5 5 5 5 5 7 7 7 7 8 8 8 8
6 6 5 5 5 5 7 7 7 7 7 7 8 8
3 3 5 5 5 7 7 7 9 9 9 9 8
3 5 5 5 7 7 7 9 9 9 9
3 3 5 5 7 7 7 7 9 9 9 9 9 3 3 3 7 7 7 7 9 9 9 1
3 3 3 3 3 3 3 7 7 7 9 1 4 1
3 3 3 3 3 3 3 2 2 4 2 1 1 1 1
3 3 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1
3 3 3 3 3 3 3 3 3 2 2 1 1 1 3 3 3 3 3 3 2 2 2 2 2 4 4
3 3 3 3 3 3 3 3 3 3 3 2 2 2 2
3 3 3 3 3 3 3 3 3 3 2 2 2 4 4 4
3 3 3 3 3 3 3 3 3 3 3 3 4 2 2 4 4 4 4
3 3 3 3 3 3 3 3 3 4 2 2 2 2 3 3 3 3 3 3 3 3 4 4 3 4 4 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 2 2 2 4
J.Zupan, M.Novic, X.Li, J. Gasteiger, Anal. Chim. Acta, 1994, 292, 219-234
The merits of
unsupervised
learning
C3 © Gasteiger
• CASE was one of the roots of chemoinformatics
• various groups worked on it (Munk, Sasaki, Funatsu,
Steinrück)
• not much done recently
• no useful general purpose system available
• much time spent by chemists on structure elucidation
structure elucidation should be done more efficiently,
using all available information and using software
Computer-Assisted Structure Elucidation (CASE)
C3 © Gasteiger
Simulation of Infrared Spectra
N number of atoms
A atomic properties
B temperature or smoothing factor
rij interatomic distance
g r A A ei j
j i
N
i
NB r rij( )
( )
1
12
3D structure representation by radial distribution function
Predict the Infrared Spectrum from the structure of a compound
C3 © Gasteiger
Radialcode
(128 values)
IR spectrum
(128 absorbance values)
3D structure transformation
Training of a Counterpropagation (CPG) Network
C3 © Gasteiger
Simulation of IR Spectra
r = 0.965
0
0.4
0.8
500100015002000250030003500
cm-1
absorb
ance
simulation
experiment
Br
O O
J.Schuur, J.Gasteiger, Anal.Chim. 1997, 69, 2398-2405.
C3 © Gasteiger
Radialcode
(128 values)
IR spectrum
(128 absorbance values)
3D structure decoding
3D Structure Prediction from IR Spectrum
C3 © Gasteiger
IR Spectrum of Unknown Compound
0
0.1
0.2
0.3
0.4
500100015002000250030003500 cm-1
Abs.
C3 © Gasteiger
3D Structure Prediction from IR Spectra
NO
O
F
O
Correct prediction!
M.Hemmer, J.Gasteiger, Anal.Chim.Acta, 2000, 420, 145-154.
C3 © Gasteiger
• needs better data in reaction databases
• reach out to theoretical chemists
• put the results of quantum mechanical calculations
into databases
Chemical Reactivity
C3 © Gasteiger
Calculation of Chemical Effects
charge distribution
J. Gasteiger, M. Marsili, Tetrahedron 36, 3219 (1980)
inductive effect
J. Gasteiger, M. G. Hutchings, Tetrah. Lett. 24, 2541 (1983)
resonance effect
J. Gasteiger, H. Saller, Angew. Chem. Int. Ed. Engl. 24, 687 (1985)
polarizability effect
J. Gasteiger, M. G. Hutchings, J. Chem. Soc. Perkin 2, 559 (1984)
bond dissociation energy
J. Gasteiger, Comp. Chem. 2, 85 (1978)
C3 © Gasteiger
Calculation of Proton Affinities (PA)
• alkyl amines only (49 cpds)
PA(kJ/mol) = 205.6 + 2.82 ad,N
J. Gasteiger, M.G. Hutchings, J. Chem. Soc. Perkin 2, 1984, 559
• alkyl amines + heteroatom substituted alkyl amines (80 cpds)
PA(kJ/mol) = 1435.5 + 12.5 ad,N - 116.3 cr,N
J. Gasteiger, M.G. Hutchings, Tetrahedron Lett. 1983, 2541
N N H+ H
C3 © Gasteiger
A pKa Model of Aliphatic Carboxylic Acids
10.1989.111.002.127.1254.37 ,,2, aminoCOODOa IAQpK a ca
n = 1122, r2 = 0.81, s = 0.42, F = 809
A2D Q a
H 2 O + H 3 O +
H c
C
O
O
C C
O
O R1
R2
R3
C
R1
R2
R3
J.Zhang, T.Kleinoeder, J.Gasteiger, J.Chem.Inf.Model., 2006, 46, 2256-2266.
C3 © Gasteiger
• CASD was one of the roots of chemoinformatics
• many products can be traced back to CASD work
•MACCS, REACCS, Beilstein DB, ChemInform RX DB
• however CASD systems are not yet widely accepted
by organic chemists
• but it is still true:
“The amount of information to be processed and the decisions between
many alternatives suggests the use of computers in synthesis design.”
(H.Gelernter, 1973)
The design of organic syntheses should be done more
efficiently, using all available information, by using software
Computer-Assisted Organic Synthesis Design (CASD)
C3 © Gasteiger
• search for enzyme inhibitors M.Reitz, A.von Homeyer, J.Gasteiger,
J.Chem.Inf.Model., 2006, 46, 2324-2332
• search for similar enzymes O.Sacher, M.Reitz, J.Gasteiger,
J.Chem.Inf.Model., 2009, 49, 1525-1534
X.Hu, A,Yan, T.Tan, O.Sacher, J.Gasteiger
J.Chem.Inf.Model., 2010, 50, 1089-1100
• discover essential pathways of diseases G.Kastenmüller, J.Gasteiger, H.W.Mewes,
Bioinformatics, 2008, 24, i56-i62
G.Kastenmüller, M.E.Schenk, J.Gasteiger, H.W.Mewes,
Genome Biology, 2009,10, R28
Application of BioPath.Database
http://www.molecular-networks.com/databases/biopath
C3 © Gasteiger 56
Uncovering Metabolic Pathways to Phenotypic Traits
Attribute selection
+ Pathway ranking
PEDANT
Annotated
genomes
BioPath
database
Phenotypic
traits
Metabolic
reconstruction
Pathway profile
Relevant pathways
C3 © Gasteiger 57
Periodontal Disease
Human oral flora:
> 700 species
PEDANT:
15 fully sequenced oral genomes
(incl. 4 of 6 periodontal pathogens)
BioPath
68 global pathways
306 smaller pathways
C3 © Gasteiger 58
Relevant Pathways for Phenotype
Periodontal Disease Causing Glutamate fermentation
Biosynthesis of L-proline
Biosynthesis of 5-formimino-THF
Conversion of l-glutamate to L-proline
Conversion of l-glutamate to l-ornithine
Degradation of l-histidine to l-glutamate
G. Kastenmüller, M.E. Schenk, J. Gasteiger, H.W.Mewes,
Genome Biology, 2009,10, R28
C3 © Gasteiger 59
Relevant Pathways for Phenotype
Periodontal Disease Causing Glutamate fermentation
Biosynthesis of L-proline
Biosynthesis of 5-formimino-THF
Conversion of l-glutamate to L-proline
Conversion of l-glutamate to l-ornithine
Degradation of l-histidine to l-glutamate
Several of these pathways produce NH3
Cytotoxic NH3 plays a major role in periodontal disease
G. Kastenmüller, M.E. Schenk, J. Gasteiger, H.W.Mewes,
Genome Biology, 2009,10, R28
C3 © Gasteiger
• meeting the challenges posed by REACH, drug design
and Cosmetics Directive
• projects funded by the European Union, Innovative
Medicine Initiative and Cosmetics Europe
• eTOX (11 academic groups, 6 SMEs, 13 pharma companies)
• COSMOS (5 academic groups, 3 public institutions, 5 SMEs,
2 cosmetics companies)
Toxicity and Risk Assessment
C3 © Gasteiger
• Chemotypes
> structure fragments with physico-
chemical properties
• ToxPrint
> alerting chemotypes for genotoxic
carcinogens
• Publicly available
• www.chemotyper.org
• www.toxprint.org
Search Structures for Toxic Alerts
CSRML: C.Yang et al., J. Chem. Inf. Model., 2015, 55, 510-528
C3 © Gasteiger
QSPR Modeling of Diverse Material Properties
• Nanomaterials
• Catalysts
• Biomaterials
• Polymers
• Ionic liquids
• Supercritical carbon dioxide
• Ceramics
T.Le, V.C.Epa, F.R.Burden, D.A.Winkler, Chem.Reviews, 2012, 112, 2889-2919.
David A. Winkler, CSIRO, Melbourne, Australia
C3 © Gasteiger
• Chemoinformatics has become indispensable in
drug design
• has many potential applications in all areas of chemistry
and other scientific fields
• talk to colleagues about their problems
• build models with their data – and knowledge
• Society expects a lot from chemoinformatics
(Nobel Prize in Chemistry 2013: Karplus, Levitt, Warshel)
Conclusions
C3 © Gasteiger
Teaching
• several textbooks have been published
• Gillet+Leach, Gasteiger+Engel, Bajorath, Wild
• define curriculum in chemoinformatics
• various universities already teach chemoinformatics
• the number is growing
• integrate chemoinformatics into regular
chemistry curricula
C3 © Gasteiger
• Many excellent and dedicated coworkers at
Computer-Chemie-Centrum, University of Erlangen-Nuremberg,
Germany
http://www2.chemie.uni-erlangen.de
• Excellent coworkers at
http://www.molecular-networks.com
• Dr.Ulrike Gasteiger
Acknowledgements