solved and unsolved problems in chemoinformaticsapplications of chemoinformatics: drug design...

72
C 3 © Gasteiger Solved and Unsolved Problems in Chemoinformatics Johann Gasteiger Computer-Chemie-Centrum University of Erlangen-Nürnberg D-91052 Erlangen, Germany [email protected]

Upload: others

Post on 27-Mar-2020

10 views

Category:

Documents


3 download

TRANSCRIPT

C3 © Gasteiger

Solved and Unsolved Problems in

Chemoinformatics

Johann Gasteiger Computer-Chemie-Centrum

University of Erlangen-Nürnberg

D-91052 Erlangen, Germany

[email protected]

C3 © Gasteiger

Overview

• objectives of lecture

• achievements

• challenges

• unsolved problems

• summary

C3 © Gasteiger

Objectives

• many problems have been solved

let us be proud!

•there is still a lot to be done

chemoinformatics is a field of its own,

is attractive for students

C3 © Gasteiger

• Structure activity relationships 1963 Hantsch & Fujita

• structure representation

1965, Morgan

• structure elucidation

1965, Sasaki, Munk, DENDRAL

• synthesis design

1970, Corey & Wipke, Ugi+Gasteiger, Hendrickson

• molecular modeling

1970, Langridge, Marshall

• data analysis / chemometrics

1970, Kowalski, Wold

Chemoinformatics – An Old Discipline

C3 © Gasteiger

• access to chemical information

• learning from chemical information

• applications

Achievements

C3 © Gasteiger

Databases

• Chemical Abstracts Service (1975)

• Cambridge Structure Database (1984)

• Beilstein (1990)

• Gmelin (1990)

• ChemInformRX (1991)

• SpecInfo (1991)

• PubChem (2004)

• etc.

C3 © Gasteiger

Number of Compounds in Chemistry

compounds published in CAS

0

10

20

30

40

50

1965 1970 1975 1980 1985 1990 1995 2000 2005

year

co

mp

ou

nd

s (

millio

ns

)

102 million

compounds,

66 million

sequences

(June 2015)

x

C3 © Gasteiger

C3 © Gasteiger

Search for Cancerostatic Drugs

similar substrates protein/substrate complex

C3 © Gasteiger

Chemical Structures

• computers have learnt the language of a chemist

communicating by structure diagrams

•molecules are stored with atomic resolution providing

access to each atom and bond

enabling substructure search

•the complex structures of molecules can be visualized

new insights can be gained

C3 © Gasteiger

• databases have strongly contributed to the

progress in chemistry and related fields

• without databases modern research in

chemistry would be inconceivable

Summary

C3 © Gasteiger

• learning from data

• QSAR/QSPR

• representation of chemical structures

Learning from Chemical Information

C3 © Gasteiger

Problem: Not Enough Information

102,000,000 chemical compounds

1,000,000 3D structures in

Cambridge Crystallographic Data File

we only have data on the 3D structure for less than

1% of the known compounds

C3 © Gasteiger

Problem: Not Enough Information

102,000,000 chemical compounds

1,000,000 3D structures in

Cambridge Crystallographic Data File

we only have data on the 3D structure for less than

1% of the known compounds

can we learn the rules from the known 3D structures

C3 © Gasteiger

Rule-Based Learning: CORINA

*

connection table & stereo descriptors

initialization of

internal coordinates

rings:

ring perception

ring template search

ring assembly

3D coordinates

acyclic systems:

removal of steric crowding

generates 3D coordinates for >99.5 % of all organic compounds

J.Sadowski, J.Gasteiger, Chem.Reviews, 1993, 93, 2567-2581.

http://www.molecular-networks.com/products/corina/

C3 © Gasteiger

Data-Based Learning

Structure-Property Relationships (QSPR, QSAR)

molecular

structure property

structure

descriptors

//

representation model building

C3 © Gasteiger

Representation of Chemical Structures

• topological indices

• fragment codes

• fingerprints

• ….

Several thousands of different types of

chemical descriptors have been developed

R.Todeschini, V.Consonni, Molecular Descriptors in

Chemoinformatics, 2 volumes, Wiley-VCH, 2009

C3 © Gasteiger

Representation of Chemical Structures

3D model

molecular

surface

constitution

N H

N O

O

N H 3 +

Of Molecules and Humans

J. Gasteiger,

J. Med. Chem., 2006, 55, 6429 - 6434

C3 © Gasteiger

• physical, e.g.

– aqueous solubility

– 13C NMR shifts

• chemical, e.g.

– acidity

• biological, e.g.

– toxicity

Prediction of Properties

C3 © Gasteiger

Applications of Chemoinformatics: Drug Design

Chemoinformatics has become an integral part of the

drug design process

• lead discovery

• virtual screening

• pharmacophore searching

• lead optimization

• QSAR

• molecular docking

• prediction of ADME properties

• solubility, adsorption, distribution, metabolism, excretion….

C3 © Gasteiger

Handbook of

Chemoinformatics

J. Gasteiger (Editor)

65 authors

73 contributions

4 volumes

1900 pages

Wiley-VCH, Weinheim

(August 2003)

From Data to Knowledge

C3 © Gasteiger

Chemoinformatics

- A Textbook -

J. Gasteiger, T. Engel

(Editors)

650 pages

Wiley-VCH, Weinheim

(September 2003)

C3 © Gasteiger

• essense of chemistry

• environmental impact of chemicals

• understanding chemistry

• understanding biological systems

• human health

applications in all fields of chemistry and related

scientific disciplines

Challenges

C3 © Gasteiger

Essense of Chemistry

Chemical Industry is not selling

compounds

but is selling

properties

i.e., compounds with desired properties

(drugs, paints, fibers, plastics, pesticides, catalysts,

ceramics, etc.)

C3 © Gasteiger

What structure do I need for a certain property?

structure-activity relationships

How do I make this structure?

synthesis design

What is the product of my reaction?

reaction prediction

structure elucidation

In all those areas the use of chemoinformatics could help!

Fundamental Questions in Chemistry

C3 © Gasteiger

REACH – Registration, Evaluation, Authorization and

restriction of Chemicals

• for those chemicals used with more than 10 tons/year manufactured or

imported into the European Union a Chemical Safety Report is needed

• law since June 1, 2007; registration until Dec 1, 2013

• applies to about 35,000 chemicals

• testing on harmful effects on human health or environment, determination of

persistence, bioaccumulation and toxicity

• testing is time-consuming, expensive and might need many animals

Risk Assessment of Chemicals

Use chemoinformatics methods for ranking of chemicals

C3 © Gasteiger

• for chemicals used in cosmetics products

• no compounds tested on animals are allowed in cosmetics

in Europe since 2009.

• all animal tested cosmetics will eventually be banned on the

European market.

Cosmetics Directive

Use chemoinformatics methods for developing

alternatives to animal testing

C3 © Gasteiger

• access to chemical information

• learning from chemical information

• applications

Problems Still to be Solved

C3 © Gasteiger

• input of chemical structures (hand writing, voice)

• representation of neglected compounds (boranes, organo-

metallic structures (ferrocene, etc.), Markush structures (patents),

polymers)

• text mining (optical character recognition)

• publishing chemical information (3D structures, spectra)

• publishing and searching on the internet

Improved Access to Chemical Information

C3 © Gasteiger

• on compounds

• store all available information (all properties)

• store all spectra

• on reactions

• give the entire stoichiometry of a reaction

• store all reaction conditions (solvent, temperature,

reaction time)

• kinetic data

Neede for Better Databases

C3 © Gasteiger

From Experiment to Reaction Database (Presently)

experiment

lab journal report

manuscript

patent publication

secondary literature abstract

input into reaction database

reaction database

printed reaction scheme

information

producer

many breaks

in

information

processing

information

consumer

written on paper

electronically

C3 © Gasteiger

From Experiment to Reaction Database (Future: Electronic Lab Notebooks)

experiment

e lab notebook report

manuscript

patent publication

secondary literature abstract

input into reaction database

reaction database

printed reaction scheme

information

producer

a direct

information

flow

information

consumer

written on paper

electronically

X

C3 © Gasteiger

• Note: standard QSAR studies have difficulties to get

published if they do not also

- provide new experimental results or

- increase our knowledge and insight

• represent structures by descriptors that can be interpreted

• combine substructures with physicochemical effects

• use data analysis methods that are not a black box

• a forest of decision trees gives more exact predictions but cannot be

interpreted; a single decision tree can be interpreted!

• combine a QSAR model with e.g. docking studies

Learning from Chemical Information

From models to interpretation

C3 © Gasteiger

• all fields of chemistry

• drug design

• analytical chemistry

• organic chemistry

• biochemistry

• toxicology

• material science

Applications

C3 © Gasteiger

• conformational flexibility of drugs and proteins

• docking into proteins

• no consensus scoring (science cannot be predicted by voting)

• try to model the physicochemistry of the process

• protein-protein and protein-DNA interactions

• prediction of ADME-Tox properties

• model the various organs of a human

Drug Design

C3 © Gasteiger

Analytical Chemistry

• origin of olive oils

• simulation of infrared spectra

• 3D structure from infrared spectrum

C3 © Gasteiger

Italian Olive Oils

9 regions

572 olive oil samples

contents of 8 fatty acids:

palmitic acid linoleic acid

palmitoleic acid arachidic acid

stearic acid linoleic acid

oleic acid eicosenoic acid

1

2

3

4

5

6

7 8

9

M.Forina, et al. Ann.Chim.(Rome), 1982, 72, 127 and 144.

C3 © Gasteiger

Self-Organizing Map of Olive Oils

6 6 6 6 6 5 5 5 5 5 5 8 8 8 8 8 8 8

6 6 6 6 5 5 5 5 5 7 8 8 8 8

6 6 6 6 5 5 5 5 5 5 7 8 8 8 8 8 6 6 6 6 5 5 5 5 5 5 7 7 7 7 8 8 8 8

6 6 5 5 5 5 7 7 7 7 7 7 8 8

3 3 5 5 5 7 7 7 9 9 9 9 8

3 5 5 5 7 7 7 9 9 9 9

3 3 5 5 7 7 7 7 9 9 9 9 9 3 3 3 7 7 7 7 9 9 9 1

3 3 3 3 3 3 3 7 7 7 9 1 4 1

3 3 3 3 3 3 3 2 2 4 2 1 1 1 1

3 3 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1

3 3 3 3 3 3 3 3 3 2 2 1 1 1 3 3 3 3 3 3 2 2 2 2 2 4 4

3 3 3 3 3 3 3 3 3 3 3 2 2 2 2

3 3 3 3 3 3 3 3 3 3 2 2 2 4 4 4

3 3 3 3 3 3 3 3 3 3 3 3 4 2 2 4 4 4 4

3 3 3 3 3 3 3 3 3 4 2 2 2 2 3 3 3 3 3 3 3 3 4 4 3 4 4 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 2 2 2 4

572 samples

9 regions

8 fatty acids

250 training

322 test

10 prediction errors

J.Zupan, M.Novic, X.Li, J. Gasteiger, Anal. Chim. Acta, 1994, 292, 219-234

C3 © Gasteiger

Self-Organizing Map of Olive Oils

1

2

3

4

5

6

7 8

9

6 6 6 6 6 5 5 5 5 5 5 8 8 8 8 8 8 8

6 6 6 6 5 5 5 5 5 7 8 8 8 8

6 6 6 6 5 5 5 5 5 5 7 8 8 8 8 8 6 6 6 6 5 5 5 5 5 5 7 7 7 7 8 8 8 8

6 6 5 5 5 5 7 7 7 7 7 7 8 8

3 3 5 5 5 7 7 7 9 9 9 9 8

3 5 5 5 7 7 7 9 9 9 9

3 3 5 5 7 7 7 7 9 9 9 9 9 3 3 3 7 7 7 7 9 9 9 1

3 3 3 3 3 3 3 7 7 7 9 1 4 1

3 3 3 3 3 3 3 2 2 4 2 1 1 1 1

3 3 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1

3 3 3 3 3 3 3 3 3 2 2 1 1 1 3 3 3 3 3 3 2 2 2 2 2 4 4

3 3 3 3 3 3 3 3 3 3 3 2 2 2 2

3 3 3 3 3 3 3 3 3 3 2 2 2 4 4 4

3 3 3 3 3 3 3 3 3 3 3 3 4 2 2 4 4 4 4

3 3 3 3 3 3 3 3 3 4 2 2 2 2 3 3 3 3 3 3 3 3 4 4 3 4 4 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 2 2 2 4

J.Zupan, M.Novic, X.Li, J. Gasteiger, Anal. Chim. Acta, 1994, 292, 219-234

The merits of

unsupervised

learning

C3 © Gasteiger

• CASE was one of the roots of chemoinformatics

• various groups worked on it (Munk, Sasaki, Funatsu,

Steinrück)

• not much done recently

• no useful general purpose system available

• much time spent by chemists on structure elucidation

structure elucidation should be done more efficiently,

using all available information and using software

Computer-Assisted Structure Elucidation (CASE)

C3 © Gasteiger

Simulation of Infrared Spectra

N number of atoms

A atomic properties

B temperature or smoothing factor

rij interatomic distance

g r A A ei j

j i

N

i

NB r rij( )

( )

1

12

3D structure representation by radial distribution function

Predict the Infrared Spectrum from the structure of a compound

C3 © Gasteiger

Radialcode

(128 values)

IR spectrum

(128 absorbance values)

3D structure transformation

Training of a Counterpropagation (CPG) Network

C3 © Gasteiger

Simulation of IR Spectra

r = 0.965

0

0.4

0.8

500100015002000250030003500

cm-1

absorb

ance

simulation

experiment

Br

O O

J.Schuur, J.Gasteiger, Anal.Chim. 1997, 69, 2398-2405.

C3 © Gasteiger

Radialcode

(128 values)

IR spectrum

(128 absorbance values)

3D structure decoding

3D Structure Prediction from IR Spectrum

C3 © Gasteiger

IR Spectrum of Unknown Compound

0

0.1

0.2

0.3

0.4

500100015002000250030003500 cm-1

Abs.

C3 © Gasteiger

3D Structure Prediction from IR Spectra

NO

O

F

O

Correct prediction!

M.Hemmer, J.Gasteiger, Anal.Chim.Acta, 2000, 420, 145-154.

C3 © Gasteiger

Organic Chemistry

• chemical reactivity

• organic synthesis design

C3 © Gasteiger

• needs better data in reaction databases

• reach out to theoretical chemists

• put the results of quantum mechanical calculations

into databases

Chemical Reactivity

C3 © Gasteiger

Calculation of Chemical Effects

charge distribution

J. Gasteiger, M. Marsili, Tetrahedron 36, 3219 (1980)

inductive effect

J. Gasteiger, M. G. Hutchings, Tetrah. Lett. 24, 2541 (1983)

resonance effect

J. Gasteiger, H. Saller, Angew. Chem. Int. Ed. Engl. 24, 687 (1985)

polarizability effect

J. Gasteiger, M. G. Hutchings, J. Chem. Soc. Perkin 2, 559 (1984)

bond dissociation energy

J. Gasteiger, Comp. Chem. 2, 85 (1978)

C3 © Gasteiger

Calculation of Proton Affinities (PA)

• alkyl amines only (49 cpds)

PA(kJ/mol) = 205.6 + 2.82 ad,N

J. Gasteiger, M.G. Hutchings, J. Chem. Soc. Perkin 2, 1984, 559

• alkyl amines + heteroatom substituted alkyl amines (80 cpds)

PA(kJ/mol) = 1435.5 + 12.5 ad,N - 116.3 cr,N

J. Gasteiger, M.G. Hutchings, Tetrahedron Lett. 1983, 2541

N N H+ H

C3 © Gasteiger

A pKa Model of Aliphatic Carboxylic Acids

10.1989.111.002.127.1254.37 ,,2, aminoCOODOa IAQpK a ca

n = 1122, r2 = 0.81, s = 0.42, F = 809

A2D Q a

H 2 O + H 3 O +

H c

C

O

O

C C

O

O R1

R2

R3

C

R1

R2

R3

J.Zhang, T.Kleinoeder, J.Gasteiger, J.Chem.Inf.Model., 2006, 46, 2256-2266.

C3 © Gasteiger

• CASD was one of the roots of chemoinformatics

• many products can be traced back to CASD work

•MACCS, REACCS, Beilstein DB, ChemInform RX DB

• however CASD systems are not yet widely accepted

by organic chemists

• but it is still true:

“The amount of information to be processed and the decisions between

many alternatives suggests the use of computers in synthesis design.”

(H.Gelernter, 1973)

The design of organic syntheses should be done more

efficiently, using all available information, by using software

Computer-Assisted Organic Synthesis Design (CASD)

C3 © Gasteiger

Biochemistry

and

Bioinformatics

• Study of diseases

C3 © Gasteiger

C3 © Gasteiger

• search for enzyme inhibitors M.Reitz, A.von Homeyer, J.Gasteiger,

J.Chem.Inf.Model., 2006, 46, 2324-2332

• search for similar enzymes O.Sacher, M.Reitz, J.Gasteiger,

J.Chem.Inf.Model., 2009, 49, 1525-1534

X.Hu, A,Yan, T.Tan, O.Sacher, J.Gasteiger

J.Chem.Inf.Model., 2010, 50, 1089-1100

• discover essential pathways of diseases G.Kastenmüller, J.Gasteiger, H.W.Mewes,

Bioinformatics, 2008, 24, i56-i62

G.Kastenmüller, M.E.Schenk, J.Gasteiger, H.W.Mewes,

Genome Biology, 2009,10, R28

Application of BioPath.Database

http://www.molecular-networks.com/databases/biopath

C3 © Gasteiger 56

Uncovering Metabolic Pathways to Phenotypic Traits

Attribute selection

+ Pathway ranking

PEDANT

Annotated

genomes

BioPath

database

Phenotypic

traits

Metabolic

reconstruction

Pathway profile

Relevant pathways

C3 © Gasteiger 57

Periodontal Disease

Human oral flora:

> 700 species

PEDANT:

15 fully sequenced oral genomes

(incl. 4 of 6 periodontal pathogens)

BioPath

68 global pathways

306 smaller pathways

C3 © Gasteiger 58

Relevant Pathways for Phenotype

Periodontal Disease Causing Glutamate fermentation

Biosynthesis of L-proline

Biosynthesis of 5-formimino-THF

Conversion of l-glutamate to L-proline

Conversion of l-glutamate to l-ornithine

Degradation of l-histidine to l-glutamate

G. Kastenmüller, M.E. Schenk, J. Gasteiger, H.W.Mewes,

Genome Biology, 2009,10, R28

C3 © Gasteiger 59

Relevant Pathways for Phenotype

Periodontal Disease Causing Glutamate fermentation

Biosynthesis of L-proline

Biosynthesis of 5-formimino-THF

Conversion of l-glutamate to L-proline

Conversion of l-glutamate to l-ornithine

Degradation of l-histidine to l-glutamate

Several of these pathways produce NH3

Cytotoxic NH3 plays a major role in periodontal disease

G. Kastenmüller, M.E. Schenk, J. Gasteiger, H.W.Mewes,

Genome Biology, 2009,10, R28

C3 © Gasteiger

gene drug protein lead

Bioinformatics Chemoinformatics

C3 © Gasteiger

• meeting the challenges posed by REACH, drug design

and Cosmetics Directive

• projects funded by the European Union, Innovative

Medicine Initiative and Cosmetics Europe

• eTOX (11 academic groups, 6 SMEs, 13 pharma companies)

• COSMOS (5 academic groups, 3 public institutions, 5 SMEs,

2 cosmetics companies)

Toxicity and Risk Assessment

C3 © Gasteiger

• Chemotypes

> structure fragments with physico-

chemical properties

• ToxPrint

> alerting chemotypes for genotoxic

carcinogens

• Publicly available

• www.chemotyper.org

• www.toxprint.org

Search Structures for Toxic Alerts

CSRML: C.Yang et al., J. Chem. Inf. Model., 2015, 55, 510-528

C3 © Gasteiger

Material Sciences

• Diverse material properties

C3 © Gasteiger

QSPR Modeling of Diverse Material Properties

• Nanomaterials

• Catalysts

• Biomaterials

• Polymers

• Ionic liquids

• Supercritical carbon dioxide

• Ceramics

T.Le, V.C.Epa, F.R.Burden, D.A.Winkler, Chem.Reviews, 2012, 112, 2889-2919.

David A. Winkler, CSIRO, Melbourne, Australia

C3 © Gasteiger

• Chemoinformatics has become indispensable in

drug design

• has many potential applications in all areas of chemistry

and other scientific fields

• talk to colleagues about their problems

• build models with their data – and knowledge

• Society expects a lot from chemoinformatics

(Nobel Prize in Chemistry 2013: Karplus, Levitt, Warshel)

Conclusions

C3 © Gasteiger

Teaching

• several textbooks have been published

• Gillet+Leach, Gasteiger+Engel, Bajorath, Wild

• define curriculum in chemoinformatics

• various universities already teach chemoinformatics

• the number is growing

• integrate chemoinformatics into regular

chemistry curricula

C3 © Gasteiger

• Many excellent and dedicated coworkers at

Computer-Chemie-Centrum, University of Erlangen-Nuremberg,

Germany

http://www2.chemie.uni-erlangen.de

• Excellent coworkers at

http://www.molecular-networks.com

• Dr.Ulrike Gasteiger

Acknowledgements

C3 © Gasteiger

C3 © Gasteiger 69

Atorvastatin

N

F

O

NH

O

O

OH

CORINA 3D structure

graphic

file

C3 © Gasteiger

C3 © Gasteiger

C3 © Gasteiger