in the name of god basic steps of qsar/qspr investigations m.h. fatemi mazandaran university...

77
In the name of GOD Basic Steps of QSAR/QSPR Investigations M.H. FATEMI Mazandaran University [email protected]

Post on 19-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

In the name of GOD

Basic Steps of QSAR/QSPR Investigations

M.H. FATEMI

Mazandaran [email protected]

QSAR

• Qualitative Structure-Activity Relationships• Can one predict activity (or properties in

QSPR) simply on the basis of knowledge of the structure of the molecule?

• In other, words, if one systematically changes a component, will it have a systematic effect on the activity?

What is QSAR?

A QSAR is a mathematical relationship between a biological activity of a

molecular system and its geometric and chemical characteristics.

QSAR attempts to find consistent relationship between biological activity

and molecular properties, so that these “rules” can be used to evaluate the

activity of new compounds.

Why QSAR?

The number of compounds required for synthesis in order to place 10 different groups in 4 positions of benzene ring is 104

Solution: synthesize a small number of compounds and from their data derive rules to predict the biological activity of other compounds.

QSXRQSXR

X=A Activity X=P Property

X=R Retention

X= bo+ b1D1+ b2D2+…..+ bnDn

bi regression coefficient

Di descriptors

n number of descriptors

History

Early Examples• Hammett (1930s-1940s)

COOH COO + H K0

COOH COO + H KpX X

COOH COO + H Km

X X

para = log10

meta = log10

Kp

Km

K0

K0

Hammett (cont.)

• Now suppose have a related series

reflect sensitivity to substituent reflect sensitivity to different system

CH2COOH CH2COO + H K'x

log10K'xK'0

X X

=

Free-Wilson Analysis

• Log 1/C = ai +

where C=predicted activity,

ai= contribution per group, and =activity of reference

Free-Wilson example

Log 1/C = -0.30 [m-F] + 0.21 [m-Cl] + 0.43 [m-Br] + 0.58 [m-I] + 0.45 [m-Me] + 0.34 [p-F] + 0.77 [p-Cl]

+ 1.02 [p-Br] + 1.43 [p-I] + 1.26 [p-Me] + 7.82

NBr

X

Y HCl

activity of analogs

Problems include at least two substituent position necessary and only predict new combinations of the

substituents used in the analysis.

Hansch Analysis

Log 1/C = a + b + c

where x) = log PRX – log PRH

and log P is the water/octanol partition

This is also a linear free energy relation

Applications of QSAR

• 1-Drug design

• 2-Prediction of Chemical toxicity

• 3-Prediction of environmental activity

• 4-Prediction of molecular properties

• 5-Investigation of retention mechanism

Structure Entry &

Molecular Modeling

DescriptorGeneration

FeatureSelection

Construct Model

MLRA or CNN

ModelValidation

Steps in QSPR/QSAR

QSAR STEPS

Data set selection

• 1-Structural similarity of studied molecules

• 2-Data collected in the same conditions

• 3-Data set would be as large as possible

Structure Entry &

Molecular Modeling

DescriptorGeneration

FeatureSelection

Construct Model

MLRA or CNN

ModelValidation

Steps in QSPR/QSAR

QSAR STEPS

INTRODUCTION to Molecular Descriptors

• Molecular descriptors are numerical values that characterize properties of molecules

• Molecular descriptors encoded structural features of molecules as numerical descriptors

• Vary in complexity of encoded information and in compute time

• Examples:– Physicochemical properties (empirical)– Values from algorithms, such as 2D fingerprints

Classical Classification of Molecular Descriptors

*

O

CH2 CH2

O

NH CH CH2

O

O

O

O

CH2 O

CH2

OH

CH2 *n

Constitutional, Topological

2-D structural formula

Physicochemical

Geometrical

3-D shape and structure

Quantum Chemical

Hybrid descriptors

Topological Indexes: Example:

• Wiener Index • Counts the number of bonds between pairs of atoms and sums the

distances between all pairs• Molecular Connectivity Indexes

– Randić branching index• Defines a “degree” of an atom as the number of adjacent

non-hydrogen atoms• Bond connectivity value is the reciprocal of the square root of

the product of the degree of the two atoms in the bond.• Branching index is the sum of the bond connectivities over all

bonds in the molecule.– Chi indexes – introduces valence values to encode sigma, pi,

and lone pair electrons

Electronic descriptors

• Electronic interactions have very important roles in controlling of molecular properties.

• Electronic descriptors are calculated to encode aspects of the structures that are related to the electrons

• Electronic interaction is a function of charge distribution on a molecule

Physicochemical PropertiesUsed in this QSAR

1. Liquid solubility Sw,L in mg/L and mmol/m3

2. Octanol-water partition coefficient Kow

3. Liquid Vapor Pressure Pv,L in Pa

4. Henry’s Law constant Hc in Pa∙m3/mole

5. Boiling point

Structure Entry &

Molecular Modeling

DescriptorGeneration

FeatureSelection

Construct Model

MLRA or CNN

ModelValidation

Steps in QSPR/QSAR

QSAR STEPS

Feature Selection

• E.g. comparing faces first requires the

identification of key features.

• How do we identify these?

• The same applies to molecules.

Objective feature selection• After descriptors have been calculated for each

compound, this set must be reduced to a set of descriptors which is as information rich but as small as possible

1- Deleting of constant or near constant descriptors

2- Pair correlation cut-off selection3- Cluster analysis4- Principal component analysis5- K correlation analysis

Descriptive Statistics

55 .01 9.44 .6524 1.66861

55 .02 708.00 13.2664 95.41298

55 .00 7.35 2.7035 2.06794

55 123.11 307.99 192.4207 42.41658

55 .02 .19 .0580 .03451

55 .00 .23 .0270 .03070

55 .00 312.00 5.6900 42.06771

54 63.45 153.63 95.7878 23.58493

55 4.07 9.13 5.9576 1.24159

55 2.20 4.68 3.1949 .76452

55 1.41 4.56 2.3626 .74960

55 .79 2.71 1.4072 .49032

55 .10 1.14 .2799 .16722

55 .43 1.90 .8358 .38795

55 .14 1.79 .4958 .27697

55 12.00 28.00 17.3091 4.11804

55 .05 .58 .3319 .19432

55 -.45 -.05 -.2652 .11673

55 4.05 6.37 5.2470 .99529

55 .75 6.95 2.5227 1.99339

55 .98 6.94 2.2400 1.62828

55 1.42 3.93 2.6579 .43353

55 106.12 218.34 146.2387 25.62153

55 129.62 262.24 175.1636 28.52871

55 44.02 80.88 57.0065 8.44310

55 22.66 56.08 31.9507 7.16801

55 18.74 38.74 25.0053 4.42347

55 .57 .80 .7089 .05104

55 .65 .92 .8291 .07153

55 .64 .90 .8080 .05988

55 1.49 6.63 3.6971 1.19562

54 1.02 5.62 3.1893 .84204

55 1.00 110.00 37.5636 33.22246

53

homo

lumo

dip

mw

mia

mib

mic

polar

x0

x1p

x2p

x3p

x3c

x4p

x4c

noa

pcpa

pcna

edn

edp

dspn

shape

volm

surf

s1zy

s2zx

s3xy

ss1

ss2

ss3

logp

bcf

number

Valid N (listwise)

N Minimum Maximum Mean Std. Deviation

Variable reduction

• Principal Component Analysis

Principal Component

• PC1 = a1,1x1 + a1,2x2 + … + a1,nxn

• PC2 = a2,1x1 + a2,2x2 + … + a2,nxn

• Keep only those components that possess largest variation

• PC are orthogonal to each other

Subjective Feature Selection

• The aim is to reach optimal model

• 1-Search all possible model (Best MLR)

• 2-Forward, Backward & Stepwise methods

• 3-Genetic algorithm

• 4-Mutation and selection uncover models

• 5-Cluster significance analysis

• 6-Leaps & bounds regression

Feature Selection:Most existing feature selection algorithms consist of :

Starting point in the feature space

Search procedure

Evaluation function

Criterion of stopping the search

Feature Selection:

Starting point in the feature space

- no features

- all features

- random subset of features

Forward Selection

• 1- variables are sequentially entered into the model.

The first variable considered for entry into the equation is the one with the largest positive or negative correlation with the dependent

variable. This variable is entered into the equation only if it satisfies

the criterion for entry. 2-If the first variable is entered, the independent

variable not in the equation that has the largest partial correlation is considered next.

3-The procedure stops when there are no variables that meet the entry criterion.

Forward Selection exampleModel Summary

.704a .496 .486 .59485

.762b .581 .564 .54785

.810c .655 .634 .50184

.834d .695 .670 .47674

Model1

2

3

4

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), logpa.

Predictors: (Constant), logp, mwb.

Predictors: (Constant), logp, mw, dipc.

Predictors: (Constant), logp, mw, dip, miad.

Backward Elimination

• 1- All variables are entered into the equation and then sequentially removed.

• 2-The variable with the smallest partial correlation with the dependent variable is

considered first for removal. If it meets the criterion for elimination, it is removed.

• 3- After the first variable is removed, the variable remaining in the equation with the smallest

partial correlation is considered next. • 4-The procedure stops when there are no

variables in the equation that satisfy the removal criteria.

Stepwise

• Stepwise. At each step, the independent variable not in the equation that has the

smallest probability of F is entered, if that probability is sufficiently small. Variables

already in the regression equation are removed if their probability of F becomes sufficiently large. The method terminates

when no more variables are eligible for inclusion or removal.

Stepwise ExampleModel Summary

.704a .496 .486 .59485

.762b .581 .564 .54785

.810c .655 .634 .50184

.834d .695 .670 .47674

.824e .679 .660 .48403

Model1

2

3

4

5

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), logpa.

Predictors: (Constant), logp, mwb.

Predictors: (Constant), logp, mw, dipc.

Predictors: (Constant), logp, mw, dip, miad.

Predictors: (Constant), logp, dip, miae.

Forward, Backward & Stepwise variable selection methods

• Advantages

• Fast and simple

• Can do with very packages

• Limitation

• Risk of Local minima

Genetic algorithm

Genetic Algorithm

Search Space

Definition

Genetic algorithm is a general purpose search and optimization

method based on genetic principles and Darwin’s law that applicable to

wide variety of problems

Darvin’s rules

Survival of fittest individualsRecombinationMutation

Biological background• Chromosome

• Gene

• Reproduction

• Mutation

• Fitness

GA basic operation

• Population generation (chromosome )

• Selection (according to fitness )

• Recombination and mutation (offspring)

• Repetition

GA flow chartInitialize

population generation

Evaluatecompute fitness for each chromosome

Exploitperform natural selection

Explorerecombination & mutation operation

Binary Encoding

Chromosome A 1 0 1 1 0 0 1 1 1 0 0 0 0 1

Chromosome B 0 0 1 0 0 1 1 1 0 1 0 0 1 1

Every of chromosome is a string of bit 0 or 1

Selection The best chromosome should

survive and create new offspring.

• Roulette wheel selection

• Rank selection

• Steady state selection

Roulette wheel selection

Fitness 1> 2 > 3 >4

Crossover ( binary encoding )

*Single point

11001011+11011111 = 11001111

11001011 + 11011111 = 11011111

* Two point crossover

Mutation* Bit inversion (binary encoding )

11001001 => 10001001

* Ordering change ( permutation encoding )

(1 2 3 4 5 6 8 9 7) => (1 8 3 4 5 6 2 9 7)

GA flow chartStart

Fitness

Selection

Crossover

Mutation

Replace

Test

End

Population generation

Parameters of GA

• Crossover rate

• Mutation rate

• Population size

• Selection type

• Encoding

• Crossover and mutation type

Advantages of GA

• Parallelism

• Provide a group of potential solutions

• Easy to implement

• Provide global optima

How many descriptors can be used in a QSAR model?

Rule of tumb:

- Per descriptor at least 5 data point (molecule) must be exist in the model

Otherwise possibility of finding coincidental correlation is too high

Structure Entry &

Molecular Modeling

DescriptorGeneration

FeatureSelection

Construct Model

MLRA or CNN

ModelValidation

Steps in QSPR/QSAR

QSAR STEPS

Questions?