in the name of god basic steps of qsar/qspr investigations m.h. fatemi mazandaran university...
Post on 19-Dec-2015
215 views
TRANSCRIPT
QSAR
• Qualitative Structure-Activity Relationships• Can one predict activity (or properties in
QSPR) simply on the basis of knowledge of the structure of the molecule?
• In other, words, if one systematically changes a component, will it have a systematic effect on the activity?
What is QSAR?
A QSAR is a mathematical relationship between a biological activity of a
molecular system and its geometric and chemical characteristics.
QSAR attempts to find consistent relationship between biological activity
and molecular properties, so that these “rules” can be used to evaluate the
activity of new compounds.
Why QSAR?
The number of compounds required for synthesis in order to place 10 different groups in 4 positions of benzene ring is 104
Solution: synthesize a small number of compounds and from their data derive rules to predict the biological activity of other compounds.
QSXRQSXR
X=A Activity X=P Property
X=R Retention
X= bo+ b1D1+ b2D2+…..+ bnDn
bi regression coefficient
Di descriptors
n number of descriptors
Early Examples• Hammett (1930s-1940s)
COOH COO + H K0
COOH COO + H KpX X
COOH COO + H Km
X X
para = log10
meta = log10
Kp
Km
K0
K0
Hammett (cont.)
• Now suppose have a related series
reflect sensitivity to substituent reflect sensitivity to different system
CH2COOH CH2COO + H K'x
log10K'xK'0
X X
=
Free-Wilson Analysis
• Log 1/C = ai +
where C=predicted activity,
ai= contribution per group, and =activity of reference
Free-Wilson example
Log 1/C = -0.30 [m-F] + 0.21 [m-Cl] + 0.43 [m-Br] + 0.58 [m-I] + 0.45 [m-Me] + 0.34 [p-F] + 0.77 [p-Cl]
+ 1.02 [p-Br] + 1.43 [p-I] + 1.26 [p-Me] + 7.82
NBr
X
Y HCl
activity of analogs
Problems include at least two substituent position necessary and only predict new combinations of the
substituents used in the analysis.
Hansch Analysis
Log 1/C = a + b + c
where x) = log PRX – log PRH
and log P is the water/octanol partition
This is also a linear free energy relation
Applications of QSAR
• 1-Drug design
• 2-Prediction of Chemical toxicity
• 3-Prediction of environmental activity
• 4-Prediction of molecular properties
• 5-Investigation of retention mechanism
Structure Entry &
Molecular Modeling
DescriptorGeneration
FeatureSelection
Construct Model
MLRA or CNN
ModelValidation
Steps in QSPR/QSAR
QSAR STEPS
Data set selection
• 1-Structural similarity of studied molecules
• 2-Data collected in the same conditions
• 3-Data set would be as large as possible
Structure Entry &
Molecular Modeling
DescriptorGeneration
FeatureSelection
Construct Model
MLRA or CNN
ModelValidation
Steps in QSPR/QSAR
QSAR STEPS
INTRODUCTION to Molecular Descriptors
• Molecular descriptors are numerical values that characterize properties of molecules
• Molecular descriptors encoded structural features of molecules as numerical descriptors
• Vary in complexity of encoded information and in compute time
• Examples:– Physicochemical properties (empirical)– Values from algorithms, such as 2D fingerprints
Classical Classification of Molecular Descriptors
*
O
CH2 CH2
O
NH CH CH2
O
O
O
O
CH2 O
CH2
OH
CH2 *n
Constitutional, Topological
2-D structural formula
Physicochemical
Geometrical
3-D shape and structure
Quantum Chemical
Hybrid descriptors
Topological Indexes: Example:
• Wiener Index • Counts the number of bonds between pairs of atoms and sums the
distances between all pairs• Molecular Connectivity Indexes
– Randić branching index• Defines a “degree” of an atom as the number of adjacent
non-hydrogen atoms• Bond connectivity value is the reciprocal of the square root of
the product of the degree of the two atoms in the bond.• Branching index is the sum of the bond connectivities over all
bonds in the molecule.– Chi indexes – introduces valence values to encode sigma, pi,
and lone pair electrons
Electronic descriptors
• Electronic interactions have very important roles in controlling of molecular properties.
• Electronic descriptors are calculated to encode aspects of the structures that are related to the electrons
• Electronic interaction is a function of charge distribution on a molecule
Physicochemical PropertiesUsed in this QSAR
1. Liquid solubility Sw,L in mg/L and mmol/m3
2. Octanol-water partition coefficient Kow
3. Liquid Vapor Pressure Pv,L in Pa
4. Henry’s Law constant Hc in Pa∙m3/mole
5. Boiling point
Structure Entry &
Molecular Modeling
DescriptorGeneration
FeatureSelection
Construct Model
MLRA or CNN
ModelValidation
Steps in QSPR/QSAR
QSAR STEPS
Feature Selection
• E.g. comparing faces first requires the
identification of key features.
• How do we identify these?
• The same applies to molecules.
Objective feature selection• After descriptors have been calculated for each
compound, this set must be reduced to a set of descriptors which is as information rich but as small as possible
1- Deleting of constant or near constant descriptors
2- Pair correlation cut-off selection3- Cluster analysis4- Principal component analysis5- K correlation analysis
Descriptive Statistics
55 .01 9.44 .6524 1.66861
55 .02 708.00 13.2664 95.41298
55 .00 7.35 2.7035 2.06794
55 123.11 307.99 192.4207 42.41658
55 .02 .19 .0580 .03451
55 .00 .23 .0270 .03070
55 .00 312.00 5.6900 42.06771
54 63.45 153.63 95.7878 23.58493
55 4.07 9.13 5.9576 1.24159
55 2.20 4.68 3.1949 .76452
55 1.41 4.56 2.3626 .74960
55 .79 2.71 1.4072 .49032
55 .10 1.14 .2799 .16722
55 .43 1.90 .8358 .38795
55 .14 1.79 .4958 .27697
55 12.00 28.00 17.3091 4.11804
55 .05 .58 .3319 .19432
55 -.45 -.05 -.2652 .11673
55 4.05 6.37 5.2470 .99529
55 .75 6.95 2.5227 1.99339
55 .98 6.94 2.2400 1.62828
55 1.42 3.93 2.6579 .43353
55 106.12 218.34 146.2387 25.62153
55 129.62 262.24 175.1636 28.52871
55 44.02 80.88 57.0065 8.44310
55 22.66 56.08 31.9507 7.16801
55 18.74 38.74 25.0053 4.42347
55 .57 .80 .7089 .05104
55 .65 .92 .8291 .07153
55 .64 .90 .8080 .05988
55 1.49 6.63 3.6971 1.19562
54 1.02 5.62 3.1893 .84204
55 1.00 110.00 37.5636 33.22246
53
homo
lumo
dip
mw
mia
mib
mic
polar
x0
x1p
x2p
x3p
x3c
x4p
x4c
noa
pcpa
pcna
edn
edp
dspn
shape
volm
surf
s1zy
s2zx
s3xy
ss1
ss2
ss3
logp
bcf
number
Valid N (listwise)
N Minimum Maximum Mean Std. Deviation
Principal Component
• PC1 = a1,1x1 + a1,2x2 + … + a1,nxn
• PC2 = a2,1x1 + a2,2x2 + … + a2,nxn
• Keep only those components that possess largest variation
• PC are orthogonal to each other
Subjective Feature Selection
• The aim is to reach optimal model
• 1-Search all possible model (Best MLR)
• 2-Forward, Backward & Stepwise methods
• 3-Genetic algorithm
• 4-Mutation and selection uncover models
• 5-Cluster significance analysis
• 6-Leaps & bounds regression
Feature Selection:Most existing feature selection algorithms consist of :
Starting point in the feature space
Search procedure
Evaluation function
Criterion of stopping the search
Feature Selection:
Starting point in the feature space
- no features
- all features
- random subset of features
Forward Selection
• 1- variables are sequentially entered into the model.
The first variable considered for entry into the equation is the one with the largest positive or negative correlation with the dependent
variable. This variable is entered into the equation only if it satisfies
the criterion for entry. 2-If the first variable is entered, the independent
variable not in the equation that has the largest partial correlation is considered next.
3-The procedure stops when there are no variables that meet the entry criterion.
Forward Selection exampleModel Summary
.704a .496 .486 .59485
.762b .581 .564 .54785
.810c .655 .634 .50184
.834d .695 .670 .47674
Model1
2
3
4
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), logpa.
Predictors: (Constant), logp, mwb.
Predictors: (Constant), logp, mw, dipc.
Predictors: (Constant), logp, mw, dip, miad.
Backward Elimination
• 1- All variables are entered into the equation and then sequentially removed.
• 2-The variable with the smallest partial correlation with the dependent variable is
considered first for removal. If it meets the criterion for elimination, it is removed.
• 3- After the first variable is removed, the variable remaining in the equation with the smallest
partial correlation is considered next. • 4-The procedure stops when there are no
variables in the equation that satisfy the removal criteria.
Stepwise
• Stepwise. At each step, the independent variable not in the equation that has the
smallest probability of F is entered, if that probability is sufficiently small. Variables
already in the regression equation are removed if their probability of F becomes sufficiently large. The method terminates
when no more variables are eligible for inclusion or removal.
Stepwise ExampleModel Summary
.704a .496 .486 .59485
.762b .581 .564 .54785
.810c .655 .634 .50184
.834d .695 .670 .47674
.824e .679 .660 .48403
Model1
2
3
4
5
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), logpa.
Predictors: (Constant), logp, mwb.
Predictors: (Constant), logp, mw, dipc.
Predictors: (Constant), logp, mw, dip, miad.
Predictors: (Constant), logp, dip, miae.
Forward, Backward & Stepwise variable selection methods
• Advantages
• Fast and simple
• Can do with very packages
• Limitation
• Risk of Local minima
Definition
Genetic algorithm is a general purpose search and optimization
method based on genetic principles and Darwin’s law that applicable to
wide variety of problems
GA basic operation
• Population generation (chromosome )
• Selection (according to fitness )
• Recombination and mutation (offspring)
• Repetition
GA flow chartInitialize
population generation
Evaluatecompute fitness for each chromosome
Exploitperform natural selection
Explorerecombination & mutation operation
Binary Encoding
Chromosome A 1 0 1 1 0 0 1 1 1 0 0 0 0 1
Chromosome B 0 0 1 0 0 1 1 1 0 1 0 0 1 1
Every of chromosome is a string of bit 0 or 1
Selection The best chromosome should
survive and create new offspring.
• Roulette wheel selection
• Rank selection
• Steady state selection
Crossover ( binary encoding )
*Single point
11001011+11011111 = 11001111
11001011 + 11011111 = 11011111
* Two point crossover
Mutation* Bit inversion (binary encoding )
11001001 => 10001001
* Ordering change ( permutation encoding )
(1 2 3 4 5 6 8 9 7) => (1 8 3 4 5 6 2 9 7)
Parameters of GA
• Crossover rate
• Mutation rate
• Population size
• Selection type
• Encoding
• Crossover and mutation type
Advantages of GA
• Parallelism
• Provide a group of potential solutions
• Easy to implement
• Provide global optima
How many descriptors can be used in a QSAR model?
Rule of tumb:
- Per descriptor at least 5 data point (molecule) must be exist in the model
Otherwise possibility of finding coincidental correlation is too high
Structure Entry &
Molecular Modeling
DescriptorGeneration
FeatureSelection
Construct Model
MLRA or CNN
ModelValidation
Steps in QSPR/QSAR
QSAR STEPS