# An Introduction to QSAR Methodology

#### Allen B. Richon

*
Network Science Corporation
4411 Connecticut Avenue NW, STE 514
Washington, DC 20008-8677
*

#### Stanley S. Young

*Glaxo Wellcome Research
Five Moore Drive
Research Triangle Park, NC 27709
*

**http://www.netsci.org/Science/Compchem/feature19.html**

## Introduction

Drug design is an iterative process which begins with a compound that displays an interesting biological profile and ends with optimizing both the activity profile for the molecule and its chemical synthesis. The process is initiated when the chemist conceives a hypothesis which relates the chemical features of the molecule (or series of molecules) to the biological activity. Without a detailed understanding of the biochemical process(es) responsible for activity, the hypothesis generally is refined by examining structural similarities and differences for active and inactive molecules. Compounds are selected for synthesis which maximize the presence of functional groups or features believed to be responsible for activity.

The combinatorial possibilities of this strategy for even simple systems can be explosive. As an example, the number of compounds required for synthesis in order to place 10 substituents on the four open positions of an asymmetrically disubstituted benzene ring system is approximately 10,000. The alternative to this labor intensive approach to compound optimization is to develop a theory that quantitatively relates variations in biological activity to changes in molecular descriptors which can easily be obtained for each compound. A Quantitative Structure Activity Relationship (QSAR) can then be utilized to help guide chemical synthesis. This chapter develops the concepts used to derive a QSAR and reviews the application of these techniques to medicinal research.

## Statistical Concepts

Computational chemistry represents molecular structures as a numerical models and simulates their behavior with the equations of quantum and classical physics. Available programs enable scientists to easily generate and present molecular data including geometries, energies and associated properties (electronic, spectroscopic and bulk). The usual paradigm for displaying and manipulating these data is a table in which compounds are defined by individual rows and molecular properties (or descriptors) are defined by the associated columns. A QSAR attempts to find consistent relationships between the variations in the values of molecular properties and the biological activity for a series of compounds so that these "rules" can be used to evaluate new chemical entities.

A QSAR generally takes the form of a linear equation

Biological Activity = Const + (C_{1} P_{1}) + (C_{2} P_{2}) + (C_{3}
P_{3}) + ...

where the parameters P_{1} through P_{n} are computed for each molecule in the series and the coefficients
C_{1} through C_{n} are calculated by fitting
variations in the parameters and the biological activity. Since
these relationships are generally discovered through the
application of statistical techniques, a brief introduction to the
principles behind the derivation of a QSAR follows.

The work
reported from The Sandoz Institute for Medical Research on the
development of novel analgesic agents^{1} can be used as an
example of a simple QSAR. In this study, vanillylamides and
vanillylthioureas related to capsaicin were prepared and their
activity was tested in an in vitro assay which measured
^{45}Ca^{2+} influx into dorsal root ganglia
neurons. The data, which was reported as the EC_{50}
(µM), is shown in Table 1 (note that compound 6f is the most
active of the series).

### TABLE 1

Capsaicin Analogs Activity Data

Cmpd Number |
Cmpd Name |
X | EC_{50}(µM) |
---|---|---|---|

1 | 6a | H | 11.80 ± 1.90 |

2 | 6b | Cl | 1.24 ± 0.11 |

3 | 6d | NO_{2} |
4.58 ± 0.29 |

4 | 6e | CN | 26.50 ± 5.87 |

5 | 6f | C_{6}H_{5} |
0.24 ± 0.30 |

6 | 6g | N(CH_{3})_{2} |
4.39 ± 0.67 |

7 | 6h | I | 0.35 ± 0.05 |

8 | 6i | NHCHO | ??? |

In the absence of additional information, the only way to derive a best "guess" for the activity of 6i is to calculate the average of the values for the current compounds in the series. The average, 7.24, provides a guess for the value of compound 8 but, how good is this guess? The graphical presentation of the data points is shown in Graph 1.

### GRAPH 1

Capsaicin Analogs Activity Data

The standard
deviation of the data, **s**, shows how far the activity values
are spread about their average. This value provides an indication
of the quality of the guess by showing the amount of variability
inherent in the data. The standard deviation is calculated as shown
below.

Rather than relying on this limited analysis, one would like to develop an understanding of the factors that influence activity within this series and use this understanding to predict activity for new compounds. In order to accomplish this objective, one needs:

- binding data measured with sufficient precision to distinguish
between compounds;
- a set of parameters which can be easily obtained and which are
likely to be related to receptor affinity;
- a method for detecting a relationship between the parameters
and binding data (the QSAR) and
- a method for validating the QSAR.

The QSAR equation is a linear model which relates variations in biological activity to variations in the values of computed (or measured) properties for a series of molecules. For the method to work efficiently, the compounds selected to describe the "chemical space" of the experiments (the training set) should be diverse. In many synthesis campaigns, compounds are prepared which are structurally similar to the lead structure. Not surprisingly, the activity values for this series of compounds will frequently span a limited range as well. In these cases, additional compounds must be made and tested to fill out the training set.

The quality of any QSAR will only be as good as the quality of the data which is used to derive the model. Dose-response curves need to be smooth, contain enough points to assure accuracy and should span two or more orders of magnitude. Multiple readings for a given observation should be reproducible and have relatively smaller errors. The issue being addressed is the signal-to-noise ratio. The variation of the readings obtained by repeatedly testing the same compound should be much smaller than the variation over the series. In cases where the data collected from biological experiments do not follow these guidelines, other methods of data analysis should be utilized since the QSAR models derived from the data will be questionable.

Once biological data has been collected, it is often found that the data is expressed in terms which cannot be used in a QSAR analysis. Since QSAR is based on the relationship of free energy to equilibrium constants, the data for a QSAR study must be expressed in terms of the free energy changes that occur during the biological response. When examining the potency of a drug (the dosage required to produce a biological effect), the change in free energy can be calculated to be proportional to the inverse logarithm of the concentration of the compound.

G_{0} = - 2.3RTlogK = log 1/[S]

Further,
since biological data are generally found to be skewed, the log
transformation moves the data to a nearly normal distribution.
Thus, when measuring responses under equilibrium conditions, the
most frequent transformation used is to express concentration
values (such as IC_{50}, EC_{50}, etc.) as log[C]
or log 1/[C]. The transformed data for the capsaicin agonists are
shown in Table 2.

### TABLE 2

Capsaicin Analogs

Transformed Data

Cmpd Number |
Cmpd Name |
X | EC_{50} |
Log EC_{50} |
Log 1/EC_{50} |
---|---|---|---|---|---|

1 | 6a | H | 11.80 ± 1.90 | 1.07 | -1.07 |

2 | 6b | Cl | 1.24 ± 0.11 | 0.09 | -0.09 |

3 | 6d | NO_{2} |
4.58 ± 0.29 | 0.66 | -0.66 |

4 | 6e | CN | 26.50 ± 5.87 | 1.42 | -1.42 |

5 | 6f | C_{6}H_{5} |
0.24 ± 0.30 | - 0.62 | 0.62 |

6 | 6g | N(CH_{3})_{2} |
4.39 ± 0.67 | 0.64 | -0.64 |

7 | 6h | I | 0.35 ± 0.05 | - 0.46 | 0.46 |

8 | 6i | NHCHO | ?? ± ?? | ?? | ?? |

The effect of this transformation on the spread of the data relative to the average is shown in Graph 2. Note that the data points, projected onto the Y-axis, have become more uniformly distributed.

### GRAPH 2

Capsaicin Analogs

Transformed Data

Given the transformed data, our best guess for the activity of 6i is still the average of the data set (or 0.40). As before, the error associated with this guess is calculated as the square root of the average of the squares of the deviations from the average.

This is an example data set intended to show the general approach; real data sets would have many more compounds and descriptors. Since the purpose of a QSAR is to highlight relationships between activity and structural features, we would like to find one or more structural features which relate these molecules and their associated activity. Additionally, we would like to find a parameter that works consistently for all of the molecules in the series.

There are several potential classes of parameters used in QSAR studies. Substituent constants and other physico-chemical parameters (such as Hammett sigma constants) measure the electronic effects of a group on the molecule. Fragment counts are used to enumerate the presence of specific substructures. Other parameters can include topological descriptors and values derived from quantum chemical calculations.

The selection of parameters is an important first step in any QSAR study. If the association between the parameter(s) selected and activity is strong, then activity predictions will be possible. If there is only weak association, knowing the value of the parameter(s) will not help in predicting activity. Thus, for a given study, parameters should be selected which are relevant to the activity for the series of molecules under investigation and these parameters should have values which are obtained in a consistent manner.

The Sandoz group divided their analysis of capsaicin analogs into three regions: the A-region which was occupied by an aromatic ring; the B-region which was defined by an amide bond; and the C-region which was occupied by a hydrophobic side-chain (See figure in Table 1). The hypothesis for the C-region assumed that a small, hydrophobic substituent would increase activity. Given this assumption, the parameters selected to best define this characteristic were molar refractivity (size) and , the hydrophobic substituent constant. These values are given in Table 3.

### TABLE 3

Capsaicin Analogs

Parameter Values

Cmpd Number |
Cmpd Name |
X | Log EC_{50} |
MR | |
---|---|---|---|---|---|

1 | 6a | H | 1.07 | 0.00 | 1.03 |

2 | 6b | Cl | 0.09 | 0.71 | 6.03 |

3 | 6d | NO_{2} |
0.66 | - 0.28 | 7.36 |

4 | 6e | CN | 1.42 | - 0.57 | 6.33 |

5 | 6f | C_{6}H_{5} |
- 0.62 | 1.96 | 25.36 |

6 | 6g | N(CH_{3})_{2} |
0.64 | 0.18 | 15.55 |

7 | 6h | I | - 0.46 | 1.12 | 13.94 |

8 | 6i | NHCHO | ?? | ?? | ?? |

The data above can be analyzed for relationships by two means: graphically and statistically. The most visual approach to a problem with a limited number of variables is graphical. In this case, a plot of activity versus either molar refractivity or hydrophobicity gives some insight into the relationship between the parameters and activity. The plots derived by the Sandoz group are reproduced in Graph 3.

### GRAPH 3

Capsaicin Analogs

Parameter Values

Does the graph provide insight into the the activity for compound 6i? Does knowing the value for either the hydrophobicity or molar refractivity parameters for this compound provide a good estimate for activity?

Since this is a simple example where only two values are examined, the answers to these questions are a qualified yes. In more complex situations however, where multiple parameters are correlated to activity, statistics is used to derive an equation which relates activity to the parameter set. The linear equation which defines the best model for this set of data is

Log EC_{50} = 0.764 - (0.817)

How much
confidence should we place in this model? The first step to
answering this question is to determine how well the equation
predicts activities for known compounds in the series. The equation
above estimates the average value for the EC_{50} based on
the value for ; because assays vary, it is not surprising that
individual values will differ from the regression estimate. The
difference between the calculated values and the actual (or
measured) values for each compound is termed the residual from the
model. The calculated values for activity and their residuals (or
the errors of the estimate for individual values) are shown in
Table 4.

### TABLE 4

Capsaicin Analogs

Calculated Values

Cmpd Number |
Cmpd Name |
X | Log EC_{50} |
Calculated Log EC _{50} |
Residual | |
---|---|---|---|---|---|---|

1 | 6a | H | 1.07 | 0.00 | 0.79 | 0.28 |

2 | 6b | Cl | 0.09 | 0.71 | 0.21 | - 0.12 |

3 | 6d | NO_{2} |
0.66 | - 0.28 | 1.02 | - 0.36 |

4 | 6e | CN | 1.42 | - 0.57 | 1.26 | 0.16 |

5 | 6f | C_{6}H_{5} |
- 0.62 | 1.96 | - 0.81 | 0.19 |

6 | 6g | N(CH_{3})_{2} |
0.64 | 0.18 | 0.65 | - 0.01 |

7 | 6h | I | - 0.46 | 1.12 | - 0.12 | - 0.34 |

8 | 6i | NHCHO | ?? | - 0.98 | 1.60 | ?? |

The residuals are one way to quantify the error in the estimate for individual values calculated by the regression equation for this data set. The standard error for the residuals is calculated by taking the root-mean-square of the residuals (in this calculation, the denominator shown as decremented by two to reflect the estimation of two parameters).

In order to be an improved model, the standard deviation of the residuals calculated from the model should be smaller than the standard deviation of the original data. The standard error about the mean was previously calculated to be 0.76 whereas the standard error from the QSAR model is 0.28. Clearly, the the use of linear regression has improved the accuracy of our analysis. The plot of measured values versus calculated is shown in Graph 4 with a 45° line.

### GRAPH 4

Capsaicin Analogs

Predicted Versus Actual EC_{50} Values

There are several assumptions inherent in deriving a QSAR model for a series of compounds. First, it is assumed that parameters can be calculated (or measured in some cases) more accurately and cheaply than activity can be measured. Second, it is assumed that deviations from the best fit line follow a normal (Gaussian) distribution. Finally, it is assumed that any variation in the line described by the QSAR equation is independent of the magnitude of both the activity and the parameters. Given these assumptions, the quality of the model can be gauged using a variety of techniques.

Variation in the data is quantified by the correlation coefficient, r, which measures how closely the observed data tracks the fitted regression line. Errors in either the model or in the data will lead to a bad fit. This indicator of fit to the regression line is calculated as:

where the Regression Variance is defined as the Original Variance minus the Variance around the regression line. The Original Variance is the sum-of-the-squares distances of the original data from the mean. This can be viewed graphically as shown in Graph 5.

The calculation is carried out as follows:

- Original Variance = (1.07 - 0.40)
^{2}+ (0.09 - 0.40)^{2}+ ... - Original Variance = 3.49
- Variance around the line = (0.28)
^{2}+ (- 0.12)^{2}+ (- 0.36)^{2}+ ... - Variance around the line = 0.40

- Regression Variance = Original Variance - Variance around the
line
- Regression Variance = 3.49 - 0.40 = 3.09
- r
^{2}= Regression Variance/Original Variance - r
^{2}= 3.09/3.49 - r
^{2}= 0.89

Possible
values reported for r^{2} fall between 0 and 1. An
r^{2} of 0 means that there is no relationship between
activity and the parameter(s) selected for the study. An
r^{2} of 1 means there is perfect correlation. The
interpretation of the r^{2} value for the capsaicin analogs
is that 89% of the variation in the value of the Log
EC_{50} is explained by variation in the value of , the
hydrophobicity parameter.

### GRAPH 5

Capsaicin Analogs

Derivation of r^{2} values

While the fit
of the data to the regression line is excellent, how can one decide
if this correlation is based purely on chance? The higher the value
for r^{2} the less likely that the relationship is due to
chance. If many explanatory variables are used in a regression
equation, it is possible to get a good fit to the data due to the
flexibility of the fitting process; a line will fit two points
perfectly, a quadratic curve will fit three, multiple linear
regression will fit the observed data if there are enough
explanatory variables^{2}. Given the assumption that the
data has a Gaussian distribution, the F statistic below assesses
the statistical significance of the regression equation.

The F
statistic is calculated from r^{2} and the number of data
points (or degrees of freedom) in the data set. The F ratio for the
capsaicin analogs is calculated as:

This value often appears as standard output from statistical programs or it can be checked in statistical tables to determine the significance of the regression equation. In this case, the probability that there is no relationship between activity and the value is less than 1% (p=0.01).

We have found that hydrophobicity values correlate well with biological activity. Does the addition of a size parameter (MR) improve our model? In order to analyze a relationship which is possibly influenced by several variables (or properties), it is useful to assess the contribution of each variable. and MR appear to be somewhat correlated in this data set so the order of fitting can influence how much the second variable helps the first. Multiple linear regression is used to determine the relative importance of multiple variables to the overall fit of the data.

Multiple
linear regression attempts to maximize the fit of the data to a
regression equation (minimize the squared deviations from the
regression equation) for the biological activity (maximize the
r^{2} value) by adjusting each of the available parameters
up or down. Regression programs often approach this task in a
stepwise fashion. That is, successive regression equations will be
derived in which parameters will be either added or removed until
the r^{2} and s values are optimized. The magnitude of the
coefficients derived in this manner indicate the relative
contribution of the associated parameter to biological activity.

There are two important caveats in applying multiple regression analysis. The first is based on the fact that, given enough parameters any data set can be fitted to a regression line. The consequence of this is that regression analysis generally requires significantly more compounds than parameters; a useful rule of thumb is three to six times the number of parameters under consideration. The difficulty is that regression analysis is most effective for interpolation and it is extrapolation that is most useful in a synthesis campaign (i.e., the region of experimental space described by the regression analysis has been explained, but projecting to a new, unanalyzed region can be problematic).

Using multiple regression for the capsaicin analogs, one can derive the following equation which relates hydrophobicity and molar refractivity to biological activity.

Log EC_{50} = 0.762 - (0.819) +
(0.011)MR

s = 0.313, r^{2} = 0.888

To judge the importance of a regression term, three items need to be considered.

- Statistical significance of the regression coefficient.
- The magnitude of the typical effect b
_{i}x_{i}(in this case, 0.011 25.36). - Any cross-correlation with other terms.

As more terms
are added to multiple linear regression, r^{2} always gets
larger. We recompute the previous calculations (r^{2} =
0.89) carrying three significant figures so that rounding does not
lead to confusion.

These results
of this analysis indicate that, **within this series**, steric
bulk is not an important factor in activity. The influence of the
hydrophobicity constant confirms the presence of a hydrophobic
binding site. Given the limited number of substituents in this
analysis, it is unlikely that more can be learned from further
analysis.

This section
has developed the fundamental mathematics of QSAR studies. Several
authors have published reviews of QSAR and have discussed various
aspects of the methods^{3-8}. Each of the examples to
follow uses these techniques to derive information about the
chemical factors which are important for activity.

## Approaches to Developing a QSAR

Drugs exert their biological effects by participating in a series of events which include transport, binding with the receptor and metabolism to an inactive species. Since the interaction mechanisms between the molecule and the putative receptor are unknown in most cases (i.e., no bound crystal structures), one is reduced to making inferences from properties which can easily be obtained (molecular properties and descriptors) to explain these interactions for known molecules. Once the relationship is defined, it can be used to aid in the prediction of new or unknown molecules.

The first approach to developing quantitative relationships which described activity as a function of chemical structure relied on the principles of thermodynamics. The free-energy terms E, H and S were represented by a series of parameters which could be derived for a given molecule.

Electronic
effects such as electron donating and withdrawing tendencies,
partial atomic charges and electrostatic field densities were
defined by Hammett sigma () values, resonance parameters (R values),
inductive parameters (F values) and Taft substituent values (*, *,
E_{s}). Steric effects such as molecular volume and surface
area were represented by values calculated for Molar Refractivity
and the Taft steric parameter. Enthalpic effects were calculated
using partition coefficients (LogP) or the hydrophobic parameter,
,
which was derived from the partition coefficient. In addition, an
assortment of structural indices were used to describe the presence
of specific functional groups at positions within the molecule. The
linear equation which described the relationship between activity
and this parameter set was the Hansch equation

log 1/[C] = A(logP) - B(logP)^{2} +
C(E_{s}) + D() + E + ...

Multiple linear regression analysis was used to derive the values of the coefficients. In general, Hansch type studies were performed on compounds which contained a common template (usually a rigid one such as an aromatic ring) with structural variation limited to functional group changes at specific sites.

Hansch
utilized this approach in his analysis of 256
4,6-diamino-1,2-dihydro-2,3-dimethyl-1-(X-phenyl)-s-triazines which
were active against tumor dihydrofolate reductase^{9}. It
was demonstrated that for 244 of the compounds, activity could be
correlated to the presence of hydrophobic groups at the three and
four positions of the N-phenyl ring. The parameters used to derive
this correlation were the hydrophobic constant () and molar
refractivity constant (MR) for meta and para substituents on the
N-phenyl ring and six indicator variables
I_{1}-I_{6} which were used to indicate the
presence (a value of 1) or absence (a value of 0) of specific
structural features. The equation which was formulated from these
data using the method of least squares is shown below.

### FIGURE 5

Analysis of the Baker Triazines

log 1/[C] = 0.680(_{3}) - 0.118(_{3})^{2} + 0.230(MR_{4}) -
0.024(MR_{4})^{2} + 0.238(I_{1}) -
2.530(I_{2}) - 1.991(I_{3}) + 0.877(I_{4})
+ 0.686(I_{5}) + 0.704(I_{6}) + 6.489

n = 244, r = 0.923, s = 0.377

The optimal
values for MR_{4} (4.7) and _{3}
(2.9) were obtained from the partial derivatives of the equation.
Note that the number of compounds in the data set was reduced to
244. Hansch and Silipo reported improvements in the value for r and
s by removing 12 compounds which were incorrectly predicted by a
factor of 10 or more.

While there
are limits to the Hansch approach, it permitted complex biological
systems to be modeled successfully using simple parameters. The
approach has been used successfully to predict substituent effects
in a wide number of biological assays. The main problem with the
approach was the large number of compounds which were required to
adequately explore all structural combinations. Further, the
analysis methods did not lend themselves to the consideration of
conformational effects. Several authors have published articles
which provide additional background on the Hansch
approach^{10-11}.

Alternative
approaches to compound design have been suggested which avoid the
combinatorial problem found in Hansch type analyses. Free and
Wilson used a series of substituent constants which related
biological activity to the presence of a specific functional group
at a specific location on the parent molecule^{12}. The
relationship between biological activity and the presence or
absence of a substituent was then expressed by the following
equation:

Activity = A + _{i}_{j}G_{ij}X_{ij}

where A was defined as the average biological activity for the
series, G_{ij} the contribution to activity of a functional
group **i** in the jth position and X_{ij} the presence
(1.0) or absence (0.0) of the functional group **i** in the jth
position.

The procedure used the equation above to build a matrix for the series and represented this matrix as a series of equations. Substituent constants then were derived for every functional group at every position. Statistical tests were used to test the importance of the constants. If the models were shown to be valid, the model was used to predict activity values for compounds which had not been prepared. In general, while a large number of compounds are required to explore the effects of multiple substitution patterns, the Free-Wilson approach substantially reduces the number of analogs required. However, the method demands that the effects of substituents are additive.

In 1972, John
Topliss published a paper which detailed methodology to automate
the Hansch approach^{2}. The method assumed that the lead
compound of interest contained at least one phenyl ring which could
serve as the template for functional group modifications. The first
modification to the template was preparation of the para-chloro
derivative to examine lipophilicity. Additional substitution
patterns were then made sequentially in an attempt to explore and
optimize the relationship between activity and the hydrophobic and
electronic character of the molecule. While the Topliss approach is
easy to follow, it has several drawbacks. The primary problems are
that the procedure is not applicable to all types of studies and
that there is a high degree of risk associated with its use (it
essentially ignores the possibility of interactions between
substituents as it changes one substituent at a time).

The use of classical QSAR was expanded during the 1960's as a means of correlating observed activity to chemical properties. However, there are many areas where these techniques could not be used or where they failed to provide useful correlations. These included situations in which activity was found to be determined by 3-dimensional geometry, where poor training sets of compounds were used or the set of compounds were too small or insufficiently diverse and cases where biological activity could not be well quantified. Many of these problems were addressed by extensions to the Hansch method and the development of alternative approaches to QSAR.

There are
cases where biological activity values cannot be determined
accurately for a variety of reasons, e.g. lack of sensitivity of a
particular test system. Alternative statistical techniques can be
used in these cases; the problem is simplified to a classification
scheme in which compounds are labeled as active, partially active,
inactive, etc. The resulting data set is then searched for patterns
which predict these categories. The methods which have been used
for this type of analysis include SIMCA (Soft, Independent Modeling
of Class Analogy)^{13}, ADAPT (Automated Data Analysis by
Pattern recognition Techniques)^{14}, CASE (Computer
Automated Structure Evaluation)^{15} and CSA (Cluster
Significance Analysis)^{16}.

Pattern
recognition methods^{17} attempt to define the set of
parameter values which will result in clustering compounds of
similar activity into regions of n-dimensional space. The methods
used to accomplish this goal can be parametric or nonparametric.
Parametric methods search the n-dimensional space for clusters of
compounds based on their calculated properties. These methods do
not use derived values (e.g., mean vectors and covariance
matrices), but instead use the original data to find clustering
definitions and apply iterative procedures to find the linear set
of parameters which best define the classification scheme.

Where the methods described above develop discriminant functions, SIMCA methods use Principal Component Analysis (PCA) to describe the data set. The objective of PCA is to create a reduced number variables which describe biological activity or chemical properties into a relatively few independent ones. This is accomplished through an analysis of the correlation matrix of biological or chemical properties.

Principle component analysis can be used to create derived variables for each class (e.g., active and inactive) separately by decomposing the correlation matrix; this method is useful to point out redundancies or interrelationships among the variables. PCA seeks to find simplified relationships in data by transforming the original parameters into a new set of uncorrelated variables which are termed principal components. The symmetric correlation matrix is decomposed by an eigenvalue decomposition. The largest eigenvalue and its eigenvector are used to form a linear combination of the original variables with maximum variance. Successively smaller eigenvalues and vectors produce linear combinations of the original variables with diminishing variance. Successive eigenvectors are independent of one another. The simplification is derived by disregarding eigenvectors associated with small eigenvalues. In summary, the procedure finds the set of orthogonal axes for the data which decompose variance in the data.

Another approach to examining the effects of chemical structure on activity was developed by the Jurs' group. Rather than rely on multivariant statistics to highlight these relationships, Jurs used the combination of cluster analysis and pattern recognition techniques as a tool to develop these correlations. The ADAPT program generated a data set of molecular descriptors (topological, geometrical and physicochemical) derived from three dimensional model building, projected these data points onto an n-dimensional surface and analyzed them using pattern recognition methods. The goal of this analysis was to discriminate between active and inactive compounds in a series.

Jurs has
reported several applications of the methodology contained in
ADAPT. In one study of chemical carcinogens^{18}, a linear
discriminant function was derived from a set of 28 calculated
structure features including fragment descriptors, substructure
descriptors, environment descriptors, molecular connectivity
descriptors and geometric descriptors. Two hundred and nine
compounds from twelve structural classes (130 carcinogens, 79
noncarcinogens) were selected for this study. The program was used
to identify a training set of 192 compounds which was used to find
the best set of descriptors and analyze the entire data set. A
predictive success of 90% for carcinogenic compounds and 78% for
noncarcinogenic compounds was obtained in randomized testing.

The CASE program extended the techniques in ADAPT by using topological methods to define substructural fragments which were essential for activity. CASE was able to differentiate between positional isomers. Both CASE and ADAPT are limited to analyzing structurally similar data sets.

The analysis
methods described to this point have not explicitly incorporated
the contribution of three dimensional shape in the analysis of the
activity of a molecule. While the use of chemical graph
indexes^{18}, intermolecular binding
distances^{19}, molecular surface areas^{20} and
electrostatic potentials^{21} contain some information
about the 3-D shape of molecules, the Hopfinger^{22} and
Marshall^{23} groups were the first to exhaustively analyze
these effects.

In 1979,
Marshall extended the 2-D approach to QSAR by explicitly
considering the conformational flexibility of a series as reflected
by their 3-D shape^{23}. The first step of the Active
Analog Approach was to exhaustively search the conformations of a
compound which was highly active in a particular biological assay.
The result of the search was a map of interatomic distances which
was used to filter the conformational searches of subsequent
molecules in the series. The implicit assumption of the method was
that all compounds which display similar activity profiles were
able to adopt similar conformations. Once the "active conformation"
was determined, molecular volumes for each molecule were calculated
and superimposed. Regression analysis of the volumes was used to
establish a relationship to biological activity. Marshall and
co-workers commercialized the Active Analog Approach and a suite of
other drug design techniques in the SYBYL molecular modeling
program.

Hopfinger and
co-workers also used 3-D shape in QSAR. In molecular shape
analysis^{24} of the Baker Triazines, the common space
shared by all molecules of a series and the differences in their
potential energy fields were computed. When these calculations were
combined with a set of rules for overlapping the series,
comparative indicies of the shape of different molecules were
obtained. Inclusion of these shape descriptors in standard Hansch
analysis schemes lead to improved descriptions relating computed
parameters to biological activity such that no compounds in the
original data set had to be eliminated from the calculations. The
techniques developed by Hopfinger and co-workers were made
available in the CAMSEQ, CAMSEQ-II, CHEMLAB and CAMSEQ-M computer
programs.

In 1988,
Richard Cramer proposed that biological activity could be analyzed
by relating the shape-dependent steric and electrostatic fields for
molecules to their biological activity^{25}. Additionally,
rather than limiting the analysis to fitting data to a regression
line, CoMFA (Comparative Molecular Field Analysis) utilized new
methods of data analysis, PLS (Partial Least Squares) and
cross-validation, to develop models for activity predictions.

The approach
used in the CoMFA procedure requires that the scientist define
alignment rules for the series which overlap the putative
pharmacophore for each molecule; the active conformation and
alignment rule must be specified. Once aligned, each molecule is
fixed into a three-dimensional grid by the program and the
electrostatic and steric components of the molecular mechanics
force field, arising from interaction with a probe atom (e.g., an
SP^{3} C atom), are calculated at intersecting lattice
points within the 3-D grid. The equations which result from this
exercise have the form

Act_{1} = Const_{1} +
a_{1}(steric_{xyz}) +
b_{1}(steric_{xyz}) + ... +
a'_{1}(estatic_{xyz}) +
b'_{1}(estatic_{xyz}) + ...

Act_{2} = Const_{2} +
a_{2}(steric_{xyz}) +
b_{2}(steric_{xyz}) + ... +
a'_{2}(estatic_{xyz}) +
b'_{2}(estatic_{xyz}) + ...

Act_{n} = Const_{n} +
a_{n}(steric_{xyz}) +
b_{n}(steric_{xyz}) + ... +
a'_{n}(estatic_{xyz}) +
b'_{n}(estatic_{xyz}) + ...

Traditional
regression methods require that the number of parameters must be
considerably smaller than the number of compounds in the data set
(or the number of degrees of freedom in the data). The data tables
which result from CoMFA analysis have far more parameters than
compounds. PLS, which removes this limitation, is used to derive
the coefficients for all of the steric and electrostatic terms. PLS
essentially relies upon the fact that the correlations among near
parts of a molecule are similar so that the real dimensionality is
smaller that the number of grid points. Since these coefficients
are position dependant, substituent patterns for the series are
elucidated which define regions of steric bulk and electrostatic
charge associated with increased or decreased activity. The size of
the model (the number of components^{27} needed for the
best model) and the validity of the model as a predictive tool are
assessed using cross-validation.

As opposed to
traditional regression methods, cross-validation evaluates the
validity of a model by how well it predicts data rather than how
well it fits data. The analysis uses a "leave-one-out" scheme; a
model is built with N-1 compounds and the N^{th} compound
is predicted. Each compound is left out of the model derivation and
predicted in turn. An indication of the performance of the model is
obtained from the cross-validated (or predictive) r^{2}
which is defined as

r^{2} (cross-validated) = (SD -
Press)/SD

SD is the Sum-of-Squares deviation for each activity from the mean. Press (or Predictive Sum of Squares) is the sum of the squared differences between the actual and that predicted when the compound is omitted from the fitting process.

As we have
discussed, values for conventional r^{2} range from 0 to 1.
Values for the cross-validated r^{2} are reported by the
method to range from -1 to 1. Negative values indicate that
biological activity values are estimated by the mean of the
activity values better than they are by the model (i.e, the
predictions derived from the model are worse than no model). Once a
model is developed which has the highest cross-validated
r^{2}, this model is used to derive the conventional QSAR
equation and conventional r^{2} and s values. The results
of the final model are then visualized as contour maps of the
coefficients.

The first
CoMFA study reported analyzed the binding affinities of 21 steroid
structures to human corticosteroid-binding globulins and
testosterone-binding globulins. This class of compounds is rigid
and was selected to eliminate conformationally dependant effects
from the study. The models for each steroid were built from
coordinates from the Cambridge Crystallographic Database which were
minimized using the Tripos force field. Side chain positioning was
accomplished using systematic conformational searching. The Field
Fit algorithm was used to align each structure within the fixed
lattice (the 3-D grid used to calculate the CoMFA field effects).
The fit of the regression line for the predicted versus actual
binding values for the corticosteroids showed a cross-validated
r^{2} of 0.65 (conventional r^{2} = 0.897, s =
0.397). For the testosterone-binding steroids, the cross-validated
r^{2} was 0.555 (conventional r^{2} = 0.873, s =
0.453).

As noted, CoMFA starts with defined pharmacophore and overlap rules and derives a 3-D model which can be used to predict activity for new chemical entities. The Apex and Catalyst (Accelrys, Formerly Molecular Simulations Incorporated) programs are used to identify pharmacophores from databases of chemical structures and biological activity. These models are then used to predict activities for novel compounds.

Apex-3D is an automated pharmacophore identification system which can identify possible pharmacophores from a set of biologically active molecules using statistical techniques and 3-D pattern matching algorithms. The program classifies molecular structures using three methods: the agreement inductive method identifies common structural patterns in compounds having similar activity; the difference inductive method identifies structural patterns which differentiate active and inactive compounds and the concomitant variations inductive method highlights variations in structural features that explain changes in biological activity for sets of compounds.

The methods defined above follow logic similar to that used by a practicing medicinal chemist: What pharmacophoric patterns are present in the active molecule which are not present in the inactive ones? Pharmacophores are defined by different chemical centers (atom centered functional groups) and the distances between these centers. These descriptor centers can include such things as aromatic ring centers, electron donor ability, hydrogen bonding sites, lipophilic regions, and partial atomic charge. The information for each molecule is stored in a knowledge base in the form of rules which can be used to predict the activity of novel structures.

Apex-3D contains an expert system which automatically selects the best conformation and alignment for structures based on identified pharmacophores. When quantitative biological data is available, a 3D-QSAR model can be developed for any of the possible identified pharmacophores. Depending on the type of biological activity available, it is possible to identify pharmacophores for different binding orientations, receptor subtypes, or agonist versus antagonist activity.

To use this approach, the scientist is required to assign the training set of compounds to one or more activity classes Specific descriptor centers also can be defined if desired. The automated pharmacophore identification portion of the program then builds the knowledge base using the following steps:

- identify all possible binding interaction centers for each
compound in the data set;
- generate topological (2D) or topographical (3D) distance
matrices based on the set of descriptors;
- identify possible pharmacophores from all pairs of molecules
using clique selection algorithms;
- classify these pharmacophores based upon their occurance in
compounds in each activity class using Bayesian statistics and
their nonchance occurance;
- set thresholds for probability and reliability statistics
associated with a pharmacophore so that all training set molecules
are properly classified by the pharmacophore rules;
- align compounds containing high probability pharmacophores on
the pharmacophore.

Once the knowledge base has been constructed, the scientist can use it to predict biological activity for compounds not included in the training set.

The pharmacophores defined above can be used to build 3D QSAR models by correlating indexes calculated for biophore sites, secondary sites or whole molecule properties. 3-D QSAR models in Apex are generated and screened using a modified scheme of multiple linear regression analysis with variable selection. Special randomization checks are made to estimate the chance of fortuitous correlation. The steps involved are:

- Interactively select the pharmacophore(s) to use in the
analysis;
- Interactively choose parameters to include in the pool of
possible parameters to be selected by the program and
- Calculate the best 3D QSAR models for each pharmacophore using
stepwise multiple regression and analysis of statistics to assess
the validity and predictive power of the model.

At this point the knowledge base with 3D QSAR models can be used to calculate activities of novel compounds. Given a set of conformations for a novel structure, an activity range will be calculated based on each conformation which contains one of the pharmacophores and fits the 3D QSAR model.

Few
applications of Apex have appeared to date in the literature. One
example applied to nonpeptide Angiotensin II antagonists will be
discussed for its heuristic value. Several structurally diverse
compounds in this activity class have been reported in the
literature^{31-44}. Pharmacophore models have been
postulated with some disagreement about whether all of the highly
active molecules are binding at the same site^{30}.
Automated pharmacophore identification can be used to analyze these
compounds and to assess the probability that they could be acting
at the same site. A set of 55 compounds with specific binding
activity (IC_{50}) values ranging over 6 orders of
magnitude was used. Multiple conformations were included using 3D
structures whose geometries were optimized by AMPAC. Four activity
classes were defined; the most active class (< 100 nM) included
27 compounds. Apex-3D was able to generate rules which properly
classified all compounds in the most active class without false
negatives or positives. The fact that several biophores were
required is consistent with the exsistence of multiple binding
sites.

Quantitative
3D QSAR models also were developed for some of these
pharmacophores. One model which contained 48 compounds had the
following statistical parameters: predicted R^{2} = 0.83,
predicted RMSE = 0.86. Predicted activities for compounds excluded
from the training set were within statistical boundaries.

Like Apex, Catalyst generates structure-activity hypotheses from a set of molecules of various activities. Once molecular connectivity and activity values are specified for all molecules, Catalyst derives a hypotheses which consist of a set of generalized chemical functions (regions of hydrophobic surface, hydrogen bond vectors, charge centers, or other user-defined features) at specified relative positions. Up to ten hypotheses are produced and ranked by estimated statistical significance. The hypotheses can be examined graphically, fit to new molecules, or fed directly to flexible 3D database search.

In the first step of the process, a set of representative conformers is found that covers the low-energy conformational space of each molecule. Representative structures are chosen to maximize the sampling of conformational space. The second step locates a list of candidate hypotheses that are common among active and rare among inactive compounds (the presence of all identified features is not required for inclusion in the active class since it is unlikely that all active members of a training set possess all the important binding features). The cost of a hypothesis is defined as the number of bits needed to describe the hypothesis as well as the errors in activities as estimated by the hypothesis. The theory of minimum complexity estimation indicates that a predictive hypothesis will minimize this cost. Minimization is carried out over the space of hypotheses covered by the candidates identified above. Statistical significance of the results (low probability of having found a chance correlation) on a variety of medicinal training sets is verified by a non-parametric randomization test.

The Genetic
Function Approximation (GFA) algorithm is a novel technique for
constructing QSAR models^{45}. It was specifically
developed for use with data sets containing many more variables
than samples, or data sets which contain nonlinear relationships
between the variables and the activity.

GFA begins with a population of randomly-constructed QSAR models; these models are rated using an error measure which estimates each model's relative predictiveness. The population is evolved by repeatedly selecting two better-rated models to serve as parents, then creating a new child model by using terms from each of the parent models. The worst-rated model in the population is replaced by this new model. As evolution proceeds, the population becomes enriched with higher and higher quality models.

The different models are, in effect, multiple fits to the data. Scientists can then use their scientific knowledge and intuition to select among the final models. By studying the similarities and differences among the models, the scientist may be provoked to consider alternative mechanisms which explain the data or may plan new experiments to decide between different possible models. Since the models usually use only a subset of the total number of variables in the data set, GFA behaves as a member of the class of variable reduction algorithms and is most useful if the critical information in the data set is concentrated in a few variables.

Experimental results against published data sets demonstrate that GFA discovers models which are comparable to, and in some cases superior to, models discovered using standard techniques such as stepwise regression, linear regression, or partial least-squares regression. As the number of computational and instrumentational sources of experimental data increase, the ability of GFA to perform variable reduction, to discover nonlinear relationships, and to present the user with multiple models representing multiple interpretations of the data set, may become increasingly useful in data analysis.

There are many additional properties beyond binding which contribute to the biological activity of a molecule including transportation, distribution, metabolism and elimination. Since the techniques described above address binding and typically do not account for any of these other processes, they often fail. However, one should note that their utility is one of efficiency and probability. They are generally more successful and efficient than are the ad hoc/intuitive methods. Because no single method works for all cases, many groups are examining alternative approaches to developing SAR equations including electronic based descriptors and topological indicies.

Peter
Goodford has reported an energy based grid approach to compound
design. The GRID program^{46} is a computational procedure
for detecting energetically favorable binding sites on molecules of
known structure. It has been used to study arrays of molecules in
membranes and crystals or proteins. The energies are calculated as
the electrostatic, hydrogen-bond and Lennard-Jones interactions of
a specific probe group with the target structure. GRID has also
been used to distinguish between selective binding sites for
different probes.

In general,
topological approaches start with a very different graph theoretic
representation of molecular structure in which atoms are
represented by vertexes and bonds by edges. Numerical indexes for
the structure are then defined which abstract information including
atom descriptors (atom types, atomic weight, atomic number, ratio
of valance electrons to core electrons, etc.) and sub-group
descriptors. In addition, a series of indexes (the indexes) are
developed which describe the molecule as a set of fragments of
varying size and complexity. Regression analysis of equations which
have chi indexes as parameters has been used to correlate chemical
structure to physicochemical behavior in applications such as
chromatographic retention times, molar refractivity, ionization
potential and heats of atomization. While the application of
topological indicies has been widely reported in
QSAR^{47-49}, the utility of these methods generally has
been limited to predicting structure-property relationships for
polymers and hydrocarbons.

## SUMMARY

Developing a quantitative structure activity relationship is difficult. Molecules are typically flexible and it is possible to compute many possibly useful properties that might relate to activity. Early in a research program there are typically few compounds to model. Thus we have a few compounds in a very high dimensional descriptor space. Which are the important variables and how do we optimize them? It is clear that many training compounds need to span through the space and model fitting techniques need to address not only deriving a fit, but the predictive quality of the fit. While these methods have not discovered a new compound, they have aided scientists in examining the volumes of data generated in a research program. As the methods evolve, they will find broader application in areas such as combinatorial chemistry.

## REFERENCES

- Christopher S.J. Walpole, Roger Wrigglerworth, Stuart Bevan,
Elizabeth A. Campbell, Andy Dray, Iain F. James, Kay J. Masdin,
Martin N. Perkins and Janet Winter,
*J. Med. Chem.*,**36**, 2381 (1993). - John G. Topliss, Utilization of Operational Schemes for Analog
Synthesis in Drug Design,
*J. Med. Chem.*,**15**, 1006 (1972). - Yvonne C. Martin, A Practitioner's Perspective of the Role of
Quantitative Structure Activity Analysis in Medicinal Chemistry,
*J. Med. Chem.*,**24**, 229 (1981). - John G. Topliss,
__Quantitative Structure-Activity Relationships of Drugs__, Academic Press, New York, (1983). - Franke, R.,
__Theoretical Drug Design Methods__, Elsevier, Amsterdam, (1984). - Seydel, J.K.,
__QSAR and Strategies in the Design of Bioactive Compounds__, VCH, Weinheim, (1985). - Yvonne C. Martin,
*Accounts of Chem. Res.*,**19**, 392 (1986). - Corwin Hansch,
*Accounts of Chem. Res.*,**2**, 232 (1969). - Corwin Hansch and Carlo Silipo,
*J. Amer. Chem. Soc.*,**97**, 6849 (1975). - Robert F. Gould (ed.),
__Biological Correlations -- The Hansch Approach__, Advances in Chemistry Series,**No. 114**, American Chemical Society, Washington, D.C., 1972. - Yvonne C. Martin,
__Quantitative Drug Design__, Marcel Dekker, New York, (1978). - Spencer M. Free and James W. Wilson, A Mathematical
Contribution to Structure-Activity Studies,
*J. Med. Chem.*,**7**, 395 (1964). - Svante Wold, Pattern Recognition by Means of Disjoint Principal
Components Models,
*Pattern Recognition*,**8**, 127 (1976). - A.J. Stuper, W.E. Brugger and P.C. Jurs,
__Chemometrics: Theory and Application__, B.R. Kowalski (Ed.), American Chemical Society, Washington, D.C., 1977. - G. Klopman and M.I. Dimayuga,
*J. Comput.-Aided Mol. Design*,**4**, 117 (1990). - J. McFarland,
*J. Med. Chem.*, - Peter C. Jurs, Chemometrics and Multivariate Analysis in
Analytical Chemistry, in
__Reviews in Computational Chemistry, Volume 1__, K.B. Lipkowitz and B.B. Boyd (Eds.), VCH Publishers, Inc., New York, 1990. - Peter C. Jurs, J. T. Chou and M. Yuan,
*J. Med. Chem.*,**22**, 476 (1979). - Lemont B. Kier,
__Molecular Orbital Theory in Drug Research__, Academic Press, New York, NY, 1971. - Gordon M. Crippen,
*J. Med. Chem.*,**22**, 988 (1979). - M. Mabilia, R.A. Pearlstein and A.J. Hopfinger, Computer
Graphics in Molecular Shape Analysis, in
__Molecular Graphics and Drug Design__, A.S.V. Burgen, G.C.K. Roberts and M.S. Tute (Eds.), Elsevier Science Publishers, Amsterdam, 1986. - H. Weinstein, in
__Chemical Applications of Molecular Electrostatic Potentials__, Peter Politzer and Donald G. Truhlar (Eds.), Plenum Press, New York, NY, 1981. - R. Potenzone, E. Cavicchi, H.J.R. Weintraub and A.J. Hopfinger,
*Comput. Chem.*,**1**, 187 (1977). - G.R. Marshall, C.D. Barry, H.E. Bosshard, R.A.. Dammkoehler and
D.A. Dunn, The Conformational Parameter in Drug Design: The Active
Analog Approach, in
__Computer Assisted Drug Design__, ACS Symposia,**112**, E.C. Olson and R.E. Christofferson (Eds.), American Chemical Society, Washington D.C., 1979 - Anton J. Hopfinger,
*J. Amer. Chem. Soc.*,**102**, 7196 (1980). - Richard D. Cramer III, David E. Patterson and Jeffrey D. Bunce,
*J. Amer. Chem. Soc.*,**110**, 5959 (1988). - Components are defined as the linear combination of all
parameters (the independent data) which are used to fit the shape
of the parameters to the shape of the activity data (the dependent
data).
- V. E. Golender and A. B. Rozenblit, Logical Structural Approach
to Computer Assisted Drug Design, in
__Drug Design, Vol IX__, Academic Press (1980). - V. E. Golender and A. B. Rozenblit,
__Logical and Combinatorial Algorithms for Drug Design__, Research Studies Press, UK (1983). - V. E. Golender and E. R. Vorpagel, Computer Assisted
Pharmacophore Identification, in
__3D QSAR in Drug Design: Theory, Methods and Applications__, ESCOM Science Publishers, Netherlands (1993). - A. P. Thomas, C. P. Allott, et. al., "New Nonpeptide
Angiotensin II Receptor Antagonists. 1. Synthesis, Biological
Properties, and Structure-Activity Relationships of 2-Alkyl
Benzimidazole Derivatives",
*J. Med. Chem.*,**35**, 877-885 (1992). - R. H. Bradbury, C. P. Allott, et. al., "New Nonpeptide
Angiotensin II Receptor Antagonists. 2. Synthesis, Biological
Properties, and Structure-Activity Relationships of
2-Alkyl-4-(biphenylylmethoxy)quinoline Derivatives",
*J. Med. Chem.*,**35**, 4027-4038 (1992). - R. H. Bradbury, C. P. Allott, et. al., "New Nonpeptide
Angiotensin II Receptor Antagonists. 3. Synthesis, Biological
Properties, and Structure-Activity Relationships of
2-Alkyl-4-(biphenylylmethoxy)pyridine Derivatives",
*J. Med. Chem.*,**36**, 1245-1254 (1993). - P. R. Bovy, D. B. Reitz, et. al., "Nonpeptide Angiotensin II
Antagonists: N-Phenyl-1H-pyrrole Derivatives Are Angiotensin II
Receptor Antagonists",
*J. Med. Chem.*,**36**, 101-110 (1993). - N. B. Mantlo, P. K. Chakravarty, et. al., "Potent, Orally
Active Imidazo[4,5-b] pyridine-Based Angiotensin II Receptor
Antagonists",
*J. Med. Chem.*,**34**, 2919-2922 (1991). - W. T. Ashton, C. L. Cantone, et. al., "Nonpeptide Angiotensin
II Antagonists Derived from 4H-1,2,4-Triazoles and
3H-Imidazo[1,2-b][1,2,4]triazoles",
*J. Med. Chem.*,**36**, 591-609 (1993). - K. S. Atwal, S. Z. Ahmed, et. al., "Dihydro pyrimidine
Angiotensin II Receptor Antagonists",
*J. Med. Chem.*,**35**, 4751-4763 (1992). - J. V. Duncia, A. T. Chiu, et. al., "The Discovery of Potent
Nonpeptide Angiotensin II Receptor Antagonists: A New Class of
Potent Antihypertensives",
*J. Med. Chem.*,**33**, 1312-1329 (1990). - B. De, M. Winn, et. al., "Discovery of a Novel Class of Orally
Active, Non-Peptide Angiotensin II Antagonists",
*J. Med. Chem.*,**35**, 3714-3717 (1992). - K. Kybo, Y. Kohara, et. al., "Nonpeptide Angiotensin II
Receptor Antagonists. Synthesis and Biological Activity of
Potential Prodrugs of Benzimidazole-7-carboxylic Acids",
*J. Med. Chem.*,**36**, 2343-2349 (1993). - I. Sircar, J. C. Hodges, et. al., "Nonpeptide Angiotensin II
Receptor Antagonists. 2. Design, Synthesis, and Structure-Activity
Relationships of 2-Alkyl-4-(1H-pyrrol-1-yl)-1H-imidazole
Derivatives: Profile of
2-Propyl-1-[[2'-(1H-tetrazol-5-yl)-[1,1'-biphenyl]-4-yl]-methyl]-4-[2-(trifluoroacetyl)-1H-pyrrol-1-yl]-1H-imidazole-5-carboxylic
Acid (CI-996)",
*J. Med. Chem.*,**36**, 2253-2265 (1993). - J. Weinstock, R. M. Keenan, et. al.,
"1-(Carboxybenzyl_imidazole-5-acrylic Acids: Potent and Selective
Angiotensin II Receptor Antagonists",
*J. Med. Chem.*,**34**, 1514-1517 (1991). - R. M. Keenan, J. Weinstock, et. al., "Imidazole-5-acrylic
Acids: Potent Nonpeptide Angiotensin II Receptor Antagonists
Designed Using a Novel Peptide Pharmacophore Model",
*J. Med. Chem.*,**35**, 3858-3872 (1992). - R. M. Keenan, J. Weinstock, et. al., "Potent Nonpeptide
Angiotensin II Receptor Antagonists. 2.
1-(Carboxybenzyl)imidazole-5-acrylic Acids",
*J. Med. Chem.*,**36**, 1880-1892 (1993). - D. Rogers and A. J. Hopfinger, "Applications of Genetic
Function Approximation (GFA) to Quantitative Structure-Activity
Relationships (QSAR) and Quantitative Structure Property
Relationships (QSPR)",
*J. Chem. Inf. Comp. Sci.*,**34**, 854-866 (1994). - Peter J. Goodford,
*J. Med. Chem.*,**28**, 849 (1985). - A.T. Balaban, Applications of Graph Theory in Chemistry,
*J. Chem. Inf. Comput. Sci.*,**25**, 334 (1985). - Lowell H. Hall and Lemont B. Kier, The Molecular Connectivity
Chi Indexes and Kappa Shape Indexes in Structure-Property Modeling,
in
__Reviews in Computational Chemistry, Volume 2__, K.B. Lipkowitz and D.B. Boyd (Eds.), VCH Publishers, Inc., New York, 1991. - I.B. Bersuker and A.S. Dimoglo, The Electron-Topological
Approach to the QSAR Problem, in
__Reviews in Computational Chemistry, Volume 2__, K.B. Lipkowitz and D.B. Boyd (Eds.), VCH Publishers, Inc., New York, 1991.

NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:

- Network Science Corporation
- 4411 Connecticut Avenue NW, STE 514
- Washington, DC 20008
- Tel: (828) 817-9811
- E-mail: TheEditors@netsci.org
- Website Hosted by Total Choice