APEX-3D Expert System for Drug Design

Valery Golender and Boris Vesterman
DCL Systems International Ltd.
20 Galgalei Haplada St.
POB 544
Herzlia, 46105, Israel

Erich Vorpagel
BIOSYM/Molecular Simulations Inc.
9685 Scranton Road
San Diego, CA 92121-3752



http://www.netsci.org/Science/Compchem/feature09.html

Introduction

Apex-3D is an expert system developed to represent, elucidate, and utilize knowledge on structure-activity relationships. Apex-3D can be used to build 3D-SAR and 3D-QSAR models which can be used for activity classification and prediction. The general principle of operation is based on emulating the intelligence of a researcher engaged in establishing relationships between a compound's structural parameters and its activity. The corner-stone of the Apex-3D methodology is automated identification of biophores (pharmacophores) [1-3]. These biophores can be used for building qualitative activity prediction rules and for creating search queries to identify new leads in a 3D-database. Identified biophores can be used as starting points for constructing 3D-QSAR models when good quantitative data is available. Combination of a 3D pharmacophore with a quantitative regression equation is unique to the Apex-3D approach. Prediction of activity for novel compounds requires the biophore be present: the activity level is calculated from the QSAR equation. This paper describes the underlying principles of the system.

Apex-3D Architecture

The overall architecture of Apex-3D is presented in Figure 1.  

Description of the main functionality of modules is given below.

Computational Chemistry Module. This module performs computation of the quantum-chemical and other atomic and molecular indexes. In addition it performs clustering of conformers for flexible compounds.

Data Management Module. Apex-3D has an internal database system based on the RAIMA DBMS [4]. It provides storage of 3D structures, structural parameters, and activity data. Apex-3D's database is interfaced with external chemical databases through files.

Frame System. Apex-3D uses frames for the representation of chemical information. The notion of frames was first proposed by Marvin Minsky in the early 1970s [5]. Frames represent an object as a group of attributes. Each attribute in a particular frame is stored in a separate slot. In Apex-3D, a frame system is implemented using a built-in ChemLisp language interpreter. ChemLisp represents a special dialect of the LISP language (LISt Processor) [6] widely used in artificial intelligence systems. ChemLisp accesses main data structures and modules containing basic algorithmic functions. Frames in Apex-3D are LISP expressions of the following form:

(frame framename
   (type frametype)
   (slots 
        ((slotname slotvalue)...)))

Slots can represent variables of different types (integer, real, symbol, interval) or LISP functions used in basic algorithms. Frames are employed in Apex-3D for generating descriptors, pseudoatoms, and setting basic algorithmic parameters.

Rule Management Module. Structure-activity relationships in the system are represent by means of rules that have the following form:

Qualitative rules:

IF structure S contains the biophoric pattern B,
THEN it possesses the activity A with probability P.

Quantitative rules:

IF structure S contains the biophoric pattern B 
   having an associated QSAR model A=F(B,S),
THEN it possesses activity A calculated using the model.

Rules are stored in the system's Knowledge Base.

Inductive Inference Module. This module performs generation of rules on the basis of structure and activity data. It is based on algorithms of the logical structural approach [1] and provides tools for automated selection of biophores (pharmacophores) and interactive building of 3D-QSAR models. The module performs statistical evaluation of the predictive and discriminating power of selected biophores and models.

Deductive Inference Module. This module performs prediction of activity based on the rules stored in the Knowledge Base. It provides the following functions:

  • Statistical prediction of activity type and level.

  • Explanation of predictions.

  • 3D graphic display of biophores found in the analyzed structures.

  • Superimposition of compounds with the same biophore.

Query Generator. Apex-3D generates biophore queries for MDL Information Systems' databases [7] in order to find new compounds satisfying biophore definition. The query generator converts internal biophore representation to the query format using several built-in rules. Database hits can be loaded into Apex-3D and processed using the activity prediction module to eliminate any false positive hits resulting from the approximate nature of the biophore query.

Integration with molecular modeling and molecular graphics system, Insight II [8]. Apex-3D is fully integrated into the Insight II environment [9]. Elements of the user interface in the latest 95.0 release are written using Insight II Open Interface libraries. Apex-3D uses Insight II functionality for:

  • Drawing 2D molecular sketches and converting them to 3D structures,

  • Generation of multiple conformers,

  • Graphical display of chemical structures and molecular superimpositions.

The general principles of basic Apex-3D algorithms are discussed below.

Chemical Structure Representation

Representation of chemical structure in Apex-3D is based on the concept of a descriptor center that represents a part of the hypothetical biophoric moiety capable of interacting with a receptor. Descriptor centers can be either atoms or pseudoatoms which can participate in ligand-receptor interactions based on the following types of physical properties:

  • Electrostatic interactions

  • Hydrogen bonds

  • Charge-transfer complexes

  • Hydrophobic interaction

  • van der Waals (or London) dispersion forces

These physical properties correlate with certain structural indexes including:

  • Quantum-chemical indexes derived from MOPAC 6.0 [10] calculations: atomic point charges, pi-populations, electron donor and acceptor indexes, HOMO and LUMO coefficients,

  • Atomic contributions to hydrophobicity [11],

  • Atomic contributions to molar refractivity [11].

Descriptor centers are composed of a combination of atom type and an atomic property. Atom types can be a set of atoms (for example, all hydrogen-bond donating groups), pseudoatoms (for example, aromatic ring centers), or molecular fragments. Atom types are defined using a structural language, SLang. SLang is a line notation similar to SMILES and SMARTS [12] providing the capability to identify combinations of fragments with similar pharmacophoric properties.

Sample SLang patterns include:


[N,P,O,S,F,CL,BR,I]                           heteroatom
CH(-NH)(-C=O)-[C,H] or CH(@N-C)(?C=O)@C       C-alpha in peptide unit
A01@[A,{4,5}]@A01                             5,6-membered ring
A01:[A,{5}]:A01                               6-membered aromatic ring
C1:N0H:N:N:N:C1                               tetrazole ring


Atom-based descriptor centers used for biophore identification are specified by setting the values of the following slots (done interactively by filling in table cells):


Name          The descriptor center name 
Pattern       SLang pattern defining atomic environment
Index         An atomic property index used for atom matching
TolA          Absolute tolerance for atomic property matching
TolR          Relative tolerance for atomic property matching
LowLimit      Lowest atomic property value considered in matching
HighLimit     Highest atomic property value considered in matching
Condition     An expression for filtering property values considered 
               in matching


Apex-3D recognizes the following pseudoatoms:

  • Ring center (CRC)

  • Pair of points orthogonal to ring plane (PPP)

  • Hydrogen bond site (single point representing averaged position of lone-pair electrons) (HST)

  • Set of points corresponding to a hydrogen bond donating atom on the receptor to a N, O, or S atom on the ligand (HBD)

  • Set of points corresponding to a hydrogen bond accepting atom on the receptor to a NH, OH, or SH group on the ligand (HBA)

Definitions of HBA and HBD pseudoatoms represent generalizations of those given by Y. Martin and coworkers [13]. Procedures for generating the pseudoatoms are written using a combination of ChemLisp and SLang languages. It is possible to use ChemLisp for defining new pseudoatom types without recompiling the software. The pseudoatom frame has the following form:


(frame pseudoatom-prototype
   (type frametype)                         ; definition of frametype
   (slots (
      (name symbol)                         ; pseudoatom name
      (pattype symbol : atom set-of-atoms)  ; defines pseudoatom as
                                            ; an atom, or a set of atoms 
      (comment string)
      (select form)                         ; lisp-procedure for 
                                            ; identifying real atoms 
                                            ; defining the pseudoatom
      (actions form)                        ; lisp-procedure for
                                            ; assignment of pseudoatom 
                                            ; properties
      (distance2d symbol)                   ; lisp function for defining 
                                            ; a pseudobond between real 
                                            ; and pseudoatoms 
   ))
)


All descriptor center information is stored by Apex-3D in two matrices:

  1. The Property Matrix stores the structural indexes for all descriptor centers identified in a given structure.

  2. The Distance Matrix stores the distances between all pairs of descriptor centers.

Data from these matrices are used to define the biophores which are a subset of the matrices common to several compounds.

Biophore Identification Algorithm

Intuitively a biophore is understood to be the spatial and electronic pattern of elements responsible for receptor recognition and activation. Automated identification of biophores in Apex-3D incorporates the following elements:

  1. Structural Elements: defining pharmacophoric centers which interact with receptors, electronic and structural indexes quantifying ligand-receptor interaction effects, distance relationships between pharmacophoric centers forming unique recognizable patterns.

  2. Statistical Criteria: assessing the probability of correct activity prediction for compounds possessing a certain biophore.

A biophore example is shown in Figure 2.  

The automated identification of biophoric patterns according to the logico-structural approach involves the following steps:

  1. Separation of training set compounds into activity classes according to their activity type or level.

  2. Generation of a representative set of conformers for flexible compounds using a combination of such molecular modeling techniques as distance geometry, force field and quantum chemical geometry optimization, systematic conformational search, and molecular dynamics, all followed by conformer clustering [14].

  3. Generation of structural representations of compounds based on property and distance matrices.

  4. Identification of common structural patterns (features) in all pairs of compounds belonging to a given activity class using a clique selection algorithm [1]. Two biophore extraction algorithms for multiple conformers are available:
    • Exhaustive search: Matches all possible pairs of conformers of all compounds.

    • Fast search: Matches all conformers of each compound with only one conformer of all other compounds.

    (The fast search algorithm produces significantly fewer biophores, and in many cases, without loss of the most frequently occurring biophoric patterns).



  5. Calculation of the number of occurrences of all identified structural patterns (features) among compounds from each activity class of the analyzed data sets. These occurrence numbers are used to calculate statistical estimates of features [1]:
    • The probability that novel compounds having a given feature will belong to a certain activity class.

    • The reliability calculated as the probability of non-chance occurrence of the feature.



  6. Identification of biophores. Biophores are selected as features having both probability and reliability higher than certain thresholds. These thresholds are established during training of the activity prediction system.

  7. Optimization of molecular superposition of compounds possessing a common biophore. This procedure allows determination of superposition quality and can be used to filter biophores producing bad alignment [9].

  8. Prediction of biological activity of novel compounds which have been synthesized, or suggested for synthesis, based on the identified biophores.

  9. Generation of molecular database queries on the basis of selected biophores.

3D-QSAR Active Site Models

3D-QSAR model building in Apex-3D allows identification of potential interaction sites in ligand molecules and correlation of physicochemical properties of these sites and global molecular properties with available quantitative biological data. Ligand active sites are centered on atoms and are divided into two groups:

  1. Biophore Sites: centers of specific ligand-receptor interactions participating in biophore definition and present in all analyzed molecules.

  2. Secondary Sites: centers of specific ligand-receptor interactions that may be present in only a subset of the analyzed structures and allow mapping of secondary receptor pockets which modify ligand activity.

Such subdivision of active site groups allows one to tailor 3D-QSAR complexity to available data. Models based only on biophore sites are more robust and less influenced by conformational uncertainties. Introduction of secondary sites usually requires more extensive molecular modeling to specify proper flexible tail positions.

Model parameters are based on an active site model and structural indexes calculated in Apex-3D's Computational Chemistry module. The calculated atomic properties are rounded off before use, based on an estimated parameter error. This helps avoid chance correlations based on insignificant variability in the property. Parameters are divided into the following three groups:

  1. Biophore site indexes
    • Charge, pi-population, electron donor index, electron acceptor index,

    • HOMO, LUMO, atomic hydrophobicity, atomic refractivity



  2. Secondary site indexes for the following types of secondary sites:
    • H-acceptors (presence, pi-population, charge, electron donor, hydrophobicity, refractivity)

    • H-donors (presence, pi-population, charge, hydrophobicity, refractivity)

    • Heteroatoms (presence, pi-population, charge, electron donor, hydrophobicity, refractivity, formal charge)

    • Hydrophobic (presence, pi-population, charge, electron donor, hydrophobicity, refractivity)

    • Steric (presence, pi-population, charge, electron donor, hydrophobicity, refractivity, formal charge)

    • Ring centers (presence, size, number of pi-electrons)



  3. Global molecular properties
    • Total hydrophobicity, total hydrophobicity squared, and total refractivity are calculated automatically from atomic increments.

    • User-supplied molecular properties such as molecular volumes, free energies, solvation energies, etc., (entered using Apex-3D's user interface).



Positions of the secondary sites are selected from the positions of atoms in the superimposed molecule. An atom in a molecule occupies the secondary site if its distance from the site position is less than the user-specified site radius. To select the more reasonable secondary sites, it is desirable to specify the site occupancy threshold which is the minimal number of compounds which must occupy a site before it can be included as a site. The different atom classes used as secondary sites have been grouped to help associate chemical properties with activity. Rules describing separation of site atoms into the described above classes are written in the ChemLisp language. This allows specification of Hydrophobic sites as carbon atoms which are part of a hydrophobic alkyl chain or aromatic ring, and Steric sites as any non-hydrogen atom. Any atom center used in the biophore definition is excluded from being a secondary site point.

Secondary sites serve three primary purposes:

  1. Identify possible extensions of the biophore common to the compounds in the model, (for example, a region of space relative to the biophore with additional hydrogen-bond interactions which increase activity).

  2. Identify steric interference; regions of space which when occupied by the ligand decrease activity.

  3. Identify hydrophobic pockets, (for example, regions of space which when occupied by hydrophobic groups in the ligand, increase activity).

The biophore chosen for 3D-QSAR model building serves as a reference for superimposing the ligands. Biophore sites may also contribute quantitatively to the 3D-QSAR model as additional parameters.

A 3D-QSAR model example is shown in Figure 3.

3D-QSAR Model Building Procedure involves the following steps:

  1. Automated selection of biophores.

  2. Optimization of superimposition of compounds sharing a common biophore.

  3. Interactive specification of the 3D-QSAR model parameters based on physicochemical properties of biophoric features, secondary sites, and global molecular properties.

  4. Calculation of the best 3D-QSAR model for the selected biophores.

  5. Selection of the multiple regression equation using modified stepwise multiple regression based on the PRESS statistics. In addition to the automated variable selection, the user can interactively check inclusion of certain variables.

  6. Estimation of non-randomness and predictive power of obtained models and filtering out unreliable models.

The probability of chance correlation is estimated using numerous random re- samplings of activity data. If for such random samples, the probability of the selection of a regression equation with the same or smaller number of variables (and close or better multiple correlation coefficient) is greater than a small confidence limit (e.g., 0.01 or 0.05), then the initial correlation model is discarded.

Applications

Non-peptide Angiotensin II antagonists can demonstrate the utility of this knowledge engineering approach. Many compounds [15] of diverse structural type have been reported in the literature for the treatment of hypertension and congestive heart failure. Pharmacophore models have been postulated with some disagreement about whether all of the highly active molecules are binding at the same site [16]. Automated pharmacophore identification can be used analyze these compounds and assess the probability they could be acting at the same site. A set of 55 compounds with specific binding activity (IC50) values ranging over 6 orders of magnitude was used. Examples of these compounds are presented in Figure 4.  

Multiple conformations were included using 3D structures whose geometry's were optimized by AMPAC [17]. Four activity classes were defined; the most active class (<100 nM) included 27 compounds. Apex-3D was able to generate rules which properly classified all compounds in the most active class without false negatives or positives. Several biophores were required which is consistent with multiple binding sites. Three of these biophores are shown in Figure 5.  

Results from a 3D database search using another biophore with a high probability of being associated with angiotensin II antagonist activity is presented in Figure 6.  

Another example of an Apex-3D application is development of a 3D-QSAR model using dihydrofolate reductase (DHFR) inhibitors. Figure 7. shows a model developed with 68 compounds belonging to several chemical classes of inhibitors including: pyrimidines, pyridopyrimidines, pyrroloquinazolines, quinazolines, and triazines. Single conformations of these compounds were obtained by fitting the 3D structures to a template of methotrexate in the conformation observed in the DHFR complex.

The QSAR model is based on a biophore consisting of a six-membered aromatic ring containing two sp2 hybridized nitrogen atoms. The equation includes 6 variables: total hydrophobicity, total number of H-donors, and partial atomic charge indexes for 4 secondary sites. Statistical parameters associated with the model include: predicted R2 = 0.72, predicted RMSE = 0.82. Predicted activities for 19 compounds excluded from the training set when compared with experimental values gave the following statistics: R2 = 0.81, RMSE = 0.78.

Acknowledgments

The work on which this paper is based was supported in part by a grant from the Israel-United States Binational Research and Development Foundation (P.O.B. 39104, Tel-Aviv, 61390, Israel). Views and information contained herein are those of the authors and not necessarily those of the Foundation. The Foundation assumes no liability for the contents of this document by virtue of the support given.

References

1. Golender, V.E.; Rozenblit, A.B. Logical and Combinatorial Algorithms in Drug Design, Research Studies Press: Letchworth, U.K. (1983).

2. Golender, V.E.; Vorpagel E.R. In: 3D QSAR in Drug Design: Theory, Methods and Applications,. Kubinyi H. (Ed.), ESCOM, Leiden, 1993, pg. 137-149.

3. Golender, V.E.; Vesterman, B.; Ehyahu, O.; Kardash, A.; Kletzkin, M.; Vorpagel, E.R. Proceedings of the 10th European Symposium on Structure-Activity Relationships, in press.

4. Raima Database. Raima Corporation, Issaquah, 1993.

5. Mishkoff H.C. Understanding Artificial Intelligence, Howard W. Sams & Co., (1985).

6. Winston P.H., Horn B.K. Lisp, Addison-Wesley, Reading, (1989).

7. ISIS 3D Searching. New Features. Version 1.2. MDL Information Systems Inc. San Leandro, 1994.

8. Insight II User Guide, Release 95.0, Biosym/MSI, San Diego, 1995.

9. Apex-3D User Guide, Release 95.0, Biosym/MSI, San Diego, 1995.

10. MOPAC: A General Molecular Orbital Package (Version 6.0). Stewart J.J.P., QCPE#455.

11. Viswanadhan, V.N.; Ghose, A.K.; Revankar, G.R.; Robins, R.K. J. Chem. Inf. Comp. Sci., 29, 163-172 (1989).

12. Weininger, D.J. J. Chem. Inf. Comp. Sci., 28, 31 (1988).

13. Martin, Y.C.; Bures, M.G.; Danaher, E.A.; DeLazzer, J.; Lico, I.; Pavlik, P.A. J. Comput.-Aided Mol. Des., 7, 83 (1993).

14. Vesterman, B.; Golender, V.; Golender, L.; Fuchs, B. Proceedings of Second Electronic Computational Chemistry Conference. (http://www.dcl.co.il/ECCC2/conf_clust.html)

15. Duncia J. V., et al, J. Med. Chem., 33, 1312-1329 (1990).

16. Keenan R. M., et al, J. Med. Chem., 36, 1880-1892 (1993).

17. AMPAC, version 2.1 (QCPE No. 506), available from Quantum Chemical Program Exchange, Indiana University, Bloomington, IN.

Figures, Charts and Tables

 

FIGURE 1

Return to the Article

 

FIGURE 2

Return to the Article

 

FIGURE 3

Return to the Article

 

FIGURE 4

Return to the Article

 

FIGURE 5

Return to the Article

 

FIGURE 6



Return to the Article

 

FIGURE 7



Return to the Article



NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:

Network Science Corporation
4411 Connecticut Avenue NW, STE 514
Washington, DC 20008
Tel: (828) 817-9811
E-mail: TheEditors@netsci.org
Website Hosted by Total Choice