3D Database Searching in Drug Discovery

John H. VanDrie

Pharmacia & Upjohn Laboratories
301 Henrietta Street
Kalamazoo, MI 49001 USA
E-mail: vandrie@mindspring.com

http://www.netsci.org/Science/Cheminform/feature06.html

Abstract

This review begins with a brief description of the historical development of the field of 3D chemical database systems. Next, a detailed comparison of different approaches and different types of functionality is given. Following that is an overview of the few applications published in the literature. Then the author's view of the open issues in methods development is presented. And finally, the author attempts to address the question "What have we learned in the past ten years of applying 3D database searching?".

INTRODUCTION

In the past, many experimentalists shared a similar sentiment regarding the usefulness of theoretical and computational methods to their drug discovery problems, namely;

"You're always explaining to me after the fact why a molecule I've already discovered is active, but you're never able to tell me ahead of time which molecules I should be investigating."

This frustration stimulated the development of tools explicitly designed to have as their output a list of molecules which may be active against a receptor, based on three-dimensional (3D) structural information. The first such software tools to appear are generally classified under the heading '3D database searching', in which a database of conformation(s) of known molecules is examined, and those molecules whose conformations satisfied specified criteria are reported as 'hits'.

The first attempts at 3D database searching were initiated in the mid-70's by Peter Gund, then a post-doc in Todd Wipke's laboratory [1,2]. The 3D criteria taken as the database query were those of an atomic pharmacophore, i.e. a set of atom types with sets of distance constraints between them. The 3D database used was a database of crystallographically determined coordinates (the Cambridge Crystallographic Database). This approach of using an atomic pharmacophore query was later elaborated by Peter Willett and co-workers in Sheffield [3,4], and also Bob Sheridan and co-workers at Lederle [5].

In the early to mid-80's, a number of new approaches were taken to 3D database searching. Kuntz's group at UCSF developed DOCK, taking for their criteria for searching the database the shape of a receptor pocket, based on the X-ray crystal structure of that protein receptor [6,7], reporting as hits all molecules whose database conformation was capable of being docked into that pocket. Yvonne Martin and I at Abbott took another approach, one intended to be closer to the way a medicinal chemist looked at the problem: we allowed the query specification to contain the most general geometric description of the orientation of chemical functional groups [8]. The geometric concepts which medicinal chemists use go far beyond the interatomic distances of the atomic pharmacophores: concepts were included like the orientation of a lone-pair or the orientation of a side-chain, the height of a group above the plane of a ring, etc. The functional group specifications were also quite general: concepts were included like 'basic amine', 'anionic group', 'hydrophobic group'. In essence, we generalized the concept of a 'pharmacophore'. This searching system was named ALADDIN. Dave Weininger's software (Merlin, Thor, etc., Daylight Chemical Information Systems) made it very easy to implement these functional group descriptions. Finally, Paul Bartlett and co-workers at Berkeley developed CAVEAT, taking the approach of focussing solely on the vectorial/angular relationships in specifying the 3D database query [9]. This was less an advance in methods development, and more an advance in the manner in which this type of tool was applied, Bartlett being a practicing synthetic chemist, using 3D database searching to find scaffolds to mimic peptides in their bound conformation.

It should be noted that some 3D searching capability was provided with the Cambridge Crystallographic Database. However, this did not lend itself to the problem of suggesting possible new activities of known molecules. It was more oriented to the problem of computing average hydrogen-bond geometries, etc.

Simultaneous with the appearance of these new approaches to 3D database searching was the serendipitous development of CONCORD by Bob Pearlman and co-workers [10]. CONCORD is a tool for the rapid computation of a single high-quality approximate 3D structure. With this, we were able to move beyond the crystallographic databases, and those in the pharmaceutical industry were able to build 3D databases containing hundreds of thousands of proprietary compounds. This dramatically enhanced the utility of 3D database searching, for two reasons:

  1. these compounds could be readily tested, since most drug companies keep inventories of compounds they've synthesized over the years, and

  2. anything which was found to be active was generally something with an existing patent position, or something readily patentable.

Mark Bures at Abbott was the first to construct a corporate 3D database using CONCORD (after we had performed some experiments verifying the quality of those conformations); by 1990, most major pharmaceutical companies had used CONCORD to make some progress towards building a corporate 3D database.

The chemical software vendors were quick to exploit these new methods. The proposal for a consortium to develop MACCS-3D emerged shortly after the visit to Abbott by an MDL vice-president, intended as a 3D extension to their popular 2D searching software MACCS-II [11]. Chem-X entered the fray with their Chem-3DBS, one of the first attempts to move away from the reliance on a single conformer per molecule in the database [12].

In 1990, I joined BioCAD as one of the founding members; two years later we introduced the Catalyst software, a major component of which was the 3D database searching, relying on databases populated with multiple conformers per molecule. About that same time, Tripos introduced their entry in the 3D database field, Sybyl-3DB. Tripos continued to rely on CONCORD-generated conformers, and attempted first by 'random-tweak' and later by 'directed-tweak' to consider the conformational flexibility of the molecules in the database [13]. MDL introduced their own approach to on-the-fly analyses of conformational flexibility, CFS, at about the same time [14].

Starting in the late 80's, another set of computational tools emerged, directed at the problem of suggesting molecules which may be active, based on 3D criteria: de novo design tools. Rather than searching a database, these tools construct new molecules to satisfy the criteria. Richard Lewis and Phil Dean in Cambridge began with 2D de novo construction [15]; Joe Moon and Jeff Howe at Upjohn used the 3D structure of a receptor to drive the evolution of a polypeptide [16]. Today's state-of-the-art is probably best represented by the SPROUT software, developed by Peter Johnson and co-workers at Leeds, which allows arbitrary organic molecules to be constructed in response to either a 3D receptor structure or to criteria defined by a general pharmacophore [17]. The utility of de novo design tools vis-a-vis 3D database searching will be discussed in a later section.

SURVEY OF EXISTING TECHNIQUES FOR 3D DATABASE SEARCHING

A number of reviews have already been written, representing the points-of-view of the different schools of thought: Kuntz's 1989 review [18] focusses on the protein-structure-driven shape-matching aspects, Martin's 1991 review [19] focusses on the medicinal chemistry perspective, and Willett's 1991 book [20] focusses on the algorithmic and information retrieval aspects. The survey which follows is not intended to replace or supersede any of those - it represents my own idiosyncratic view of this area, reflects work that has come to my attention, and is not intended to be comprehensive. Those whose work I've omitted or poorly described I encourage to send me e-mail.

Let us compare and contrast different techniques, based on how they deal with the following issues:

  • Query language
    Steric interactions
    Functional groups (topological descriptions)
    Geometric relationships
    "1D" constraints
  • Searching
  • Computing conformation(s) of molecules in the database
  • Query construction and pharmacophore identification

Query language: Steric interactions

Emil Fischer's 1894 model of a lock-and-key as the origin of biological specificity continues to dominate the thinking of many drug designers. This model says that those molecules whose shape, i.e. whose steric characteristics, perfectly complements the shape of the receptor will be active.

The DOCK approach continues to be the most sophisticated method for dealing with the shape complementarity between the ligand and the receptor [6]. In this method, the shape of the active site cavity is approximated by a set of spheres, and each conformation in the database is tested for its ability to be docked into this representation of the cavity. The published pharmacophore-based methods all rely on a very crude representation of the potential steric clashes, using 'dummy atoms' [8] or 'outrigger atoms' [5] to represent spherical regions of excluded volumes, whose location is defined relative to the atoms or functional groups of the pharmacophore. Studies of the effectiveness of such approaches have never been published, but my own experience suggests this is a very poor way to describe the shape restrictions of the receptor (though, in theory, an unlimited number of such spheres should be able to describe any shape). We did experiment with a routine in ALADDIN, called CLASH, to represent the full shape of the receptor in atomic detail, into which each molecule which satisfied all other criteria was oriented via minimizing the rms; this didn't work all that well, for reasons that require more explanation than I should put in this survey.

Query language: Functional groups

For a molecule to bind to a receptor, it must have functional groups complementary to specific binding regions on the receptor. For example, a basic amine is required for most molecules active in the CNS (central nervous system), which is thought to bind to an aspartate sidechain on the receptor. Compounds that do not possess that basic amine have a very small likelihood of being active in the CNS.

To properly detect arbitrary functional groups on molecules, one must use a subgraph isomorphism algorithm. It appears that most commercial 3D database software uses this. There is an undeniable speed cost for using this; Sheridan et al.'s approach [5], like Gund's original approach [2], use atom types, each atom being encoded with a series of bits which depend on that atom's environment. Willett et al. [21] published a comparison of various algorithms for subgraph isomorphism, which concluded that the Ullman algorithm was the best; this analysis may have led many to shy away from including subgraph isomorphism in their 3D search systems (in my opinion, the speed of isomorphism is not that critical an issue, and more easily-coded algorithms suffice).

The DOCK developers have begun to introduce the notion of 'chemical filters', which appears to aspire to this goal of constraining the hits to those having the required chemical functionality [22]. The level of detail published thus far on this subject makes it difficult to evaluate precisely how they are dealing with this problem.

Query language: Geometric relationships

The hallmark of a specific ligand-receptor interaction is its sensitivity to chirality - the (S,S) isomer of captopril is a potent ACE inhibitor, but any other diastereomer is inactive. Yet the old-fashioned notion of a pharmacophore - a set of atoms separated by a set of distance constraints - is clearly not enantioselective, i.e. the set of distances one would measure for (S,S)-captopril is identical to that of its mirror image (R,R)-captopril. This suggests that additional geometric relationships are required to capture enantioselectivity in a 3D database search query. Sterics may capture some of that, but there are numerous examples where the flip of a chiral center barely alters the shape of a molecule, yet only one member of the pair is active, e.g. (+)- and (-)-apomorphine.

In ALADDIN, we included every conceivable type of geometric relationships, allowing the definition of points (e.g. center of mass of a carboxyl), lines (e.g. the line along an N-H), and planes (e.g. the plane of an aromatic ring), with distances between any of those, 3-centered-angles or the angle between two planes, and 4-centered torsion angles. Of these, only signed torsion angles are enantioselective; capturing the direction of the line along the N-lone pair relative to the catechol allows one to define the enantioselectivity. These geometric relationships, coupled with the functional group descriptions, allows one to capture a wide range of descriptions of ligand-receptor recognition, each query being both selective and specific for the receptor of interest. It is this selectivity and specificity that makes such queries superior to purely-shaped based queries, which can rarely distinguish between similar analogs, one active, the other inactive. This set of geometric relationships has been widely used in 3D database search systems.

In Catalyst, we were a bit more parsimonious, since it was apparent the most common use of a line was to define the direction of a hydrogen-bond-donor or -acceptor. This led us to include the built-in notion of a projected acceptor or donor point, similar to that described by Marshall and co-workers in their studies of receptor mapping [23]. The other geometric relationships were retained.

In CAVEAT, Bartlett and co-workers chose extreme parsimony: only vectorial relationships were included. While the success they've reported appears to be indisputable, it is difficult to defend the position that the subtlety of most receptor-ligand recognition can be encoded purely in terms of vectorial relationships. They also did not include any recognition of functional groups, leaving it up to the chemist using the software to filter out hits mentally, based on his own intuition.

Query language: "1D" constraints

"1D" constraints allow one to take data known about a compound into account while doing a search. For example, if one were performing a search of a corporate database for new angiotensin-II inhibitors, one would want to report as hits only those compounds whose compound inventory is sufficient for testing (e.g. AMOUNT-ON-HAND > 10 mg) and which hadn't already been tested for A-II activity (e.g. ANGIO-II-IC50 EQ NULL). Gary Chappell at BioCAD jokingly suggested the name "1D constraints" for such constraints (vs. the 2D constraints, the functional group specifications, and the 3D constraints, the geometric relationships), and the name has stuck.

The obvious way to deal with such constraints is to allow an arbitrary relational query, "SQL", in tandem with a 3D query. Inexplicably, this simple approach has yet to appear. ALADDIN was built upon a database system which only allowed essentially a flat-file operation, and thus only allowed primitive 1D constraints as in the example above. MACCS-3D allows certain views to be projected from ORACLE, a relational database which one may run coupled to MACCS. Catalyst is tightly integrated with an underlying relational database engine, but sadly what is presented to the user for searching is essentially a flat-file format.

Searching

The problem of efficiently processing a database search entails primarily two steps: selecting candidates quickly, molecules which may meet the query specification, and performing a slower detailed analysis on these candidates to find the hits, those molecules which exactly meet the query. There are three standard approaches in Computer Science to the problem of determining candidates:

  1. using a long bit-vector for each molecule, each bit being clear or set depending on whether that molecule possesses some property, e.g. has a carboxyl;
  2. using inverted-keys to index into the database; or
  3. using a tree-like structure to index a database [24].

Peter Willett and co-workers have published extensively on the first method (see ref. [20], and references therein). Bob Sheridan and co-workers employed the second method [5]. Examples of the third method have been described for 2D searching [25,26], but descriptions of employment of the third method in 3D database searching have not appeared. For 3D database searching which employs strictly shape information, I have not seen published information on any candidate selection procedure; this process poses some unique challenges not present in the pharmacophore-based 3D database searching.

Identification of the hits by a detailed analysis of the list of candidates, "query isomorphism", is inevitably a time-consuming process, but this implies that overall search performance will be a function of both the efficiency of candidate selection (how many candidates that are not hits must be processed in the detailed phase), and the speed of the detailed analysis. Those methods which rely on a specific orientation to an absolute coordinate frame of reference, such as DOCK or the CLASH routine in ALADDIN, inevitably are faced with a time-consuming iterative process of achieving the proper orientation of the molecule starting from the random orientation in the database. Pharmacophore-based methods are not faced with this hurdle; their most time-consuming step is usually the graph isomorphism step for mapping functional groups. One of the reasons for the slowness of the early 3D database search systems was that this detailed query isomorphism first processed the graph isomorphism, then the 3D constraints were applied. Later systems built from scratch allowed the graph isomorphism to be integrated with the testing of the 3D constraints, an inherently much more efficient process. In the original version of Catalyst, we represented a 3D pharmacophoric query as a directed acyclic graph (DAG), the leaves of which represented a graph isomorphism for mapping a functional group. This allowed simple and highly efficient state-transition logic to traverse the DAG rapidly, to stop the moment the first isomorphism was found. In my current research software, I extend the DAG down to the query atoms, and use a common state-transition logic for the entire DAG (without a separate graph isomorphism step).

Computing conformation(s) of molecules in the database

The first databases which were used were either experimental single-conformer databases (CCD), or databases of multiple conformers computed from molecular mechanics [8]. For databases of hundreds of thousands of molecules, the latter was impractical; two disadvantages of the former was that conformational flexibility was not taken into account, and that everyone was searching the same molecules - the ability to search a proprietary database is a tremendous advantage. It was apparent at the time we initiated our studies with ALADDIN that some type of expert-system approach to generating conformers might be the solution, something being explored at that time by Dolata [27]. In mid-1987, CONCORD became available, which used an expert-system approach to generate a single conformer for each molecule. At roughly one second per molecule, its application to databases of hundreds of thousands of molecules was practical. Most of the effort associated with the conversion of a corporate database using CONCORD was associated with properly preparing the input - usually chirality. When faced with a center of unknown stereochemistry, since it was limited to producing only one conformer per input molecule, CONCORD only produces one stereoisomer; hence a considerable amount of effort was required to preprocess the input so that both stereoisomers was supplied as input to CONCORD - for molecules with many unknown centers, this can be troublesome.

CONCORD's greatest limitation was its ability to only produce one conformer per molecule, which meant that the early 3D searching systems did not take conformational flexibility into account at all. Two solutions emerged: populate the database with multiple conformers, or perform an analysis of the conformational flexibility on-the-fly starting with the single CONCORD structure. Andrew Leach's COBRA software was one of the first attempts to generate multiple conformers per molecule, using an expert-system approach [28]. Dave Weininger adapted some distance-geometry code to achieve the same end [29]. At BioCAD, Peter Towbin and Andrew Smellie applied many techniques to achieve a multipurpose conformational analysis, one use of which was the construction of multi-conformer databases. Recently, a group at Merck weighed in with their own approach to generating multi-conformer databases [30]. Gasteiger recently reported on CORINA, another program to generate a single high-quality approximate 3D structure [31].

A number of approaches to performing an on-the-fly conformational analysis starting from a single CONCORD structure appeared almost simultaneously, primarily applying the notion of torsion-space minimization. Tripos' method, 'directed-tweak', aims to distort the CONCORD structure to see if it may adopt a conformation consistent with the 3D database query [32]; no energy tests are performed, except for an optional crude bump-check. MDL's method, 'CFS', adopts a similar approach.

Much hue-and-cry has arisen over which of these various approaches is the best way for treating conformational flexibility in 3D database search. I have developed a method (presented at the Fall 1993 ACS meeting, manuscript in preparation) for studying in an objective way how well conformational flexibility is treated in 3D database search, by relying on an mathematical inequality which approaches an equality in the limit of perfect conformational analysis. This appears to indicate that storing discrete conformers in a multi-conformer database works surprisingly well, with directed-tweak performing significantly poorer and the CFS method performing marginally better at a significant cost of speed of search. More thorough studies are needed, however - these are very preliminary results.

Query construction and pharmacophore recognition

DOCK takes as input a crystallographically-determined protein structure with an inhibitor bound. Hence for such methods there are no apparent difficulties in constructing a query representing the shape of the pocket. However, users of every other 3D database method are faced with the challenge of constructing a query.

In many cases, pharmacophores are published in the literature, though my own experience indicates these are usually of little value. Working initially with ALADDIN, the approach I took was to begin with a query, initially taken from some chemical intuition and from the known structure-activity relationship (SAR), and to iteratively modify it until it was capable of returning all known actives as hits, while excluding all inactives (it was this procedure that constantly drove the requirements for a sophisticated query language). A number of methods for automating the construction of database queries emerged in the early 90's: BioCAD's 'hypothesis generation', which tries to find points in space to which all active molecules are rms'd, for which the rms deviation for each molecule correlates linearly with the log of the activity [33]. Mark Bures and Yvonne Martin proposed DISCO [34], an adaptation of a procedure of Andrew Smellie et al. [35]; this method critically depends on the 'seeding' of the procedure with the bioactive conformation of one of the actives. Biosym adapted the APEX procedure of Golender [36] to this problem; this method tends to generate hundreds of possible pharmacophores. None of these methods is truly satisfactory, based on my own experience and what I've heard of the experience of others.

APPLICATIONS

Unfortunately, few of the successes of 3D database searching have appeared in the literature. The story of our first success with ALADDIN finally appeared in Martin's review [19], in which a database of a small subset of the Abbott corporate database was searched with a dopaminergic query. From that search emerged a novel chemical structure, for which, after some elaboration, clinical research was initiated. One of DOCK's early successes was the discovery that haloperidol is a modest inhibitor of HIV-1 protease [37]; I believe, however, that this lead, and everything found by various users of 3D searching against the HIV-1 protease structure, did not lead to any clinical candidates. I am aware of at least one other pharmaceutical company besides Abbott who found one or more new leads using ALADDIN with a CONCORD-generated 3D database; using the Catalyst 3D database software with a single-conformer 3D database constructed with its own conformational analysis tools, I'm aware of at least one pharmaceutical company having multiple successes as a method for discovering new lead structures. Using the Chem-3DBS software, at least one pharmaceutical company reportedly has had a success. Unfortunately, no details have emerged in the published literature about any of these successes. In probably the finest study of this type published to date, Wang et. al. [38] recently reported in complete detail their successful discovery of novel leads for protein kinase C agonists, using the Chem-3DBS searching software.

Given the information that is publicly available, it is difficult to draw any general conclusions about what elements were responsible for the successful applications of DOCK and related protein shape-matching programs. From what is published and my own first-hand knowledge, however, some comments can be made about common elements to most of the success stories of the pharmacophore-based 3D searching: exactly which searching algorithm or which conformational analysis method is not that critical; what is critical is a very large database, and facilities for screening the thousands of hits which emerge. High-quality pharmacophores are helpful, but, amazingly, positive results have emerged with comparatively low-quality pharmacophores. In one instance, the ratio of actives among the hit list was only about 1%, but since it superseded random screening proceeding at a rate of about 0.1%, this was considered a success. In a sense, the proportion of actives among the hit list is a direct measure of the quality of the pharmacophore. I have seen this number anywhere from 50% down to 0%, with numbers in the range 1-10% generally considered acceptable.

OPEN ISSUES IN METHODS DEVELOPMENT

Once again, the caveat: what follows is my own idiosyncratic view of what the major outstanding issues are for the future development of 3D database searching.

Merging of shape with chemical/geometric descriptors. In my opinion, this is still one of the major unresolved issues. Kuntz's group is actively pursuing this, from the perspective of adding chemical filters to their active-site-shape-matching procedure [39]. Unfortunately, it is difficult to assess precisely what they are doing, and what success they are having, based on the published literature. Those working from the pharmacophore perspective do not appear to regard the shape of the protein pocket as important, and continue to proceed with very crude methods for representing steric interactions.

How to treat molecular flexibility. The zealotry associated with the two schools of thought here (populating a database with multiple conformers vs. on-the-fly tests) suggests that inevitably the best solution to emerge will be one that is a mixture of both. The latest versions of the Catalyst 3D database searching software include such a combined algorithm, though unfortunately no details have emerged on how this is achieved. There is a strong need for objective tests to discriminate between competing methods. Any method, to be useful, must be applicable to the construction and searching of million-compound databases.

Constructing synthetic databases. The utility of 3D database searching is greatly enhanced if databases of hitherto-unknown compounds can be searched. This will be especially important for the directed mass synthesis of combinatorial chemistry. Ho and Marshall [40] recently reported an interesting approach to this problem.

Query construction from active and inactive analogs (automated pharmacophore recognition). As mentioned earlier, no existing methods are really satisfactory in this regard. The standards which I impose in measuring any new method are:

  1. it must be at least as good as a human medicinal chemist or computational chemist in developing a pharmacophore,

  2. the answers it produces should be reasonably robust to small changes in the input dataset, largely independent of how one selects that input dataset from the literature, and constant or smoothly-varying as the input dataset is expanded,

  3. 3D database searches using such pharmacophores should yield hit lists the proportion of which are actually active is greater than 1%,

  4. the resultant pharmacophore should be highly selective, especially against receptor subtypes (e.g. a beta-1-antagonist pharmacophore should not in general hit molecules which are solely beta-1-agonists or beta-2 active),

  5. the method should not be totally thrown off by bad data in the input dataset (e.g. one molecule is marked as active when in fact it is not); the ideal method would detect such erroneous data and flag them as suspect,

  6. nanomolar binders should not be necessary in the input dataset to get useful results (otherwise we're back to the setting described in the first sentence of this article), and

  7. the results should be statistically significant (e.g. if presented with two dozen randomly chosen molecules, it shouldn't blindly report what it considers a good pharmacophore).

Constructing pharmacophoric queries from a protein. One would think composing a pharmacophore query from a protein is a trivial task, and yet all of my own experience suggests it is not. If one looks at our structure of a C2-symmetric peptide bound to HIV-1 protease [41], one would think it simple to identify a few key interactions, which one would choose as features of a pharmacophore. Yet, what constitutes 'key'? One doesn't know that until you have a family of different ligands bound to the protein; then, certain regions of the protein will emerge as indisputable 'hot spots', binding to which is common for all potent inhibitors. Peter Goodford's GRID [42], and the HIPPO software of Peter Johnson and Zsolt Zsoldos [43] both aspire to answer this question, but it is apparent that more work still needs to be done in this area. Also, as recently pointed out by Gerhard Wagner [44], conformational flexibility of the protein (ligand-induced fit) may play a major role in our understanding of how ligands interact with receptors, and hence how pharmacophores may need to represent that abstractly.

Fuzzy queries. One of the strengths of 3D database searching is also one of its weaknesses: it returns only those molecules which exactly match the criteria you've specified. Probably the most frustrating thing about 3D database searching is dealing with the situation of zero hits being returned. The first question that one asks: in which direction must I relax my query to get some hits? It would be useful if one were given some feedback on a zero-hit query as to what molecules almost met the query - this could begin to answer that query. I refer to this as 'fuzzy query searching'. I've not seen or thought of anything productive along these lines, but it remains apparent to me that this issue would be a useful one to resolve.

Scoring hits. The Holy Grail of both 3D database searching and de novo design has been the ability to rank the hits in order of the likelihood that they're active. A reminder: the Holy Grail was the object of prolonged quests of medieval knights, something never attained but always held out as the goal of the quest. We had originally called the hit-reporting routine in ALADDIN the SCORE routine, because we'd planned from the outset to do this. We never did, in part because we were concerned that any system would be poor, and it might discourage the experimentalists from trying the low-scoring hits. I am not aware of anything in the literature that is really effective in this regard, though I am aware of various research projects underway to address this problem. One can appreciate the difficulty of this task by realizing that, to succeed, one must in effect be able to predict the binding free energy of a ligand given a protein structure.

WHAT HAVE WE LEARNED IN THE PAST TEN YEARS OF APPLYING 3D DATABASE SEARCHING?

One of my favorite questions to pose is: What are the general principles that govern ligand-receptor recognition? It amuses me greatly to hear that many people are absolutely convinced they know what these principles are, but that there is little consensus on what these are. The answers to this question are critical, since they underlie all of our thinking about posing queries for 3D database searching, and de novo design. And, interestingly, one of the lessons that has come out of our experience with 3D database searching is some empirical insight into the answers to that question.

The oldest principle for ligand-receptor recognition is the lock-and-key model proposed by Fischer in 1894. The notion of shape complementarity being the dominant element of ligand-receptor recognition is the guiding light behind the DOCK approach, and related shape-matching methods. I'm certain it is a controversial point, but one conclusion I would make in surveying our experiences of the last ten years is that steric complementarity is necessary, but not sufficient, and that is apparently a 2nd-order effect, and not the dominant one. What convinces me of that is the relative low proportion of active hits the come out of a shape-based search, and the high proportion of totally absurd molecules, compared to the proportion of active hits that one sees from a pharmacophore based search. There is little published data on this issue, and I hope we will see some in the future, to either support or overturn my assertion.

It is an article of faith in the modelling community that accurate treatment of the conformational flexibility is critical, if 3D database searching is to succeed. When we described our initial successes with ALADDIN searching single-conformer, CONCORD-built databases, the response from the modelling community consistently reflected that faith, at times to the point of apparently disbelieving the success we reported. And while it is indisputable that we are now much better off with the various algorithms for conformationally-flexible search, and constructing multi-conformer databases, I believe it is still safe to assert that differences in various methods of treating conformational flexibility don't make huge differences in the effectiveness of 3D database searching. The primary effect is to allow it to be more selective, to allow one to build much more refined queries and hence to retrieve fewer hits. Initially, there was much concern about the question 'how do we deal with all these hits?', but now it appears that with highly selective queries and high-throughput screening, that the number of hits doesn't need to be that large, and whatever the number of hits is, they're readily screened.

That issue leads to another observation from our initial experience, one first reported by Guner, Henry, and Pearlman [45], and later made more rigorous by my method for assessing the conformational flexibility in 3D database search, namely that there is a complementarity between the precision of query and the demands on the treatment of conformational flexibility, imprecise queries capable of being handled by single-conformer databases, precise queries (i.e. ones with tight tolerances in the distance constraints) requiring high-quality treatment of conformational flexibility. I call this 'complementarity', in the sense used by Bohr in quantum mechanics.

I think one of the great surprises of our experience with 3D database searching is that these generalized pharmacophores are inexplicably useful. As protein crystal structures started becoming widely available, many seemed to treat the notion of a pharmacophore as some musty relic of the grind-and-bind past. I believe our experience has revitalized the notion of a pharmacophore, and it certainly leads to the fundamental theoretical question of 'why should this be so? - why should one be able to characterize the essence of binding by a small number of interactions?'. Especially astonishing is that dyad queries (2 functional groups connected by one distance constraint) work. (It has almost become a parlor game for me to search any corporate database with a query consisting of a basic nitrogen separated from a hydrophobic group from 5 to 7 Angstroms. Sorting the hits by molecular weight inevitably brings up a handful of CNS-active molecules at the top of the hit list, astonishing all the onlookers). At a fundamental theoretical level, this suggests that ligand-receptor recognition is highly non-linear, that certain interactions contribute large amounts to the binding affinity, others less so (in the case of CNS-active agents, loss of that basic nitrogen usually results in the loss of 5 orders of magnitude of activity). Since most of our binding-energy-predicting methods are based on explicitly linear expressions for energy, this poses quite a conundrum. This suggests strongly that the essence of ligand-receptor recognition is NOT lock-and-key, though it remains to be understood what exactly are these principles governing ligand-receptor recognition.

I also believe our experience indicates that it is important to be able to express sophisticated queries, in at least a practical setting to avoid an overload of hits. The types of things I refer to here are, for example, the ability to identify 'exocyclic' as a property of an atom, or 'tertiary carbon'. Very often, I will initially compose a query and find lots of things in the hit list that I know either from chemical intuition aren't correct, or do not reflect what I wanted in my query. It is important to be able to immediately cycle back to modify the query to eliminate such hits.

Everyone in the pharmacophore-based-searching area is still having difficulty composing effective queries. This is true in database searching in general, and many ideas have been worked out there which have not appeared in the chemical 3D database searching arena, e.g. query-by-example. It is quite clear from our collective experience that good ideas translate eventually into good queries, which when combined with a good database (usually a large corporate database) and screening capabilities frequently leads to useful hits, but this process in the early query-building phase is still too tedious and error-prone.

I believe it is safe to say that our experience also suggests that there are no good methods for scoring or ranking hits. In my mind, it will be interesting to see if we can make some rough progress here, to demonstrate that it is a feasible goal to which to aspire.

Experiences with both 3D database searching and de novo design indicates that these methods are complementary, with neither obviating the other. At the risk of over-simplification, I would contend that 3D database searching is more appropriate for lead-finding, because our knowledge of the receptor-ligand recognition is usually poor at that stage and one doesn't want to invest much time synthesizing new molecules to test one's ideas. On the other hand, de novo design is most appropriate for lead- optimization, to progress past the point where some patentable lead structure is already in hand by modifying it in to optimally complement the receptor, and where one is at the point where one's understanding justifies a significant investment of synthetic efforts.

REFERENCES

1. Gund, P., Wipke, W.T., Langridge, R., "Computers in Chemical Research, Education, and Technology", 3:5, 1974.

2. Gund, P., Ann. Rep. in Med. Chem, 14:299, 1979.

3. Brint, A.T., Willett, P., J. Mol. Graph., 5:49, 1987.

4. Jakes, S.E., and Willett, P., J. Mol. Graph., 4:12, 1986.

5. Sheridan, R.P. et al., J. Chem. Inf. Comp. Sci., 29:255, 1989.

6. Kuntz, I.D., et al., J. Mol. Biol, 161: 269, 1982.

7. Des Jarlais et al., J. Med. Chem., 29: 2149, 1986.

8. Van Drie, J.H., Weininger, D., Martin, Y.C., J. Comp-Aided Mol. Design, 3:255, 1989.

9. Bartlett, P.A., et al., in Molecular Recognition: Chemical and Biological Problems, Roberts S.M. (ed.), London: Royal Soc. of Chemistry, 78:182, 1989; Lauri, G.; Bartlett, P.A.; J. Comp-Aided Mol. Design, 8:51 (1994).

10. CONCORD, Rusinko A., et al. A proper description of CONCORD has yet to appear in a journal publication. The best reference to it is still Andy Rusinko's 1988 Ph.D. thesis, available from University Microfilms, Ann Arbor, MI.

11. Christie, B.D., et al., Online Inf., 90, 1990:137, 1990.

12. Murrall, N.W. and Davies, E.K., J. Chem. Inf. Comp. Sci., 30:312, 1990.

13. Hurst, T., J. Chem. Inf. Comp. Sci., 34: 190, 1994.

14. Moock, T.E., et al., J. Chem. Inf. Comp. Sci., 34: 184, 1994.

15. Lewis, R.A. and Dean, P.M., Proc. Roy. Soc., B236:125, 1989.

16. Moon, J.B. and Howe, W.J., Proteins: Str. Func. and Gen., 11:314, 1991.

17. V. Gillet, et al., J. Comp.-Aided Mol. Design, 7:127, 1993.

18. Kuntz, I.D., Science, 257:0178, 1992.

19. Martin, Y.C., J. Med. Chem., 35:2145, 1992.

20. Willett, P., 3D Chemical Structure Handling, New York: Wiley, 1991.

21. Brint, A.T., Willett, P., J. Mol. Graphics, 5:49, 1987.

22. Good, T., et al., J. Comp.-Aided Mol. Design, 9:1, 1995.

23. For example, Mayer et al., J. Comp.-Aided Mol. Design, 1:3, 1987.

24. See, for example, Cormen, T.H., Leiserson, C.E., and Rivest, R.L, Introduction to Algorithms, Cambridge: MIT Press, 1990.

25. Hicks, M.G., Jochum, C., J. Chem. Inf. Comput. Sci., 30:191, 1990.

26. The only reference to the internal workings of SSSS was an article in Software Entwicklung in der Chemie 1, J. Gasteiger (ed.), Berlin: Springer-Verlag. The detailed reference I've now lost. HTSS is briefly described by Nagy, M.Z., et al, Chemical Structures 1, Warr, W. (ed.), Berlin:Springer-Verlag, 1988, p. 127.

27. Dolata, D.P., Leach, A.R., Prout, K., J. Comp.-Aided Mol. Design, 1:73, 1987.

28. Leach, A.R., Dolata, D.P., Prout, K., J. Chem. Inf. Comput. Sci., 30:316, 1990.

29. I'm not aware of any publication describing his method. The software is available from Daylight Chemical Information Systems, Irvine, CA.

30. Kearsley, S.K., et al., J. Comp.-Aided Mol. Design, 8:565, 1995.

31. Sadowski, J. Gasteiger, J, Klebe G., J. Chem. Inf. Comp. Sci., 34:1000, 1994.

32. Hurst, T., J. Chem. Inf. Comp. Sci., 34: 190, 1994.

33. A description of this method has not appeared in a journal publication. A pamphlet is available from Biosym/MSI, at the moment headquartered in San Diego, CA.

34. Martin, Y.C., et al., J. Comp.-Aided Mol. Design, 7:83, 1993.

35. Smellie, A.S., Crippen, G.M., Richards, W.G., J. Chem. Inf. Comp. Sci., 31:386, 1991.

36. Golender, V.E., Rozenblit, A.B., Zh. Vses. Khim. O-va., 25: 28, 1980.

37. DesJarlais et al., Proc. Nat. Acad. Sci USA, 87: 6644, 1990.

38. Wang et al., J. Med. Chem., 37: 4479, 1994.

39. Good, T., et al., J. Comp.-Aided Mol. Design, 9:1, 1995.

40. Ho, C.M.W., and Marshall, G.R., J. Comp.-Aided Mol. Design, 9:65, 1995.

41. Erickson, J.W., et al., Science, 249:527, 1990.

42. Goodford, P.J., J. Med. Chem., 28:849, 1985.

43. Gillett, V.J., et al., J. Chem. Inf. Comp. Sci., 34:207, 1994.

44. Wagner, G., Nature Str. Biol., 2: 255, 1995.

45. Guner, O.F., Henry, D.R., Pearlman, R.S., J. Chem. Inf. Comput. Sci., 32: 101, 1992.



NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:

Network Science Corporation
4411 Connecticut Avenue NW, STE 514
Washington, DC 20008
Tel: (828) 817-9811
E-mail: TheEditors@netsci.org
Website Hosted by Total Choice