A Note on the Sense and Nonsense of Searching 3-D Databases for Pharmaceutical Leads
David Weininger
Daylight Chemical Information Systems, Inc.
http://www.netsci.org/Science/Cheminform/feature04.html
[Editors note: This paper is the result of a request made to Dr. Weininger for a contribution for the NetSci issue on 3-D Databases. While the paper is neither a review nor a research paper, it consists of an overview of the basis for current 3-D database searching and suggestions for the future.]
Introduction
Searchable 3-D databases have become accessible to pharmaceutical chemists in recent years. In most respects, such databases are logical extensions of 3-D databases which have been around for a relatively long time. New capabilities are also becoming available as new types of data are stored and new data analysis techniques are developed. Like most new technologies, there is a potential for both uses and abuses. A number of strategies are described in overview. Although many of these strategies are used for pharmaceutical lead-finding, it appears that the majority of 3-D database searching done so far doesn't make much sense for this purpose. Suggestions are made for improving the relevance of such efforts in the future.
Background: 3-D Databases of Observed Conformations
The term "3-D databases" is used here to refer to databases containing descriptions of chemical conformations. In the vast majority of such databases, chemical conformations are represented by geometric positions of atoms. Other searchable descriptions of molecular conformation are possible which are arguably more relevant to pharmaceutical chemistry (e.g., surface descriptions and property-geometry representations), but these are currently not widely used as research tools.
Traditional 3-D databases are used to archive observed conformations, which are primarily derived from X-ray crystal structures. For mainly historical reasons, there is a large dichotomy between the database technologies used to manage conformations of "large molecules" (e.g., biopolymers such as proteins) and "small molecules" (all other structures). The most important databases representing these types are the Brookhaven Protein Database (PDB) and the Cambridge Crystallographic Database (CSDB), respectively. Methodologies for searching and analyzing observed conformations of large molecules are, in most respects, more advanced than those for small molecules. This is probably due to the better availability of PDB compared to CSDB, rather than the nature of the problem. However, in neither case can it be considered a "solved" problem.
Databases of observed conformations represent a 3-D form of molecular structure which exists in nature. Because of this direct relationship, searching and analysis software may be used to examine the nature of chemistry in 3 dimensions. Much of what we think we know about theoretical chemistry is validated from the small molecule data, since crystal structures represent certifiable low-energy conformations. Similarly, a large body of empirical information has been derived from observed large molecule conformations, e.g., relationships between primary and secondary structure.
From a pharmaceutical point of view, one is confronted with the fact that crystal structures represent an environment (a molecule dissolved in itself and frozen) which is very far removed from that of interest (typically, a protein surface or intracellular environment). However, when it comes to observations of 3-D structure, crystals are mainly what one can observe. Important exceptions to this generalization exist which provide data of greater relevance. A significant number of crystal structures are available for large molecules with bound ligands. A modest number of good-quality data are becoming available for NMR-derived structures which provide information about the 3-D nature of molecules in solution.
Foreground: 3-D Databases of Computed Conformations
A different type of 3-D database is used to manage computational results, which almost universally contain small organic molecules (since large molecule conformations can't be sensibly computed). Although such databases are superficially similar to those used to store observed conformations of small molecules, they share very little commonality in function.
Computed conformations don't represent observations of nature. They represent the result of a computer algorithm which is typically a rule-based or mathematical model. In general, it's pointless to examine databases of such conformations to discover the "nature" of their 3-D structure -- one will only "discover" the nature of the algorithm used to generate them. This is something best done by examining the algorithm directly unless the algorithm used is extremely complex, e.g., ab initio computation. The potential benefit of such database systems for purposes of pharmaceutical discovery is small. (The closely related problem of pharmacophore hypothesis generation, i.e., finding common structural features given a set of actives, is typically done by conformational analysis rather than 3-D database searching.)
Search systems for large databases of computed conformations are therefore oriented toward finding conformations matching user-specified 3-D patterns rather than toward discovering such patterns in computed 3-D data. Patterns can range from simple geometric relationships (e.g., 3- and 4-point pharmacophores) to comprehensive descriptions of a surface or cavity. The ideal pattern for pharmaceutical lead-finding is one which recognizes site activity. All current types of 3-D search patterns are only very rough approximations to this ideal. In the absence of a theoretically preferable choice, selection of a search method is generally dictated by how the hypothesis was generated, by availability of supporting software, or by the investigator's preference.
There are three basic methods for searching a 3-D database. The simplest is to search one or more stored conformations per molecule directly ("fixed search"). A more powerful method is to use one or more stored conformations to provide the starting point for a simplified conformational analysis which results in testing a range of accessible conformations for each molecule ("flexible search"). The most powerful method (in terms of the pattern to be found) is to generate compliant conformations for molecules in the database which are independent of stored 3-D data ("directed 3- D generation and/or sampling"). Performance optimizations are available for each of these methods which trade larger space requirements for higher speed. In general, more powerful searches are slower because less effective performance optimizations are available.
In most cases, one or more conformations must be invented for each molecule in the database. The most widely-used method is to generate a single conformation using a rule-based model builder such as CONCORD or CORINA. The main advantage of this method is that it is very fast, the main disadvantage is that it produces a single low-energy conformation (at least with currently available tools). An alternate approach is to generate one or more conformations by sampling conformation space for the molecule, which is typically done with a distance geometry method (e.g., DG2, DGEOM, Rubicon). Advantage of this approach is that it provides conformation sampling for a subsequent fast, fixed search, disadvantages are that it is slower and uses more storage space. Various energy minimization methods are also used, e.g., ab initio, semi-empirical, and molecular mechanics. Minimization methods are often used in concert with one of the other methods, e.g., conformations sampled by distance geometry may be minimized before storage.
Searching Nonsense
Whatever methodology is used for searching 3-D databases, the value of such searches is primarily dependent upon the relevance of the results to the problem at hand. Although this sounds trivial, it is probably the weakest part of all current 3-D database searching strategies.
Many of the problems posed to databases of observed conformations are quite simple geometric queries, e.g., "What is the range of out-of-plane torsions for ortho chlorides?", "What is the average length of a peptide bond?", and "How frequently does Ala-Gln-Tyr occur in a sheet?" By definition, we are asking questions about the conformations per se. The nice thing about such geometric questions is that they can be answered definitively by a geometric search.
In pharmaceutical discovery, questions posed to a database of computed conformations are generally in the form, "Find molecules that fit this binding site (or model of a binding site)." Although such questions are posed in geometric terms, they are actually asking for information about molecules, not conformations per se. For a fixed 3-D search, we need to assert that the desired bound molecular conformations exist in the database. If this assertion is not true, the search will fail with any pattern-recognition method. (Similarly, for flexible searches, the required assertion is that the desired conformation will be considered by the search algorithm.)
In practice, the required assertion may be true for some molecules in the database and not for others (which would be hits). Search efficiency is dependent upon the amount of time required to do the search, the rigor of the search algorithm and the relevance of stored conformations.
The relevance of stored conformations is taken to be the fraction of would-be molecule hits for which desired conformations are accessible. This factor is hard to quantify. It is dependent upon the search method (fixed search < flexible search < directed generation < exhaustive analysis) and distribution of molecular structures in the database (flexible structures < rigid structures). In the extreme, it is clear that if all database structures are completely rigid, conformational relevance would be "100%". If the database structures were all extremely flexible (e.g., substituted acyclic C20's) it would be very poor. Molecular distributions in most real-world databases fall between these extremes.
One approach to answering this question is to look at the relative hit rate for flexible vs. fixed 3-D searches for the same query across the same set of database molecules. Although affected by a large number of factors (specific query and specific database), this ratio is quite small in practice, generally between 2 and 5, i.e., a flexible search finds 2 to 5 times as many hits as a fixed search on single conformations. This does not tell us anything about the relevance of flexible search results. The only conclusion that can be drawn from this observation is that the efficiency of fixed 3-D database searching is no better than 20 - 50%
A more serious problem in geometric searches over databases of computed conformations of flexible molecules is that many conformations of interest are systematically not represented (or in the case of flexible search, may not be accessible). This is because the methods used to compute database conformations are purposely selected to be low energy conformers (or approximations thereof). The in vaccuo environment for which conformations are computed is far from that of pharmaceutical relevance (solvated and bound). Binding to biological catalysts is a relatively high energy process -- perhaps 10 - 50 kcal/mole of non-covalent energy for high-specificity competitive inhibitors -- in any case, plenty of energy to twist a few rotatable bonds to move the bound conformer far from a lowest energy conformation.
These considerations raise serious questions about the relevance of searching 3-D databases of low energy conformers, at least for flexible molecules. Except for rigid molecules, few known drug molecules bind in their fully extended, lowest energy in vaccuo conformation. One must conclude that rigid molecules and these rarities are the only ones that will be found by a fixed-conformation 3-D search. Unfortunately, the plurality of 3-D database searching currently done depends on exactly this type of search.
A flexible search improves the prospects of finding relevant hits, but only to the extent that it accesses higher energy structures (typically reducing net efficiency). To maintain efficiency, many "flexible" searches only search low energy conformations. Directed conformation sampling improves prospects further (at reduced efficiency). However, post-sampling energy minimization as typically done probably reduces the benefits of this strategy.
Making Sense of the Static
The perspective outlined above suggests a number of sensible approaches to the problem of finding relevant leads via searching 3-D databases of computed conformations.
It is clear the most relevant search methods, exhaustive search and directed conformation generation, are also the most computationally expensive by one or two orders of magnitude. One sensible approach would be to grind it out using 100x the computational effort required by current online systems. From a software designer's point-of-view, the easiest way to do this is to wait until such computational power is readily available, something on the order of 10 years. This isn't a viable option for molecular discovery chemists who are responsible for coming up with next month's test compounds.
With respect to the given problem, the main failing of the current crop of fast, rule-based conformation generators (e.g., CONCORD and CORINA) is that they are limited to producing one lowish energy conformation per molecule. There is nothing inherent in rule-based model building that makes this inevitable. When such programs operate on flexible molecules, they make a number of arbitrary and near-arbitrary choices. It should be possible to build a rule-based model builder that produces a variable number of conformations, ideally within a specifiable limit of the lowish energy structure. In principle, such a program would allow 3-D searching to operate on a larger number of more relevant structures (at a cost in space) or operate on-the-fly (at a cost in time). If the ability to add target constraints was combined with on-the-fly rule-based conformation generation, the system could theoretically approach directed conformation sampling in quality, but at a speeds approaching those of fixed searches.
The simplest and most reliable approach to finding relevant structures is to work only with completely rigid or semi-rigid structures (e.g., with 2 or fewer rotatable bonds which affect overall conformation). The majority of problems associated with result relevancy are caused by structure flexibility: this approach attempts to finesse such problems rather than solve them. Limiting the database to rigid structures will signifiantly reduce the number of structures available in a given database. This has the obvious disadvantage that many potentially viable compounds are eliminated from consideration, but the corresponding advantages are that: results are likely to be relevant to the actual process, the fraction of spurious hits will presumably be reduced, and computational resources are conserved.
It is recognized that this approach (working only with structures that are fairly rigid) doesn't follow the "rules of the game" for 3-D searching, but when combined with other advances in pharmaceutical chemistry, it might get the job done. In particular, it would appear that combinatorial synthesis could provide a mechanism for producing large numbers of rigid structures (given appropriate chemistry). The main disadvantage of this approach is that it is not a pure "data mining" technique (it requires new synthesis); the main advantage is that it can potentially deliver a large number of relevant structures relatively cheaply.
It must also be mentioned that there is the possibility that radically different types of geometric searching can be developed which are significantly more efficient and relevant to real-world problems than current methodologies. As described above, virtually all current 3-D database search methods are based on the positions of points or vectors in Euclidean space. Given the profound theoretical weaknesses inherent in geometric searching as currently done (and given the very modest success of these methods), there is no reason to believe that this representation is superior to all others. A number of researchers are experimenting with search methods which operate on direct representations of surfaces, volumes, and other mathematical descriptors of shape-space. To date, none of these methods have entered the mainstream of 3-D database searching, but this is most definitely an open area for research.
One final suggestion is based on the observation that, in general, observed large molecule conformations are used as the basis for developing pharmacophoric search criteria, but not used for searching directly. Observed conformations of interesting large molecules are becoming available at an ever-increasing rate. It might prove valuable to reverse the sense of 3-D database searching, i.e., look for a match of each given small molecule conformation to the surface of a database of large molecules. Unfortunately, most 3-D database search systems are not designed to efficiently search in this direction. Although such a approach might be computationally expensive, it offers a high potential for finding new pharmaceutical leads.
Conclusion
Searching 3-D databases containing single computed conformations of flexible molecules for pharmaceutical leads is inefficient in almost every conceivable respect. Even so, this continues to be the method that most researchers use. A number of alternative strategies are available which should at least increase the efficiency of the process. A promising application of 3-D database search technology might be to operate on databases of relatively rigid theoretical structures which can be generated in combinatorial libraries (or other "lightweight" syntheses).
NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:
- Network Science Corporation
- 4411 Connecticut Avenue NW, STE 514
- Washington, DC 20008
- Tel: (828) 817-9811
- E-mail: TheEditors@netsci.org
- Website Hosted by Total Choice