Novel Software Tools for
Addressing Chemical Diversity

R. S. Pearlman

Laboratory for Molecular Graphics and Theoretical Modeling
College of Pharmacy
University of Texas
Austin TX 78712
E-mail: pearlman@vax.phr.utexas.edu



http://www.netsci.org/Science/Combichem/feature08.html

Perhaps the most fundamental task related to chemical diversity is that of selecting a diverse subset of compounds from a much larger population of compounds. The obvious objective of that task is to identify a subset which best represents the full range of chemical diversity present in the larger population. But what if that larger population does not include representatives of one or more chemical classes or pharmacophores? How could one know whether such compounds were under- represented in a particular population or database? How could one identify compounds from other databases or compound libraries which would fill in the chemical diversity missing from the first? This presentation will address those questions and the solutions we have implemented in our DiverseSelector package for managing chemical diversity.

Before directly addressing the tasks of identifying and filling in missing diversity, we must first address issues related to the concept of "chemistry space." Just as x-, y- and z-coordinates define positions of points in a 3-dimensional cartesian space, the values of N different "molecular descriptors" define the positions of chemical compounds in an N-dimensional chemistry space. Whereas the dimensionality of our physical world is predefined as 3, the dimensionality of a chemistry space can be chosen to best suit our particular needs. The "diversity" of compounds positioned in chemistry is intuitively related to the inter-compound distance as measured in that space.

Most software for addressing chemical diversity uses some form of "molecular fingerprint" to describe each compound in a population. Fingerprints are bit- strings (sequences of 1's and 0's) representing the answers to yes/no questions about the presence or absence of various substructural features within the molecular structure of a given compound. Although not often discussed in such terms, each bit represents an axis in a multi-dimensional chemistry space. Each axis could have either of two values: 0 and 1. Fingerprints typically consist of hundreds or even thousands of bits. Thus, a 1000-bit fingerprint represents a point in a 1000- dimensional chemistry space. Similar compounds are expected to be located near each other in this space; dissimilar or "diverse" compounds are expected to be further apart from each other. It is often assumed that the Tanimoto dissimilarity index (1 minus the well-known Tanimoto similarity index) is directly related to inter-compound distance in these high-dimensional chemistry spaces. However, we have shown that the Tanimoto dissimilarity index is non-Euclidean and that another method for computing such distances may be advantageous. This, however, is not the focus of this presentation.

Satisfactory algorithms for diverse subset selection have been developed based upon calculations of inter-compound distances in these high-dimensional (fingerprint) chemistry spaces. However, distance-based algorithms are poorly suited to the tasks of identifying and filling in diversity voids. Imagine several hundred objects distributed (in some non-uniform fashion) across a 2-dimensional checkerboard. Given only the distances between all pairs of objects and no further information regarding their position on the checkerboard, how easy would it be to determine which squares contained very few objects or even none at all? And given only distance information, how easy would it be to determine whether some new object might be assigned to one of those under-represented squares? Clearly, identifying and filling in diversity voids would be trivially simple if we knew which and how many objects were located in each square.

We can extend the checkerboard analogy to a multi-dimensional chemistry space by defining a small, finite number of "bins" on each axis of that space. The bin-definitions, in turn, define multi-dimensional cells which, altogether, cover the entire space. Each compound occupies some particular cell. The population of each cell can easily be determined. "Empty" cells (those with less than some user-definable population) precisely identify diversity voids --- regions of chemistry space which are under-represented within the total population of compounds. Whether a new compound fills in missing diversity could be trivially determined simply by noting which cell it would occupy.

Cell-based (in contrast with distance-based) methods are ideally suited for identifying and filling in diversity voids and are also ideally suited for other tasks related to diverse subset selection. However, cell-based methods cannot be applied to high-dimensional representations of chemistry space. With just two (binary) bins on each axis of a 1000-dimensional chemistry space, there would be 21000 cells --- an astronomically large number. Thus, cell-based methods can only be applied if we are positioning compounds in a low- dimensional chemistry space (dimensionality less than 10 or so). This brings us to the crux of the issue: how can we rationally define a low-dimensional chemistry space?

Several groups have attempted to use "traditional" molecular descriptors (e.g., dipole moment, molecular weight, estimated logP, surface area, HOMO-LUMO gap, etc.) as the axes of a low-dimensional chemistry space. There are two basic reasons for which these efforts have not proven particularly useful. The first is that many of these "traditional" descriptors are highly correlated; the axes of a vector- space should be orthogonal (uncorrelated). This first problem could be addressed to a limited extent by using principal components of the "traditional" descriptors as the axes but the second and more fundamental problem would remain. The "traditional" descriptors are whole-molecule descriptors which are, at best, only vaguely indicative of drug-receptor complementarity. The advantages of cell-based methods cannot be realized unless a definition of a meaningful low-dimensional chemistry space is developed.

We believe that our novel "BCUT values" represent descriptors useful for the definition of a meaningful low-dimensional chemistry space. Unlike all of the rigorously based work for which our Laboratory is known, our justification for the preceding statement is drawn primarily from our own empirical experience and the empirical experience of two other groups which considered relatively crude precursors to the BCUT approach described below.

In 1989, Burden [F. R. Burden, J. Chem. Info. Comp. Sci., 29, 225-7, 1989] suggested that a "molecular ID number" could be defined in terms of the two lowest eigenvalues of a matrix representing the hydrogen-suppressed connection table of the molecule. More specifically, Burden suggested putting the atomic numbers on the diagonal of the matrix. Off-diagonal matrix elements were assigned values of 0.1 times the nominal bond-type if the two atoms are bonded and 0.001 if the two atoms are not bonded. He also added 0.01 to the off-diagonal elements representing "leaf edges" in the molecular graph (i.e., terminal bonds to the last atom in a chain). In suggesting that a set of compounds could be ordered by eigenvalue he was actually proposing a 1- dimensional chemistry space. Since fingerprint-based similarity searching methods were just becoming available for modestly sized databases (under 0.5 million compounds), Burden's seemingly far-fetched suggestion was generally ignored.

In 1993, Rusinko and Lipkus [A. Rusinko III and A. H. Lipkus, unpublished results obtained at Chemical Abstracts Service, Columbus OH] were eager to find some sort of "similarity searching method" applicable to the Chemical Abstracts Service (CAS) Registry File of approximately 12 million structures. Using a test database of 60,000 compounds, they were delighted to find that Burden's suggestion compared surprisingly well with the results of an accepted similarity searching procedure. They also experimented with the notion of assigning a constant value to all diagonal matrix elements or a constant value for all bonded off-diagonal elements but, in each case, were using the lowest eigenvalue of a single matrix to define a 1- dimensional chemistry space.

Based on Burden's original suggestion (B) and CAS's "validation" of the basic idea (C) we at the University of Texas (UT) added the following significant extensions which resulted in what we now refer to as the BCUT approach. First, we reasoned that if a 1-dimensional chemistry space showed some promise, a similarly defined multi-dimensional chemistry space should be even more promising. Second, we are interested in diversity with respect to the way in which compounds might interact with a bioreceptor. Atomic number has almost no bearing on the strength of intermolecular interactions. Rather, the strength of intermolecular interactions depends on atomic charges, atomic polarizabilities, and atomic H-bond-abilities. Thus, we proposed constructing three classes of matrices: one class with atomic charge-related values on the diagonal, a second class with atomic polarizability- related values on the diagonal, and a third class with H-bond-abilities on the diagonal. Third, we proposed using a variety of additional definitions for the off- diagonal elements including functions of interatomic distance, overlaps, computed bond-orders, etc. Fourth, we demonstrated that both the lowest and highest eigenvalues of Burden-like matrices should reflect aspects of the molecular structure. Clearly, considering all possible combinations of diagonal and off-diagonal choices, some method must be developed for rationally deciding which BCUT values (eigenvalues) would be best for representing the chemical diversity of a given population of compounds.

Since different charge-related values (e.g., Gasteiger-Marsili charges, AM1 charges, AM1 densities, etc.) are all intended to convey basically the same fundamental information, and since different measures of polarizability and H-bond-ability are all intended to convey the same fundamental information, we propose that only one matrix be chosen from each of the three classes. Since both the highest and lowest eigenvalues should be relevant and since they are relatively uncorrelated, this would result in a 6-dimensional BCUT space. The choice of exactly which six BCUT values are best for a given population can be made using a chi-squared approach based on the principle that the best choice will be that which results in the most uniform distribution of compounds in chemistry space. (This chi-squared approach also works equally well with other descriptors of potential utility for defining low-dimensional chemistry spaces).

Note that BCUT values can be computed at three different levels (requiring increasing amounts of cpu-time). The simple 2D connection table is all that is needed to compute crude Gasteiger-Marsili charges, tabulated polarizabilities and H-bond-abilities, and off-diagonal elements related only to topological connectivity. Using CONCORD to generate 3D structures enables the use of interatomic distance or overlap on the off- diagonals. Using HSCF, our FORTRAN- and C-callable molecular orbital package, enables use of more accurate semi-empirical MO charges or densities and our novel AM1-derived atomic polarizabilities on the diagonals and calculated bond-orders on the off-diagonals. Using our PipeComm software for distributed computing over a network of processors enables MO calculations to be performed on large databases in fairly short times (e.g., 200,000 AM1 calculations on drug-sized compounds in about 3 days on 29 SGI R4000 and R4400 processors). However, it should be emphasized that BCUT values based solely on 2D connection tables have proven quite satisfactory for diversity purposes.

Note that the chi-squared approach yields the set of BCUT values best for a particular population of compounds. Using the combination of MDL's ACD and MDDR databases as our non-optimal but best-available example of a "truly diverse" population, we have identified a set of BCUT values which might be useful in scenarios addressing "universal chemistry space." In contrast, two populations resulting from generating two different combinatorial virtual libraries should be expected to occupy two different regions of that universal chemistry space. Thus, diverse subset selection from each of those two populations might be aided by using a chemistry space specifically tailored to spread each population as uniformly as possible.

By representing a corporate database in the chemistry space best for the hypothetical "truly diverse" population, we can easily identify regions of chemistry space which are under-represented in the corporate database. By representing a commercially available chemical library in the same chemistry space, we can easily identify which compounds would fill particular diversity voids. If two companies both represented their corporate databases in the same chemistry space, they could exchange "empty-cell" lists enabling each company to identify which of its compounds might be "traded" to fill voids at the other company without either company revealing the complete contents of its corporate database to the other.

All of the cell-based as well as distance-based algorithms for selecting diverse subsets and managing diversity voids have been implemented in a comprehensive package called DiverseSelector which is designed to work either independently or in conjuction with our CombinDBMaker package for combinatorial database generation. Although developed on SGI hardware, DiverseSelector is easily portable to any platform which would support its X- window-based graphical user interface (GUI). The GUI was specifically designed to facilitate use by either computational or bench chemists.



NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:

Network Science Corporation
4411 Connecticut Avenue NW, STE 514
Washington, DC 20008
Tel: (828) 817-9811
E-mail: TheEditors@netsci.org
Website Hosted by Total Choice