Opportunities for Computational Chemists
Afforded by the New
Strategies in Drug Discovery: An Opinion
Yvonne Connolly Martin
Computer Assisted Molecular Design Project
100 Abbott Park Road
Abbott Park, IL 60064
The past three years have witnessed several challenges to the traditional ways that we search for new drugs. These changes will also change the way that computational chemistry can contribute to the process. No longer are compounds designed and synthesized one at a time: instead, libraries of hundreds to hundreds of thousands of molecules are made.[1-6] No longer is a structure-activity set made of precise biological measurements on 50 to 100 compounds: instead it may be "yes" or "possibly" on thousands of related or unrelated compounds.[6,7] No longer do we wait years for 3D structures of target proteins: instead new experimental structures are being produced at an accelerated rate and homology modeling is providing more accurate structures. The challenge to computational chemists is to not only keep up with the flood of new data, but to contribute to converting the data to useful information by helping plan the experiments performed and analyzing the resultant data. We face the challenge of examining more data and doing so faster. This opinion piece will examine several of the new technologies to highlight the challenges and opportunities each offer to computational chemists. We must use all of the skills and knowledge that we developed in molecular modeling, pharmacophore mapping, 3D database searching,  and 3D-QSAR (Quantitative Structure-Activity Relationships)[10,11] but add to them strategies to handle more data in less time.
Changes in Strategies for Drug Discovery
Large Numbers of Compounds are being Tested by High Through-put Screening Strategies.
The race to find a new lead has led many pharmaceutical companies to automate biological screening to such a point that 100,000 miscellaneous compounds can be tested in a particular biological assay in a month. Promises are being made that this can be reduced to one week. This enormous rate of testing is accomplished by using robotics, bar-coding, and other automation strategies at every opportunity. Frequently the through-put is increased by testing mixtures rather than single compounds. Although the strategy involves testing every available compound (and thus eliminating one traditional objective of 3D database searching), follow-up includes purchasing "similar" compounds from outside vendors or designing individual compounds or combinatorial libraries that capitalize on the information found in the initial screening. Additionally, there is the desire that the compounds tested are diverse and that they somehow explore the available chemical space.
Which methods will automatically handle all chemical structures and generate molecular descriptors for QSAR quickly?
Since the structures tested in such high through-put screening belong to hundreds to thousands of different series, traditional physicochemical properties such as octanol-water partition coefficient and even molecular connectivity indices might not be sufficient to characterize what distinguishes active from inactive molecules. Clearly, if we are to analyze 100,000 to 5,000,000 compounds we cannot depend on only methods that take even a minute per compound because this would mean that we could handle only 1440 compounds per day. Compounds are being added to the collection from parallel combinatorial synthesis faster than that. Our preliminary analysis suggests that substructural descriptors perform quite well for this function. However, by no means is the set tested ideal and we would expect that the appropriate 3D descriptors would perform better. More large datasets will be required to validate proposed descriptors.
Are there methods that handle such high dimensions and are fast enough to be of use?
To calculate a relationship between molecular properties and biological activity, we need to be able to discover the relevant variables quickly and accurately. Methods such as partial least squares and principal components regression that use all of the input molecular descriptors, suffer from the fact that descriptors that are irrelevant to biological activity add noise that dilutes the signal from relevant descriptors. Although traditionally structure-activity relationships were analyzed as if all compounds fit one relationship, with high through-put screening data it is possible that the compounds should first be segregated into sub-sets and relationships developed within the sub-sets. One challenge is to do this automatically.
How do I handle the possibility that some of the structures of the compounds are incorrect?
Usually high through-put screening involves testing compounds that have been on the shelf for years. Their structures may never have been proven with modern methods or they could have decomposed over time. If purchased compounds are added to the collection, these too are not always the structure listed in the database. Hence, any automated structure-activity analysis must be robust enough that a few errors in structure are tolerated. The opportunity is that the structures of outliers from the SAR should be verified.
Are all the hits found?
The efficiency of high through-put screening is frequently accomplished by testing a mixture of several compounds. Clearly, if the compounds in the mixture react with each other, then they are not present to produce the biological effect but the reaction product might do so. In other cases one compound in a mixture may interfere with the biological assay with the result that a legitimate effect of another compound in the mixture is missed. As a result, any automated structure-activity method must also be robust enough that a few errors in biological activity are tolerated. The opportunity is that the biological properties of outliers from the SAR should be verified.
How do I get the data from various computers into mine?
Serious attempts to analyze the structure-activity relationships in high through-put screening data run into a road-block if this data is not kept in such a way that it is accessible to other computer programs. The need for structure-activity analysis can be an important spur to organizing the data, developing plans for archiving, etc. Many investigators will not be able to work to improve the methods since the whole structure-activity dataset will not be published.
Companies are unlikely to publish the contents of their compound collection. This means that advances will increasingly depend on the ingenuity of the industrial investigators and their consultants who have signed a confidential disclosure agreement. Mechanisms to publish successful strategies that can be peer-reviewed need to be designed.
Increasing the diversity of the sample collection.
If high through-put screening is to identify useful hits, then the database of compounds should be structurally diverse.[13-15] Usually the strategy includes augmenting the compound collection with compounds purchased from outside vendors. Clearly, one would like these new compounds to be different from those already in hand: computational chemists typically take this responsibility. The question then becomes how one quantitates diversity or similarity in a way that is relevant to choosing compounds for biological testing. Should it be in 2D structures or 3D pharmacophores or both? This can be an area of active research in coming years. Beyond this, how does one actually select the compounds to add the maximum diversity to a collection? Some of the existing methods are clearly too slow for extensive application. Invention and validation of new algorithms for compound selection can also become an area of active research.
Using the SAR data to design novel, potent, and selective inhibitors.
The availability of so much structure-activity data will make it easier to validate any proposed method of analysis and will spur the development of fast and efficient methods. Additionally, the measurements of many properties on the same molecules offers the opportunity to develop specific computational strategies to handle selectivity as well as potency.
Many Series of Compounds to Follow up a Lead are Now Made by Parallel Synthesis.
Leads identified in high through-put screening or traditional project work are now scrutinized for opportunities for parallel synthesis of many analogs. This typically involves the synthesis of hundreds to a few thousand compounds in batches of 30 to 100.[2,4] Usually only a few synthetic steps are applied using commercially available precursors. The synthesis often involves using various automation strategies including robotics and automated synthesizers.
Is the structure of the compounds correct or is an impurity causing the observed bioactivity?
The biggest challenge to the interpretation of data from parallel synthesis is the lack of purity of the compounds tested. Although some purity checks are made, frequently one accepts samples with 10-20% of the mass not the intended product. One must then be careful that the activity resides in the predominant product and not an impurity. This is especially important if one of the starting materials has biological activity in the particular assay.
Series planning is needed.
Since the objective of parallel synthesis is to explore a wide range of structural types and often there are more precursors than are possible to use, series design strategies are especially appropriate. With them one can design libraries that sample a wide variety of chemical substructural types, a wide range of physicochemical properties, or a combination of both. Such series planning increases the chance that the resulting structure-activity data will yield a consistent pattern to form the basis for subsequent libraries or single-compound synthesis.
Can be based on a QSAR.
The selection of precursors for parallel synthesis can also include consideration of forecast bioactivity if that is available. The numbers of compounds also make it possible to generate compounds that will distinguish between alternate quantitative structure-activity relationships. Because of the number of compounds involved, the predictions don't need to be especially precise to be useful.
SAR Data are Now Being Generated from Combinatorial Mixtures.
Combinatorial libraries may contain as many as 105 or 106 compounds synthesized by automated methods. Typically the compounds are bound to a solid support such as a bead with each bead containing only one compound. Such mixtures are assayed by some sort of affinity-based assay in which only the most potent compounds present are identified.[19-25]
Are all of the compounds actually present in the mixture?
Although the chemical explorations prior to synthesis of the library examine the scope and conditions of the reaction, there is no guarantee that all of the compounds intended to be in the library are in fact present. Only the most active are identified. Hence any automated method for analyzing the structure-activity data must allow for the possibility that some of the compounds, predicted to be active, might not be present in the mixture.
How do I handle the lack of SAR data?
As was indicated above with high through-put screening, the lack of information on weakly active compounds confounds the search for quantitative relationships between structural and biological properties. New computational strategies will probably be needed.
Library planning is needed.
Just as in parallel synthesis, it is advisable to plan combinatorial libraries so that the maximum amount of information is produced. Sometimes so many precursors are available that selection must be made: At other times not enough precursors are available or they are too limited in properties so that the computational chemist may suggest which compounds that can be easily converted into precursors will add the most variety to the library.
A final point of library design involves the mixing strategy. Usually the compounds are prepared in such a way that the final library is not in one flask, but in several sub-pools. It may be important to optimize the mixtures present in any sub-pool to ease identifying the structures of the active compounds.
Using Protein Structures to Design Ligands Needs to be Fast.
Advances in molecular biology and in macromolecular crystallography and NMR have produced an explosion in the numbers of 3D structures of macromolecular targets for drug design. The structures of bound ligands can be determined quickly as well. Often there is also available the structures of two related proteins for which one desires selective inhibitors. These macromolecular structures increasingly form the basis for the design of parallel synthesis or combinatorial libraries. 
We still lack a rapid and accurate scoring function.
Although much progress has been made, we still often cannot forecast binding affinity accurately enough to distinguish attractive from unattractive candidates for synthesis.[27,28] The protonation state of the protein, the presence or absence of bound waters or ions, and conformational flexibility of the protein and the ligand all present challenges to selecting the optimal compounds from the thousands of candidates. Additionally, the problem of how to account for the desolvation energy of ligand and macromolecule remain challenges.
Combinatorial chemistry/parallel synthesis will provide more data to test a scoring function.
Feeding more data into developing a scoring function should improve its performance or suggest overlooked factors that will improve it.
Combinatorial chemistry/parallel synthesis allows one to be less accurate in one's predictions.
If the selection of compounds for combinatorial chemistry or parallel synthesis is based on predictions from 3D structures of the macromolecular target, then substantial error may be accommodated. All that is important is that some of the suggested molecules have high affinity. This may allow one to emphasize different aspects of the scoring function in different compounds of the mixture.
Genetic Algorithm Design of Compounds Using Bioassay as the Fitness Function.
Two recent publications demonstrate the efficacy of using molecular evolution to optimize the potency of molecules in a series.[29,30] The molecules in the first generation are selected at random from all combinations possible, and then tested in the biological assay of interest. The potency is used as the fitness function to guide genetic operations of cross-over and mutation that then design the next generation of compounds. This process continues until potency is optimized.
This strategy precludes computational chemistry input based on biological activity.
Will this strategy make QSAR obsolete? If not, will the compounds so designed and their biological activities be useful for QSAR analysis? It is possible that such a set of molecules will have too much correlation between physical properties to allow an unambiguous QSAR to be determined.
The substituents should probably be selected to maximize molecular diversity.
The same strategies as are used to plan combinatorial libraries will be appropriate for this purpose as well. However, this field is so new that investigation of this point would be worthwhile.
Who else understands GA's?
The computational chemists will probably also be responsible for selecting the parameters of the specific genetic algorithm used in the optimization. Although general guidelines have been suggested, only more experience will tell if they apply in general or need to be modified in certain cases.
Just as the practices of medicinal chemistry and biological assays have added "quick and dirty" strategies to the more traditional ones, so can computational chemists make valuable contributions to drug discovery if they devise and perfect methods that handle large amounts of data quickly. Computational chemists will now have the opportunity to design sets of compounds that efficiently explore property space, to derive relationships between chemical and biological properties on sets of thousands of diverse molecules, and to explore such issues as chemical diversity. Additionally, the volumes of new data will provide information that should improve the methods for predicting the affinity of ligands for proteins of known 3D structure. The large amounts of data generated in many of these activities will increase the need to smoothly transfer data from one computer system to another. Lastly, the emerging genetic algorithm strategy for designing compounds to follow up a lead promises to move the practice of drug discovery in yet unknown directions. These are exciting times for all, and the challenges present great opportunities for computational chemists.
(1). Doyle, P. M. Journal Of Chemical Technology And Biotechnology, 1995, 64, 317-324.
(2). Dewitt, S. H.; Czarnik, A. W. Current Opinion In Biotechnology, 1995, 6, 640-645.
(3). Chabala, J. C. Current Opinion In Biotechnology, 1995, 6, 632-639.
(4). Bunin, B. A.; Ellman, J. A. J. Am. Chem. Soc., 1992, 114, 10997-10998.
(5). Hermkens, P.; Ottenheijm, H.; Rees, D. Tetrahedron, 1996, 52, 4527-4554.
(6). Gordon, E. M.; Barrett, R. W.; Dower, W. J.; Fodor, S. P. A.; Gallop, M. A. J. Med. Chem., 1994, 37, 1387-1401.
(7). Rigler, R. Journal of Biotechnology, 1995, 41, 177-186.
(8). Martin, Y. C. In Design of Bioactive Molecules Using 3D Structural Information; ; P. Willett and Y. C. Martin, Ed.; American Chemical Society: Washington, DC, 1996; pp .
(9). Martin, Y. C. J. Med. Chem., 1992, 35, 2145-2154.
(10). Martin, Y. C.; Kim, K.-H.; Lin, C. T. In Advances in Quantitative Structure Property Relationships; (Vol.1 of ; M. Charton, Ed.; JAI Press: Greenwich CT, 1996; pp 1-52.
(11). Greco, G.; Novellino, E.; Martin, Y. C. In Design of Bioactive Molecules Using 3D Structural Infor mation; ; P. Willett and Y. C. Martin, Ed.; American Chemical Society: Washington, DC, 1996; pp .
(12). Clark, M.; Cramer III, R. D. Quant. Struct.-Act. Relat., 1993, 12, 137-145.
(13). Johnson, M.; Lajiness, M.; Maggiora, G. M. In QSAR: Quantitative Structure-Activity Relationships in Drug Design; ; J. L. Fauchere, Ed.; Alan R. Liss: New York, 1989; pp 167-171.
(14). Lajiness, M. S.; Johnson, M. A.; Maggoria, G. M. In QSAR: Quantitative Structure-Activity Relationships in Drug Design; ; J. L. Fauchere, Ed.; Alan R Liss Inc: New York, 1989; pp 173-176.
(15). Downs, G. M.; Willett, P. In Chemometric Methods in Molecular Design; ; H. van de Waterbeemd, Ed.; VCH: Weinheim, 1994; pp 111-130.
(16). Brown, R. D.; Martin, Y. C. J. Chem. Inf. Computer Sci., 1996,
(17). Martin, E. J.; Blaney, J. M.; Siani, M. A.; Spellmeyer, D. C.; Wong, A. K.; Moos, W. H. J. Med. Chem., 1995, 38, 1431-1436.
(18). Sheridan, R. P.; Kearsley, S. K. J. Chem. Inf. Computer Sci., 1995, 35, 310-320.
(19). Frank, R. Journal of Biotechnology, 1995, 41, 259-272.
(20). Terrett, N. K.; Bojanic, D.; Brown, D.; Bungay, P. J.; Gardner, M.; Gordon, D. W.; Mayers, C. J.; Steele, J. Bioorganic & Medicinal Chemistry Letters, 1995, 5, 917-922.
(21). Baldwin, J. J.; Burbaum, J. J.; Henderson, I.; Ohlmeyer, M. H. J. J. Am. Chem. Soc., 1995, 117, 5588-5589.
(22). Murphy, M. M.; Schullek, J. R.; Gordon, E. M.; Gallop, M. A. J. Am. Chem. Soc., 1995, 117, 7029-7030.
(23). Gallop, M. A.; Barrett, R. W.; Dower, W. J.; Fodor, S. P. A.; Gordon, E. M. J. Med. Chem., 1994, 37, 1233-1251.
(24). Armstrong, R. W.; Combs, A. P.; Tempest, P. A.; Brown, S. D.; Keating, T. A. Acc. Chem. Res., 1996, 29, 123-131.
(25). Gordon, E. M.; Gallop, M. A.; Patel, D. V. Acc. Chem. Res., 1996, 29, 144-154.
(26). Combs, A. P.; Kapoor, T. M.; Feng, S. B.; Chen, J. K.; Daudesnow, L. F.; Schreiber, S. L. J. Am. Chem. Soc., 1996, 118, 287-288.
(27). Blaney, J. M.; Dixon, J. S. Perspectives in Drug Discovery and Design, 1993, 1, 301-319.
(28). Ajay; Murcko, M. A. J. Med. Chem. 1995, 38, 4953-4967.
(29). Singh, J.; Ator, M . A.; Jaeger, E. P.; Allen, M. P.; Whipple, D. A.; Soloweij, J. E.; Chowdhary, S.; Treasurywala, A. M. J. Am. Chem. Soc., 1996, 118, 1669-1676.
(30). Weber, L.; Wallbaum, S.; Boger, C.; Gubernator, K., Angew. Chem. Int. Ed. Engl., 1995, 34, 2280-2282.
NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:
- Network Science Corporation
- 4411 Connecticut Avenue NW, STE 514
- Washington, DC 20008
- Tel: (828) 817-9811
- E-mail: TheEditors@netsci.org
- Website Hosted by Total Choice