Formula for Determining the "Goodness of Hit Lists" in 3D Database Searches
Osman F. Güner
Accelrys
9685 Scranton Road
San Diego, CA 92121
Tel: (619) 458-9990
Douglas R. Henry
MDL Information Systems, Inc.
14600 Catalina Street
San Leandro, CA 94577
Tel: (510) 895-1313
![]()
http://www.netsci.org/Science/Cheminform/feature09.html
Presented at the 1998 Charleston Conference, March 9, 1998
Submitted for Publication June, 1998
![]()
Abstract
A measure of the 'Goodness' of hit lists obtained from chemical database searching is proposed (the G-H score). This measure takes into account both the yield (the fraction of active structures hit) and the percentage of actives that are retrieved from the database. By using variable coefficients on these terms and adjusting for the size of the hit list, a flexible but quantitative measure of hit list quality is obtained. We show the application of this measure to several published search results. We also show how the G-H score can be used to measure the quality of clustering results.
![]()
Schematic Representation of a Database and a Hit List

![]()
Definition of Terms

![]()
G-H Score: Arithmetic Average of Yield and Percent of Actives

![]()
Hit List Definitions: The "Best" and the "Worst"

![]()
Search Results Test Case
- The best case:
where Ht = Ha = A- -Search retrieves all of the actives and nothing else;
false negatives = 0, false positives = 0
- -Search retrieves all of the actives and nothing else;
- The worst case:
where Ha = 0, Ht = D - A- -Search retrieves everything in the database except the
actives;
false negatives = A, false positives = D - A
- -Search retrieves everything in the database except the
actives;
- Extreme case 1:
(%Y = 100 with a very small hit list):
where D = 50,000, A = 100, Ht = Ha = 1- -Case where the %Y is 100 (i.e., all the hits in the hit list are active) but retrieves a single hit.
- Extreme case 2:
(%A = 100 with a very large hit list):
where D, Ht = 50,000, A, Ha = 100- -Case where the %A is 100 (i.e., all of the actives in the database are retrieved together with the rest of the database)
- Typical good:
where D = 50,000, A = 100, Ht = 200, Ha = 80- -A typical hit list with high %Y and %A
- Typical bad:
where D = 50,000, A = 100, Ht = 1,000, Ha = 50- -A typical hit list with low %Y and medium %A
![]()
Six Database Search Scenario
Applied to G-H
| Case | %Y | %A | enr. | false- | false+ | G-H |
|---|---|---|---|---|---|---|
| Best | 100 | 100 | 500 | 0 | 0 | 1 |
| Typ. Good | 40 | 80 | 200 | 20 | 120 | 0.6 |
| Extreme 1 | 100 | 1 | 500 | 99 | 0 | 0.5 |
| Extreme 2 | 0.2 | 100 | 1 | 0 | 49000 | 0.5 |
| Typ. Bad | 5 | 50 | 25 | 50 | 950 | 0.26 |
| Worst | 0 | 0 | 0 | 100 | 49900 | 0 |
![]()
G-H Applied to a Published Analysis
| Query | Ha | Ht | %Y | %A | G-H |
|---|---|---|---|---|---|
| Q-4 | 64 | 91 | 70.3 | 82.1 | 0.76 |
| Q-5 | 72 | 645 | 11.2 | 92.3 | 0.52 |
| Q-6 | 58 | 560 | 10.4 | 74.4 | 0.42 |
| Q-7 | 24 | 165 | 14.5 | 30.8 | 0.23 |
The G-H scores corroborate the intuitive conclusion published with this work. Flexible Query, Q-4, was proposed to be substantially more selective than the others without compromising percent of Actives too much.
Ref: Güner O. F.; Henry D. R.; and Pearlman R. S.
J. Chem. lnf. Comput. Sci., 1992, 32, 101.
![]()
Clustering Classes of Active Compounds
| Ac No. | DDR Ac Idx | Patented Activity |
|---|---|---|
| 1 | 02454 | TNF Inhibitor |
| 2 | 06245 | 5HT Uptake Inhibitor |
| 3 | 09221 | Acetylcholine Esterase Inhibitor |
| 4 | 09248 | Prolylendopeptidase Inhibitor |
| 5 | 12453 | Lipid Peroxidation Inhibitor |
| 6 | 12454 | Excitatory Amino Acid Inhibitor |
| 7 | 52502 | Squalene Synthetase Inhibitor |
| 8 | 54112 | H+/K+ATPase Inhibitor |
![]()
G-H Applied to Clusters of Compounds
Cluster Analysis* Results - Hit Counts
| Cls | A.1 | A.2 | A.3 | A.4 | A.5 | A.6 | A.7 | A.8 | Tot |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 45 | 263 | 53 | 92 | 16 | 52 | 162 | 685 |
| 2 | 1 | 1 | 23 | 0 | 26 | 4 | 10 | 201 | 266 |
| 3 | 5 | 12 | 91 | 202 | 48 | 27 | 16 | 38 | 439 |
| 4 | 0 | 0 | 3 | 3 | 51 | 16 | 0 | 84 | 157 |
| 5 | 17 | 1 | 19 | 1 | 14 | 20 | 7 | 0 | 79 |
| 6 | 17 | 42 | 4 | 7 | 31 | 0 | 17 | 0 | 118 |
| 7 | 0 | 0 | 1 | 0 | 6 | 0 | 40 | 1 | 48 |
| 8 | 26 | 0 | 0 | 0 | 41 | 0 | 50 | 10 | 127 |
| To | 68 | 101 | 404 | 266 | 309 | 83 | 192 | 496 | 1919 |
* analyzed by the monothetic cluster analysis method of Kaufman and Rousseau Ref: Kaufman L. and Rousseeuw P. J., Finding Groups in Data - an Introduction to Cluster Analysis, Wiley, N. Y., 1990, pp. 280-311
![]()
G-H Applied to Clusters of Compounds
Cluster Analysis Results - G-H Scores
| Cl | Act. 1 | Act. 2 | Act. 3 | Act. 4 | Act. 5 | Act. 6 | Act. 7 | Act. 8 |
|---|---|---|---|---|---|---|---|---|
| 1 | .016 | .256 | .517 | .138 | .216 | .108 | .173 | .282 |
| 2 | .009 | .007 | .072 | .000 | .091 | .032 | .045 | .580 |
| 3 | .042 | .007 | .216 | .610 | .132 | .193 | .060 | .082 |
| 4 | .000 | .000 | .013 | .015 | .245 | .147 | .000 | .352 |
| 5 | .232 | .011 | .144 | .008 | .111 | .247 | .063 | .000 |
| 6 | .197 | .386 | .011 | .043 | .182 | .000 | .116 | .000 |
| 7 | .000 | .000 | .012 | .000 | .072 | .000 | .521 | .011 |
| 8 | .294 | .000 | .000 | .000 | .228 | .000 | .327 | .049 |
Via GH scores it was possible was able to identify each cluster with compound of specific activities. Such analysis is not obvious with the raw data.
![]()
Conclusions
We present the G-H formula as a convenient way to quantify hit lists obtained from searches with various queries. This can not only be used to quantitatively sort the results with respect to "goodness of hits," but also can be used in automated procedures that will optimize a 3D query (see Miller, Henry, Güner ACS 213th National Meeting, April 13-17, 1997, paper COMP-39). In addition, the equation can also be used to identify the most appropriate clustering technique for work similar to the ones published by Brown and Martin (J. Chem. Inf. Comput. Sci., 1996, 36, 572-584).
NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:
- Network Science Corporation
- 4411 Connecticut Avenue NW, STE 514
- Washington, DC 20008
- Tel: (828) 817-9811
- E-mail: TheEditors@netsci.org
- Website Hosted by Total Choice