Formula for Determining the "Goodness of Hit Lists" in 3D Database Searches

Osman F. Güner

Accelrys
9685 Scranton Road
San Diego, CA 92121
Tel: (619) 458-9990

Douglas R. Henry
MDL Information Systems, Inc.
14600 Catalina Street
San Leandro, CA 94577
Tel: (510) 895-1313

http://www.netsci.org/Science/Cheminform/feature09.html
Presented at the 1998 Charleston Conference, March 9, 1998
Submitted for Publication June, 1998

Abstract

A measure of the 'Goodness' of hit lists obtained from chemical database searching is proposed (the G-H score). This measure takes into account both the yield (the fraction of active structures hit) and the percentage of actives that are retrieved from the database. By using variable coefficients on these terms and adjusting for the size of the hit list, a flexible but quantitative measure of hit list quality is obtained. We show the application of this measure to several published search results. We also show how the G-H score can be used to measure the quality of clustering results.

Schematic Representation of a Database and a Hit List

Schematic

Definition of Terms

Definitions

G-H Score: Arithmetic Average of Yield and Percent of Actives

Hit List Definitions: The "Best" and the "Worst"

Search Results Test Case

  • The best case:
    where Ht = Ha = A
    -Search retrieves all of the actives and nothing else;
    false negatives = 0, false positives = 0
  • The worst case:
    where Ha = 0, Ht = D - A
    -Search retrieves everything in the database except the actives;
    false negatives = A, false positives = D - A
  • Extreme case 1: (%Y = 100 with a very small hit list):
    where D = 50,000, A = 100, Ht = Ha = 1
    -Case where the %Y is 100 (i.e., all the hits in the hit list are active) but retrieves a single hit.
  • Extreme case 2: (%A = 100 with a very large hit list):
    where D, Ht = 50,000, A, Ha = 100
    -Case where the %A is 100 (i.e., all of the actives in the database are retrieved together with the rest of the database)
  • Typical good:
    where D = 50,000, A = 100, Ht = 200, Ha = 80
    -A typical hit list with high %Y and %A
  • Typical bad:
    where D = 50,000, A = 100, Ht = 1,000, Ha = 50
    -A typical hit list with low %Y and medium %A

Six Database Search Scenario
Applied to G-H

Case %Y %A enr. false- false+ G-H
Best 100 100 500 0 0 1
Typ. Good 40 80 200 20 120 0.6
Extreme 1 100 1 500 99 0 0.5
Extreme 2 0.2 100 1 0 49000 0.5
Typ. Bad 5 50 25 50 950 0.26
Worst 0 0 0 100 49900 0


G-H Applied to a Published Analysis

Query Ha Ht %Y %A G-H
Q-4 64 91 70.3 82.1 0.76
Q-5 72 645 11.2 92.3 0.52
Q-6 58 560 10.4 74.4 0.42
Q-7 24 165 14.5 30.8 0.23


The G-H scores corroborate the intuitive conclusion published with this work. Flexible Query, Q-4, was proposed to be substantially more selective than the others without compromising percent of Actives too much.

Ref: Güner O. F.; Henry D. R.; and Pearlman R. S.
J. Chem. lnf. Comput. Sci., 1992, 32, 101.

Clustering Classes of Active Compounds

Ac No. DDR Ac Idx Patented Activity
1 02454 TNF Inhibitor
2 06245 5HT Uptake Inhibitor
3 09221 Acetylcholine Esterase Inhibitor
4 09248 Prolylendopeptidase Inhibitor
5 12453 Lipid Peroxidation Inhibitor
6 12454 Excitatory Amino Acid Inhibitor
7 52502 Squalene Synthetase Inhibitor
8 54112 H+/K+ATPase Inhibitor


G-H Applied to Clusters of Compounds
Cluster Analysis* Results - Hit Counts

Cls A.1 A.2 A.3 A.4 A.5 A.6 A.7 A.8 Tot
1 2 45 263 53 92 16 52 162 685
2 1 1 23 0 26 4 10 201 266
3 5 12 91 202 48 27 16 38 439
4 0 0 3 3 51 16 0 84 157
5 17 1 19 1 14 20 7 0 79
6 17 42 4 7 31 0 17 0 118
7 0 0 1 0 6 0 40 1 48
8 26 0 0 0 41 0 50 10 127
To 68 101 404 266 309 83 192 496 1919


* analyzed by the monothetic cluster analysis method of Kaufman and Rousseau Ref: Kaufman L. and Rousseeuw P. J., Finding Groups in Data - an Introduction to Cluster Analysis, Wiley, N. Y., 1990, pp. 280-311

G-H Applied to Clusters of Compounds
Cluster Analysis Results - G-H Scores

Cl Act. 1 Act. 2 Act. 3 Act. 4 Act. 5 Act. 6 Act. 7 Act. 8
1 .016 .256 .517 .138 .216 .108 .173 .282
2 .009 .007 .072 .000 .091 .032 .045 .580
3 .042 .007 .216 .610 .132 .193 .060 .082
4 .000 .000 .013 .015 .245 .147 .000 .352
5 .232 .011 .144 .008 .111 .247 .063 .000
6 .197 .386 .011 .043 .182 .000 .116 .000
7 .000 .000 .012 .000 .072 .000 .521 .011
8 .294 .000 .000 .000 .228 .000 .327 .049


Via GH scores it was possible was able to identify each cluster with compound of specific activities. Such analysis is not obvious with the raw data.

Conclusions

We present the G-H formula as a convenient way to quantify hit lists obtained from searches with various queries. This can not only be used to quantitatively sort the results with respect to "goodness of hits," but also can be used in automated procedures that will optimize a 3D query (see Miller, Henry, Güner ACS 213th National Meeting, April 13-17, 1997, paper COMP-39). In addition, the equation can also be used to identify the most appropriate clustering technique for work similar to the ones published by Brown and Martin (J. Chem. Inf. Comput. Sci., 1996, 36, 572-584).



NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:

Network Science Corporation
4411 Connecticut Avenue NW, STE 514
Washington, DC 20008
Tel: (828) 817-9811
E-mail: TheEditors@netsci.org
Website Hosted by Total Choice