QuaSAR-Binary: A New Method for the
Analysis of High Throughput Screening Data

Paul Labute

Chemical Computing Group Inc.
1255 University Street, Suite 1600
Montreal, Quebec, Canada H3B 3X3



http://www.netsci.org/Science/Compchem/feature21.html

Background

The advent of High Throughput Screening (HTS) technology promises to make available large amounts of experimental data. Quantitative Structure Activity Relationship techniques have been used successfully to analyze experimental data; however, HTS data presents two formidable problems that require resolution:

  • Data Format. Many HTS technologies report a binary condition; i.e., pass/fail or active/inactive. Conventional QSAR techniques such as regression were developed for continuous activity measurements, not discrete ones.

  • Dirty Data. The error rate of most HTS technologies is significant. Conventional QSAR techniques are, for the most part, based upon least squares fitting which is very sensitive to outliers. The significant error rate increases the proportion of outliers in the data set.

In this poster, we describe QuaSAR-Binary, a new method for analyzing HTS results in which activity measurements are binary values. The method is also capable of modeling error and uncertainty in a more direct way than conventional QSAR techniques.

Method

We now present the mathematical methods used. Let

{(yi,xi)}

be the results of m HTS experiments on a common target where the yi are discrete values that, without loss of generality, we may assume are the numbers {1,2,..,k} and the xi are vectors each with n numbers (the molecular descriptors) and we write

xi=(xi1,..,xin)

Let Y denote a random variable with values {1,2,..,k} and let X=(X1,..,Xn) denote a random variable over n-vectors (a random molecular descriptor).

The theoretical method is to use the conditional distribution Pr(Y|X) in order to determine the probability that a new molecule L belongs to activity class y with Pr(Y=y|X=L). We can then, for example, sort the molecule into the class that has the highest probability. Mathematically, we have that

In order to use this formula for practical purposes it is necessary to analyze the HTS data in an effort to approximate the distributions on the right hand side of the equation. The distribution of Y is easily estimated using a maximum likelihood estimator or a Bayes estimator. Estimating the k distributions of the form Pr(X=x|Y=j) is more problematic since the X is a vector of n numbers: for values of n of 5 for more a straightforward counting procedure cannot be used in practice because there will not be enough experimental data to approximate the distribution with any reasonable accuracy. Furthermore, if the X vectors contain continuous (non-discrete) descriptors this problem becomes even more acute.

Our method to approximate the distributions of X is to transform a multidimensional distribution into a product of one dimensional distributions. The method of principal component analysis is used to determine a p by n linear transform Q and an n-vector u such that the random variable Z=Q(X-u) has a covariance matrix equal to the p by p identity matrix. We then assume for the purposes of approximation that the individual coordinates of Z are independent. This leads to a model formula of

which can be estimated from the training set.

RESULTS

The techniques presented in this poster were implemented in the SVL programming language of Chemical Computing Group's Molecular Operating Environment (MOE) version 1997.09.

The method was tested using 1659 drug-like compounds. An abstract "activity" criterion was created using molar refractivity. A compound was called "active" if its molar refractivity was less than 5. This created 288 "active" compounds in the data set (17.4%).

Four chi topological indices were used as molecular descriptors: the zero'th and first order connectivity indices as well as the zero'th and first order valence corrected connectivity indices.

A binary model was trained on the data set and its predictiveness was evaluated with a leave-one-out cross-validation protocol. The observed cross-validated accuracy was 94.0% (97.9% on the "active" subset).



NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:

Network Science Corporation
4411 Connecticut Avenue NW, STE 514
Washington, DC 20008
Tel: (828) 817-9811
E-mail: TheEditors@netsci.org
Website Hosted by Total Choice