About Network Science The Science Center The Resource Center Courseware
Network Science Corporation



Bioinformatics
Cheminformatics
Combichem
Computational Chemistry
HTS
LIMS
Special Topics
The Netsci search Engine


QuaSAR-Binary: A New Method for the
Analysis of High Throughput Screening Data

Paul Labute

Chemical Computing Group Inc.
1255 University Street, Suite 1600
Montreal, Quebec, Canada H3B 3X3



http://www.netsci.org/Science/Compchem/feature21.html

Background

The advent of High Throughput Screening (HTS) technology promises to make available large amounts of experimental data. Quantitative Structure Activity Relationship techniques have been used successfully to analyze experimental data; however, HTS data presents two formidable problems that require resolution:

In this poster, we describe QuaSAR-Binary, a new method for analyzing HTS results in which activity measurements are binary values. The method is also capable of modeling error and uncertainty in a more direct way than conventional QSAR techniques.

Method

We now present the mathematical methods used. Let

{(yi,xi)}

be the results of m HTS experiments on a common target where the yi are discrete values that, without loss of generality, we may assume are the numbers {1,2,..,k} and the xi are vectors each with n numbers (the molecular descriptors) and we write

xi=(xi1,..,xin)

Let Y denote a random variable with values {1,2,..,k} and let X=(X1,..,Xn) denote a random variable over n-vectors (a random molecular descriptor).

The theoretical method is to use the conditional distribution Pr(Y|X) in order to determine the probability that a new molecule L belongs to activity class y with Pr(Y=y|X=L). We can then, for example, sort the molecule into the class that has the highest probability. Mathematically, we have that

In order to use this formula for practical purposes it is necessary to analyze the HTS data in an effort to approximate the distributions on the right hand side of the equation. The distribution of Y is easily estimated using a maximum likelihood estimator or a Bayes estimator. Estimating the k distributions of the form Pr(X=x|Y=j) is more problematic since the X is a vector of n numbers: for values of n of 5 for more a straightforward counting procedure cannot be used in practice because there will not be enough experimental data to approximate the distribution with any reasonable accuracy. Furthermore, if the X vectors contain continuous (non-discrete) descriptors this problem becomes even more acute.

Our method to approximate the distributions of X is to transform a multidimensional distribution into a product of one dimensional distributions. The method of principal component analysis is used to determine a p by n linear transform Q and an n-vector u such that the random variable Z=Q(X-u) has a covariance matrix equal to the p by p identity matrix. We then assume for the purposes of approximation that the individual coordinates of Z are independent. This leads to a model formula of

which can be estimated from the training set.

RESULTS

The techniques presented in this poster were implemented in the SVL programming language of Chemical Computing Group's Molecular Operating Environment (MOE) version 1997.09.

The method was tested using 1659 drug-like compounds. An abstract "activity" criterion was created using molar refractivity. A compound was called "active" if its molar refractivity was less than 5. This created 288 "active" compounds in the data set (17.4%).

Four chi topological indices were used as molecular descriptors: the zero'th and first order connectivity indices as well as the zero'th and first order valence corrected connectivity indices.

A binary model was trained on the data set and its predictiveness was evaluated with a leave-one-out cross-validation protocol. The observed cross-validated accuracy was 94.0% (97.9% on the "active" subset).

[ NetSci's Home Page ] [ The Science Center ] [ The Computational Chemistry TOC ]