A Database System for
Combinatorial Synthesis Experiments

Craig A. James and Dr. David Weininger[+]

Daylight Chemical Information Systems, Inc.

The following talk was presented at the April 1995 IBC meeting in London and is a description of a system for handling Combinatorial Synthesis information.



http://www.netsci.org/Science/Combichem/feature06.html

 
  1. Background: The "Size" of the Chemical-Information Problem
  2. Monomer-level Description of Molecules: CHUCKLES
  3. Regular Mixtures: CHORTLES
  4. Searching: CHARTS
  5. An Example Database: Peptides
  6. Example: Diazepines
  7. Limitations of CHORTLES
  8. Conclusions
  9. Acknowledgment

Background: The "Size" of the Chemical-Information Problem

For the past decade, our work has been based on the principle that information- processing capacity increases more rapidly than the amount of chemical information available. The recent development of combinatorial mixture synthesis has challenged that principle.

Since the invention of the digital computer, its information-handling capacity has increased at an exponential rate, typically doubling every few years. The amount of chemical information in the world has historically grown at a more sedate pace, due to the fact that the growth of the chemical information depends on the synthesis and analysis of chemicals. These trends indicate that the size of computers available will eventually exceed the amount of chemical information in the world.



Computing Capacity and Chemical Information

Indeed, by the mid-1980's, most corporate and academic chemical databases would fit into the memory of a "workstation" class computer, and the problem of managing all of the known structures in the world (ca. 1.5 x 107 structures) seemed within grasp. By the early 1990's, computers with more than a gigabyte of memory and tens of gigabytes of disk were no longer called "supercomputers," making all but the very largest databases accessible on readily-available computers. A major milestone was passed: a database of all known chemicals could, in principle, be managed by a single workstation-class computer.

At the same time this milestone was reached, new combinatorial-chemistry techniques were emerging that radically changed the nature of drug synthesis and evaluation. A single molecular-diversity experiment could, in principle, create more molecules in a day than the total number of molecules created in the previous history of chemistry. These techniques threatened to erase the gains in computing capacity that make modern chemical databases useful.

On closer examination, one finds that although combinatorial chemistry produces orders of magnitude more compounds than earlier techniques, the amount of chemical information produced has not changed very much. A typical experimental procedure and its results can be described with a few thousand bytes of English text.

Computer languages, such as C, FORTRAN, SQL, and PICT, have been developed for a wide variety of uses in information processing. Computer languages such as these are typically developed with the following goals:

  • Portability: A language needs well-defined syntax and semantics, so that the language's meaning remains constant across various computer architectures and software implementations.
  • Conciseness: A language should express information exactly and compactly.
  • Comprehensiveness: A language should express all information that it is intended to encompass.
  • Parsability: A computer should be able to efficiently interpret the language. Some languages also need to be human-readable.
 

A "connection table" is an example of a computer language used to express chemical structure; examples include the Protein Databank (PDB) format and "Mol" files [1] (Molecular Design, Ltd.). Connection-table languages have proved extremely useful and are still widely used. "Line notations" such as WLN [2], ROSDAL [3], and SMILES [4,5,6] are languages that represent chemical structures in compact typographical form. These languages have become indispensable to the chemical-information business.

The challenge presented by molecular-diversity chemistry is to extend the current use of computer languages so that molecular diversity can be expressed in a form that meets the above goals of portability, conciseness, comprehensiveness, and parsability.

Return to the table of contents.


Monomer-level Description of Molecules: CHUCKLES

 

The CHUCKLES language [7] was developed to expresses chemical structure at the "monomer" lever rather than at the atomic level. A monomer is a "molecular chunk" -- a piece of a molecule that is typically more than one atom but less than a whole molecule. A monomer is often similar in concept to a "functional group", or to its namesake, a single monomer in a peptide or other oligomeric chemistry. However, the term "monomer", as used here, is a broad concept: any portion of a molecule can be defined as a monomer, ranging from a single atom to a whole molecule. A monomer is defined by three properties: its symbol, its SMILES, and its description. For example, the following properties define four monomers:

Symbol ................ SMILES.........................Description

  Gly .................. NCC(=O) ...................... glycine
  Ala .................. NC(C)C(=O) ................... alanine
  Cys .................. NC(CS&1)C(=O) ................ cysteine
  OH ................... [OH] ......................... hydroxy

Monomers have some of the characteristics of atoms (e.g., they have symbols) and some of molecules (e.g., they have SMILES). Unlike atoms, there is not a fixed number of them in an immutable table. Monomers are defined on a per-application basis, typically in a "Monomer table" stored in a file or database.

Using the above Monomer definitions, the following CHUCKLES specify particular molecules, shown with their equivalent (non-unique) SMILES:


AlaCysOh ................ NC(C)C(=O)NC(CS)C(=O)O
CysGlyCysOh ............. NC(CS)C(=O)NC(C)C(=O)NC(CS)C(=O)O

Bonds between monomers are specified with the bond symbols '-', '=', '#', and ':', representing single, double, triple, and aromatic, respectively. A "disconnection" (adjacent monomers that are not bonded) is represented with a period '.'. An unspecified bond defaults to single or aromatic, as appropriate. For example, consider the following simple monomer definitions:

Et .............. CC ............... ethyl
Nit .............. N ................. amine

We might use these with bond symbols as follows:

EtNit .......... CCN ................ ethyl amine 
Et=Nit ......... CC=N ............... ethyl imine 
Et#Nit ......... CC#N ............... acetonitrile 
Et.Nit ......... CC.N ............... ethane and ammonia 

An ampersand character (&) followed by digits in a monomer's SMILES indicates an external connection -- the definition of Cys above illustrates this. The external connections are indicated by paired digits in CHUCKLES, which specify a bond between non-adjacent positions in the CHUCKLES. For example, the following CHUCKLES defines a Cysteine-Alanine-Cysteine-Hydroxy peptide with a cross-link between the sulphurs in the two cysteine groups:

Cys1AlaCys1Oh

Digits are also used in "parent-substituent" (non-oligomeric) chemistry. For example, consider the following monomers that define monomers for 2,3,4-substituted phenol and several substituents:

Phen ........ Oc1c&2c&3c&4cc1 .......... 2,3,4-substituted phenol 
Me .......... C&1 ...................... methyl 
Et .......... C&1C ..................... ethyl 
Oh .......... O&1 ...................... hydroxy  

Using these, we can construct various molecules:

Phen234.Me2.Et3.Oh4 ................ 2-methyl, 3-ethyl, 4-hydroxy 
Phen234.Oh2.Oh3.Oh4 ................ 2,3,4-hydroxy

Return to the table of contents.


Regular Mixtures: CHORTLES

 

The CHORTLES language [8] is an extension of the CHUCKLES language that represents regular mixtures. Multiple monomer choices in a given position are specified via a "monomer set", made of semicolon-separated monomers in brackets, e.g.:

Ala[Cys;Gly]AlaOh ................................... 2 trimers 
[Ala;Gly][Cys;Gly][Ala;Gly]Oh ....................... 6 trimers 
[Gly;Ala;Cys][Gly;Ala;Cys][Gly;Ala;Cys]Oh .......... 27 trimers 
Cys1[Gly;Ala;Cys]Cys1Oh ............................. 3 cyclic trimers 

Multiple monomers are implicitly related by an AND operator; i.e. [Ala;Gly] means Ala and Gly are in the mixture

Return to the table of contents.


Searching: CHARTS

CHARTS is a language which describes monomer patterns, used for substructure searching at the monomer level. CHARTS monomer-level specifications include numeric ranges, variability in each position expressed as AND and OR specifications, and the special "pseudo-monomers" Begin, End, and Any. These features allow the construction of powerful queries for database-searching and for expressing chemical knowledge.

The CHARTS language is a "superset" of the CHUCKLES and CHORTLES languages. Any CHUCKLES is also a valid CHARTS which matches itself. Any CHORTLES is also a valid CHARTS that matches itself or any supermixture of itself.

A CHARTS "AND" expression indicates multiple requirements for a given monomer position, and is written as a semicolon-separated list of monomer symbols inside brackets, e.g., Ala[Pro;Tyr]His will match the CHORTLES:

[Ala;Gly;Lys][Pro;Ser;Tyr][Cys;His]O

A CHARTS "OR" expression indicates choices at a given monomer position and is written as a comma-separated list of monomer symbols inside brackets, e.g., Ala[Pro,Tyr]His will match all three of the CHUCKLES:

AlaProHisOh, AlaTyrHisOh, TyrAlaProHisGlyOh, etc.

The same pattern, Ala[Pro,Tyr]His will also match the CHORTLES:

[Ala;Phe][Thr;Tyr][Cys;His]Oh

This type of match may be used to find single components in regular mixtures, e.g., the above match indicates that at least one of the multimers AlaProHisOh or AlaTyrHisOh exist in the six-component mixture [Ala;Phe][Thr;Tyr][Cys;His]Oh.

The symbols "Any", "Begin" and "End" represent special pseudo-monomers which match any monomer. "Begin" and "End" additionally are constrained to have no left- hand and right-hand bonds, respectively. For instance, the CHARTS expression:

BeginAnyAnyAnyEnd

will match all pentamers.

Repeat counts allow expression of repeated units and of variability in the match. Repeat counts are specified in brackets following the monomer symbols, and apply to the entire AND or OR expression. Examples of range specifications are:

[Ala:2] ......................... Two alanines 
[Ala,Gly,His:1-3] ............... One, two, or three Ala, Gly, or His 
[Ala:2-] ........................ At least two alanines 
[Ala:-5] ........................ At most five alanines 
[Ala:0-1] ....................... Zero or one alanine 

Return to the table of contents.


An Example Database: Peptides

 

The THOR database system [9] is a lexically-based, thesaurus-oriented storage and retrieval system that uses molecular structure (i.e. SMILES) as its primary key. In such a system, the CHORTLES language is a natural addition that facilitates storing regular mixtures.

The THOR system has built-in knowledge about CHORTLES; specifically, it can interpret them as structures. THOR automatically generates a SMILES for each mixture, replacing any variable positions with a "*" pseudo-atom.

The following data are from a contrived database, one created to illustrate the typical techniques one might use when making mixtures using the naturally-occurring peptides. The database represents an "experiment" in which a mixture is repeatedly analyzed for activity then "deconvolved" into submixtures, until a single compound is discovered with the desired activity.

The initial mixture is a trimer with 20, 8, and 8 choices at the first, second and third positions, respectively. Thus, we begin our pseudo-experiment with a single entry in the database that represents 20 x 8 x 8 = 1280 compounds:

The first stage of our deconvolution fixes the first position in each of 20 mixtures, e.g.

1 Ala[Ala;Arg;Asn;Asp;Cys;Gln;Glu;Gly][His;Ile;Leu;Lys;Met;Phe;Pro;Ser]Oh 
2 Arg[Ala;Arg;Asn;Asp;Cys;Gln;Glu;Gly][His;Ile;Leu;Lys;Met;Phe;Pro;Ser]Oh 
3 Asn[Ala;Arg;Asn;Asp;Cys;Gln;Glu;Gly][His;Ile;Leu;Lys;Met;Phe;Pro;Ser]Oh 
4 Asp[Ala;Arg;Asn;Asp;Cys;Gln;Glu;Gly][His;Ile;Leu;Lys;Met;Phe;Pro;Ser]Oh 
 ... etc ... 
20 Val[Ala;Arg;Asn;Asp;Cys;Gln;Glu;Gly][His;Ile;Leu;Lys;Met;Phe;Pro;Ser]Oh 

These twenty submixtures and their respective assay results are entered into the database; for example, the record for the third item in the list above is:

We examine our "assay" results, and select the best candidate (Reg. No. 20004), and proceed with the deconvolution, this time fixing the second position, e.g.:

1. AsnAla[His;Ile;Leu;Lys;Met;Phe;Pro;Ser]Oh 
2. AsnArg[His;Ile;Leu;Lys;Met;Phe;Pro;Ser]Oh  
3. AsnAsn[His;Ile;Leu;Lys;Met;Phe;Pro;Ser]Oh 
..... etc. .....
8. AsnGly[His;Ile;Leu;Lys;Met;Phe;Pro;Ser]Oh 

Each of these and its assay data becomes a new record, e.g.:

Finally we fix the third position, e.g.:

At the end of our "experiment," we find that 2632 compounds were synthesized, representing 1280 unique molecules, yet we only made and tested 37 samples.

A key feature of the THOR Database system using CHORTLES is that we store exactly what we know, and no more. The database that represents this experiment has 37 records in it, exactly the same as the number of mixtures actually made.There is no need to "enumerate" the components of a mixture prior to storing information about the mixture.

Return to the table of contents.


Example: Diazepines

In addition to the oligomeric chemistry illustrated above, CHUCKLES and CHORTLES can also be used to specify non-oligomeric chemistry, such as a parent structure with substituents. To illustrate this, we'll use a single mixture of substituted diazepines.

Substituted Diazepines

We define the following monomers:

Symbol ................... SMILES ................... Description

Pam ...... C1&1N=C(c2c&2cccc2)c3cc&3ccc3N&4C1=O ..... benzodiazepin 
Hx  ........... [H] ..................................hydro 
Fx  ............ F ...................................fluoro
Clx  ........... Cl ..................................chloro 
Brx  ........... Br ..................................bromo
Ix  ............ I ...................................iodo
Nitrox  ........ O=[N+]([O-]) ........................nitro
Mex  ........... C ...................................methyl
Etx  ........... CC ..................................ethyl
Mohx  .......... OC ..................................methylhydroxy
Eohx ........... OCC .................................ethylhydroxy
Mcpx  .......... C1CC1C ..............................methylcyclopropyl 
Tppx ........... C#CC ................................2propynyl
Tfex ........... FC(F)(F)C ...........................trifluoroethyl
Phex ........... c1ccccc1 ............................phenyl
Ppox ........... Oc1ccc2cc1 ..........................paraphenolyl
Ohx ............ O ...................................hydroxy
Carx  .......... CN(C)C(=O)O .........................tert-N-carbamate

With these monomer definitions, the diazepines mixture illustrated above is represented by the following CHORTLES:

Pam1234.[Hx;Fx;Clx;Brx;Ix]1.[Hx;Fx;Clx;Nitrox]2.[Hx;Mex;Etx;Mohx;Eohx;Mcpx; Tppx;Tfex;Phex;Ppox]3.[Hx;Ohx;Carx]4

Return to the table of contents.


Limitations of CHORTLES

CHORTLES represents regular mixtures, those in which the "shapes" of all components of the mixture are the same. By "same shape" we mean that all components must have the same number of monomer positions, and the bonding between the monomer positions ( the left- and right-hand bonds, and external connections, or "cross links") must be identical. Some mixtures do not meet these requirements. For example, a synthesis might generate 2,3-halogenated phenol, using the halogens Cl, Br, and I; however, the diameter of the iodine atom makes it impossible to have iodine in both the 2 and 3 positions. The CHORTLES language has no convenient expression to say, e.g. "if monomer A is X, then monomer B is (is not) Y."

A second limitation is that CHORTLES is based on a set of monomer definitions that are defined on a per-application basis, unlike SMILES which is based on the immutable Periodic Table. It is possible that a CHARTS search will miss a substructure if it the monomer representation in the pattern (the CHARTS) is different than that of the target (the CHUCKLES or CHORTLES). For example, consider the following two monomers:

Et ............. CC ................ ethyl
Pr ............. CCC ............... propyl

Using these, we might expect the CHARTS "PrPr" to match hexane. However, if our database contains hexane as "EtEtEt", a CHARTS search won't find it. In general, CHARTS searches are effective when consistent usage of particular monomers is maintained. For example, searching for "AlaPro" in a database of peptides is likely to work correctly, since such databases generally use the naturally-occurring amino-acid residues in a consistent fashion.

Return to the table of contents.


Conclusions

The CHUCKLES, CHORTLES, and CHARTS languages, used to represent molecules, mixtures, and queries, respectively, combined with the THOR thesaurus-oriented chemical-information database system, are effective tools for building databases of combinatorial libraries. Such databases are compact, in that each mixture is stored only once, and no enumeration of the mixture's components is required. Monomer- level searching using CHARTS is fast, and also does not require enumeration of a mixture's components. Atom-level searching, which does require enumeration of components, is also possible.

Return to the table of contents.


Acknowledgment

CHUCKLES and CHORTLES were proposed and prototyped by Michael A. Siani and Jeffrey M. Blaney of Chiron Corporation. Their suggestions and tests have been invaluable during the development of a database system for combinatorial libraries.


For more information about the Daylight ToolKit used in building the system described here, please look up the MONOMER ToolKit

[+] References and Notes:

Address correspondence to this author at Daylight Chemical Information Systems, Inc., 419 E. Palace Ave. #1, Santa Fe, NM, 87501, USA. [Return to article]


[1] Nourse, J.G., et. al., J. Chem. Inf. Comput. Sci., 1988, 32. [Return to article]

[2] Smith, E.G. and Baker, P.A. (1975), The Wiswesser Line-Formula Chemical Notation (WLN), Chemical Information Management Inc., New Jersey. [Return to article]

[3] Welford, S.M., ROSDAL MANUAL for Users of the Beilstein Database at Dialog, Version 1.1 5 October 1989, Springer. [Return to article]

[4] Weininger, D., J. Chem. Inf. Comput. Sci, 1988, 28, 31. [Return to article]

[5] Weininger, D., J. Chem. Inf. Comput. Sci, 1989, 29, 97. [Return to article]

[6] For a more recent exposition on the SMILES language, see Daylight Theory Manual, chapter 3, Daylight Chemical Information Systems, Inc., 18500 Von Karman Ave. #450, Irvine CA 92715, USA. [Return to article]

[7] Siani, M.A. et al, J. Chem. Inf. Comput. Sci., 1994, 34. [Return to article]

[8] Siani, M.A.., et. al., CHORTLES: A method for Representing Oligomeric and Template-based Mixtures, submitted for publication in J. Chem. Inf. Comput. Sci. [Return to article]

[9] Daylight Theory Manual, chapters 7-10, cited above. [Return to article]



NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:

Network Science Corporation
4411 Connecticut Avenue NW, STE 514
Washington, DC 20008
Tel: (828) 817-9811
E-mail: TheEditors@netsci.org
Website Hosted by Total Choice