Discovery Potential by Integrating Chemical and Biological Information Resources: Disease Analysis with Chemical Abstracts Service Databases

M. J. Toussant, H. Liu-Johnson and D. Dayton

Biochemistry and New Products Departments
Chemical Abstracts Service
Columbus, OH 43210



http://www.netsci.org/Science/Bioinform/feature04.html

Introduction

Knowledge is used for rational decision making. Traditional databases provide clues to knowledge by serving as information repositories that organize the mass of data from scientific observations. Discovery represents the crossover point between organized information collections and knowledge.

Algorithmic approaches to a discovery function may be a natural extension of the desktop technology intended to meet the needs of research scientists(1). To accomplish this in disease research necessitates, however, bringing together information in databases and via tools that are not currently integrated. Examples of these sources (Figure 1) include CAS databases (e.g., CAplus and CAS Registry), Internet sources and tools (e.g., World-Wide Web protein and nucleic acid databases, secondary structure prediction tools, solvent accessibility predictors), desktop and local tools (e.g., modeling software, metabolic fate data, biopathway databases, spectral data), and other public databases (e.g., BIOSIS, MedLine, and Mendelain Inheritance in Man).

[Figure 1]

Figure 1
Information sources for biological discovery function



The goal of this study was to explore bioinformatic design approaches to disease analysis by examining information integration and discovery potentials with CAS and Internet bioinformation tools and databases. Further, the utility of search term development based on protein sequence homology searching is demonstrated. Techniques for information integration and discovery are tested using familial hypercholesterolemia as an example, attempting to identify active agents or structure moieties that might interact with either of two important proteins in this disease - apolipoprotein B100 (apo B100) or the low-density lipoprotein (LDL) receptor.

Background and Design

Databases and Tools

CAS files have considerable information related to all aspects of chemistry (2-4). The CAS Registry file has over 14 million substances, with structure connection table information and names. It contains over 950,000 biosequences, with that number continuing to increase each week, and it is the worlds most complete repository of uniquely identified small molecules. The CAplus file is a bibliographic database with over 12 million records published since 1967. Over 2,000 new records for published articles are added per day currently, and in 1994, about 265,000 published articles were covered in biology and biochemistry.

For protein homology searching, Blast software is widely used with databases, including Swiss-Prot, EMBL, and the Protein DataBank. A number of search services, such as the Baylor Medical College Search launcher, enable Blast homology and secondary structure and solvent accessibility predictions to be made.

MedLine represents a source of clinical and medical information. This database can be accessed via on-line services such as STN International as well as via the Internet or as a local tool.

Familial Hypercholesterolemia

Familial hypercholesterolemia is a complex, primarily genetics disease characterized by high levels of LDL cholesterol particles and limited amounts of high-density lipoprotein cholesterol (5-6). The condition is associated with the development of atherosclerotic plaques and myocardial infarction.

Apo B100 is a key protein in the LDL particle, delivering cholesterol to tissues, and it is critical to cell surface receptor binding of the particle. Uptake at the cell is mediated by its LDL receptor proteins (7-8). Aberrations in the structure of either Apo B100 or the LDL receptor proteins are associated with many of the disease conditions in the familial hypercholesterolemia syndrome.

Therapy for familial hypercholesterolemia has involved several approaches including at least the following (9-10):

  • Diet modification;
  • HMG-CoA reductase inhibitors;
  • Other cholesterol synthesis inhibitors;
  • Bile acid sequesterants/transport inhibitors;
  • Fibrates;
  • Antioxidants;
  • Niacin (apo B100 synthesis inhibitor);
  • LDL apheresis;
  • Combination approaches

Bioinformatic Approach

A basis for discovery is information integration (11). To accomplish this, the first step involved information collection, and the second step information integration and evaluation.

The approach utilized here used the protein sequence of Apo B100 and LDL receptor proteins as the initiating point for information collection (Figure 2). After identifying sequences for these proteins in the CAS Registry file, Blast searches were performed. For the homologues that were identified, names for the proteins were captured. This nomenclature was then combined with other terms that specified small molecule relations and was formatted into a script for uploading and searching in the CAplus file on STN. MedLine searches could also be performed. Additional information that was collected for later evaluation included secondary structures and solvent accessibilities predictions, via Internet tools, for Apo B100 or LDL receptor proteins as well as for proteins with relatively high homology scores. All of this information was then used for the integration and evaluation phase of the approach.

[Figure 2]

Figure 2
Information collection scheme



A two-step scheme was followed:

  1. Evaluate candidate homologous proteins by:

    homology score compared to Apo B100 or LDL receptor;
    similarity of secondary structure assignment between query sequence and homologue by comparing overlapping ASCII designations for sheet, helix, and coil segments;
    solvent accessibility similarity;
    related biological role of Apo B100 or LDL receptor;


  2. Select small molecules studied in relation to homologue proteins with greatest potential:

    use STN Messenger Select command to capture CAS Registry numbers studied in association with the homologue protein.

All of the above steps could be handled algorithmically except selecting proteins with biological role related to Apo B100 or the LDL receptor and making the final selection of small molecules that were directly associated with candidate homologue proteins.

Results and Discussion

A search for Apo B100 and LDL receptor protein sequences in the CAS Registry file enabled identification of 57 sequences. From that group, one Apo B100 fragment sequence of 82 residues (CAS Registry Number 148882-09-1) was used for Blast searches. This sequence had been identified as an important substance in a patent (CA Abstract Number 119:67299). Names of identified homologues were stripped from the Blast output and used in a subsequent script (Figure 3). Over 4,000 CAplus records were identified using the script. For protein sequence homologues with relatively high homology scores (> 20), ASCII versions of secondary structure and solvent accessibility predictions were collected and compared.



S  3.2.1.8  and  (org/sc or synthesis or prepn)
S  3.4.11.  and  (org/sc or synthesis or prepn)

S  ALPHA-LATROTOXIN  and  (org/sc or synthesis or prepn)
S  AMINOPEPTIDASE II  and  (org/sc or synthesis or prepn)
S  APOLIPOPROTEIN B 100  and  (org/sc or synthesis or prepn)

S  DNA POLYMERASE  and  (org/sc or synthesis or prepn)
S  DNA-binding protein ci?  and  (org/sc or synthesis or prepn)
S  E2  PROTEIN  and  (org/sc or synthesis or prepn)
S  ENDO  BETA  XYLANASE  and  (org/sc or synthesis or prepn)
S  GAMMA-GLUTAMYL-PHOSPHATE REDUCTASE  and  (org/sc or synthesis or prepn)
S  GAS1  and  (org/sc or synthesis or prepn)
S  GENERAL SECRETION PATHWAY PROTEIN D PRECUSOR  and  (org/sc or synthesis or prepn)

Figure 3
Search script generated from Blast search output



The integration and evaluation scheme discovered that a xylanase (Figure 4) had significant homology (Blast score 75, probability .0051) with the query sequence, and relatively similar secondary structure. Additionally, the query sequence and the xylanase sequence overlapped in a region consistent with the enzyme's active site. Related to the enzymic activity, and identified via the CAplus script search, was information that indicated that this xylanase has been studied in connection with the preparation of oligosaccharides, for example the xylose-containing oligosaccharide, CAS Registry Number 155957-71-4. Based on the relationships (homology, secondary structure, etc.) between the proteins, it can be hypothesized that this substance or an oligosaccharide of a similar structure may interact with Apo B100 sequences. This may suggest an approach to a new lead compound to interact with Apo B100, for therapeutic approaches, such as LDL apheresis, or for a diagnostic agent.

sp|P35811|XYNC_FIBSU ENDO-1,4-BETA-XYLANASE C PRECURSOR (EC 3.2.1.8)
(XYLANASE C)
(1,4-BETA-D-XYLAN XYLANOHYDROLASE C).  >gt;gp|U01037|FSU01037_2
endo-1,4-beta-xylanase precursor [Fibrobacter succinogenes]
Length = 608

Score = 75 (32.9 bits), Expect = 0.0052 P = 0.0051
Identities = 14/50 (28%), Positives = 28/50 (56%)

Query: 31 LTSYFSIESSTKGDVKGSVLSREYSGTIASEANTYLNSKSTRSSVKLQGT  80
          L  Y+ I+++   D+ GS +  E  GTI  +  TY+  ++TR+   ++ +    
SUBJ: 140 LVEYYVIDNTLANDMPGSWIGNERKGTITVDGGTYIVYRNTRTGPAIKNS 189

Figure 4
Xylanase homology with APO B100 query



As an extension to this approach, other CAS databases can then be used to find possible routes of candidate oligosaccharide lead development. For instance, the CASREACT file can be used to find techniques for generating related compounds and expanding lead substance candidates.

Thus, this work demonstrates how simple integration of CAS databases and Internet tools and databases, using sequence information and their name segments, can provide value for lead generation and discovery. An algorithmic approach, with some intelligent features at the researcher's desktop, could handle much of this interaction. Additionally, this work suggests name segments from sequences identified in homology searches may be a valuable source of related biological information when used as search terms.

References



1. Williams, J., SciFinder from CAS - Information at the desk-top for scientists. Online, 1995, 19, 4, 60-66.

2. Liu-Johnson, Huei-Nin; Haines, Reginald; Hackett, William, Searching for protein sequences in CAS Online. Biotech Forum Eur., 1991, 8(4), 204-9.

3. Stobaugh, Robert E., Chemical Abstracts Service Chemical Registry System. 11. Substance-related statistics: update and additions. Chem. Inf. Comput. Sci., 1988, 28(4), 180-7.

4. Zamora, Antonio; Dayton, David L., The Chemical Abstracts Service Chemical Registry System. V. Structure input and editing. J. Chem. Inf. Comput. Sci., 1976, 16(4), 219-22.

5. Humphries, Steve E.; Mailly, France; Gudnason, Vilmundur; Talmud, Philippa, The molecular genetics of pediatric lipid disorders: recent progress and future research directions. Pediatr. Res., 1993, 34(4), 403-15.

6. Schaefer, Ernst J.; Genest, Jacques J. Jr.; Ordovas, Jose M.; Salem, Deeb N.; Wilson, Peter W. F., Familial lipoprotein disorders and premature coronary artery disease. Atherosclerosis (Shannon, Irel.), 1994, 108(Suppl), S41-54.

7. Shachter, Neil S.; Weinberger, Judah, Mutations of the low-density-lipoprotein receptor gene and familial hypercholesterolemia. Trends Endocrinol. Metab., 1994, 5(6), 245-9.

8. Friedl, W. I., Familial defective apolipoprotein B-100. Molecular basis, prevalence, and clinical features. Klin. Wochenschr., 1991, 103(20), 621-5.

9. Larsen, Scott D.; Spilman, Charles H., New potential therapies for the treatment of atherosclerosis. Annu. Rep. Med. Chem., 1993, 28, 217-26.

10. Stein E. A., Drug and alternative therapies for hyperlipidemia. Atherosclerosis, 1994, 108 (Suppl), S105-16.

11. Liebman, Michael N., Distance-based approaches to protein structure-function analysis. Protein Struct. Distance Anal., pp 287-301. Editor(s): Bohr, Henrik; Brunak, Soeren. IOS Press: Amsterdam, Neth., 1994.



NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:

Network Science Corporation
4411 Connecticut Avenue NW, STE 514
Washington, DC 20008
Tel: (828) 817-9811
E-mail: TheEditors@netsci.org
Website Hosted by Total Choice