Daylight CGI interface (DCGI)
Presented at the Daylight Users Group Meeting, 2/96

David Weininger

Daylight Chemical Information Systems, Inc.

http://www.netsci.org/Science/Cheminform/feature08.html

The World Wide Web has caused a revolution in the way information is exchanged in the last few years. Starting from humble beginnings (distribution of static documents), it has grown to embody most computational user interface functions. One can easily argue that the Web has fundamentally transformed the ideas of "publishing", "computer", "database" and "user interface".

One of the most powerful Web concepts is the Common Gateway Interface (CGI), a mechanism to deliver computational results and a meta-GUI to an end user over the net. Given CGI-capable servers and clients, the "Web" becomes a what we used to think of as a "computer" and computers become more like abstract "computational resources".

The Daylight Common Gateway Interface (DCGI) is a chemical information interface which operates using Web technology. Such technology can be used on an isolated machine, in a secure network, or in the global WWW environment. The Web provides a great framework and many general purpose utilities but was not designed to deliver chemical information interactively and is deficient in this respect. DCGI provides tools, utilities, and meta-interfaces required to successfully deliver chemical information in the Web environment.

A brief history of chemical information interface

Single-machine architecture

Database interfaces were terminal-based programs which directly accessed databases which were simple files. Data integrity was dependent on the quality of interface programs. Access to data was limited by the need to connect directly to the computer on which the data was stored. User interfaces were extremely machine-specific, e.g., only terminals for which the system was developed worked reliably. Such database systems are expensive to build, debug, maintain, improve, and purchase.

Client-server architecture

Databases are accessed by a single "server" process which provides all services required by one or more "client" processes via a network. TCP/IP becomes the defacto network standard.. Client-server systems are able to provide high data integrity, concurrency, access control and high-volume data delivery. Database application programs become distinct from database systems, e.g., they could operate on completely different platforms.

In theory, user-interface programs should have become less precious since maintaining data integrity is now a server responsibility. In fact, this simplification was more than offset by the added complexity required to create a reliable, event-driven GUI. Such systems remained expensive largely due to the added effort required to build complex client interfaces.

World Wide Web architecture

The need for information delivery in other fields leads to the widespread availability of very capable, high-quality information browsers. HTML (Hypertext markup language) servers, HTML browsers, IP (Internet Protocol) and the Internet itself combine to form a defacto standard for information exchange. Because HTML servers now provide a meta-GUI as well as data, the whole job of data delivery can be done with servers.

This architecture should result in major improvements in the availability and cost of chemical information access. Since the entire information delivery interface is provided "free" to the developer, a system based on a "zero-cost seat" should be possible. Remaining issues include access control, high-volume data delivery and handling the special requirements of chemical information.

Key features

  • HTML browsers are used (E.g., Netscape, MacWeb, HotJava, Mosaic, etc.)
  • HTTP servers are used (E.g., from NCSA, CERN, Netscape, Sun, etc.)
  • Existing databases and database servers are used (E.g., thorserver, merlinserver, spresi95, wdi95, etc.)
  • Published URLs: Each DCGI release comes with a list of "published" URLs. Published URLs are supported, i.e., they're documented and you can count on them being there and working in future releases. The /dayhtml/published.html is an example.

  • All documentation is delivered via HTML: This includes all man(1) pages (with hyperlinked see-also's), user, administrator, theory and programmer manuals. Hard copy documentation is still available, but why lug it around?

  • Chemical information processing delivered in CGI programs - Such CGI programs are normal Daylight Toolkit programs which exchange information in HTML and other WWW formats.

  • Designed to work with public domain client-side chemical tools (E.g., The word GLYMIDINE is linked to a DCGI program which samples a conformation of (using rubicon) and downloads to rasmol via the HTML browser.

  • Daylight CGI tools are available - As always, the tools we use at Daylight are available as a supported products. The DCGI Toolkit provides tools which allow you to add chemical information to your own HTML files and to create custom HTML interfaces.

Advantages of the DCGI approach

  • Highest quality chemical information interface to date

    Compared to any special-purpose chemical information end-user interface, HTML browsers are amazingly capable, reliable and inexpensive. There is no mystery to this. A program such as Netscape is beta tested by literally millions of users and its development cost is spread out widely. No chemical information interface will ever have this advantage -- there just aren't that many people interested in chemistry. (The same is true for HTTP servers such as httpd.)

  • Makes use of freely available chemical software

    There is a growing niche for public chemical information software. Much of this is oriented to format conversion and interoperability, but some chemical freeware has established itself as a standard in sophisticated areas (e.g. rasmol). The DCGI approach is more of an "organization between collaborating programs" than a "monolithic single-vendor system" and it makes it very easy to take advantage of such software.

  • Increase user acceptance, reduce user training

    There is something magical about HTML browsers. Despite the fact that they aren't actually easier to use or set up than conventional GUI interfaces, an enormous number of people have embraced them. The fact that they are freely available and generally useful seems to empower otherwise "computer-timid" users. The same user who is unwilling to spend 10 minutes reading a manual for a special purpose interface will gladly spend an hour setting up an HTML helper program on their personal computer. The reason, of course, is that the perceived value of an HTML browser is extremely high, since it open up a whole world of information exchange. In any event, this phenomenon is wonderful from the user training point of view.

  • Works with all your existing hardware

    Hardware needed to implement the DCGI interface system includes a capable Unix box host the servers, user workstations or personal computers which run Netscape and an IP network to tie them together. In most cases, all the hardware needed is already in place.

  • Zero-cost seat

    Chemical companies in general do not provide adequate chemical information access to most of their employees who need such information. On one extreme, database managers and drug designers get excellent access. On the other extreme, temporary employees in the stockroom typically don't get any (although they probably have an equally valid need). The reality is that providing enough information access to optimize the productivity of each employee is very expensive due to the per-seat cost. It's simple math, e.g., $5000 x 1000 users is $5 million per year.

    Any time a chemical information systems vendor produces a commercial program to be run by an end-user, the development must be paid for and the product must be supported. All extant chemical information systems work this way and all of them are expensive. By using HTML browsers, however, the chemical information interfaces can be implemented on a server and the theoretical need for a per-seat cost disappears.

  • Deliver information locally, site-wide, world-wide

    The Web technology underlying the DCGI system fundamentally evolved to provide widely-distributed information exchange. By design, the resultant DCGI system scales up smoothly from a single isolated system with a single user to a global information service with 1000's of users.

  • Well-tested security

    The same arguments that apply to interface servers and clients apply to secure HTML servers (SHTML). No security system can claim absolute perfection, but it's reassuring to use a system which is under constant use and attack by a lot of clever people. To date, no chemical information system has been subject to such stringent testing.

  • Dramatically simplify custom interface development

    The "collaborating programs" approach used by DCGI simplifies prototyping and developing chemical information interfaces (compared to using other GUIs). The DCGI system allows most task-oriented programs to reference each other, so such interfaces can be used as tools for other "programs" without relinking or recompiling.

    However, the rest of the task in producing a production quality program is not equally simplified. DCGI doesn't dramatically reduce the effort needed to define a problem, establish an acceptable interface, document it, and respond to conflicting user requirements.

  • Overall system is stable, supportable, and extensible

    Most of the advantages of DCGI derive from the widespread use and development of the HTML protocol. Such an enormous amount of information is being delivered via the World Wide Web that we are assured of the survival of this technology for many years. Our current interfaces (e.g., xvthor, xvmerlin) have grown so complex that extending them has become a significant effort. The flexibilities inherent in the DCGI system are barely tapped.

Disadvantages of the DCGI approach

  • It's new

    Computational methods used in chemical information systems have historically lagged behind those used in by mainstream of computer science by about a decade. With DCGI, we're closing the technology gap to 2-3 years. This may make some people nervous, and for some good reasons. For instance, system administrators will need to learn new skills and to manage httpd servers and internal web networks. The newness of this system is ameliorated by its omnipresence (there are a lot of people in the same boat).

  • Real HTML security is good for the few, awkward for the many

    Secure HTML (SHTML) provides point-to-point security in the form of encrypted authorization and communication. This all-or-nothing approach to security was historically used in heavy-handed government and military systems while most chemical information systems have used lightweight authorization-only schemes. Although it's nice to provide real security, the result might not be entirely welcome. E.g., authorization passwords would need to be maintained for all users of a secure system, even those with access only to "internal public" data.

  • SMILES exposure

    Like all Web users, DCGI users are expected to understand something about the role of various information formats. For instance, Web users need to know basically what a GIF file is used for. A similar case in the DCGI system would be the SMILES format. Just as there are GIF creators and viewers, there are SMILES editors and viewers. In either case, users need to know when its appropriate to use such formats and programs. The underlying Web methodology makes it difficult to protect users from such issues.

  • Not as flexible as WWW

    To achieve a supportable system in the long run, DCGI is very restrictive compared to the anarchy of the WWW. It's not OK to make arbitrary HTML references to anything in the system. DCGI is set up as a series of autonomous packages which can operate however they like internally but can only reference external entry points (URLs) which are "published". There is no question that DCGI needs such restrictions are for stability and supportability. In doing so, it gives up some of the good aspects of the "footloose" character of the Web.

  • Increased computational overhead

    DCGI uses more computational overhead and network bandwidth than a properly installed X-based-system. In fact, almost none of our customers use the Daylight X-system in the way that it was designed, so this might be a moot point.

DCGI Architecture

  • Thor and Merlin servers are used for database access

    Existing 4.x databases and database servers are unchanged. All database operations are implemented as network transactions. Databases and database servers communicate only via the network (using IPX) and form a well-defined package that can be modified without affecting system operation.

  • Database functionality is encapsulated in DCGI programs

    DCGI programs communicate via the network: on one side they access databases (via Daylight Thor and Merlin Toolkits), on the other they talk to the HTTP server. Meta-interfaces are implemented by such CGI programs. Note that use of the Toolkits has been moved from the client side to server side in this architecture.

  • An HTTP server is used to deliver the user interface

    All traffic control is managed by an HTTP server including access control, delivery of the meta-interface and and transaction security. The system is designed to run on "vanilla" public domain servers (e.g., NCSA's httpd).

  • HTML browsers

    The end-user interface is an HTML browser such as Mosaic, Netscape, MacWeb, etc. The design user-interface protocol is HTML-3.x (HTML-2.0 with tables). The browser used for initial development is Netscape 1.1N because it is very capable, reasonably reliable, and widely available.

  • Helper programs

    Client-side helper programs are an important part of the system. Mainstream Web helper programs will be used whenever possible, (e.g., GIF, JPEG, MPEG viewers). Helper programs specific to chemical information also be used, either via specific languages (e.g., SMILES from molecular editors) or via a general chemical object-exchange mechanism (e.g., CEX to rasmol or other modeling packages).

Secure DCGI system

wdi -- a minimalistic interface

The wdi HTML interface is a special-purpose interface to Derwent's World Drug Index. It is designed to deliver essential information quickly and reliably to users who do not need to be chemical information specialists.
  • "What do you want? Here it is." interface

    Given a suitable task, HTML is an excellent protocol for writing "zerofaces" (minimalistic user interfaces). For the World Drug Index, the basic task is delivering pharmaceutical information given one of a large number of identifiers (structure, preferred name, trade name, INN, USAN, CAS number, manufacturer's id, etc.)

    The wdi HTML interface has only one entry field in which a user can enter any identifier. When SUBMIT or RETURN is selected, all data corresponding to that identifier is produced.

    The user does not need to know anything about the way the data is stored or retrieved, e.g., the host name, service, database name or version, what kind of identifier is entered, whether the structure is known, etc.

    For instance, here are the wdi entries for verapamil, No-Doz, RU-486, pectin and CC(=O)NO.

  • Appears to be a dynamic single page

    Visual context switching can be confusing. In the wdi interface this is kept to an absolute minimum by making the query and data pages appear to be the same "page".

  • Ambiguous references are handled within the page

    If an identifier is ambiguous, a graphical index is provided with links to the appropriate page section. For instance: dristan and udolac.

  • SMILES are visible only if needed

    Since WDI is used by non-chemists, "SMILES exposure" is kept to a minimum. SMILES are only used for structural entry. On the other hand, chemists often use WDI to find the structure given a trade name in which case having the SMILES available is important. This is handled by linking the depictions to a SMILES entry page.

  • Pages may be bookmarked

    Pages with data are entitled "WDI: identifier" and may be saved as a dynamic bookmark. When this bookmark is invoked (or page is reloaded), the page will reflect the current state of the database.

    This requires a bit of invisible trickiness since one cannot normally save the results of a FORM entry as a bookmark. The "trick" is done by using a FORM-handling CGI program wdiform which causes the httpd server to instruct the browser to re-invoke the original wdi CGI script with an argument containing new instructions (in this case, the hex-encoded identifier). Believe it or not, these kinds of twisted relationships are normal on the Web.

  • Pages may be referenced

    The "bookmarkability" trick establishes an absolute URL address for every possible wdi page. Given that, pages can be referenced in a normal HTML file, e.g.

    <A HREF="/daycgi/wdi?4449415a4550414d">DIAZEPAM</A>--------  ------ --- ----------------  -------- ---
    reference  alias  |  "DIAZEPAM" (hex)  hot-text end                 CGI program name
    
    Appears in an HTML page as: DIAZEPAM
  • Pages may be printed

    One nice feature of HTML browsers is that you can print anything you see. Color depictions of structures are also nice, but unfortunately don't print well on most monochrome laser printers. To resolve this, a hot-link to a "black on paper" version appears in at the top of each page. (This bit of HTML magic causes black-on-white GIF-89a images to be produced with with an transparent background.)

acd -- a special-purpose interface

The acd HTML interface is a special-purpose interface to MDL's Available Chemical Directory. This interface is designed for ACD users who purchase chemicals (who are typically not chemical information specialists).
  • Simple (but not minimalistic) interface is used

    Like the interface to wdi, the acd interface implements a single-line query which can be any identifier (SMILES, CAS numbers, ACD names, ACD numbers, catalog names or catalog numbers). In ACD, the most common ID used is the catalog number which can be extremely ambiguous (e.g., "11").. Delivering all possible data associated with an identifier would sometimes produce an unreasonable amount of data (e.g., supplier information for all sulfuric acid products).

    A three-page interface is therefore used:

    1. Query/lookup -- user specifies ID, resolves ambiguities
    2. Product table -- a table of products is presented
    3. Supplier information -- supplier coden links
  • Information is delivered in a table

    The core of the interface is a table containing all product entries sorted by specific price. If the specified structure is ambiguous, an index is provided to multiple tables on the page (e.g. for various isotopes). Supplier codens are hot-linked to ordering information (e.g., supplier name, address, phone number, etc.) Product table pages may be bookmarked, referenced, and printed, e.g., 2-AMINOHEPTANE.

savant -- a special-purpose interface

Savant is a special-purpose interface to the Spresi database which is designed to provide chemists with a synthetic literature survey as painlessly as possible. Given a structure, the user is presented with references to papers which describe the synthesis of the given structure (if available) and similar structures.

  • A three page interface is used
    1. Query -- user specifies structure of interest
    2. Search results -- user selects from list of similar structures
    3. Reference list -- a list of synthetic paper references
  • The initial page offers options To increase user efficiency, the query page offers two options with the structure specification.

    The first option is "language orientation". Spresi is a truly international database. This option limits the output to include only references to journals published in desired languages.

    The second option controls the number of similar structures retrieved. It is set to 10 by default but may be decreased to 1 (exact match or most similar structure only) or increased to as much as 100.

  • Savant pages may be bookmarked, referenced, and printed

    Referencing may be done at the can be done at the query level, e.g., Savant query for diphenyl sulfide, or at the report level, e.g., Papers on synthesis of 3-iodo-toluene.

  • Savant is indirectly recursive

    The report page allows the user to automatically re-enter savant with a structure retrieved by the last search.

  • Savant works with all or part of Spresi

    Savant looks for journal article (JA) and patent (PAT) data which contain the keyword preparation. The spresi95preps database is a subset of spresi95 which only contains structures with journal articles which have this keyword and therefore works just as well as the whole database for this purpose, while saving the server a lot of space (200 vs 500 MB). Savant also works with spresi95demo, but not very well, since it only contains 1410 structures.

    It would be reasonable to create Savant-like interfaces for other purposes, e.g., for analytical papers, patents, etc.

hyperthor -- a general-purpose database browser

Hyperthor is a general-purpose HTML browser for Thor databases.

  • Designed for the database-oriented user

    Hyperthor allows you to access Thor data by specifying the host, service, database, datatype, and identifier. The fact that the user needs to know the names of these things to specify them is a certain disadvantage. Advantages include generality (works with all Thor databases), flexibility (for the user), and efficiency (for the computer). Default values can be set to minimize how much a user needs to know to get in to the system.

  • Displays hyperlinked datatrees and tables

    Hyperthor's datatree display should be clear to users of Daylight's 4.x systems. If enabled, 2D data will be drawn as stored in a database, 3-D data can be downloaded to helper programs, and data survey tables are offered more than one dataitem exists for a given datatype. Server-side behavior is controlled by options in the DCGI environment, e.g., one can control which types of data are displayed and how. Client-side behavior is controlled by browser options (e.g., .mailcap and .mime types control which helper programs get invoked). One could set up a default CGI schell script for each database (e.g., hyperacd, hyperwdi).

  • To be done before production release

    The version of Hyperthor shown at Euromug(s) is a beta-level program. The main issue currently outstanding is how to bookmark and reference hyperthor pages while allowing large amounts of data to be processed (e.g., the current version does not limit data size length). Also needed are datatype-specific options which control helper program invocation from the server side (via mime typing).

wizard -- a general-purpose EDA interface

Wizard is a general HTML interface to Merlin, an exploratory data analysis tool. The wizard interface is quite different from others described here: it is basically a programming tool for MCL (Merlin Control Language).
  • Wizard is based on MCL (Merlin Control Language)

    MCL is an English-language interface to the Merlin search engine. The 4.42 version of MCL is nearly unchanged from previous releases except that it will optionally write output in HTML (mcl -h). The wizard user generates an MCL program with an HTML-GUI which is then run with HTML output. If the display is done with tables, the output looks a lot like something you might see in xvmerlin.

  • Wizard specification is done with a recursive CGI

    The Wizard program specification page allows the user to write an MCL program using an HTML graphical interface (menus, text fields, etc.) The interface is very powerful although somewhat rough looking.

  • To be done before production release

    The version of Wizard shown at Euromug(s) is a alpha-level program. The main issue currently outstanding is how to bookmark and reference wizard programs without creating a maintenance headache.

DCGI "Toolkit"

  • Tools to support custom HTML interfaces

    In putting together the Daylight HTML interfaces, we needed to develop quite a few CGI-specific widgets, gadgets and dohickies. As with virtually all other tools that we use internally at Daylight, we intend to offer these as a supported product, the DCGI toolkit. Unlike the oop-ish toolkits that we currently offer, the DCGI toolkit will be a mixture of C-object libraries, scripts, and programs.

  • Context management

    The biggest problem in building all but the simplest WWW interface is that it is based on a stateless model (no persistent client.context). The DCGI system provides a reliable mechanism for managing a client context based on hidden FORM entries.

  • smi2gif, cxt2depict and other 2-D delights

    smi2gif is a CGI program which accepts a SMILES as a hex-encoded argument, generates a structural depiction, and produces a GIF file suitable for HTML display. Control of color mode and image size are provided. cxt2depict does the same job, but operates on context variables which makes it more powerful than smi2gif: it handles input of unlimited length and will use specified coordinates (if provided).

  • Legal GIF generation

    A persistent irritation when working with GIFs is that the algorithm which is nearly universally used for image compression is patented by Unisys. Everybody seems to use it anyway and Unisys doesn't seem to enforce their patent rights ... but ignoring it isn't a good solution for commercial software. The DCGI system includes a novel GIF generator which operates using a different compression algorithm (Sayles/Knuth). This method only marginally slower than the patented algorithm and produces identical GIF files. A graphics library and graphics-object-to-GIF functions are provided with the DCGI system.

  • coord2cex and other 3-D delights

    Download a conformation from your program to an HTML browser as a CEX object for use with rasmol and other modeling programs. Can operate either from a Daylight conformation object or SMILES and (X,Y,Z) coordinates.

  • Miscellaneous tools

    Producing a reliable HTML interface involves keeping track of a large number of details. The DCGI system provides a number utilities to solve such problems (or at least make them manageable). These include: a error handling mechanism, functions which allow arbitrary text to be written in HTML safely, hex-encoding deduction and conversion, scratch file management, and others.

Grins in v4.5 and beyond

The version of Grins available in MUG '96 is the 4.42 version, suitable for input of generic organic structures only. There is only one control panel in this version:

The initial version of the 4.5x Grins prototype is very similar:

Additional capabilities (chirality and reaction specification) are provided by a second panel which is available by selecting the "More" control:

If reaction specification is prohibited the panel looks like this:

Are the functions of the smiley faces clear? Can anyone suggest better icons?)

It is tempting to write a version of Grins in Java. Maybe we'll try it out and see. In any event, the basic HTML-Grins capabilities will be available in DCGI.

Outstanding issues

  • Distribution & installation

    Installing a Daylight HTML-based chemical information system is intrinsically more complicated than systems based on dedicated clients and servers. The main reason is that instead of one place for installation to take place, there are three: the Daylight software, the HTTP server, and the HTML clients. Fortunately, most of our customers are already running internal HTTP servers and HTML clients, so it shouldn't be too much of a hassle. In any event we will need to develop an installation scheme which smooths out this process.

  • Security

    As described above, it is possible to operate a secure DCGI system by using secure point-to-point protocols, i.e., encrypting all communication between the HTTP server and the HTML client. While secure, the current methods of achieving this require an "all-or-nothing" approach to security. Is this acceptable?

  • Licensing and pricing

    The idea of a server-based chemical information system with zero-cost seats is a new one in many respects. Extant methods used for licensing software (number of seats, number of users, number of concurrent users) just don't reflect the reality of what's going on.

    Our inclination is to simplify our licensing scheme in a way which is consistent with what's actually happening. We are proposing to consolidate the ever-increasing number of product components into a few functionally-defined packages and license them in "small", "medium", and "unlimited" versions based on annual usage. For instance, a "database server" package would include all servers and supporting software; average lookup usage limits might be 200/day (small) and 2000/day medium; a log would be kept for accounting purposes but no usage restrictions would be enforced (except an annual review of usage when renewing the license). You only pay for what we provide: internal database services. There would be no limits on the number of users (what's that to us?) the number of seats (now equipped with non-Daylight software) or CPU usage per se (after all, it's your CPU!)

  • Transitioning to DCGI

    Given that the DCGI system uses existing Thor databases and servers, and that capable HTML components are readily available, the technical transition should be relatively simple. The main challenges are establishing reliable installation and maintenance protocols.

    Physical transition is simplified by the fact that end-users can use any machine that runs Netscape. The same is true on the server side: any machine that can run the Daylight servers can run an httpd server. The only glitch would be if a company didn't have an IP network installed (almost unheard of these days).

    We expect that user training will be equally straight-forward. It seems like everybody and their hairdresser knows how to run Netscape and surf the Web these days. Such skills apply directly to DCGI.

    The biggest change will be for people writing custom programs. Although writing HMTL interfaces is much simpler than doing so with other GUIs, it's done very differently and has a steep learning curve.

    Finally, if we do take the plunge and change our licensing scheme, we will be committed to making the change as smooth and as fair to our customers as possible.

[ NetSci's Home Page ] [ The Science Center ] [ The Cheminformatics TOC ]



NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:

Network Science Corporation
4411 Connecticut Avenue NW, STE 514
Washington, DC 20008
Tel: (828) 817-9811
E-mail: TheEditors@netsci.org
Website Hosted by Total Choice