2.7.5 -->Chemical Publishing via the InternetChemical Publishing via the Internet

departmentUniversity of Leeds Logo

 


 

 

Internet

Benjamin J Whitaker[a] & Henry S Rzepa[b]

[a] School of Chemistry, University of Leeds,

Leeds LS2 9JT, UK: benw@chem.leeds.ac.uk

[b] Department of Chemistry, Imperial College,

London SW7 2AY, UK: rzepa@ic.ac.uk

Abstract:

Information dissemination in the molecular sciences raises a number of fundamental issues involving complex image and data types concomitantly with notational, and often mathematically oriented text. International standards are beginning to emerge which notionally address these issues, but which also posit novel mechanisms for scientific discourse which have no paper equivalents. Here we discuss our work on two developments which may be used to incorporate interactive three-dimensional representations of molecules into electronic documents; extensions to the MIME protocol and a 3D model description language (VRML). We argue that numerous information systems can be built on this technology including multimedia database search and retrieval engines, subject specific conferencing tools, and multimedia publishing. Progress in the development of a prototype electronic chemistry journal is also described. In addition to demonstrating an electronic document delivery mechanism, the aim of this experiment is to explore how interactive documents, in which structured data are embedded in the document in such a way as to allow the reader to control the presentation, will change, at a fundamental level, the way in which scientific information is disseminated and digested. This paper discusses issues such as strategies for document markup and the development of authoring tools, long term archival standards and data security, and molecular indexing and searching. We also describe methods for integrating electronic delivery mechanisms with other scientific activities, such as conferences and workshops, and teaching and training.

Introduction:

Developments in spectroscopy, structural analysis and computational methods have enabled the molecular scientist to study larger and larger molecules and ever more subtle interactions. In the process the visual and numerical complexity of the molecular information available has increased enormously. Such progress has not, however, been matched by associated enhancements in the methods by which such information is disseminated. Although the results of a research programme in, say, structural biology may frequently be made available through molecular co-ordinate and connectivity information repositories such as the Cambridge structural database [1] or the protein data bank[2] the primary mode of information dissemination remains the research paper. There are a number of problems with this albeit well established technology. In the first place complex notational conventions have had to be developed in order to describe on a two dimensional surface intrinsically three dimensional structures. Most people trained in the molecular sciences will, for example, be familiar with the Fischer projection formula for representing the structures of chiral compounds. However, this and other methods (e.g. Haworth and conformational representations) although useful can be difficult to interpret, particularly for a non-specialist chemist and become clumsy as the size of the molecule increases. In the second place paper is essentially a narrative medium; things are best structured to follow one from another lest the reader be forever losing his or her place in the text. Whilst this is not a problem for something like a research paper, where the discipline is positively beneficial, it can be far from ideal for other sort of text such as a thesaurus, and even in the arena of the dissemination of scientific labour it can be limiting. For example, by far the greater bulk of what is known as `supplemental data' is never made available to the scientific community at large because of the limited space available in journals and presentational difficulties

The recent development of the World-Wide Web[3] (WWW), a server-client information exchange system based on the global Internet network, offers an interesting solution to at least some of these problems [4]. In this paper, we describe the implementation of a chemical structure markup language (CSML) which can provide on-screen annotation of 3D molecular diagrams embedded into electronically delivered text, and demonstrate how hyperlinks between information servers can be used to provide access to an author's supplemental data. We then discuss some of the implications of `moving beyond paper' with respect to the future of scientific discourse, and report on recent progress towards implementing these ideas in a functioning electronic journal, and discuss some of the issues, such as text conversion tools, involved in producing such a journal. Finally we describe our experiences with electronic conferences.

Beyond Conventional Text:

The WWW operates a communication protocol known as hypertext transport protocol (HTTP) to provide a client-server model of information exchange in which data can be distributed over a number of servers and retrieved through hypertextual links called uniform resource locators (URLs). The URLs are embedded into WWW documents which are themselves written in a text and graphical markup language known as hypertext markup language (HTML). The HTML specification defines structural features, such as paragraphs, lists, tables etc., within a document using a subset of the standard generalized markup language (SGML) definition (ISO 8879) [5]. It is the task of the HTTP to retrieve document fragments from remote servers and of an HTML aware client to display and assemble the contents in an appropriate and consistent manner on the computer screen. The client program, or "browser", interprets the markup and displays it. Most browsers understand how to render text and some well defined image formats, such as GIF and JPEG but other file formats, such as MPEG for digital video are usually rendered using an external application program. The application is generally configurable within the browser through the use of the multipurpose internet mail extension (MIME) types. MIME was originally conceived as a method for specifying the format of Internet message bodies (electronic mail) so that binary files could be attached to messages and posted through mail gateways without breaking them [6]. The configuration file which maps the known MIME types to an external viewer is generally known as a mailcap file. Using this mechanism a molecule referred to in a document can be associated with a set of molecular co-ordinates. Clearly the co-ordinate data themselves need not necessarily be stored on the same machine as the document itself since the remote information is associated with a URL that enables it to be located across the Internet. When activated the hyperlink transfers the data pointed to by the URL to the calling browser where it can be passed to an application program for rendering.

As they are currently defined by Internet RFC15216 primary MIME types are defined to be one of seven primary types: text, application, image, video, audio, message or multipart. These are qualified with secondary types. Thus a MIME specification of video/mpeg indicates a digital video data stream encoded according to the MPEG file format. Given this information the browser can launch an application program (previously defined in the mailcap configuration file) with which to render the data stream on the screen. It is important to understand that the Internet document RFC1521 strictly defines MIME as a two level hierarchy consisting of primary and secondary types only. The variety of digital chemical information therefore led us to propose, in an Internet draft [7], that a new primary MIME type to be called chemical be defined along with a variety of secondary types. These proposals have yet to be ratified by the Internet Engineering Task Force (IETF) and should be considered as experimental. That being said they have been widely adopted by the community most notably by the National Institutes of Health "Molecules-R-Us"protein databank[8].

In previous publications[4] [9] we have described a number of applications of chemical MIME types. Examples may be found on the WWW at the URLs http://www.chem.leeds.ac.uk/Project/MIME.html and http://www.ch.ic.uk/chemical_mime.html. The molecular information is physically stored on these servers and the MIME type identified in the server configuration file from the filename tail (extension). At the reader's end the browser is configured to recognise the MIME type (not the filename) and map the encapsulated MIME data to an external application. Note that the choice of the application program is entirely at the discretion of the user.

In table 1 we list some of the secondary MIME type that have been implemented together with typical application programs with which we have tested the concept.

Proposed MIME type                       Associated applications                  
chemical/x-pdb                           RasMol, Xmol, EyeChem, Ball-and-Stick    
 chemical/x-mopac-input                  Xmol, EyeChem, Ball-and-Stick, Chem3D    
chemical/x-xyz                           RasMol, XMol                             
chemical/x-mdl-tgf                       ISIS Draw                                
chemical/x-chemdraw                      ChemOffice                               
chemical/x-gaussian-input                Xmol, EyeChem                            
chemical/x-csml                          RasMol                                   

Table 1. Experimental chemical MIME types.

The secondary MIME types x-pdb, x-mopac-input, x-xyz, and x-gaussian-input are fairly obviously associated with protein database, MOPAC, xyz co-ordinate, and Gaussian file formats rspectively. The x-mdl-tgf format is used to describe a complex reaction scheme in terms of two dimensional structure representations. The x- indicates that these are experimental types which have not been ratified by the IETF. The last entry in the table refers to an experimental type which we call chemical structure markup language (CSML) is discussed in more detail below. It has been our philosophy that chemical MIME types should only be defined for data sets for which renderers are readily available across the common display platforms and which are preferably in the public domain.

As an example of CSML fig. 1 illustrates how the concept can be used to move beyond conventional ideas of text. The figure shows a screen dump of a WWW browser reading a document in which a NOESY 2D NMR spectrum of the DNA oligomer CGCGTTTTCGCG is being discussed. In the HTML document an embedded figure shows a representation of the 2D NMR spectrum in the corner of which a small image of the biomolecule has been inserted. The "thumbnail" sketch of the structure has been linked to a file containing the co-ordinates. When the user activates the mouse in this region the co-ordinates are downloaded wrapped as a chemical/x-pdb MIME type. The browser therefore pipes the co-ordinate data into a suitable application program, in this case RasMol. This allows the reader to interact with the co-ordinate data and to select a suitable viewing angle and representation of the molecule. The image of the spectrum itself has been marked up using CSML so that cross-peaks in the NOESY spectrum, in this case representing inter-chain contacts, are defined as small circular regions in the figure. These regions are themselves associated with a file on the server containing structural markup instructions so that when the reader activates one of these regions the instructions contained in the file are sent to the RasMol viewer to update the display and render the two residues associated with the selected NOE peak as red spheres. This is achieved by wrapping the instructions as a chemical/x-csml MIME type.

Click here for Picture

Figure 1. 2D NOESY spectrum of the DNA oligomer CGCGTTTTCGCG in which the cross-peaks are annotated with CSML commands sent to the RasMol viewer.

The hypertext paradigm of the WWW allows authors to link text with two dimensional images, and as we have shown with CSML markup, to enable dynamic markup of three dimensional molecular models initiated from comments in text or "hotzones" in 2D images. The process, however, does not work in reverse. Thus it was not possible to embed a hyperlink onto an individual atom or sub-structure of a 3D representation of a molecule such that it is associated with another sources of information. However, when the concept of a virtual reality modelling language (VRML) was first seriously introduced at the first WWW conference in May 1994, we realised that this would provide a powerful, and in particular a standard, mechanism for annotating 3D data. VRML became based on a well developed object oriented graphical language called Inventor, and hence a number of tools for creating VRML based descriptions of molecular information were readily created. In particular, a program system called EyeChem [10] which we had been working on was straightforwardly converted into a VRML authoring system. During 1995, we have been exploring the potential applications of this language [11]. Examples of our use of this system include producing 3D molecule diagrams hyperlinked to a transparent surface of a molecular orbital surface, "navigable" 3D projections of multi-dimensional potential energy surfaces and 3D scatter diagrams of intermolecular interactions in crystal lattices. These VRML "worlds" can contain hyperlinks to other VRML descriptions, or to more conventional HTML based text descriptions, or potentially to any Internet resource with a URL. In theory, a research paper could be presented as a navigable VRML based "world", although whether the general chemical community is ready for such an unusual departure from a conventional printed article remains to be tested.

Electronic Journals:

The mechanisms just described immediately suggest a radically new model for information dissemination in which the reader can render source data to suit his or her tastes. However, the needs/wishes of publishers, authors and readers may not overlap. For the publisher information management is a real issue, and it is extremely doubtful that electronic journals will lead to lower production costs, at least in the short term when electronic and paper versions will need to be produced in parallel. Questions of copyright and charging mechanisms also need to be addressed. For the author electronic journals hold the promise of reduced publication times, and, as we have just illustrated, to "say new things in new ways". It might even seem tempting to cut out the publisher altogether since it is a relatively trivial task to set-up an HTTP server and publish directly on the Internet. On the other hand publishing in today's prestige journals is often an important consideration for academic promotion, and frequently a criterion for future research funding. What, one can legitimately ask, is the "impact factor" of publishing on the Internet ? For the reader the problem is one of information overload, and many have already been put off by the apparent necessity of having to browse through pages and pages of substandard information on the Internet before stumbling on something useful.

With a view to exploring these and related issues we have recently embarked on a collaborative project with the Royal Society of Chemistry to investigate the feasibility of delivering the journals published by the Society electronically[12]. The goal is to deliver Chemical Communications in parallel with the paper version by the end of 1996. We have only just begun this project and can do little more than report our progress to date and map out the areas we perceive to be important here.

There is considerable anecdotal evidence to suggest that chemists tend to scan printed journals by "looking at the pictures". This is hardly surprising because as we have already noted the molecular sciences are often concerned with the structural relationships between complicated three dimensional objects - molecules. The printed version of Chemical Communications has recognised this and has published so-called graphical abstracts on the contents page of the journal for several years. An electronic prototype of this contents page has been tested and will become available in August 1995 to external browsers at URL http://chemistry.rsc.org/rsc/. Very shortly thereafter we expect to link these graphical abstracts to the full text. Before discussing the implications of this for publisher and author, it is worth mentioning the indexing problem associated with the graphical abstract concept.

Clearly within SGML an instance of a graphical abstract could be defined in a document type definition (DTD) with an indexable header (text) element. However, it is possible using generalizations to the CSML that we have described to construct a chemical indexing system. The Klotho project [13] describes a possible model for such a system. Here a chemical name is translated into a SMILES string using a well defined lexical and grammatical analysis. The SMILES representation is then used to generate a pseudo pdb set of molecular co-ordinates which can be returned to the user. It is not difficult to imagine a system in which the reverse process is used to generate an indexable representation from co-ordinate data referenced in a graphical abstract. One can then envisage search engines, based for example on the Lycos [14] model, which browse the Internet for specified chemical MIME types and construct an index in a common and well defined representation. Such a system might also provide a gateway to established databases such as those provided by Chemical Abstracts. It is clear that multimedia browsing and search tools would be of enormous use in other fields. For example, a pathologist might wish to demand of an image library "find me a slide that looks like the one I have under my microscope". It is far less clear how such a system might be implemented - we have proposed an indexing system based on principal component analysis but have yet to demonstrated it - however, as far as chemical indexing is concerned the highly structured nature of the information, essentially a connection table, allows a much more rigorous semantic analysis. Work is underway to define a DTD specifically for chemical objects and to develop a content based chemical markup language (CML)[15].

The major issues for the publisher of an electronic journal, as we see them, are data management, achiving and automatic markup of authors manuscripts. Other concerns are copyright and charging mechanisms. We do not intend to address these latter issues here as they are essentially commercial rather than technical considerations, except to say that in our view an institutional subscription model (which incidentally is easy to implement through the Internet domain name structure) is preferred to "pay by page" mechanisms.

As a site grows data management on a WWW server becomes increasingly difficult. It is easy to get tangled in a web of ones own making. HTML documents are simply linked together, and there is no inbuilt database management structure. Recently a number of projects have tried to address this issue. Most notably the Symposia [16] project developed at INRIA, which is a collaborative authoring tool for the web, and the virtual science park (VSP) project [17] at Leeds, in which X.500 directory services and HTTP are being used to build a collaborative working environment. Within the CLIC electronic journal project we will explore these and other models for managing the RSC journal "server" which is currently distributed across four physical servers.

Management of supplemental data and archiving in general also need to be addressed. The WWW system is a model of a distributed information system and it is obviously desirable to make use of this feature to reduce the load on individual servers, however, authors can not be relied upon to maintain their own servers forever. Any user of the WWW knows how frequently information is moved and how URLs can end up pointing into empty space. Although it is possible that in the future the naming formalism will be augmented by more robust uniform resource names (URNs)18, which would allow more facile information relocation, this mechanism cannot be relied on. One answer is simply to live with the situation and to treat supplemental data as ephemeral, however, this is not very satisfactory. Better would be the creation of moderated data repositories along the lines of the Cambridge [1] and Brookhaven [2] data banks to which the more useful supplemental data would eventually migrate.

A related issue is the management of author's manuscripts, which are likely to be submitted in a range of formats. The Institute of Physics (IoP) have recently made great progress in delivering their publications over the Internet in parallel with their paper equivalents [19]. The situation in the physics community is somewhat easier than in chemistry, with over 40% of submissions being directly submitted in TeX and 36% in TeX related (e.g. LaTeX) formats. A number of conversion tools exist, for example the perl translator LaTeX2HTML [20], which facilitate document markup into HTML from TeX source files, and the IoP have for a long time used TeX in typesetting their journals. Electronic submission is less prevalent in the chemical community and is further complicated by the wide variety of platform specific text preparation systems that are in use, although MicroSoft Word is predominant. Within this environment rich text format (RTF) appears to offer the easiest metamarkup language, but is complicated by the existence of several dialects. Further complications arise because mathematically oriented text is treated as a graphical object rather than as a special text container as in TeX. Given current browser specifications based on the HTML2 standard this is not a significant limitation since non-standard text has to be treated as an embedded object, however, this will become limiting if the HTML3 proposed standards become adopted. The HTML3 is much closer to SGML and defines a container for mathematical text in its DTD. We are currently developing software conversion tools for LaTeX2HTML3 and investigating strategies suitable for implementation by the RSC but it is too early in the project to propose any guidelines. An alternative strategy, and one also under review by the IoP, is to make use of Adobe's PDF format for postscript. This has the advantage of delivering the electronic document in a form that has the same "house style" as the paper publication with the additional functionality of hypertext, however, it will not be a solution if authors, publishers and readers wish to make use of the mechanisms described in the previous section to add features to electronic publications that are not available in the printed form.

Issues of this kind raise more general ones about the future of scientific discourse in an electronically mediated world. In the near future electronics are unlikely to change current working practices dramatically except in the speed with which results may be disseminated from the laboratory to the wider world. Here electronic mail both for submission of manuscripts and the referee review process are already having an enormous impact. It is also becoming common practice in some communities to circulate electronic pre-prints once an article has been accepted for publication (and in some cases before!). However, as electronics impinge more on work practice, it is likely that this paradigm will change. Already it is difficult to decide what constitutes a publication. For example, is an unrefereed URL to be considered as a bone fide publication ? Almost certainly not. But as electronic bulletin boards move towards being moderated discussion groups the distinction between these forums and traditional journals distributed electronically becomes blurred. Presently the essential difference between the two seems to be the way in which a subject thread in a moderated discussion group is a "living document", frequently annotated by other authors, whereas a traditional journal paper is a static, once and for all, statement. The interesting question is whether we are dealing with information or precedence. For example, an author may discover that she or he has made an error and quoted a wrong result in a paper. Should the paper be amended ? On the other hand there are important papers, say interpretations of quantum mechanics pre 1926, which although "wrong" contain useful insight into the historical development of an idea. The future is unclear but it is our belief that electronics will radically change the ways in which science is communicated.

Electronic Conferences:

An area which can particularly benefit from the immediacy of communication on the Internet is conferencing. Two electronic models can be envisaged [21]. The first is closely associated with a physically real conference enhanced by an electronic component. The first four WWW conferences followed this model, allowing on-line pre-registration, abstract and program dissemination, and access to a variety of supplemental information. The 210th American Chemical Society National meeting in Chicago in August 1995 also incorporated elements of this model, with the full conference agenda including last minute changes being available on-line [22] , allowing delegates to plan their attendance in advance. In addition, a selection of electronic posters by authors who were not able to attend the conference in person were made available. The costs associated with attending multi-national conferences are not inconsiderable, and even attendance cannot ensure that a delegate can necessarily attend all the talks they might wish because of frequent parallel sessions.

These particular issues are directly addressed in the second model, where no physical venue for delegates to meet is actually necessary. Here, the ethos is to create an environment where people can exchange ideas and to help promote subsequent collaborations and perhaps physical meetings between people. By July 1995, two mainstream chemistry conferences of this type had been organised; ECCC (Electronic Computational Chemistry Conference) [23] , November 1994, and ECTOC (Electronic Conference on Trends in Organic Chemistry) [24], June 1995. For these events conference papers were available on-line; most were mounted on a single "conference" server, but around 25% were mounted on the author's own server, and simply linked to the main conference pages. Discussions occurred using a electronic mail distribution list to around 300-400 "attendees". These discussions were also made available via Web pages for those people who chose not to subscribe to the discussions. Various "value-added" facilities were also available. For example, keyword searches of the ECTOC accepted papers were available. Molecular co-ordinates for a number of structures associated with papers were made available through chemical MIME, and a "hyperglossary" containing these structures served to add structure and thematic elements to the conference. Sub-structure searches could be performed on this hyperglossary. An interesting component of theECTOC was the access to statistical information on how the conference was being browsed by readers around the world. In excess of 2000 computer systems were used to view the conference documents, and over 100,000 server accesses were provided during a four week period. In addition, up-to-date statistics for individual papers could be compiled. We wonder whether this particular type of information, hitherto unavailable for readers of journals or conferences delegates, will eventually feature in funding or tenure applications! A rather less formal component was the conference "photograph", containing around 45 digital photographs that delegates had submitted of themselves. The final cost of these conferences were remarkably low in comparison to conventional meetings, and both ECCC and ECTOC were generally agreed as representing a valuable addition to the mechanisms available to the chemical community for the exchange of ideas and information.

Conclusions:

During the period 1994-5, as awareness of the Internet rapidly increased, so did the criticism that the quality of information available via this mechanism was highly variable, and only infrequently actually useful. To some extent, this inevitable for a phenomenon which in age is barely beyond the toddler stage. Nevertheless, the responsibility of all serious users of the Internet as an information delivery tool is to address these quality issues. At least in part, they will be solved by creative and innovative applications offering the user additional benefit over conventional and mature methods such as printed journals. Ultimately, it should not be perceived as a mechanism in competition with traditional methods, or even as a mechanism for reducing cost structures. Instead, it should be viewed as a new and rapidly improving instrumental tool for use by the chemical community in general.

Acknowledgements:

We are grateful to the HEFC Joint InformationSystem Committee for funding under the FIGIT programme, and to Drs D James, J Goodman, C Hildyard and Mr O Casher for stimulating discussions and technical assistance.

References

1. FH Allen, JE Davies, JJ Galloy, O Johnson, O Kennard, CF Macre, EM Mitchell, GF Mitchell, JM Smith and DG Watson, J. Chem. Inf. Comput. Sci., 1991, 31, 187.

2. The Protein Data Bank, Chemistry Department, Brookhaven National Laboratory, Upton, NY.

3. TJ Berners-Lee, R Cailliau, J-F Groff and B Pollerman, CERN, "World-Wide-Web: The Information Univers" in Electronic Networking: Research, Applications and Policy, Meckler Publishing, Westport, CT, 1992, 2, 52

4. HS Rzepa, BJ Whitaker and M Winter, JCS Chem. Comm., 1994, 17, 1907.

5. M Byan, "SGML: An Author's Guide", Addison-Wesley, 1988 and International Standards Organization (1986), "Information processing - Text and Office Systems - Standard Generalized Markup Language (SGML) (ISO8879), Geneva: ISO.

6. N Borenstein and N Freed, Internet Request for Comment (RFC) No. 1521, 1993 - see also

http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/rfc1521.txt

7. HS Rzepa, P Murray-Rust, and BJ Whitaker, "A Chemical Primary Content Type for MIME",

http://www.ch.ic.ac/chemime2.html, Febuary 1995.

8. National Institutes of Health, Bethesda, Maryland 20892, USA. See http://www.nih.gov/molecular_modelling/net_services.html

9. O Casher, GK Chandramohan, MJ Hargreaves, C Leach, P Murray-Rust, HS Rzepa, R Sayle and BJ Whitaker, JCS Perkin Trans. 2, 1995, 7

10. O Casher, HS Rzepa and SM Green, J. Mol. Graphics, 1994, 12, 226

11. See http://www.ch.ic.ac.uk/VRML/

12. The CLIC consortium, Electronic Journals Programme Area, Joint Information System Committee. See http://ukoln.bath.ac.uk/elib/intro.html

13. T Kazic, ACS Symposium Series, 1994, 576, 486.

14. ML Maudlin and JRR Leavitt, Proceedings of the ACM Special Interest Group on Networked Information Discovery and Retrieval, McLean, Virginia, August 1994. See http://fuzine.mt.cs.cmu.edu/mlm/signidr94.html

15. P Murray-Rust, C Leach and HS Rzepa, work in progress. See http://www.ch.ic.ac.uk/cml/

16. European Commission, Telematics Applications Programme W1001. See http://symposia.inria.fr/symposia

17. P Dew, CM Leigh, and D Morris, =http://dream1.leeds.ac.uk/~vsp/

18. TJ Berners-Lee, "WWW Names and Addresses, URIs, URLs, URNs", CERN, Geneva, 1993. See http://info.cern.ch/hypertext/WWW/Addressing/Adressing.html

19. Institute of Physics Publishing Ltd., Techno House, Bristol ,UK. See http://www.iop.org/

20. LaTeX2HTML, Nikos Drakos, University of Leeds. See http://cbl.leeds.ac.uk/

21. HS Rzepa, Trends Analyt. Chem., 1995 (in press). See also http://www.elsevier.nl/

22. http:/amerchem.acs.org/memgen/meetings/acs210/findex.html

23. SM Bachrach, Ed., http://hackberry.chem.niu.edu/ECCC2/

24. HS Rzepa and JM Goodman, Ed., =http://www.ch.ic.ac.uk/ectoc

Copyright and Legal | Accessibility | Privacy | Freedom of Information