Findings, questions and recommendations from the ISDA workshop

Future Generation Computer Systems 16 (1999) 1–8

Findings, questions and recommendations from the ISDA workshop

Roy Williamsa,∗, Julian Bunnb, Reagan Moorec, James C.T. Poola

a Center for Advanced Computing Research, Caltech 158-79, Pasadena, CA 91125, USAb CERN, CH 1211 Geneva 23, Switzerland

c San Diego Supercomputer Center, UCSD 0505, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA

Accepted 11 February 1999

1. Metadata

As in all workshops about digital libraries, therewas a great deal of discussion about metadata. It is de-fined in terms of relationships between data objects,or as searchable data about data; for example cata-loging information, call numbers, file names, objectID’s, hyperlinks, ownership records, signatures andother parentage documents.

We point out that representations of data, summarydata, and other kinds of derived data are not strictlyconsidered to be metadata, for example thumbnail im-ages, graphs and visualizations, power spectra, instru-ment calibration information. Computers can createthese kinds of data automatically, and they can alsomake records of low-level data descriptors such as filesize, and file name, location, timestamps, access con-trol, etc.

The more interesting metadata issues concern thesemantic content, the ‘meaning’ of the data. Currently,this can only be created by a human mind, and theeasiest way to do this is when the data objects arecreated in the first place. The key here is to createstructured documents in a sophisticated markup lan-guage such as XML or SGML, or failing that, com-pliant HTML, with all the tags written in a compliant

∗ Corresponding author. Tel.: +1-626-395-3670; fax: +1-626-584-5917E-mail address:[email protected] (R. Williams)

manner. If this is done, machines will be able to parsethe document for the purposes of abstracting, sorting,graphing, summarizing and archiving.

Metadata may also be added to the archive gradu-ally. The archive might be processed to extract a newattribute for the catalogue database, for example anevaluation of whether the data is valid, in some sense.Larger quantities of data can be added to the archiveby providing hyperlinks to other servers. If these ad-ditions are deemed valid by the administrators of thearchive, they will make the transition from personalto public; from the cached data belonging to an in-dividual scientist to an acknowledged part of the li-brary. Other kinds of metadata that can be graduallyadded might include information on who has accessedor cited which parts of the library.

Scientific data archives are more often sophisti-cated query engines than collections of documents.We should consider a data object to consist of morethat just a MIME-type and binary data, but rather acombination of the binary with a page of structuredtext which is a description of the object. This doc-ument, the metadata, is therefore generated in realtime, as is the requested data object. We recommendthat a markup language such as XML is used for themetadata description. As an example, scientific datacontains numerical parameters; if the description con-tains tags like〈param name=lambda〉0.37〈/param〉rather than the usual ad hoc script files, then (in thefuture) sophisticated XML software will be able to

0167-739X/99/$ – see front matter ©1999 Elsevier Science B.V. All rights reserved.PII: S0167-739X(99)00030-8

2 R. Williams et al. / Future Generation Computer Systems 16 (1999) 1–8

catalogue the results, and to sort, graph, and summa-rize the results.

2. Collaboration

Collaboration is the lifeblood of scientific investiga-tion, especially collaboration between disparate fieldsof enquiry. The structure of the scientific data archivecan assist and encourage collaboration in a number ofways.

Federating libraries can foster collaboration. For ex-ample, the Digital Sky project is a federation of sur-veys at optical, infrared, and radio wavelengths; it isexpected that there will be new astronomical knowl-edge when these archives are interoperating, knowl-edge that is created from joining existing archives,without any new observations. But another, social,effect of the library interoperation is the encourage-ment of collaboration. In forming metadata standardsand agreeing upon semantics, subfield experts see awider picture and work more closely with experts fromrelated fields.

The library can provide a bulletin board and a mail-ing list. Useful contributed material includes docu-mented scripts of existing sessions that can be usedby others. Collaboration tools – groupware – can pro-vide shared browsing, whiteboard, and conferencingfacilities. Currently it is difficult to do collaborativescientific visualization, because it means that the pix-els of one screen are copied, rather than the muchsmaller model from which the pixels are constructed.To use commercial groupware for scientific purposes,we need the API to be published that allows special-ized scientific visualization software to be connectedto the groupware in the bandwidth-optimal way de-scribed above.

When designing user interfaces, we should not thinkonly about the isolated user, but also in terms of sharedbrowsing, tutorial sessions, and exchange of scriptsamong geographically separated users.

3. Librarians

Many of those at the workshop considered them-selves as creators or users of a scientific data archive,but not as a librarian or administrator. So the ques-

tion we are left with is who are the librarians for thisincreasing number of ever more complex libraries?Who will be ingesting and cataloguing new data;maintaining the data and software; archiving, com-pressing and deleting the old. Somebody should beanalyzing and summarizing the data content, assuringprovenance and attaching peer-review; encouraginginteraction from registered users and project col-laborators. Perhaps the most important function ofthe librarian is to answer questions and teach newusers.

4. How long will the archive last?

Scientific data archives contain valuable informa-tion that may be useful for a very long time, forexample climate and remote-sensing data, long-termobservations of astronomical phenomena, or proteinsequences from extinct species. On the other hand, itmay be that the archive is only interesting until theknowledge has been thoroughly extracted, or it maybe that the archive contains results that turn out to beflawed. Thus a primary question about such archivesis how long the data is intended to be available, fol-lowed by the secondary questions of who will manageit during its lifetime, and how that is to be achieved.

Data is useless unless it is accessible, unless it iscatalogued and retrievable, unless the software thatreads the binary files is available and there is a ma-chine that can run that software. While we recognizethe finite lifetime of hardware such as tapes and tapereaders, we must also recognize that files written withspecific software have a finite lifetime before they be-come incomprehensible binary streams. Simply copy-ing the whole archive to newer media may solve thefirst problem, but to solve the second problem thearchive must be kept ‘alive’ with upgrades and otherintelligent maintenance.

A third limit on the lifetime of the data archive maybe set by the lifetime of the collaboration that main-tains it. At the end of the funding cycle that createdthe archive, it must be transformed in several ways ifit is to survive. Unless those that created the data areready to take up the rather different task of long-termmaintenance, the archive may need to be taken overby a different group of people with different interests;indeed it may pass from federal funding to commer-

R. Williams et al. / Future Generation Computer Systems 16 (1999) 1–8 3

cial product. The archive should be compressed andcleaned out before this transformation.

5. Flexibility

The keys to flexibility in a complex, distributed soft-ware system are specification of interfaces, clear con-trol channels, and modularity of components.

Interfaces should be layered, with the upper layersspecified closely and documented by the system ar-chitect, while the lower layers should be a standardsoftware interface, specified by a standards body or ade facto standard which is well-accepted; somethingthat will be used and extended in the future. Unlessthere is really a new paradigm, it should not be nec-essary for the system architect to specify a protocolat the byte level, with all the error-prone drudgery ofbyte-swapping and packetization. It is also not a goodidea to utilize a proprietary or non-standard protocol,unless documentation and source-code is available andthere are no property-rights issues.

Each process or thread that is involved in thecomputation should have a channel through which itreceives control information, and it should be clearand well-defined when this channel opens, closes,or changes. If there are separate control and datastreams, it must be clear who is listening where. TheUnix model of input, output, and error streams is agood one.

On the other hand, the components that commu-nicate through these interfaces need not be carefullytested, committee-approved software; they can be in-efficient prototypes, hacked code, or obsolete imple-mentations by a bankrupt company. Once the inter-faces and protocols are well-known and strong, differ-ent implementations of the components can be createdand replaced easily.

6. Brokers

A broker is a software process in a distributed com-puting system that connects clients with servers, thattranslates, interprets, redirects or fuses queries andtheir results. Brokers provide the flexibility to unifyservers with different implementation details and lan-guage variations, to translate between languages and

to encapsulate legacy systems. A broker allows a userto create a complex query without exposing all thecomplexity; for example a broker can create queries ina language such as SQL using a simple wizard-basedgraphical interface. With a broker, the client andserver protocols and services can be optimized inde-pendently, with the broker providing the translation.Brokers can also be used to translate a single languageto the particular dialect that each server might want.

Brokers are also useful for system design, to pro-vide modularity and portability. When considering thechoice of a database product, it may be advantageousto separate the database from the rest of the system sothat it can be extended or replaced with another prod-uct at a later time. This can be achieved by placing abroker between the database and the client system.

A very popular kind of broker that is universallyaccessible is the Web server. With an enormousimplementation effort in the business world, it iseasy to use a web server as the protean broker. Aweb server can work with multiple, heterogeneousdatabases, with high-performance archival storagesystems, with text-based legacy systems; it can hidethe underlying OS; it can also provide authentication,encryption, multithreading, and sessions; web serversand application servers can be made from softwarecomponents. There is much research into parallel,high-performance web servers and other advancedservers.

7. Distributed software components

Modern software technology offers the promise offlexible, high-performance, distributed systems wherethe code can be understood and modified withoutneeding to know how everything works, but knowingonly the semantics and methods of one object. Dis-tribution is supposedly easy if all the code is writtenin Java, through Remote Method Invocation (RMI)and JavaSpaces. At a higher level of complexity andflexibility, CORBA provides portable distribution ofobjects whose methods are written in other languagessuch as C++. Of course the Java optimism may onlybe because it is newer than CORBA.

Components offer the idea of plug-and-play soft-ware. When the component is introduced to a system,it announces its purpose through a process known as


introspection, and allows customization. A GUI orother application is made by connecting componentsthrough events, with the events produced by one com-ponent being sent to other components. The creationof an interface thus includes many kinds of people:besides the end-user of the interface, there is a libraryof components written by different people, and theperson who connects the components together into aGUI. Components offer an advantage to the end-useras well as the creator of the GUI, because once theuser has learned how to use a component, the knowl-edge can be reused; for example the file-chooserhas the same look in all Microsoft Windowsapplications.

8. Database software

Most of the case-studies have embraced the idea thata relational or object database is essential for flexibilityand modularity in design, for querying, for sorting, forgenerating new types of data object or document.

Object database practitioners are largely convincedof their superiority over relational databases for sci-entific data, but of course relational data bases canalso work for scientific data. Object databases providefeatures such as data abstraction and hiding, inheri-tance, and virtual functions. One might also argue thatif programs are in an object-oriented language, thenit is natural and proper that the data be in an objectdatabase.

A portability question arises between relational andobject databases. Relational schemas are simpler thanthe richness that is possible with a full object model,so porting between relational products entails writingthe tables in some reasonable form, perhaps ASCII,and reading them into the other product. With an ob-ject database, however, much of the implementationwork is writing code that interfaces to the proprietaryAPI, and porting to another ODBMS will only be easyif the code is designed for portability right from thestart.

While most scientific data archives have embracedthe idea that a database is essential for storing themetadata, another question is how to store the large bi-nary objects that represent the data objects themselves.One point of view maintains that investing time, ef-fort, and money into a DBMS implies that it should be

used for managing all the data in a unified way. Thisis appropriate when there are many, relatively static,not-too-large data objects. The other point of view isthat cataloguing functions are very different from thespecialized operations on the large data objects, andwhen we write the code that does complex process-ing and mining, we want to work directly with thedata, not through the DBMS API. Splitting data frommetadata in this way reduces dependence on a partic-ular DBMS product, making it much easier to port thearchive to a different software platform, since only themetadata needs to be moved.

9. Commercial software

Using commercial software in a digital library canbe difficult if the client must pay for the use of the soft-ware. Licensing agreements are often written from thepoint of view of a single user running the software ona single workstation. While advances such as floatinglicences are welcomed, new kinds of software and li-cences are still needed: short-term licences, free clientsoftware, licences for running the software as part ofan application server, or licences based on measuredusage.

Sometimes it is a good idea for the library imple-mentor to insulate herself from uncontrolled changesin a commercial product by thinking of escape rightfrom the start. It may be possible to encapsulate thecommercial package into a few functions or classes,so that the package can be ‘swapped out’, if necessary,at some future date.

10. Exceptions and diagnostics

One of the most difficult aspects of distributed sys-tems in general is exception handling. For any dis-tributed system, each component must have access toa ‘hotline’ to the human user or log file, as well asdiagnostics and error reporting at lower levels of ur-gency, which are not flushed as frequently. Only witha high quality of diagnostic can we expect to find andremove not only bugs in the individual modules, butthe particularly difficult problems that depend on thedistributed nature of the application.


11. Deep citation

In principle, the account of a scientific experimentincludes detailed quantitative evidence that allows dif-ferentiation between competing theories. For a com-putational simulation or data mining investigation, theanalogue of this evidence includes metadata that al-lows the reader full access to evidence. Carrying thisidea to its logical conclusion would allow readers ac-cess not only to the data that was used, but also the pro-grams that created and extracted the knowledge con-tent of the data, so that they can verify results and ex-amine variations by running the simulation or miningcode themselves. This is deep citation. While this ideamay seem excessive for a paper in a journal, it wouldbe very desirable for a close collaboration or for ed-ucational purposes. Individual researchers would alsobenefit from keeping deep citations to their own work.

12. Data driven computing

We are used to the idea of writing a procedural pro-gram which reads files from some kind of data ser-vice. In some circumstances, this may not be the cor-rect model: rather we write the code as a handler ofdata objects which are handed to the compute servicein some arbitrary order by the data service. Suppose,for example, that we wish to apply an algorithm to allthe data of a large data archive, and that the archiveis stored on a tape robot, where the data objects arearbitrarily ordered on many tapes. If the compute pro-cess is in control, then the robot may thrash the tapesfuriously, in order to deliver the data objects in the or-der demanded by the program. But in the data-drivenmodel, all the data objects are extracted from a tapein the order in which they appear and delivered to thecompute process (the handler), and the job is com-pleted in much less time. Certain kinds of query, in-volving touching most of the data of the archive, maybe scheduled for a ‘data-driven run’, where many suchqueries, from different users, can be satisfied with asingle run through the data, perhaps scheduled to runover a weekend.

13. Text-based interfaces

While the point-and-click interface is excellent forbeginners, mature users prefer a text-based command

stream. Such a stream provides a tangible record ofhow we got where we are in the library; it can be storedor mailed to colleagues; it can be edited, to changeparameters and run again, to convert an interactivesession to a batch job; the command stream can bemerged with other command streams to make a moresophisticated result; a command stream can be used asa start-up script to personalize the library; a collectionof documented scripts can be used as examples of howto use the library.

We should thus focus effort on the transition be-tween beginner and mature user. The graphical inter-face should create text commands, which are displayedto the user before execution, so that the beginner canlearn the text interface.

In a similar fashion, the library should produce re-sults in a text stream in all but the most trivial cases.The stream would be a structured document contain-ing information about how the results where achieved,with hyperlinks to the images and other large objects.Such an output would then be a self-contained docu-ment, not just an unlabeled graph or image. Becauseit is made from structured text, it can be searched,archived, summarized and cited.

14. XML

Extensible Markup Language (XML), which hasbeen developed in a largely virtual W3 project is thenew ‘extremely simple’ dialect of SGML for the web.XML combines the robustness, richness and precisionof SGML with the ease and ubiquity of HTML.Microsoft and other major vendors have already com-mitted to XML in early releases of their products.Through style sheets, the structure of an XML docu-ment can be used or formatting, like HTML, but thestructure can also be used for other purposes, suchas automatic metadata extraction for the purposes ofclassification, cataloguing, discovery and archiving.

A less flexible choice for the documents producedby the archive is compliant HTML. The compliancemeans that certain syntax rules of HTML are followed:rules include closing all markup tags correctly, forexample closing the paragraph tag〈p〉 with a 〈/p〉 andenclosing the body of the text with〈body〉 . . . 〈/body〉tags. More subtle rules should also be followed so thatthe HTML provides structure, not just formatting.


15. Metadata standards and federation

If metadata is to be useful to other libraries, tosearch and discovery services, or for federation ofarchives, then it must be standardized: there must be aconsensus about the semantics, the structure, and thesyntax of the metadata. A basis for the semantics isprovided by the Dublin Core standard, that is gain-ing strong momentum in the library world, as wellas among commercial information providers. Thereare 15 key components (Title, Author/Creator, Sub-ject/Keywords, Description, Publisher, Other Contrib-utor, Date, Resource Type, Format, Resource Iden-tifier, Source, Language, Relation, Coverage, RightsManagement). The Dublin Core also includes ways toextend the specification either by adding componentsor by hierarchically dividing existing components.

Federation of archives grows in synergy with thecreation of metadata standards. When a local efforttries to join two archives, a common vocabulary iscreated, leading to a metadata standard. Not only doesthis encourage collaboration, but also it leveragesmore knowledge from existing data assets. Metadatastandards also encourage the beginning of collabo-rations because they allow and encourage discoveryservices.

We felt that the Dublin Core is an effective metadatastandard that is appropriate (with extensions) to sci-entific data objects. What is needed in addition is thedevelopment of languages needed to exchange meta-data, schemas and ontologies. For each of these levelsof abstraction, a definition language and a manipula-tion language are needed.

16. Authentication and security

The workshop identified a need for an integrationand consensus on authentication and security for ac-cess to scientific digital archives. Many in the sci-entific communities are experts in secure access toUnix hosts through X-windows and text interfaces.In the future we expect to be able to use any thinclient (Java-enabled browser) to securely access dataand computing facilities. The workshop felt that therewas a distinct lack of consensus on bridges from theselatter access methods to the Unix world that manyscientists inhabit.

Low-security access: This area can be accessed witha clear-text password, control by domain name, HTTPauthentication, a password known to several people, oreven ‘security through obscurity’. This kind of secu-rity emphasizes ease of access for authenticated users,and is not intended to keep out a serious break-in at-tempt. Appropriate types of data in this category mightbe prototype archives with data that is not yet scientif-ically sound, catalogues or other metadata, or data thatis part of a collaborative effort, such as a partly-writtenpaper.

High-security access: Access to these data and com-puting resources should be available only to autho-rized users, to those with root permission on a servermachine, or to those who can watch the keystrokes ofan authorized user. Access at this level allows copy-ing and deletion of files and long runs on powerfulcomputing facilities. The data may be valuable intel-lectual property and/or the principal investigator mayhave first-discovery rights. Appropriate protocols in-clude Secure Socket Layer (SSL), Pretty Good Pri-vacy (PGP), secure shell (ssh), One-Time Passwords(OTP) and digital certificates.

Once a user is authenticated to one machine, wemay wish to do distributed computing, so there shouldbe a mechanism for passing authentication to othermachines. One way to do this is to have trust betweena group of machines, perhaps using ssh; another waywould be to utilize a metacomputing framework suchas Globus, that provides its own security. Once we canprovide effective access control to one Globus server,it can do a secure handover of authentication to otherGlobus hosts. Just as Globus is intended for heteroge-neous computing, the Storage Resource Broker pro-vides authentication handover for heterogeneous datastorage.

17. Standard scientific data objects

We would like standard semantics and user-interfacefor common objects that arise in scientific investiga-tions. An example is the multi-dimensional point set,where several numerical attributes are chosen froma database relation (a table) as the ‘dimensions’ ofthe space, and standard tools used to create 2D or3D scatterplots, principle component extractions, andother knowledge extraction methods.


Another example of a standard object is a trajec-tory in a high-dimensional phase space, which occurswhen storing the results of a molecular dynamics orother N-body computation, or when a multi-channeltime-series is recorded from a scientific instrument.

18. Request estimation and optimization

It is important that the user is given continuousfeedback when using the library. When a non-trivialquery is issued, there should be an estimate of the re-sources (time, cost, etc) that are needed to satisfy it,and the user accepts these charges before continuing.Large queries may be scheduled to run later, smallerqueries will provide a continuously-updated resourceestimate.

An area that needs particular attention is the estima-tion of resources in distributed systems, for examplewhen a query joins data from geographically-separatedsites and computes with it at a third.

19. Data parentage, peer-review, and publisherimprint

Information without provenance is not worth much.Information is valuable when we know who createdit, who reduced it, who drew the conclusions, andthe review process it has undergone. To make digitalarchives reach their full flower, there must be waysto attach these values in an obvious and unforgeableway, so that the signature, the imprint of the authoror publisher stays with the information as it is rein-terpreted in different ways. When the information iscopied and abstracted, there should also be a mech-anism to prevent illegal copying and reproduction ofintellectual property while allowing easy access to thedata for those who are authorized.

Roy Williams is a Senior Scientist withthe Center for Advanced Computing Re-search at the California Institute of Tech-nology, Pasadena, CA. He received theB.A. in mathematics from Cambridge Uni-versity, England, in 1979 and the Ph.D.degree in Physics from the California In-stitute of Technology in 1983. He is inter-ested in scientific data archives, their stor-age, retrieval, mining, browsing, as well

as ways to extract knowledge from the combination of heteroge-neous data sources.

Dr. Julian Bunn has been researchingin computing for High Energy (Parti-cle) Physics since 1985. He was bornin England in 1959, and educated atthe University of Manchester, obtaining aB.Sc.(Hons) in Physics in 1977, and thenat the University of Sheffield, where heobtained his Ph.D. in Experimental HighEnergy Physics in 1983. He was then ap-pointed as a Research Associate at the

Max Planck Institute for High Energy Physics in Munich. In thefollowing year he accepted a position as a Research Associateat the Rutherford Appleton Laboratory in Oxford. Shortly after-wards he was offered a staff position at the European Laboratoryfor Particle Physics (CERN) in Geneva. Since joining CERN, hehas held several positions as Project Leader and Section Leaderin the Information Technology Division. Recently, he instigated,and became co-Principal Investigator of, the “GIOD” joint projectbetween Caltech and CERN, an effort funded by Hewlett-Packard.The project is investigating the use of Object Oriented software,commercial Object Databases and mass storage systems as so-lutions to the PetaByte storage needs of the next generation ofparticle physics experiments. To carry out this project, Dr. Bunnis on Special Leave of Absence from CERN, working at Caltech.He is collaborating closely with Caltech’s Center for AdvancedComputing Research (CACR). His work has involved the designand implemention scheme for populating an Object Database with1 TeraByte of physics data, using SMP servers and clusters of NTworkstations. He has developed C++ and Java/3D/JFC applicationsthat run against the database (a featured application at the Fall’98“Internet-2’ meeting), measured scalability and deployment issues,and evaluated the Object Database performance on a 256 CPUExemplar system, using distributed and numerous clients. Latterly,his work has focussed on modelling the system behaviour to pro-duce scaling predictor algorithms, with special emphasis on theWAN aspects of the systems, and development of sophisticatedevent viewers based on Java 3D. The event viewers interact di-rectly with the Object Database to access and render the complexevent structures typical of particle physics.

Dr. Reagan W. Moore is Associate Di-rector of Enabling Technologies at the SanDiego Supercomputer Center and an Ad-junct Professor in the UCSD CSE depart-ment. He coordinates research efforts indevelopment of massive data analysis sys-tems, scientific data publication systems,and persistent archives. An ongoing re-search interest is support for informationbased data-intensive computing. Moore is

an active participant in NSF workshops on digital libraries andKnowledge Networks. Recent publications include a chapter ondata-intensive computing in the book “The Grid: Blueprint for aNew Computing Infrastructure”.

Moore has been at SDSC since its inception, initially beingresponsible for operating system development. Prior to that he


worked as a computational plasma physicist at General Atomicson equilibrium and stability of toroidal fusion devices. He has aPh.D. in plasma physics from the University of California, SanDiego, (1978) and a B.S. in physics from the California Instituteof Technology (1967).

James C.T. Pool, Executive Director ofCaltech’s Center for Advanced ComputingResearch, currently has overall responsi-bility for the Center. The Center was es-tablished to ensure that Caltech and its JetPropulsion Laboratory will be at the fore-front in computational science and engi-neering and, in particular, to enable break-throughs in computational science and en-gineering.

He is also Executive Director of Caltech’s Center for Simulationof Dynamic Response of Materials, a center of excellence ofthe DOE Accelerated Strategic Computing Initiative’s AcademicStrategic Alliance Program.

He has previously held positions related to high performancecomputing at a national laboratory, federal research agencies, andan independent software vendor. At Argonne National Laboratory,he was responsible for the applied mathematics and computer sci-ence research program, including the research group that producedEISPACK and LINPACK. At the DOE Office of Energy Researchand the DOD Office of Naval Research, he directed basic re-search programs in applied mathematics and computer science.At the Numerical Algorithms Group (NAG), he was responsiblefor NAG’s North American activities including interactions withthe numerical software research community, major user sites, andhigh performance system vendors.

Findings, questions and recommendations from the ISDA workshop

Documents

Transcript of Findings, questions and recommendations from the ISDA workshop