Sequencing Information Management System (SIMS)

36
b I Final Report Sequencing Information Management System (SIMS) Feasibility Study National Center for Genome Resources 1800 Old Pecos Trail Santa Fe, NM 87505 Chris Fields, PI Supported in part by: Human Genome Program Office of Health and Environment Research U.S. Department of Energy Cooperative Agreement DE-FC03-95ER62062

Transcript of Sequencing Information Management System (SIMS)

b

I

Final Report

Sequencing Information Management System (SIMS)

Feasibility Study

National Center for Genome Resources 1800 Old Pecos Trail Santa Fe, NM 87505

Chris Fields, PI

Supported in part by:

Human Genome Program Office of Health and Environment Research

U.S. Department of Energy

Cooperative Agreement DE-FC03-95ER62062

- 2 -

I. Summary

A feasibility study to develop a requirements analysis and functional specification for a data management system for large-scale DNA sequencing laboratories was supported as a Supplement to Cooperative Agreement DE-FC03-95ER62062 between the Office of Health and Environment Research (OHER), U. S. Department of Energy (DOE) and the National Center for Genome Resources (NCGR). This effort resulted in a functional specification for a Sequencing Information Management System (SIMS). This document reports the results of this feasibility study, and includes a functional specification for a ShMS relational schema. This specification is described as "Version 0.1" to indicate its prototype design status.

The Sequencing Information Management System (SMS) is an integrated information management system that supports data acquisition, management, analysis, and distribution for DNA sequencing laboratories. The SIMS provides ad hoc query access to information on the sequencing process and its results, and partially automates the transfer of data between laboratory instruments, analysis programs, technical personnel, and managers. The SIMS user interfaces are designed for use by laboratory technicians, laboratory managers, and scientists.

The SIMS is designed to run in a heterogeneous, multiplatform environment in a clientlserver mode. The S M S communicates with external computational and data resources via the internet.

The SIMS Version 0.1 requirements and specifications were assembled with substantial input from staff of the LANL Genome Center. Draft documentation on SMS was transmitted in September to the LANL, LBNL, and LLNL Genome Centers. Written comments were obtained from the LLNL Center, some of which resulted in changes in the design.

Development of a SIMS meeting the requirements outlined here would be a major, many programmer-year effort. Development of a scaled-back SIMS for a single laboratory with well-defined procedures is feasible with the resources available to a typical genome laboratory.

c

- 5 -

11. Requirements (System Level)

The SIMS will meet the following systems-level requirements:

1. Application.

The SIMS will support high- or low-pass DNA sequencing processes, employing a variety of strategies, in steady state.

2. Integration.

The SIMS will integrate data management for the entire sequencing process, from clone library preparation to finished sequence distribution.

3. Queries.

The SIMS will support ad hoc queries joining any SIMS-maintained data, either synchronously or retrospectively.

4. Interfaces.

The SIMS will provide user interfaces usable with minimal specialized training by laboratory technicians, laboratory managers, and scientists.

5. User control.

The SIMS will allow user review and intervention at any stage in any process under the direct control by the SIMS.

6. Documentation.

The SIMS will include complete design and user documentation.

7. Pfatform.

The SIMS will run in a heterogeneous, rnultiplatfonn environment.

t

- 3 -

111. Functional Specification (System Level)

The SIMS will meet the system-level requirements specified in Section I1 with the followin,o functionality:

1. Application.

Requirement:

The SIMS will support high- or low-pass DNA sequencing processes, employing a variety of strategies, in steady state.

Specification:

1.1 The SIMS will represent the DNA sequencing process at a level of granularity suitable for expressing differences between sequencing and analysis strategies.

1.1.1 Acquisition and analysis of either continuous or discontinuous DNA sequences can be represented.

1.1.2 Both random and directed strategies will be represented.

1.1.3 SlMS will represent the current results of a sequencing process at any stage of completion.

1.1.4 Design will allow alterations in strategy with minimal recoding.

1.1.4.1 A process-step-oriented design will be employed.

1.1.4.2 Protocols employed at different steps can be varied at runtime.

1.1.4.3 Representations of additional protocols can be added with no recoding.

1.1.4.4 Interfaces to additional instruments or analysis programs can be added with no recoding other than the required user interfaces.

- 5 -

1.2 The SIMS will support both low- and high-throughput sequencing.

1.2.1 Throughputs of at least 960 templates per day.

1.2.2 Project sizes of at least 10 megabases.

1.3. The SIMS will support automated and interactive sequence analysis.

1.3.1 Both synchronous and retrospective analyses of sequence data.

1.3.1.1 Analysis process will be tracked.

1.3.1.2 Analysis results will be tracked and archived as necessary.

1.3.1.3 Retrospective reanalyses will be supported.

1.3.2 Analysis steps to include at least:

1.3.2.1 Vector identification and masking.

1.3.2.2 End-trimming and reversible masking.

1.3.2.3 Chimera detection and sequence splitting.

1.3.2.4 Contaminant screening and rejection.

1.3 -2.5 Interactive quality-control assessment.

1.3.2.6 Overlap-driven sequence assembly.

1.3.2.7 Heuristic-driven meta-assembly .

1.3.2.8 Sequence similarity searching.

1.3.2.9 Compositional analysis.

- 6 -

1.3.2.10 Seaches for discrete features.

1.3.2.1 1 Gene structure prediction.

1.3.3 Interactive sequence interpretation will be supported.

1.3.4 Distribution of automatically analyzed and interactively interpreted sequences to appropriate public databases will be supported.

1.3.4.1 Ail relevant data types for public distribution will be supported.

1.3.4.2 Client-server interaction with GSDB will be supported.

2. Integration.

Requirement :

The SIMS will integrate data management for the entire sequencing process, from clone library preparation to finished sequence distribution.

Specification:

2.1 The SIMS will represent at least the following process steps (illustrated in the attached Materials Flow Diagram, Figure 1):

2.1.1 Clone library preparation.

2.1.2 Template preparation.

2.1.3 Sequencing reactions.

2.1.4 Gel loading and running.

2.1.5 Basecalling.

2.1.6 Interactive basecall editing.

- 7 -

2.1.7 Sequence analysis.

2.1.8 Analyzed sequence distribution.

2.2 The SIMS will track the status of processes in real time.

2.2.1 Assign and maintain names of materials and data.

2.2.2 Track the protocols and devices used in processes.

2.2.3 Track status, location, and availability of materials, reagents, and supplies used in processes.

2.2.4 Maintain records of personnel executing processes.

2.3 The SIMS will track personnel availability and skills.

2.4 The SIMS will represent protocols for laboratory instruments.

2.4.1 In human-readable form.

2.4.2 In executable form where feasible.

2.5 The SIMS will provide mechanisms for archiving data and analysis results.

3. Queries.

Requirement:

The SIMS will support ad hoc queries joining any SIMS-maintained data, either synchronously or retrospectively.

Specification:

3.1 The SWIS will support queries to define daily workflow.

- 8 -

3.2 The SIMS will support queries for retrospective quality control.

3.3 The SIMS will support queries for personnel performance analysis.

3.4 The SIMS will support queries to assess resource usage and availability.

4. Interfaces.

Requirement:

The SIMS will provide user interfaces usable with minimal specialized training by laboratory technicians, laboratory managers, and scientists.

Specification:

4.1 The SIMS will provide query forms for standard query types and summary report generation.

4.1.1 Identification of next step in process.

4.1.2 Identification of next protocol to perform, and protocol contents.

4.1.3 Identification of step results.

4.1.4 Process status.

4.1.5 Process results to date.

4.1.6 Retrospective quality analysis.

4.2 The SIMS will provide tools for visualizing and editing sequences and features.

4.2.1 Sequences can be visualized at scales from bases to megabases.

4.2.2 Sequence alignments can be visualized and edited.

- 9 -

4.2.3 Span-oriented features can be added, deleted, and edited.

4.2.4 Histograms for continuous-variable analyses can be constructed.

4.3 The SIMS will allow any displayed data to be printed.

4.4.1 Files or hardcopy for text data.

4.4.2 Hardcopy for screen displays.

4.4 The SIMS will minimize data entry and file manipulation.

4.5.1 Data transfer between processes will be automated where software interfaces between devices or programs automating the process are feasible.

4.5.2 Archiving will be automated, with user configuration of archiving frequency, format, and media.

5. User control.

Requirement:

The SIMS will allow user review and intervention at any stage in any process under the direct control by the SIMS.

Specification:

5.1 Interfaces will include user options to edit, save, save-as, abort, or continue a step or a project.

5.2 Interfaces will include specified failure codes.

5.2.1 Failure codes will be specified by process step.

6. Documentation.

- 10-

Requirement:

The SIMS will include complete design and user documentation.

Specification:

6.1 Documentation to incIude:

6.1.1 Summary.

6.1.2 Requirements.

6.13 Functional Specification.

6.1.4 Design Specification.

6.1.5 Embedded Help System.

6.1.6 User Manual.

6.1.7 Administrator Manual.

7. Platform.

Requirement:

The SIMS will run in a heterogeneous, multiplatform environment.

Specification:

7.1 All SIMS user interfaces will run in client-server mode.

7.2 User interfaces will run on at least:

7.2.1 Unix.

- 1 1 -

7.2.2 Macintosh 680x0 or PowerPC.

7.2.3 Windows NT.

7.3 SIMS core database will run on one or more of the following:

7.3.1 Sybase.

7.3.2 Oracle.

7.4 SIMS will interact with external resources in internet client-server mode.

7.4.1 External resource use will be run-time configurable.

- 12 -

IV. Component Model

SIMS will comprise the following software components. Functional specifications will be developed for each component prior to component-level design. Subcomponents listed here correspond to identifiable areas of effort required to develop each component.

1. Database

a) Schema structure

b) Data definitions

c) Controlled vocabularies

d) Management system

e ) Programming interface

2. Instrument interface

a) Instrument file UO

b) Instrument database UO

c) Instrument user interface links (if any)

3. Technician interface

a) Query forms

b) Help system

4. Analysis tool interface

a) Data YO language

- 13 -

b) Parsers

c) Resource-selection and control-flow system

5. Analysis tool user interface

a) Command forms

b) Results browsers and selectors

c) Option lists

d) Help system

6. Ad hoc query interface

a) Editable example SQL queries

b) Free-text SQL window

c) Error interpreter

d) Help system

7. ProjectIStepProtocoi definition interface

a) Editor

b) Step linker

c) Constraint editor and checker

d) Help system

I - - 14-

~

V. Functional Specification, Schema Level (Database Component)

This document specifizs the functionality of the SIMS Database Component with respect to the system-level SIMS functional specification. Parenthetical notations of numbers preceded by "FS I' reference systcm-level specifications (cf. Section ID).

1. Schema structure

See the attached "SIMS 0. I ' I entity-relationship diagram (Figure 2).

1. I Conceptual structure.

The operation of a sequencing laboratory can be viewed as involving multiple "projects," each comprising multiple "steps." that may be running concurrently. These projects may overlap in personnel and resource use. Sequencing a cosmid or other clone by some strategy, generating and sequencing a set of fragments and constructing STSs, or performing a retrospective analysis of a data set are examples of types of projects.

A project is requested or initiated by a particular person at a particular time. A project is completed at a particular time.

Completing a project involves executing a sequence of "steps" in an order. Constructing sequencing templates from clones, performing sequencing reactions, loading gels, performing BLAST searches, or reviewing analysis results prior to public or internal distribution of data are examples of steps. The order in which steps are performed is determined by enabling and precondition relations between steps; in general, each step will enable one or more further steps, and will have one or more steps as preconditions. A coherent project specifies a sequence of steps that is consistent with these enabling and precondition relations.

When a project is carried out, "instances" of the steps are performed. Step instances are initiated and completed at particular times, by particular people, instruments, or software systems. They employ particular data and/or materials as inputs, and produce particular data and/or materials as outputs. They employ particular methods and protocols, which may vary between instances of the same step. Particular laboratory devices are employed. A step instance may fail. in tshich case it may need to be repeated for the project to succeed.

- 15-

The SIMS schema represents the relationships between people, devices, methods, protocols, inputs, and outputs

projects, steps, step instances, and the of each step instance.

1.2 Representation of projects

Projects are represented by:

1.2.1 A Project table specifying the name of the project, requesting person, start and completion times, and current status (requested, in progress, aborted, completed).

1.2.2 A Project - Link table specifying the steps, their order, and the status of each step in the context of the project.

A project must comprise one or more steps. The level of granularity at which projects may be represented is dependent on the level of granularity at which steps are represented.

The project representation supports queries from managers to assess the status of each project on a step-by-step basis (FS 3.1, FS 4.1.4), and queries from technicians to determine the steps that must be performed to complete a project (FS 4.1.1).

1.3 Representation of steps and step instances.

Steps and their instances are represented by:

1.3.1 A Step - type table describing the step in human-readable form (text).

1.3.2 A Precondition table specifying enabling and precondition relations between steps.

1.3.3 A Step - instance table specifying when a step was performed, the person performing and reviewing (optional) the performance, its status (pending, in progress, failed, completed), and how it failed (NULL indicates success).

1.3.4 An I/O - Link table specifying the inputs and outputs of the step instance.

The step and instance representations are fundamental to SIMS. Steps can be represented at

- 1 6 -

any level of granularity. The scope of the SIMS is determined by the steps that it represents; supporting a new strategy or type of project requires only that the additional steps be specified (FS 1.1, FS 2.1).

The text description of a step must be sufficient to allow the technician executing the step to understand what must be done and perform the necessary actions. Specifying a step requires writing the descriptive text and specifying the enabled and precoditions.

The distinction between steps and step instances allows the descriptive information specifying the step as an abstraction to be separated from the tracking of each individual performance of the step. This separation permits different instances of the same step to use different protocols (FS 1.1.4.2), devices, and materials, and provides a single identifier for linking (via I/O - Link) the performance of a step to the resources used and the results (FS 2.2.2, FS 3.1, FS 3.4, FS 4.1.3, FS 5.2.1).

1.4 Representation of persons

Projects are requested, steps are performed, and results are reviewed by persons. Persons are represented by:

1.4.1 A Persons table specifying the person, their unavailability dates, and their status (project - id, step - id; NULL indicates unassigned).

1.4.2 A Skill table specifying which persons are qualified for which steps.

The person performing each step instance is recorded, together with the step status and stadstop times. This permits continuously tracking personnel availability (FS 2.3), task assignments (FS 2.2.4), and performance (FS 3.3).

1.5 Representation of devices

Laboratory devices are represented by:

1.5.1 A Device table specifying the type, serial number, and status (available, in use, nonoperational) of the device.

- 17-

The device actually used in a step instance is referenced by identifier, via a Resource - Link table that allows efficient representation of the use of multiple devices (e.g. multiple sequencing machines) in a singIe step (FS 2.2.2. FS 3.4). Device identifiers are referenced by descriptions of components using or used in a device (e.g. gels). Step failures due to device failures can be tracked by these identifiers (FS 5.2.1).

1.6 Representation of experimental protocols

Experimental protocols are distinguished from data analysis "methods" to simplify the representation of each. Protocols are represented by:

1.6.1 A Protocol table describing the protocol and specifying its status (current, superceded).

1.6.2 A Super - Prot - Link table specifying superceding protocols.

Protocols are linked to step instances via Resource-Link to allow efficient representation of multiple protocol use in a step.

1.7 Representation of reagents

General reagents are represented by:

1.7.1 A Reagent table specifying the reagent and its source, availability, and status (available, on-order).

1.7.2 A Reagent - Link table linking reagents, protocols, and step instances.

1.7.3 A Kit table specifying the source and status (available, on-order) of a commercial sequencing-reaction reagent lut.

1.7.4 A Gel table specifying the separation reagent, prep protocol, and status (poured, ready, in-use) of a sequencing gel.

The Reagent and Reagent - Link tables provide a representation for generic reagents that must be regularly inventoried and reordered (FS 2.2.3) and that may be employed in multiple

- 18-

protocols. The Kit and Gel tables describe sequencing-specific reagents, and are separated from Reagent for ease of queryability.

1.8 Representation of biological materials

Biological materials commonly used in sequencing are represented by individual tables for ease of querying. These include:

1.8.1 A Library table specifying the type, development history, and status (available, obsolete, on-order) of a clone library.

1.8.2 A Clone table specifying the type, source, and status (sampled, sequenced, chimeric) of an individual clone.

1.8.3 A Template table specifying the type, primer sites, source, and status (left- sequenced, right-sequenced, walked, chimeric) of a sequencing template.

1.8.4 A Primer table specifying the sequence, source, and status (available, test, order, synthesize) of a PCR primer.

1.8.5 A PCR - Link table linking PCR primers to amplification protocols.

1.8.6 A Mix table linking templates, primers, sequencing instruments and gel lanes, and specifying the status (ready, loaded, re-read, bad) of a loadable sequencing-reaction mix.

The Library and Clone tables provide a representation for clones other than sequencing templates and their sources. Arbirarily deep subcloning can be represented (FS 2.1.1). The Template table provides a uniform representation for templates generated by either subcloning or PCR, and for either end-primer sequencing or walking (FS 1.1.2, FS 2.1.2). PCR protocols are represented using the Protocol table; this avoids a rigid specification of relevant conditions in the Primer and PCR - Link tables, and hence allows greater flexibility for varying PCR protocols. The Mix table assigns sequencing mixes to gel lanes, and allows multiple mixes to be loaded in a lane, as in multiplex sequencing (FS 1.1) or genotyping strategies.

1.9 Representation of automated sequencer output

- 19-

SIMS will support automated sequencing systems that produce continuous ("trace" or electropherogram) output as well as base-called sequence data. SIMS will also support laboratories maintaining image files of gels obtained by, e.g. direct-transfer electrophoresis. Sequencer output is represented by:

1.9.1 A Gel - File table specifying the device and person that ran a gel. its inspection, failure (NULL indicates success), and tracking (manual or automated) status. the archival location of the image file, and the status of the gel file (in-progress, complete, tracked, inspected, failed, archived).

1.9.2 An Egram table linking the sequence-mix read during the sequencing run, the gel file, lane number, and the sequence, basecalling and editing status of the electropherogram, archival location, and status (called, edited, re-edited, archived).

These tables support the generation of multiple electropherograms from a gel file, and of multiplexing, multiple basecalls or repeated editing steps generating multiple electropherograms for a single lane (FS 2.1.4, FS 2.1.5, FS 2.1.6). Archiving of either gel files or electropherograms by an external process is tracked by the database (FS 1.3.1.2, FS 2.5).

1-10 Representation of sequences

SIMS represents sequences with a schema structure adapted from the Genome Sequence DataBase, version 2.2, for which a schema specification is available. The representation includes:

1.10.1 A Sequence table specifying the sequence, its type (KIA, DNA, AA), whether it is experimental or derived, gapped, single-source or composite, or a known clone-end sequence, trash code (vector, contaminant, chimera) and a status (edit, split, assemble, complete, archive, release).

1.10.2 A Component table specifying an overlap relation between two sequences and its type (assembly, instance-type, substrate-product).

1.10.3 A Component - Location table specifying the coordinates of each overIapping region for each pair of sequences represented by the Component table.

- 20 -

1.10.4 A Sequence - Pieces table specifying the order and distance between two discontiguous sequences.

1.10.5 An Assembly table specifying whether an assembly is cvnstructed from sequences from a single clonal source or multiple sources.

1.10.6 An Inst - Type - Re1 table specifying whether an instance-type rdation between two sequences is between paralogs or orthologs, or is unknown.

1.10.7 A Tree table for representing similarity trees in flat form.

1.10.8 A Sub - Prod - Re1 table for specifying the type of a substrate-product relationship (rearrangement, transcript, edited, spliced, translation, post-translation processed).

1.10.9 A Pathway table for representing substrate-product pathways in 3at form.

1.10.10 A Confidence table specifying strand coverage and estimated sequence accuracy.

1.10.11 An Ext - Dist table specifying whether a sequence has been transferred to an external database, and referencing that database's identifier.

The Sequence, Component, and Component - Location, and Sequence - Pieces tables provide the core representation of basecalled sequences, spatial relations and alignments between them, and consensus sequences (FS 1.3.2.6). They also provide a representation of xanslations, and of alignments between experimental sequences and those identified in database searches (FS 1.3.2.8).

The pairwise component representation handles multiple sequence alignments by referencing each sequence in the alignment to a common "derived" consensus sequence, v.-hich provides a common coordinate system for the alignment.

Virtual sequences "assembled" from multiple discontiguous samples are represented as "derived" sequences that reference multiple ordered "pieces" separated by g a p (FS 1.1.1, FS I . 1.3). The pieces themselves may be virtual sequences; hence multipie levels of resolution in mapping components can be represented. Local order uncertainty is represented by

- 21 - .

specifying a gap uncertainty greater than the gap size.

The ,Assembly table supports a distinction between assemblies from single sources (e.g. shotgun fragments of a cosmid) and assemblies from multiple sources (e.g. ESTs from multiple libraries).

The Inst - Type - ReI, Tree: Sub - Prod - ReI, and Pathway tables aIlow the representation of sequence diversity and products, respectively. These tables support alignments of, e.g. sequences from multiple isolates of a pathogen or disease gene, and of genomic, cDNA, and inferred protein sequences.

The Confidence table supports tracking of estimated sequence accuracy on a coordinate-span basis, and tracking coverage for assembled sequences.

The Ext - Dist table supports distribution of a sequence and its derived features to either public or private databases other than SIMS.

I . 1 1 Representation of sequence analysis and its results

SIMS represents sequence analysis procedures as steps in the sequencing and analysis process; hence analysis is included in &e specification of a project. The results of sequence analysis are represented either as aligned sequences or a features assigned to sequences by one or more coordinate spans. The representation has the following components:

1.11.1 A Method table, analogous to the Protocol table, specifying the algorithm, program, and parameters used in an analysis.

1.11.2 A Ref - Data table specifying the reference data set using in an analysis.

1.11.3 A Data table specifying any ancillary data used in or produced by an analysis.

1.11.4 A Sim - Search table specifying the results of a similarity analysis for a particular pair of aligned sequences.

1.1 1.5 A Feature table assigning a feature of a specific type to one or more sequences.

- 22 -

1.11.6 A Feature - Location table specifyins the coordinates of a feature on a particular sequence.

1.11.7 A Feat - Inst table specifying whether the instance of a feature specified by Feature - Location is complete, pseudo (e.g. a pseudogene), consistent with relevant consensus sequences, or experimentally confirmed.

1.1 1.8 A Feature - Translation table specifJ-ing the amino-acid translation of a protein- coding feature.

1.1 1.9 A Transl - Except table specifying exceptions to the translation specified by the appropriate Transl-Table table.

1.11.10 A Feature - Search table specifying the results of any analysis to identify features specified by coordinate spans.

The Method and Ref - Data tables are linked to a step instance by an Analysis - Link table, permitting multiple analysis methods to be employed in a given step. These tables track the application of an analysis method to a data set of one or more sequences. The person performing and reviewing the analysis is tracked by the Step - instance table (FS 1.3.1.1, FS 2.1.7, FS 2.2.4).

Sequence-based quality control procedures typically employ both similarity and compositional analyses. Similarity searches that define vector or other heterologous contaminant sequences are tracked by the Sim - Search table, with the results represented as alignments. A new "trimmed" sequence may be "derived" as a new entry to the sequence table, referencing a step - id (FS 1.3.2.1). Compositional analyses (e.g. based on word usage) are represented as feature searches, which may indicate that an entire sequence span has the "feature" of being a probable heterologous contaminant (FS 1.3.2.4). Low-quality end sequence is represented as a feature; again a new sequence trimmed to delete the region with this feature may be created (FS 1.3.2.2).

The Sim - Search table employs a generic "score" field to accomodate multiple scoring schemes used by different similarity algorithms. ,I\ statistical significance measure is included for algorithms that supply such a measure as outuut IFS 1.328). The Feature - Search table has a structure analogous to Sim - Search, but is separated for ease of queryability. It supports

compositional, discrete-feature, and feature-based gene-prediction analyses (FS 1.3.2.9, FS 1.3.2.10, FS 1.3.2.11).

The Feature, Feature - Location, and Feat - Inst tables allow the representation of features shared by multiple sequences in an alignment. The feature instances contained in different sequences may have different sequences, lengths, consensus properties, or functions (i.e. some may be nonfunctional pseudofeatures). A Super - - Id Link table permits tracking of superceding features.

The Feature - Translation table tracks translations of features and links them to the corresponding amino-acid sequences.

1.12 Representations of sources of materials

SIMS includes a minimal source representation, comprising:

1.12.1 A Source table describing the source, and specifying an external source database and identifier.

1.12.2 A Taxon table summarizing the taxonomic name of the source organism.

1.12.3 A Transl - Table table specifying the genetic code for the source organism and genome.

The Source table allows the specification of a particular tissue, individual, or collection as the source of a clone library, and provide a pointer to an external database more completely desribing the source. The Taxon and Transl - Table tables provide the taxonomic information needed for naming and correct amino-acid translation.

2. Management System

The SIMS will employ a commercial relational database management system that supports ad hoc SQL queries joining any tables (FS 3.1, 3.2, 3.3, 3.4). Either Sybase (FS 7.3.1) or Oracle (FS 7.3.2) will be used for the initial implementation, with support for other SQL-compliant relational management systems optional. Subsets of the SIMS schema may also be implemented on additional platforms to facilitate user access or report generation.

VI. Functional Specification, Concept Level (User Interface)

This document specifies functionality of the SIMS user interface with respect to the systems- level SIMS functional specification. Parenthetical notations of numbers preceded by "FS " reference system-level specifications (cf. Section III).

1. Conceptual structure

The SIMS 0.1 interface is forms-based, with default values provided by the database wherever possible. Screen flow through the interface is managed by ancillary tables within the database itself. Controlled vocabularies are maintained as resource files readable by the interface.

The representation of laboratory processes maintained by SIMS 0.1 is intended to be human- readable and query able, but not to support complex inferential computations.

The interface supports pipeline processes only; it does not support conditional branching. This is a design simplification that is adequate in laboratories with well-defined operational processes that can be represented as partially-overlapping pipelines. It is not appropriate for situations in which procedures are under development or otherwise in flux.

2. Specification of projects and steps

A project is defined by specifying a sequence of steps to be performed. A "Project" interface screen supports this process. Steps are selected from a controlled volcabulary of step types (FS 2.1). Figure 3 illustrates the screen form for defining a project.

A step type is defined by a human-readable text description (FS 2.4.1). These descriptions are intended to be written, or transcribed from other sources such as procedure manuals, by expert personnel. The database exerts version control on these descriptions, in that each new description is represented as a new row in the Protocol table with a pointer to the row it supercedes, but no control over the content of the protocol descriptions. Figure 4 illustrates a screen form for defining a step type.

SIMS 0.1 supports a simple precodition constraint mechanism between step types via the Precondition table. Step types are linked to qualified personnel through the Skill table.

- 25 -

3. Completion of steps in a project

As a project is completed by one or more technicians, a sequence of screens is presented that both provide information about available resources and record information about the execution of each step (FS 2.2, 3.1, 4.1). Figures 5 and 6 provide examples of these screens. The order of presentation of these screens is controlled by the Step - Link table. Resources are described by screens that both report resource characteristics and provide fields for input or updates to resource status. Figure 7 provides an example.

4. Visualization of sequences and analysis results

Sequences managed by SJMS and results of sequence analysis processes are visualized using a browsededitor interface (FS 4.2). This interface may be based on the Genome Sequence DataBase v. 2.2 Annotator (Figure S), for which requirements, specification, and design documents are available. The interface provides functionality for viewing sequences and annotation at multiple scales ranging from the base sequence scale to the > 1 iMb scale. editing base sequences, and specifying functional annotation and sequence alignments by coordinates.

5. General query support

Ad hoc queries, including queries supporting process quality control, personnel performance assessment, and resource allocation and status, are supported by a general query interface (FS 3.2 - 3.4). This interface may provide some level of form support, while displaying editable SQL statements of entered queries. A variety of "canned" queries that return standard reports on resource or project status are supplied.

Expert users of the database can be expected to master SQL as general query language. Experiments at a number of sites, including LLNL and TIGR, have shown that laboratory managers and technicians rapidly learn needed SQL given interfaces that display the SQL for standard queries in editable form. Access to the full database for ad hoc queries such as those required for general quality-control analysis will require this type of general-language capability.

- 26 -

VII. Technical Feasibility and Resource Requirements

Development of the SIMS 0.1 functionai specification as outlined in sections 111 - VI of this document raises a number of issues regarding the technical feasibility and resource requirements for a software system meeting the requirements outlined in section II of this document.

1. Technical feasibility

A relational model appears to provide an adequate basis for meeting the SIMS requirements. The current schema specification does not, however, fully address the requirements for automating control flow through the sequencing process. Doing so will require developing a system for encoding and interpreting machine-readable specifications of projects, and developing an interface that allows this system to control presentation of screens. The current system also does not support the representation of conditional branching in processes due to error conditions, test assays, or other criteria. Providing this capability will require developing both a specification interface and an interpreter for the specifications. The complexity of processes with conditional branches can easily approach that of arbitrary multistate automata, so the required interpreter will have at least the complexity of a context- free language recognizer. Despite considerable effort over the last three decades, the development of reliable recognizers for languages of this complexity in relatively open domains is not straightforward.

.

A second set of feasibility issues surround communication between SIMS and automated laboratory instruments and robots. Fixed programs for robots or other instruments can clearly be stored in machine-readable form in a database. However, these representations will generally only be interpretable by the robot itself (or, laboriously, by a skilled robot programmer). Hence any form of computation over such representations, e.g. for guiding control flow based on what steps a robot’s latest program demands, requires development of a further interpreter that extracts this information from the program. The results obtained by interpreting the stored program would then have to be compared against the actual state of the robot as independently measured. While systems for tracking robot performance in real time are commonplace, they are generally based on the assumption that the robot is executing a fixed protocol. Having a change in the robot’s program effectively reprogram the system that monitors the robot’s performance is. again, a substantial artificial intelligence problem.

- 27 -

These issues are largely, if not completely, obviated in a situation in which the SIMS subserves the operation of a laboratory using a fixed set of instruments and a fixed set of procedures. This will be the case. for example, in high-throughput laboratories that are subjected to Good Laboratory Practices approval or a similar pre-approval process for their procedures. Some high-throughput genomics laboratories, especially EST laboratories, fall into this category. A SXMS-like system custom-designed specifically for a fixed set of procedures is straightforward using a design philosophy such as the one adopted here.

2. Resource Requirements

Development of a full-fledged SIMS that supports multiple strategies. even as pre-coded, selectable options, will be a major software effort. A robust, professional implementation of such a system, from the ground up, is likely to be a 20 - 30 programmer-year effort. Development of a system of interpreters to address the flexibility issues discussed above could be expected to involve at least doubling this level of effort, and including personnel skilled in the interface between robotics, natural language understanding, and expert systems.

Development of a less flexible, single-process SIMS for a specific laboratory and set of procedures will be much more manageable: a few programmer-year effort. Much of this time will go into interface development; the simpler the interfaces, the shorter the development cycle. If a robust sequence and annotation visualization interface is not required, the interface development could proceed using only low-end tools, and could be done relatively quickly. A robust, multi-platform sequence visualization and editing tool is likely, by itself, to consume four or more programmer-years. Development of a true ad hoc query interface that imposes neither a complex query language nor an overely-restrictive object model is similarly difficult; avoiding an interface of this complexity will yield a more developable product.

28

SIMS Materials Flow Diagram

Person Prep protocols 1 Device

Clone data Materials

Person Device Materials

Prep protocols

Template data Person Device Materials

Reaction protocols

Reaction-mix data Person Device Materials

Load, Run, Call protocols

ElectropherograndRaw sequence data Person Program Data

Inspect, Edit, Mask protocols

$

ElectropherogramEdited sequence data Person Program

Assembly, meta-assembly

protocols Assembled sequences

Data

Search, Composition, Site, Gene Program

protocols Interpreted sequence data

Selection, Distribution 1 Person

protocols

Distributed sequencing results

Figure 1

29

name requcstctl Ily requested-time complete-time

vector library-id source cat-# location

S l M S 0.1 One-to-Many link * One-to-one Link Many-to-Many Link 0 #D I I/O l ink mmm-rnmrl,

Project-Link i (1 projrct _id slrp-name

4 step-id description ;rck7i nst-i d

slatus location access-method

Step-Instance status

- type-id start _lime

review-time status failure-code

contlilion id rnal)lcd. id

qualified-date

PCR-Li nk 1 Clone-Plat e-Link l

i d primer1 i d

i d I clone-id plate- it1 primer2 id - -a pcr-prolocot-id well# lest $erson

I

h I

comment I , . i

Plate id well-count location

I I

CJ

.I

, vector Wand-#

i pre p-da t e

Person

name outdates status

-

i d description name reference super-id stalus

I ~ - ~ I ~ I D D R I R U I L . U

Pro toco I

I Primer I name I

I I

date I I

- id

sequence source cat-#

location conlacl person status

* E

comment 1 1 1 1 1 1 1 1 1 1 1 1 1 1

reage trt-icl I step-id I)- 1 protocolid comment

clone-id - template-id - primer-id mix-id

I reagent-id 1 gel-file-id egram-id I seq .id f - feature-id

data-id

I I

:!m

Method

alg-name progname version parameters t status

-4

Reagent i d namc lYPe source cat-# order-date rev-date quant-avail contact-person status

Device I i d name type product-name serial-number status comment

lb-id I 1 ‘athway d lame lat-path

I

status contact- person

Component- location Library

id name [Y Pe chromosome organism laxon-id veclor left-site right-site source- i d 4 cat-# replicate-cycle-# pre -protocol-id 4

ortg-lib-name or ig l ih- id orglib-date date contact. person status comment

rep P ic-protocol-id C

componen t-id baseseq-s tart baseseq-end is-refined creator

Inst-Type-Re1

component-id relalion-type t t ree-id

Feature Inst-Type-Re1

component-id relalion-type t ree-id

id gsdb-fid name description feat-type repeat-type is-tandem is- flanking

b I I ’ Component

baseseq-id compseq-id

created creator

Taxon i d division kin dom

class order family genus species scientific-name common-name genome transl-table-id

pliy ’i urn

Super-id-Link

gsd b-f id superceded -fid

islinvertevd is-terminal is._direct is-i n f ers persed rpt-famil y status super-id creator

component-id relation-type

Sequence-Pieces

piece-id order

error

Sim-Search id comploc-id alg-name prog-name prog-version score-cutoff parameters db-searched d b-versi on db-date score stat-sig

seq-id

db-id release-date

-1

Feature-location I Feat-lnst id feat-id seq-id opera tor number pseudo an1 icodon-aa standard-name is-5-complete is-3-cornplete fits-consensus fi(s..consensns5 fits-consensus3 approx-length lengtli is-experimental direction created creator

11 I d ieq-id ‘ea t-id left-start left..-end right _start right-end complement exclusive order creator

c.

,-

I S’equence id tax o n-id sequence length seqJ ype is-experimen tal is-derived is-gapped is-composite is-clone-end name description created trash-code archive-location creator status

-.I feat-inst-id 1 1 trans1 -(able

Feature Translation

feature-inst-id star t-t ranslation internal-~rardation stop-translation is-a-ta-stop trans-se i d

c Iransl-talf;le-id -

algname prog-name progversion score.-cutoff paramelers ref-clalaset ref-dala-date ref-data-desc score L stat-sig

I Confidence

seq-id loc-start loc-end percent-conf strand-cover t d bl-s tranded

SlMS 0.1 5CAF4 Jan 96

30

*

Figure 3

. P 31

Figure 4

32

Figure 5

33

t

Figure 6

34

Figure 7