GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree [email protected].

GUS 3.0: Implementationand Dependencies

June 19, 2002

Jonathan Crabtree

[email protected]

Outline

Schema "implementation" what's done, what's not

Dependencies data migration and testing other tasks

Database implementation details design decisions and implications production & development dbs

Future work/current schema issues

Implementation

Implemented so far: The schema itself (Pinney) Updated Perl object layer (Brunk) Revised GUS Application (GA) code (Schug) Preliminary version of allgenes interface (Fischer)

Not yet implemented: Extensive testing of schema, interface, objects, etc. Data migration from GUSdev to GUS 3.0

Migration Dependencies

Instantiate/finalize GUS 3.0 schema (Pinney) Upgrade database server operating system Install and configure new RAID device Write scripts to migrate existing data

Resolve any remaining inconsistencies

Freeze access to database Annotator's interface (Diskin, Mazzarelli) Current allgenes update (Pinney, Fischer)

Migration Dependencies cont.

Run scripts to migrate all existing data Fix any problems that arise

Begin to "certify" plugins as 3.0-compliant Discuss: how much does the GUS 3.0 schema

"implementation" depend on our data migration? In other words, the 3.0 schema can be viewed as

implemented but untested. Conflict with PlasmoDB final release date?

Migration Highlights

Two "namespaces" (Oracle schemas) to five: GUSdev,RADdev => Core,DoTS,SRes,RAD3,TESS Certain tables are now shared in Core, SRes Avoid primary key conflicts by reloading RAD3

Restructuring of DoTS "central dogma" tables: GeneInstance, RNAInstance, ProteinInstance Also GO terms, new LOE and Complex tables

Other pervasive changes: e.g. ExternalDatabase => ExternalDatabaseRelease

Other Tasks

Script(s) to automate schema creation: schemas (in the Oracle sense) tables sequences (to generate primary key values) views "bootstrap" rows populate other tables as desired? (Anatomy, etc.) constraints indexes GRANT permissions as desired

Other Tasks II

Complete schema documentation Convert plugins to new schema as needed

Remove site-specific dependencies or standardize e.g., hard-coded references to specific external_db_ids

Particularly for data loading plugins make it easier to load and display sample dataset

Formalize schema development process Which changes are "major" or "minor"? Automatically determine which plugins are affected?

Database Design Decisions

GUS vs. other "plain" relational databases: 1. subclassing (extra views) 2. [blame] tracking/access control (extra columns) 3. versioning (extra tables)

Minimal reliance on database-specific features no stored procedures no server-side Java no object-relational tables

Generic links and naming conventions

1. Subclassing With Views

Advantages conceptual clarity straightforward to query the superclasses schema evolution; views are easier to change

Implications large tables (number of columns and rows) complicates query optimization (number of rows) slows row accesses (number of columns)

Subclassing cont.

Query optimization issues Cost-based query optimization requires statistics Confounded by coexistence of subclasses in table Bigger tables make the worst case worse

Physical I/O issues Any row access must read the entire row, including a

potentially large set of irrelevant column values Also increases the likelihood of chaining

Subclassing - alternatives

Use views for the superclass not the subclasses? Isolates subclasses from one another more Requires changing tables rather than views Superclass view will be a large SQL UNION Queries likely less efficient over superclasses

Keep existing system, but use partitions to specify physical placement of subclass rows Solves some, but not all of the problems

"Large" Tables I

GUS: indexes=25G tables=~100G NASequenceImp = 11G AssemblySequenceVer = 8.6 G SimilaritySpan = 8G / 74 million rows Similarity = 4.8 G / 38 million rows AssemblySequence = 3.6G Evidence = 3G / 36 million rows SimilarityVer = 2.8G

Approximately 10-20 quite large tables

"Large" Tables II

Tables with largest average row length: GeneMapVer = 811 bytes NASequenceImp = 747 bytes AssemblySequence = 521 bytes

Tables with the most chained rows: NASequenceImp = 384,524 rows AssemblySequenceVer = 56,873 rows AssemblySequence = 21,458 rows

2. Tracking/Access Control

Advantages: Enables DBA to disburse wrath appropriately Aids in correcting errors

Disadvantages: Extra columns have foreign key constraints Several small tables become bottlenecks for certain

DDL and database update operations Access controls not fully implemented

where and how should they be implemented?

Tracking II

3. Versioning

Advantages: required for complete tracking

Disadvantages: space overhead, results in slower updates requires application-level code to implement may be unnecessary in some DBMSs currently not used uniformly

Different versions coexisting e.g., PlasmoDB

Development => Production

nemesis/8i (GUS) and erebus/9i (GUSdev) Release cycle based on whole-database copy Uses Oracle IMPORT and EXPORT utilities:

EXPORT over network to flat files change owner/schema name change physical placement of tables, indexes

Alternatives: transportable tablespaces, SQL-based copy

Future Work

Issues with current schema from PlasmoDB free text searching (and use of CLOB values) more sophisticated schema for tracking session-

oriented data (more on this tomorrow) supporting queries for genome browser(s)

GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree [email protected].

Documents

Transcript of GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree [email protected].