Post on 18-Jan-2016
description
Pathway/Genome Databases and Software
Tools
Peter D. Karp, Ph.D.
Bioinformatics Research Group
SRI International
pkarp@ai.sri.com
http://ecocyc.DoubleTwist.com/ecocyc/
SRI InternationalBioinformaticsOverview
Overview of bioinformatics
Motivations for the EcoCyc project
EcoCyc demoDescription of EcoCyc database and Pathway Tools
software
Underlying technologies Ocelot object database GKB Editor X-windows to WWW translator
SRI InternationalBioinformaticsDefinition of Bioinformatics
Computational techniques for management and analysis of biological data and knowledge
Methods for disseminating, archiving, interpreting, and mining scientific information
SRI InternationalBioinformatics
Motivations for Bioinformatics
Growth in molecular-biology knowledge
Industrialization of biological experimentation
High-throughput biology Genome sequences Gene and protein expression data Protein-protein interaction data Protein 3-D structures ….
SRI InternationalBioinformatics
A
E
SRI InternationalBioinformaticsMotivations for EcoCyc --
E. coli Encyclopedia
Integrate E. coli information dispersed in the literature
New paradigm of scientific publishing
Model the full metabolic network of an organism
Integrate genomic data with functional data
Develop algorithms for computing with function
Provide a challenging domain for computer-science research
SRI InternationalBioinformaticsDefinitions
A chemical reaction interconverts chemical compounds
An enzyme is a protein that accelerates chemical reactions
A pathway is a linked set of reactions
A conceptual unit of cell’s biochemical machine
A + B = C + D
A C E
SRI InternationalBioinformaticsOrganism-Specific
Pathway/Genome Databases
Layer functional information above the genome
Rich ontology to encode biological information with high fidelity
Chromosomes, genes, operons, gene products, reactions, pathways
Curated by experts for that organism Integrate literature and computational predictions
SRI InternationalBioinformaticsPathway Tools Software
Pathway/Genome Navigator WWW publishing of PGDBs Graphic depictions of pathways, chromosomes, operons Pathway visualization of gene-expression data
Pathway/Genome Editors Distributed curation of genome annotations Distributed object database system Interactive editing tools
PathoLogic Prediction of metabolic network from genome
SRI InternationalBioinformatics
EcoCyc = E.coli Dataset + Pathway/Genome
Navigator
Genes: 4,393
Gene Products: 4,393
Reactions: 1,117
Pathways: 158
Metabolic Network
Compounds: 1,887
http://ecocyc.DoubleTwist.com/ecocyc/
Operons: 375
SRI InternationalBioinformaticsEcoCyc
Collaborative development via internet Karp -- Bioinformatics architect Riley -- Metabolic pathways, signal transduction Saier and Paulsen -- Transport Collado -- Regulation of gene expression
Ontology of 1000 biological classes14,000 instances
Over 2,600 registered users
SRI InternationalBioinformaticsPathway Tools Software
Pathway/ Genome Databases
Pathway/GenomeNavigator
PathoLogic Pathway
Predictor
Pathway/GenomeEditors
SRI InternationalBioinformaticsCreation of the Overview Graph
Run layout algorithms on individual pathway graphs
Automatically determine topology of pathway graph Apply associated layout algorithm (linear, circular, tidy tree)
Use superpathways to create hierarchical layouts Treat each individual pathway as a single node Pathway connections are edges Run appropriate layout algorithm
Manually position the resulting pathway clusters
SRI InternationalBioinformaticsInference of Metabolic Pathways
Genomic Map
Genes
Gene Products
Reactions
Pathway
Metabolic Network
Compounds
Pathway/Genome Database
PathoLogicList of Genes/ORFs
List of Gene Products
ANNOTATED GENOMEStructured ASCII Text File
DNA Sequence
Reports
MetaCyc
SRI InternationalBioinformaticsSummary of H. pylori Analysis
For 121 E. coli pathways, what is the evidence that each pathway occurs in H. pylori?
Strong evidence: 41 Medium evidence: 29 Little or no evidence: 51 31 reactions catalyzed by H. pylori but not by E. coli
H. pylori has partial abilities to synthesize cofactors and amino-acids, extremely
limited carbohydrate catabolism, some amino acid utilization, and a reductive citric-acid pathway
SRI InternationalBioinformaticsMicrobial Pathway/
Genome DBs
Literature-based Datasets:
MetaCyc
Escherichia coli
PathoLogic-based Datasets:
Bacillus subtilisMycobacterium tuberculosisHelicobacter pyloriHaemophilus influenzaeMycoplasma pneumoniaTreponema pallidumChlamydia trachomatis
Saccharomyces cerevisiae
SRI InternationalBioinformaticsPathway Tools Software
Architecture
Implemented in Common Lisp
WWW server runs as a single Unix process with a separate thread to service each query
Grasper-CL graph manager
Ocelot object databaseGKB Editor schema-driven editor
SRI InternationalBioinformaticsEcoCyc WWW Server
SRI InternationalBioinformaticsPathway Tools Architecture --
Development Configuration
Ocelot DBMS
GFP API
PathwayGenome Navigator
WWWServer
X-Windows Graphics
Object EditorPathway EditorReaction Editor
Oracle
SRI InternationalBioinformaticsOcelot Database System
Object Database ManagerPersistence via filesystem or relational DBMS
Demand and background faulting of objects from RDBMS
Two-level object cachingExtensive bioinformatics schema
Stored transaction history Inspect object history
SRI InternationalBioinformaticsOcelot Knowledge Server
Architecture
Frame data model
Persistent storage via Disk files Oracle DBMS
Optimistic concurrency-control protocol
Schema evolution
Logging facility
SRI InternationalBioinformaticsThe Frame Data Model
Frames are of two types: classes, instances
Frames have slots that define their properties, attributes, relationships
A slot has one or more values
Each value can be any Lisp datatype
Slotunits define metadata about slots: Domain, range, inverse Collection type, number of values, value constraints
SRI InternationalBioinformaticsInference Capabilities
Inheritance of defaults
Slot values computed via attached procedures
Maintenance of inverse relationships
Constraint system Deferred evaluation Tolerant of nonconformant data
SRI InternationalBioinformaticsStorage System Architecture
Oracle KBs
DBMS is submerged within FRSRelational schema is domain independent,
supports multiple KBs simultaneously
Frames transferred from DBMS to Ocelot On demand By background prefetcher Memory cache Persistent disk cache to speed performance via Internet
SRI InternationalBioinformaticsFrame Faulting
(get-slot-value gene ‘map-position)
Gene present in in-memory object cache?Gene present in cache on local disk?Query Oracle DBMS
SRI InternationalBioinformaticsLogging
Oracle DBMS stores: The latest version of each frame A history of all OKBC operations applied to KB
Reconstruct earlier versions of KBView history of changes to an objectUpdate replicatesConcurrency control
SRI InternationalBioinformaticsSchema Management
FRSs store and process class and instance information similarly
Applications can query schema information as easily as they can query instances
SRI InternationalBioinformaticsGKB Editor
Browser and editor for KBs and ontologies
Four editing tools
GKB Editor reusable with multiple FRSs All database queries via OKBC/GFP API Interoperability achieved with Ocelot, LOOM, Ontolingua
All operations are schema driven
http://www.ai.sri.com/~gkb/overview.html
SRI InternationalBioinformaticsEditors
Taxonomy editor
Frame editor
Relationships editor
Spreadsheet editor
SRI InternationalBioinformaticsResults
Ocelot in use in the EcoCyc project for 5 years
Supports collaborative development of EcoCyc by four groups in North America
Distributed architecture GKB Editor in active use
Supports development of 8 Pathway/Genome Databases
SRI InternationalBioinformaticsSummary
Pathway/Genome Databases
Pathway Tools software Extract pathways from genomes Distributed curation tools Query, visualization, WWW publishing Analysis algorithms
SRI InternationalBioinformaticsComputer Science Results
Extend scalability and multiuser access for knowledge representation systems
Reusable, schema-driven KB editor
Hierarchical graph layout algorithms
Dynamic translation from X-windows to HTML+GIF
Importance of ontologies and of content:Discovery = Algorithm + Database
SRI InternationalBioinformaticsProblem Solving Depends on
Algorithms and Content
Database Size and Quality
SolutionQuality
Algorithm Quality
ComputeTime
SRI InternationalBioinformaticsBioinformatics Results:
Content
The EcoCyc database describes the full metabolic map of an organism
The MetaCyc database describes over 300 metabolic pathways
Ontology spans genome to pathway information
SRI InternationalBioinformaticsBioinformatics Results:
Algorithms
Software environment for genome and pathway information
Query and visualization Distributed database development
PathoLogic algorithm predicts the metabolic network of an organism from its genome
Algorithms under development for qualitative modeling of the cell
SRI InternationalBioinformaticsAcknowledgements
Funding sources: NIH National Center for Research Resources
Collaborators: Monica Riley, Marine Biological Laboratory Milton Saier, UC San Diego Julio Collado, UNAM Christos Ouzounis, European Bioinformatics Institute
Peter D. Karp, Ph.D.
http://www.ai.sri.com/pkarp/
pkarp@ai.sri.com