Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic...
Transcript of Bioinformatic Strategies: Integrating Data and Knowledge to … · 2017. 5. 18. · Bioinformatic...
Presented by
Managed by UT-Battelle for the U.S. Department of Energy
Bioinformatic Strategies:
Integrating Data and Knowledge to Improve
Biosurveillance
Bob Cottingham
Biosciences Division Oak Ridge National Laboratory
NDIA Biosurveillance Conference
August 28, 2012
Managed by UT-Battelle for the U.S. Department of Energy
Introduction and Motivation
• Biosurveillance has been primarily based on traditional and manual methods such PCR detection, and intelligence gathering and analysis.
• As science and technology advances there is increasing potential for terrorist engineered threats.
• But also increasing potential for computational means to detect both natural and synthetic threats.
2
Managed by UT-Battelle for the U.S. Department of Energy
What would it take to develop a
standardized biosurveillance system
• Common, Standardized, Scalable
• Computational framework
• Allows rapid, efficient development of new computational analytic methods in a common, integrated data environment.
• Enable common methods for the evaluation and comparison of analytic methods to drive improved performance.
• Provide the basis for computational work environments for biosurveillance analysts.
3
Managed by UT-Battelle for the U.S. Department of Energy
Some relevant topics to consider
• Data intensive computing for analysis of multiple real-time data streams,
• Genomic, transcriptomic, and other -omics analysis of samples,
• Sensor network integration,
• Spatial analysis and visualization,
• Social network analysis
4
Managed by UT-Battelle for the U.S. Department of Energy
Déjà vu all over again
– Yogi Berra
• Next-next gen sequencers e.g. PacBio = 100GB/hr in 2012?
• Is storing and processing data all that’s needed?
• Is computing becoming the bottleneck in research progress?
• Haven’t we been saying this since early 1990s?
• Is Moore’s Law keeping up?
• What about transcriptomics and omics integration?
0
100
200
300
400
500
600
700
800
900
2000 2001 2002 2003 2004 2005 2006
Year
Data
bases
Probes per array
1M
60GB DNA bases in Genbank
Probes tested 1T
Scaling of Bio Data - Both Volume and
Breadth of Data
5
Managed by UT-Battelle for the U.S. Department of Energy
Problem: Discovery of Microbial Pathways
Important to Production of Cellulosic Ethanol
Discover novel microbes
Microbial conversion
of cellulose
Cellulose Digestion Regulon Model
Model microbes and mutants
under relevant stress conditions
Annotation Pipeline
0
100
200
300
400
500
600
700
800
900
2000 2001 2002 2003 2004 2005 2006
Year
Data
bases
Other Bio Databases
High Throughput Assay Data Sets
2nd Gen Sequencing
Microarrays
Mass Spec
Metabolomic
Exponential Growth
PacBio 100GB/hr
Super-Exponential Growth
Scientific Model
Knowledgebase
Data Model
Wiki concept
Conceptual Modeling
Integration
Involve All Researchers
Transition Clusters > HPC 6
Managed by UT-Battelle for the U.S. Department of Energy
Computational Biology & Bioinformatics: Future of Analytical Integration
Microbial Annotation Pipeline
Sequencing
Super-exponential growth: more data to analyze, more often.
Quality assessment of data, analysis tools and results.
Versioning of analysis tools, data and evidence.
Align with all conserved protein domains >1000
Desulfobacula toluolica (AF271773)
U3SP-30 F1SP-21
U3SP-04 U3SU-27
U3SP-09 Desulfosporosinus orientis (AF271767)
Desulfotomaculum putei (AF273032) Desulfotomaculum nigrificans (AY015499)
Desulfotomaculum ruminis (U58118) Desulfotomaculum aeronauticum (AF273033)
Archaeoglobus profundus (AF071499) Archaeoglobus fulgidus (M95624)
U3SP-13 F1SU-14
F1SP-18 F1SU-12
F1SP-34 U3SU-15
U3SU-24 Desulfobacter vibrioformis (AJ250472)
Desulfobacter latus (U58124) F1SU-29
F1SU-06 Desulfonema limicola (U58128)
Desulfococcus multivorans (U58126) Desulfobotulus sapovorans (U58120) Desulfovibrio burkinensis (AB061536)
Desulfovibrio fructosovorans (AB061538) Desulfovibrio vulgaris (U16723)
Desulfovibrio longus (AB061540) Desulfovibrio desulfuricans (AF273034) Desulfovibrio intestinalis (AB061539)
Desulfovibrio simplex (AB061541) Desulfomonile tiedjei (AF334595)
Desulfotomaculum thermobenzoicum (AJ310432) Desulfotomaculum thermoacetoxidans (AF271770) Desulfotomaculum luciae (AY015494) Desulfotomaculum kuznetsovii (AF273031)
Desulfotomaculum thermocisternum (AF074396) Desulfotomaculum geothermicum (AF273029)
Desulfotomaculum thermosapovorans (AF271769) Desulfotomaculum acetoxidans (AF271768)
F1SU-19 F1SP-02
F1SP-04 F1SU-02
F1SP-16 F1SP-03
F1SP-26 F1SU-33
F1SU-32 F1SP-23
Desulfobulbus elongatus (AJ310430) Desulfobulbus propionicus (AF218452)
Desulfobulbus rhabdoformis (AJ250473) F1SU-18
Desulfovirga adipica (AF334591) Thermodesulforhabdus norvegica (AF334597) U3SP-34
U3SP-17 U3SU-08
U3SP-20 Desulfoarculus baarsii (AF334600)
F1SU-17 F1SU-26
U3SP-19 U3SP-11
F1SP-13 F1SP-32
U3SU-03 U3SU-13
Thermodesulfovibrio yellowstonii (AF334599) Thermodesulfovibrio islandicus (AF334599)
0.05 changes
100
53
51 100
100
74
100 100
100
100
82
69
82 76
80
100
100
78
98
77
63
80 100
68
71 96
100
80
100
100
60
100
77 86
87
86
54
60
80
96 100
56
64
99
62
100
83
59 70
100
100 100
100
100
Cluster DSR-4
Cluster DSR-9
Cluster DSR-8
Cluster DSR-2
Cluster DSR-7
Cluster DSR-3
Cluster DSR-6
Cluster DSR-5
Cluster DSR-1
58
Increased data and integration increases computation and storage
Predict protein and function
Correlate unknowns
Extend so researcher can see evidence, quality and history.
Establish open standards for assessing quality and performance of analytical methods.
Lead open source software policy.
Managed by UT-Battelle for the U.S. Department of Energy
• 2008 Workshop Report – Historically projects developed in
isolation resulting in isolated data and methods.
– Vision: Integrating, community informatics resource enabling a broader and more powerful systems biology research effort.
• Objective: Develop an implementation path toward the vision of the DOE Systems Biology Knowledgebase.
Knowledgebase R&D Project
Background and Objective
8
Managed by UT-Battelle for the U.S. Department of Energy
Experimental Design to Validate
Prediction
Managed by UT-Battelle for the U.S. Department of Energy
A Model of the Earth – 15th
century
Managed by UT-Battelle for the U.S. Department of Energy
A Current Model of the Earth
Managed by UT-Battelle for the U.S. Department of Energy
Image with Annotation
Managed by UT-Battelle for the U.S. Department of Energy
User Interface on
Computational Model
Managed by UT-Battelle for the U.S. Department of Energy
Research Function with User Interface
Managed by UT-Battelle for the U.S. Department of Energy
Experimental Design to Validate
Prediction
Managed by UT-Battelle for the U.S. Department of Energy
Could something like that exist for
biosurveillance?
• The ability to collect, communicate and analyze biological-related data is rapidly changing.
• Diverse sources of data can be used in predicting and mitigating an outbreak.
16
Synthetic Biology
DNA
Sequence
• When a bioterrorism event unfolds, we must rapidly collect, analyze, and filter diverse information to enable the best response and decision-making.
• Information could come from search engines like Google, from scientific labs, field detectors, handheld devices, social networks, blogs, over the counter sales, traditional media or virtually any person or entity that shares information on the Web.
Managed by UT-Battelle for the U.S. Department of Energy
Data Intensive Science – The 4th
Paradigm
• Experimental, Theoretical, Computer Simulation, … next Data Intensive.
• Data discipline based on databases, schemas, ontologies – scientific community generally lacks understanding of these topics.
• Data intensive science requires specialized skills and analysis tools.
• Each piece of data needs have its associated ontological and semantic information.
• Search, analysis and reuse is supported by standard vocabularies.
• IT industry building huge “cloud services” with high bandwidth, low cost storage and computing. No prominent bio examples yet.
• Future scientific progress in biology depends on how well the community acquires the necessary expertise in database, workflow, visualization and cloud computing techniques.
17
Managed by UT-Battelle for the U.S. Department of Energy
Scenario Driven Data Modeling SDDM:
1
ID=SCEN:SEQ_1003_3 SEQ=“gaaaatcgcc…”
ID=SCEN:SEQ_1003_2 SEQ=“tctcgacatg…”
3
ID=SCEN:SEQ_1003_1 SEQ=“cttggtcgtc…”
2
4
hasSequence
hasSequence
hasSequence
5
ID=INTERPRO:IPR000871
6
7
ID= INTERPRO: IPR001279
ID= INTERPRO:IPR001036
matches
matches
matches
8
ID=GO:0016787
annotatedBy
annotatedBy
annotatedBy
9
Beta-lactamase
Owl:sameAs
hasOutbreak
10
ID=SCEN:EVENT 1003_1_2 DATE=8/14/2010 DESC= At least 2 Canadians have become infected with a dangerous new "superbug" from India that is spreading around the world, partly due to medical tourism. COOR=-102.041016,55.727112,0.000000
11
hasEvent
12 ID=SCEN:EVENT 1003_1_1 DATE=8/11/2010 DESC=A new "superbug" that is resistant to even the most powerful antimicrobial agents has entered UK hospitals COOR=-2.065430,53.330872,0.000000
13
ID=SCEN:OUTBREAK_1003_2 START DATE=8/13/2010 END DATE=8/14/2010
14
ID=SCEN:EVENT 1003_2_1 DATE=8/13/2010 A Belgian man died from a drug-resistant so-called superbug originating in South Asia, the 1st reported death from the new health threat. COOR= 4.790039,50.359482,0.000000
hasOutbreak
hasEvent
16
ID=PROMED:20100815.2812
hasEvent hasPROMEDRef
hasPROMEDRef 15
ID=PROMED:20100817.2853
hasPROMEDRef
ID=OUTBREAK_1003_1 START DATE=8/11/2010 END DATE=8/14/2010
ID=SCEN:1003
Managed by UT-Battelle for the U.S. Department of Energy
• 2008 Workshop Report – Historically projects developed in
isolation resulting in isolated data and methods.
– Vision: Integrating, community informatics resource enabling a broader and more powerful systems biology research effort.
• Objective: Develop an implementation path toward the vision of the DOE Systems Biology Knowledgebase.
Knowledgebase R&D Project
Background and Objective
19
Managed by UT-Battelle for the U.S. Department of Energy
The Problem
• Biological research projects are becoming larger and more complex, both experimentally and computationally.
• To succeed there is an increasing need to cooperate and standardize.
• Technological advances continue to produce exponentially more and diverse types of data.
• A new kind of computational infrastructure is needed for the overall scientific effort to be productive and successful.
20
Managed by UT-Battelle for the U.S. Department of Energy
Kbase Mission
• To provide a large-scale, open community computational capability for systems biology research data management and analysis.
• Promote openness and sharing of data, code and computational infrastructure.
• Address problems of large scale data management and processing.
• Utilize computational techniques required for community effort at data integration, open development, large scale resource sharing and to meet other research community policies and objectives.
21
Managed by UT-Battelle for the U.S. Department of Energy
How Would Kbase Be Different?
• How would Kbase be different? – Integrate across projects and laboratories
– Implies a community research effort
– More standardized approach
– More mature software engineering approach
– More involvement of non-informatics researchers
• Reference models – technical: – Open source development, e.g. Linux
– iPhone Apps, Google Apps, Facebook Apps
– Google Maps cloud computing with smart phone app
– Wikipedia shared community reference resource
22
Managed by UT-Battelle for the U.S. Department of Energy
Kbase Guiding Principles
• Science drives Kbase development.
• Community effort integrating data and methods across multiple laboratories to improve research productivity.
• Open access – data and methods are available for anyone to use with perhaps a limited embargo policy.
• Open contribution – data and source code managed in an open environment and can be contributed by anyone with an editorial / peer review process.
• Distributed data and methods, Kbase is not a single, centralized, monolithic system.
23
Managed by UT-Battelle for the U.S. Department of Energy
DOE Systems Biology
Knowledgebase
Tools
Meta-
data Data
DOE Systems Biology Knowledgebase
Establishing A Systems Biology Modeling Framework
Open-Access Data and Information Exchange
• Flexible user interfaces • Easy data retrieval • Environment for in silico experimentation
Open Development of Open-Source Software and Tools
• Analysis and visualization • In silico experimentation • Tracking and evaluation of tool use
Community-Wide Stewardship
• User, Standards, and Advisory committees
• Value-added analysis • Training, tutorials, and support
Data generators
Data
users
Software and tool
developers
Seamless Submission and Incorporation of Diverse Data
• Standards for data, metadata • Quality control and assurance • Automated data handling
24
Managed by UT-Battelle for the U.S. Department of Energy
What is a Knowledgebase?
Databases enable the rapid organization and search of data
Knowledgebases should learn a “model” of the data to provide
“conclusions” (hypotheses)
Managed by UT-Battelle for the U.S. Department of Energy
Knowledge = Models
• The Knowledgebase should learn models from data and
human interaction.
• Models and their parts should carry information about
the quality of their data, annotations, and predictions.
• Data, protocols, algorithms and models should be
subject to both calculated and community quality
assessment.
Managed by UT-Battelle for the U.S. Department of Energy
The KBase Infrastructure and Services
Bulk Data Services Persistent Store Central Data Store
Managed by UT-Battelle for the U.S. Department of Energy
Concept: KBase User Experience
Managed by UT-Battelle for the U.S. Department of Energy
Concept: Interactive Community
Knowledge
DOE Office of Science • Office of Biological and Environmental Research
Narra veIdea
ScenarioA
A.1
A.2
ScenarioB
A.1
A.2
Narra veIdea
ScenarioA
A.1
A.2
ScenarioB
A.1
A.2
Narra veIdea
ScenarioA
A.1
A.2
ScenarioB
A.1
A.2
Narra veCodeVersioningandScenarioBranching
HH
H HE
! "##"$%&'() &"'
*+&, "#-. '/ ''
/ 01'
/ 02'
*+&, "#-. '3'
/ 01'
/ 02'
Narra veQueryUpdate
! "##"$%&'() &"'
*+&, "#-. '/ ''
/ 01'
/ 02'
*+&, "#-. '3'
/ 01'
/ 02'
Narra veDataChange
! "##"$%&'() &"'
*+&, "#-. '/ ''
/ 01'
/ 02'
*+&, "#-. '3'
/ 01'
/ 02'
ProjectLinkageandCita on
! "##"$%&'() &"'
*+&, "#-. '/ ''
/ 01'
/ 02'
*+&, "#-. '3'
/ 01'
/ 02'
ENarra veGraphsProvideMeasuresofAc vityandInfluence
Func onandDatacontentscanbetrackedtoassess“use”