Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from...

30
Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular Systems Initiative Pacific Northwest National Laboratory (www.sysbio.org)
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from...

Page 1: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

Pacific Northwest National Laboratory

U.S. Department of Energy

DOE Data WorkshopView from Information-intensive

Applications

H. Steven WileyBiomolecular Systems Initiative

Pacific Northwest National Laboratory(www.sysbio.org)

Page 2: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

2

Information Intensive ScienceGoals of IIS

Understanding systems versus individual phenomena Strengthening/automating links between different types of data from different scales

Examples Biology: Cell Signaling Biology: BIRN Chemistry: CMCS Homeland Defense Complexity of systems is becoming pervasive

Challenges Efficient federation, graph-based queries Continuous data correlation Managing complex experiments, data provenance using multiple independent data and analysis

resources

Priorities High-performance federation, data mining, semantic query capabilities (software, hardware

architecture) Knowledge environments (lightweight, evolvable, powerful, …) Organization and Visualization of large-scale, complex information

Page 3: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

3

A systems-science approach to address complex problems

New knowledge is assimilated from different data, tools, and disciplines at each scale

Real-time bi-directional information flow Deep analysis across scales Multiple applications for the same information

Challenges Data, provenance, annotation publication Syntactic and Semantic Federation Standardization versus innovation

Examples: IUPAC – update of radical thermochemistry reference

values by global expert group PrIMe – community developed optimized reaction

mechanismsguiding experimental plans across scales, providing

community resources for applied research

Combustion is a Multi-scale Chemical Science Challenge

Page 4: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

4

Volume of data, orders of magnitude larger and at different levels of abstractionComplexity of information spaces into very high dimensions, 200 the normInformation often out of context, incomplete, fuzzyDeceptionInformation in all media types: text, imagery, video, voice, web, sensor dataTime and temporal dynamics fundamentally change the approachSpatial, yet non-spatial abstract dataMultiple ontologies, languages, culturesPrivacy Issues

Homeland Security: Pulling insight out of information overload

ImmigrationFinancial

Sensors

Shipping

Communications

Is there adomesticterrorist

plot?

Is there adomesticterrorist

plot?

Can we detect and prevent a terrorist attack BEFORE it happens?

For homeland security and science For homeland security and science we now turn to data-intensive visual analyticswe now turn to data-intensive visual analytics

Page 5: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

5

Page 6: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

6

Molecularparameters:protein levels / states /locations / interactions / activities

Cellfunction: death,proliferation,differentiation,migration, ...

Systems Biology of Cells

Ultimate aim: Understanding andpredictionof effects ofcomponent properties

Page 7: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

7

Page 8: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

8

Page 9: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

9

What, Where, Quantity, Quality?

What parts are being made? (identity)What is the regulatory network structured? (interactions)Where are the proteins located in cell? (location)What are their levels? (quantity) How do they interact with their partners? (activity)

As a function of covalent modification Contribution of steric restrictions Forward and reverse rate constants

To successfully model a complex biological system, one must minimally

know the following information:

Page 10: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

10

Cells as Input-Output Systems

Biologists look at their experiments as input-output systemsWe start with a “defined” system to which we apply a stimulus (Input: independent variable)We then look for a specific response (output: dependent variable)The relationship between the input and output provides insight into the workings of the system

SystemInput Output

Unknown context So unless we control the experimental context, we cannot

interpret our experiments

Page 11: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

11

The Two Greatest Challenges of Systems Biology

1. Working with indeterminate systems

2. Understanding context - what it is and how to control and capture it

Page 12: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

12

Defining the composition of living systems is driving analytical technologies

GenomicsProteomicsMetabanomicsExpression profilingImagingEtc…….

All of these technologies seek to rigorously define the composition of living

systems

Page 13: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

13

2,500

2,243

1,731

1,475

1,218

962

706

450

1,987

24 33 44 52 62 71

MW

Capillary LC-FTICR 2-D display of peptides from a yeast soluble protein digest>160,000 isotopic distributions corresponding to >100,000 polypeptides detected

2,500

2,243

1,731

1,475

1,218

962

706

450

1,987

24 33 44 52 62 7124 33 44 52 62 71

MW

Capillary LC-FTICR 2-D display of peptides from a yeast soluble protein digest>160,000 isotopic distributions corresponding to >100,000 polypeptides detected

Time

2-D display of detected peptides

Mass

Global simultaneous quantitative proteome measurements

Proteins identified and quantified using Proteins identified and quantified using accurate mass and time (AMT) tagsaccurate mass and time (AMT) tags

0 42 84 126LC elution time (min)

m/z 750 1000

Dimension one - separation time

Dimension two - accurate mass

1250 1500

Page 14: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

14

9.4 Tesla High Throughput Mass Spectrometer

1 Experiment per hour5000 spectra per experiment4 MByte per spectrum

Per instrument:20 Gbytes per hour480 Gbytes per day

These are based ontoday's technologies.

Time to analyze offsite: 1 weekTime to analyze onsite: 48 hoursTime to analyze onsite with smart storage: 2 hours

High Throughput ProteomicsHigh Throughput Proteomics

Page 15: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

15

Integrated, High-throughput Experiments will Generate Enormous Amounts of Data

Experiment templates for a single microbe

class of experiment

time points treatments conditions

genetic variants

biological replication

total biological samples

Proteomics data volume in TB

Metabolite data in TB

Transcription data in TB

simple (scratching the surface) 10 1 3 1 3 90 1.8 1.4 0.009moderate 25 3 5 1 3 1125 22.5 16.9 0.1125upper mid 50 3 5 5 3 11250 225.0 168.8 1.125complex 20 5 5 20 3 30000 600.0 450.0 3real interesting 20 5 5 50 3 75000 1500.0 1125.0 7.5

Profiling methodProteomics Looking at a possible 6000 proteins per microbe assuming ~20 GB per sample Metabolites Looking a panel of 500-1000 different molecules assuming ~15GB per sampleTranscription 6000 genes & 2 arrays per sample ~100 MB

Typically a single significant scientific question takes the multidimensional analysis of at least 1000 biological samples

Page 16: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

16

Page 17: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

17

Trey Ideker

The Molecular Interaction Scaffold is Huge

Page 18: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

18

Cell Imaging New multispectral, multidimensional imaging techniques

can generate enormous amounts of data

Page 19: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

19

Cell Imaging Workflow

Complex set of metadata

collected here

Page 20: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

20

How Much Data From Imaging?

Currently, a high quality image of a single cell field is 4mb per image, obtained at 4fps (16mb/s)Following cell through one cell cycle is 24h, or approximately 1.4tbNew hyperspectral microscopes analyzing only 10 wavelengths would generate 7tb/dayCharacterizing dynamics of most abundant set of genes (4000) would require 5.5pbThis is for a single instrument and a single experiment using today’s technology

Page 21: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

21

Understanding the influence of cell context is driving experimental and computational

biology

Cell SignalingDevelopmental biologyCancer and growth controlHost-pathogen interactionsDynamics of microbial communitiesCellular responses to stress

Page 22: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

22

Computational Modeling Approaches-- Diverse Spectrum

differential equations

statistical mining

Bayesian networks

SPECIFIED ABSTRACTED

Markov chains

Boolean models

relationships

mechanisms

influences *(includingstructure)

*

Page 23: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

23

Computer Models Allow Reconstruction of Processes Across Different Scales

MODEL DATABASE

Organ 1Organ 1Organ 1Organ N

Model 1Model 1Cell DataSet N

Unique IDModel NameModel Descr.Default Par.Default Comp.TimestampSecurity

Organ

Species 1 Species 1 Species 1 Species N Species

Solution Par.Input_par IDInput_par IDReact. RatesChemical Par.Concen. Val.--

GeometricPar.

Input_par IDInput_par IDValue_par--

EquationDocs.

Input_par IDInput_par IDSymbolicSource--

TissueModel 1Model 1Model 1Tissue N Cell

ComputePar.

Input_par IDInput_par IDValue_par--

Initial Conditions

Input_fld IDInput_fld IDValue_parValue_par--

ParameterDocs.

Input_par IDInput_par IDReferencesLimits-

Page 24: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

24

Page 25: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

25

Page 26: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

26

Page 27: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

27

Data is distributed across many repositories with various ontologies and data formats

Analysis tools do not address integration of heterogeneous data sets

Minimal informatics based analysis tools that support a systems biology approach

Collaboration capabilities are primitive to support shared knowledge among researchers

Obstacles preventing scientists from utilizing available data

Obstacles preventing scientists from utilizing available data

Page 28: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

28

The Challenge for Data Handling is Two-fold

1. Managing the massive amounts of compositional data necessary to define all of the relevant experimental systems

2. Capture all of the data on the relationships between context, composition and response

Integration of the analytical and experimental methodologies into a single system is necessary to

link all of the data in a useful way

Page 29: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

29

END

Page 30: Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.

30

Understanding Living Cells

Cell responses are multiphasic

Different classes of stimulants (information) are processed at characteristic time scales

Processing nodes within cells are spatially segregated

Each cell responds independently depending on its specific context

A response generally induces a reprogramming of the cell machinery

To create cell simulations, we must “abstract” this information to create a reference model which can then be modified