Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley...

20
Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    1

Transcript of Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley...

Page 1: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Data Management in theDOE Genomics:GTL Program

Janet Jacobsen and Adam Arkin

Lawrence Berkeley National LaboratoryUniversity of California, Berkeley

Page 2: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Topics (talk or handout)

• Basic facts about the Genomics:GTL Program• Goals of the GTL Program• Experimental data generated by GTL• Laboratory methods• Data management challenges, requirements,

and needs• Survey on Data Standards, Data Sharing, and

Data Management – if time• Overall Recommendations

Lawrence Berkeley National Laboratory University of California 2

Page 3: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Genomics:GTL Program

• Genomes to Life renamed Genomics:GTL• One of three DOE genome programs • First funding awards in July 2002• Plan to fund and develop four user facilities

– Production and Characterization of Proteins– Whole Proteome Analysis– Characterization and Imaging of Molecular Machines– Analysis and Modeling of Cellular Systems

Lawrence Berkeley National Laboratory University of California 3

Page 4: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Goals of the GTL Program

Microbes are ubiquitous and have adapted to practically every environmental niche on earth. Some live and thrive in conditions generally thought to be inhospitable to life.GTL plans to study microbes and microbial communities that may be helpful in• energy generation,• environmental cleanup,• carbon sequestration.

Lawrence Berkeley National Laboratory University of California 4

Page 5: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Categories of Experimental Data

• Biomass production• Genomic

– sequence and annotate the microbe’s genome

• Transcriptomic– study transcription under different conditions

• Proteomic– what proteins are present and at what levels

• Metabolomic– what metabolites are present

• and others…

Lawrence Berkeley National Laboratory University of California 5

Page 6: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Laboratory Methods• Biomass production

– cell culture

• Transcriptomic (HTP)– microarrays

• Proteomic (HTP)– 2D gels, mass spectrometry

• Metabolomic (~HTP)– mass spectrometry, NMR

Lawrence Berkeley National Laboratory University of California 6

Page 7: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Data Volume and Complexity

Example: mass spectrometry• mass spec used to identify proteins• raw data analyzed to get peak list• peak list used to identify peptides• database search to identify proteins from peptidesVolume:• size of raw data set per experiment ~ 10 GB• multiple experiments per __/per organization• use FedEx to ship disk drivesComplexity: see PEDRo UML class diagram on next slide

Lawrence Berkeley National Laboratory University of California 7

raw data

proteins

peak list

peptides

Page 8: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

8

Page 9: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Data Management Challenges

1. INTEGRATING DATA FROM DIVERSE SOURCES IS THE KEY TO GTL’S SUCCESS

diverse = different laboratory methods, different organizations, different aspects of cellular functions/pathways

2. CAPTURING METADATA IS VERY IMPORTANT3. In the future, we must be able to process

LARGE numbers of LARGE data sets

Item 3 is important, but not as important asitems 1 and 2. We have to address those first.

Lawrence Berkeley National Laboratory University of California 9

Page 10: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Why is Data Integration So Important to the GTL Program?

Experimental data will be used to build models of cellular pathways, i.e., what goes on inside of the cell. Different types of data contribute to building different aspects of the model (response to environmental conditions, growth phases, etc.). Think of building a pathway as an inverse problem.In addition, experimental data are used to verify models.

Lawrence Berkeley National Laboratory University of California 10

Page 11: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Why are MetaData So Important to the GTL Program?

We need to capture not only sample treatment (e.g., heat shock, oxygen stress), but all of the conditions under which an experimental analysis was performed. Otherwise we cannot compare the results from different experiments. We want to investigate how the same organism responds to different conditions, and how different organisms respond to the same condition. We also want to capture uncertainty.

Lawrence Berkeley National Laboratory University of California 11

Page 12: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Other Data Management Needs

All of the usual ones…• secure access• storage of large volumes of data• data archives• data provenanceplus one wrinkle… “staging of data accessand management”.

Lawrence Berkeley National Laboratory University of California 12

Page 13: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Staging of Data Access/Management

Stage 1: data collected and QA/QC within the lab producing the data – manage data locally.Stage 2: data are shared with other project collaborators – transport data and/or provide restricted access.Stage 3: data are published and move into the public domain –provide community-wide access to data.Stage 4: data are archived – need to provide safe storage that data could be retrieved from.

Lawrence Berkeley National Laboratory University of California 13

Page 14: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Survey on Data Standards, Data Sharing, and Data Management

• Follow up to work by the GTL Data Standards Working Group

• Link to survey mailed to registrants for GTL Program Workshop

• 50+ respondents – mostly experimental biologists – 26 from nat’l labs, 16 from universities, 8 from other organizations

• See handout for summary of survey results

Lawrence Berkeley National Laboratory University of California 14

Page 15: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Survey Results

• Most common data ‘format’ (78%): spreadsheet• Most common measurement type (70%): image• Few respondents are using any data standard.• FCS (Flow Cytometry Standard), which is a file

format, is the only data standard that received a high rating.

• About 20% of the respondents expressed a willingness to participate in developing or implementing data standards for GTL.

Lawrence Berkeley National Laboratory University of California 15

Page 16: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Recommendations from the Survey

• Checklist of required information about experiments, experimental conditions, and data

• Data standards, data formats, file formats• Software tools/Web interfaces for

– data entry, including metadata and experiment details– data uploading, query, and access

• Data organization to relate information on sample origin to experimental data on the sample

• DBMS with software to enter data

Lawrence Berkeley National Laboratory University of California 16

Page 17: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Comments from the Survey

“It will help me a lot if someone will offer a short seminar on data standards.”

Data standards are “of more interest to computer scientists than [to] biological scientists.”

“This is all Greek to me which is exactly why very little to nothing is being developed that is useful to biologists like me.”

Lawrence Berkeley National Laboratory University of California 17

Page 18: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Difficulties in GTL Data Management • Heterogenous data. Metadata. Uncertainty.• Lack of data standards. (Love/hate relationship.)• Variety of DBMS being used.• Variety of instrument output formats.• Different DM phases with respect to data

generation, analyses, and publication.• Human factors: lab notebook -> electronic format

(potential loss of information), data rearrangement in spreadsheets.

• Data attribution.

Lawrence Berkeley National Laboratory University of California 18

Page 19: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Overall Recommendations GTL Program:• Establish data standards and facilitate

implementation. Data standards MUST be compatible with formats required by journals.

• Establish project-wide schema for organism/gene based database(s) to facilitate integration.

• Address data conversion problem.DOE: Require description of data managementplan as part of proposal. (Currently being done?)Investigate digital notepad technology?

Lawrence Berkeley National Laboratory University of California 19

Page 20: Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Acknowledgements

Carol GiomettiArgonne National Lab

Frank OlkenLawrence Berkeley National Laboratory

Nancy Slater, GTL Project ManagerLawrence Berkeley National Laboratory

Lawrence Berkeley National Laboratory University of California 20