Reconstruction of Near-Global Precipitation Variations Thomas Smith 1 Phillip Arkin 2
Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley...
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley...
Data Management in theDOE Genomics:GTL Program
Janet Jacobsen and Adam Arkin
Lawrence Berkeley National LaboratoryUniversity of California, Berkeley
Topics (talk or handout)
• Basic facts about the Genomics:GTL Program• Goals of the GTL Program• Experimental data generated by GTL• Laboratory methods• Data management challenges, requirements,
and needs• Survey on Data Standards, Data Sharing, and
Data Management – if time• Overall Recommendations
Lawrence Berkeley National Laboratory University of California 2
Genomics:GTL Program
• Genomes to Life renamed Genomics:GTL• One of three DOE genome programs • First funding awards in July 2002• Plan to fund and develop four user facilities
– Production and Characterization of Proteins– Whole Proteome Analysis– Characterization and Imaging of Molecular Machines– Analysis and Modeling of Cellular Systems
Lawrence Berkeley National Laboratory University of California 3
Goals of the GTL Program
Microbes are ubiquitous and have adapted to practically every environmental niche on earth. Some live and thrive in conditions generally thought to be inhospitable to life.GTL plans to study microbes and microbial communities that may be helpful in• energy generation,• environmental cleanup,• carbon sequestration.
Lawrence Berkeley National Laboratory University of California 4
Categories of Experimental Data
• Biomass production• Genomic
– sequence and annotate the microbe’s genome
• Transcriptomic– study transcription under different conditions
• Proteomic– what proteins are present and at what levels
• Metabolomic– what metabolites are present
• and others…
Lawrence Berkeley National Laboratory University of California 5
Laboratory Methods• Biomass production
– cell culture
• Transcriptomic (HTP)– microarrays
• Proteomic (HTP)– 2D gels, mass spectrometry
• Metabolomic (~HTP)– mass spectrometry, NMR
Lawrence Berkeley National Laboratory University of California 6
Data Volume and Complexity
Example: mass spectrometry• mass spec used to identify proteins• raw data analyzed to get peak list• peak list used to identify peptides• database search to identify proteins from peptidesVolume:• size of raw data set per experiment ~ 10 GB• multiple experiments per __/per organization• use FedEx to ship disk drivesComplexity: see PEDRo UML class diagram on next slide
Lawrence Berkeley National Laboratory University of California 7
raw data
proteins
peak list
peptides
8
Data Management Challenges
1. INTEGRATING DATA FROM DIVERSE SOURCES IS THE KEY TO GTL’S SUCCESS
diverse = different laboratory methods, different organizations, different aspects of cellular functions/pathways
2. CAPTURING METADATA IS VERY IMPORTANT3. In the future, we must be able to process
LARGE numbers of LARGE data sets
Item 3 is important, but not as important asitems 1 and 2. We have to address those first.
Lawrence Berkeley National Laboratory University of California 9
Why is Data Integration So Important to the GTL Program?
Experimental data will be used to build models of cellular pathways, i.e., what goes on inside of the cell. Different types of data contribute to building different aspects of the model (response to environmental conditions, growth phases, etc.). Think of building a pathway as an inverse problem.In addition, experimental data are used to verify models.
Lawrence Berkeley National Laboratory University of California 10
Why are MetaData So Important to the GTL Program?
We need to capture not only sample treatment (e.g., heat shock, oxygen stress), but all of the conditions under which an experimental analysis was performed. Otherwise we cannot compare the results from different experiments. We want to investigate how the same organism responds to different conditions, and how different organisms respond to the same condition. We also want to capture uncertainty.
Lawrence Berkeley National Laboratory University of California 11
Other Data Management Needs
All of the usual ones…• secure access• storage of large volumes of data• data archives• data provenanceplus one wrinkle… “staging of data accessand management”.
Lawrence Berkeley National Laboratory University of California 12
Staging of Data Access/Management
Stage 1: data collected and QA/QC within the lab producing the data – manage data locally.Stage 2: data are shared with other project collaborators – transport data and/or provide restricted access.Stage 3: data are published and move into the public domain –provide community-wide access to data.Stage 4: data are archived – need to provide safe storage that data could be retrieved from.
Lawrence Berkeley National Laboratory University of California 13
Survey on Data Standards, Data Sharing, and Data Management
• Follow up to work by the GTL Data Standards Working Group
• Link to survey mailed to registrants for GTL Program Workshop
• 50+ respondents – mostly experimental biologists – 26 from nat’l labs, 16 from universities, 8 from other organizations
• See handout for summary of survey results
Lawrence Berkeley National Laboratory University of California 14
Survey Results
• Most common data ‘format’ (78%): spreadsheet• Most common measurement type (70%): image• Few respondents are using any data standard.• FCS (Flow Cytometry Standard), which is a file
format, is the only data standard that received a high rating.
• About 20% of the respondents expressed a willingness to participate in developing or implementing data standards for GTL.
Lawrence Berkeley National Laboratory University of California 15
Recommendations from the Survey
• Checklist of required information about experiments, experimental conditions, and data
• Data standards, data formats, file formats• Software tools/Web interfaces for
– data entry, including metadata and experiment details– data uploading, query, and access
• Data organization to relate information on sample origin to experimental data on the sample
• DBMS with software to enter data
Lawrence Berkeley National Laboratory University of California 16
Comments from the Survey
“It will help me a lot if someone will offer a short seminar on data standards.”
Data standards are “of more interest to computer scientists than [to] biological scientists.”
“This is all Greek to me which is exactly why very little to nothing is being developed that is useful to biologists like me.”
Lawrence Berkeley National Laboratory University of California 17
Difficulties in GTL Data Management • Heterogenous data. Metadata. Uncertainty.• Lack of data standards. (Love/hate relationship.)• Variety of DBMS being used.• Variety of instrument output formats.• Different DM phases with respect to data
generation, analyses, and publication.• Human factors: lab notebook -> electronic format
(potential loss of information), data rearrangement in spreadsheets.
• Data attribution.
Lawrence Berkeley National Laboratory University of California 18
Overall Recommendations GTL Program:• Establish data standards and facilitate
implementation. Data standards MUST be compatible with formats required by journals.
• Establish project-wide schema for organism/gene based database(s) to facilitate integration.
• Address data conversion problem.DOE: Require description of data managementplan as part of proposal. (Currently being done?)Investigate digital notepad technology?
Lawrence Berkeley National Laboratory University of California 19
Acknowledgements
Carol GiomettiArgonne National Lab
Frank OlkenLawrence Berkeley National Laboratory
Nancy Slater, GTL Project ManagerLawrence Berkeley National Laboratory
Lawrence Berkeley National Laboratory University of California 20