Metagenomic Data Provenance and Management using the ISA infrastructure --- overview, implementation...

download Metagenomic Data Provenance and Management using the ISA infrastructure --- overview, implementation patterns & software tools

If you can't read please download the document

description

Metagenomic Data Provenance and Management using the ISA infrastructure - overview, implementation patterns & software tools Slides presented at EBI Metagenomics Bioinformatics course: http://www.ebi.ac.uk/training/course/metagenomics2014

Transcript of Metagenomic Data Provenance and Management using the ISA infrastructure --- overview, implementation...

  • 1. Metagenomic Data Provenance andManagement using the ISA infrastructureoverview, implementation patterns & software toolsAlejandra !Gonzalez-Beltran, PhDEamonn [email protected]@oerc.ox.ac.uk!!Metagenomics Bioinformatics,EMBL-EBI, Hinxton, UKSeptember 2014University of Oxford e-Research Centre, UK

2. ExperimentalMetadataRoadmap 3. ExperimentalMetadataRoadmap 4. ExperimentalMetadataRoadmaplink to analysis platforms 5. ExperimentalMetadataRoadmaplink to analysis platformssubmission to publicrepositories 6. ExperimentalMetadataRoadmaplink to analysis platformssubmission to publicrepositories 7. ExperimentalMetadataRoadmaplink to analysis platformssubmission to publicrepositoriesdata publication 8. Experimental MetadataNotes in lab notebooks(information for humans) Spreadsheets & tablesRDF statements(information for machines)It is all about structuring experimental information to make it available tocomputers and software agents to enable:8!provenance trackingassessment and evaluationaccountability, reliability, trust, evidenceconservation, preservation, storage, archiving and mining 9. 9 10. http://www.ama-rochester.org/WP/wp-content/uploads/2013/01/three-pillars.png 11. The community 12. 12A growing ecosystem of over 30 public and internal resources usingthe ISA metadata tracking framework (ISA-Tab and/or tools) tofacilitate standards-compliant collection, curation, management andreuse of investigations in an increasingly diverse set of life sciencedomains, including:! stem cell discovery system biology transcriptomics toxicogenomics also by communities working to build a library of cellularsignatures! environmental health environmental genomics metabolomics metagenomics nanotechnology proteomics 13. The format 14. Why ISA format and Tools?investigationassay(s) assay(s)pointers to data filenames/locationexternal files innative or other for-matsdata datainvestigationhigh level concept to linkrelated studiesstudythe central unit, containinginformation on the subjectunder study, its characteristicsand any treatments applied.a study has associated assaysassaytest performed either onmaterial taken from the sub-jector on the whole initialsubject, which produce quali-tativeor quantitative meas-urements(data)H. SapiensH. SapiensH. SapiensH. Sapiens33 YearsH1H1H2353533YearsYearsYearsISA metadata specifications:! workflow and processorientated compatible with checklistenforcement compatible with externalvocabulary resources compatible by design withexisting schemas!H1.sample1H1.sample2H2.sample1LabelingLabelingH1.sample1.labeledH2.sample1.labeledh1-s1.celh1-s2.celh2-s1.celH1H2H1.sample1H1.sample2H2.sample1LabelingLabelingH1.sample1.labeledH2.sample1.labeledh1-s1.celh1-s2.celh2-s1.celH. Sapiens35 YearsMAGE-TabPride-xml SRA-xml 15. Essentials about ISA syntax15 3 types of files Investigation file: at max 1 (think executive summary)Why? general study descriptionHow? methods / protocol declarationHow? variable declarations (factors and response variable)Who? contact and affiliation information Study File: true table (think sorting, filtering)What? Listing all biological materials collected over the study course. Assay File: true table (think sorting, filtering)Results! Listing all data files collected by a given assayn files, as many as there are assay types declared 16. Essentials about ISA syntax Material Transformations: Input and Outputs of Protocols are Material Nodes (Source Name, Sample Name, Extract Name, LabeledExtract Name.)Material NodeCharacteristics[]Factor Value[] (independentvariables)Material TypeComment[]Parameter Value! []Performer (operator effect)Date (day effect)MaterialProtocolProcessData File Node!DATA Derived Data FileRaw Data File!DATA!Material16 17. Basic coding patterns 18. Essentials about ISA syntaxBranching events: Tabular RepresentationSampleMaterialmusclebiopsyliverbiopsyhumanvolunter1SourceNameCharacteris0cs[organism]ProtocolREFParameterValue[storagecondi0on]SampleName Characteris0cs[organ]volunteer1 Homosapienssamplecollec8onheparinatedtube,roomtemperaturevolunteer1-sample1 peripheralbloodvolunteer1 Homosapiens samplecollec8onliquidnitrogen volunteer1-sample2 musclevolunteer1 Homosapienssamplecollec8on liquidnitrogen volunteer1-sample3 liverSourceMaterialperipheralblood18 19. Essentials about ISA syntaxPooling events: Tabular RepresentationSourceNameCharacteris0cs[organism]ProtocolREFParameterValue[storagecondi0on]SampleMaterialSampleName Characteris0cs[organ]animal1 Musmusculussamplecollec8onheparinatedtube,roomtemperaturepool1 salivaryglandanimal2 Musmusculus samplecollec8onheparinatedtube,roomtemperaturepool1 salivaryglandanimal3 Musmusculussamplecollec8onheparinatedtube,roomtemperaturepool1 salivaryglandanimal1animal2animal3SourceMaterialsalivaryglands19 20. Essentials about ISA syntaxTagging with Terminologies Implicit column order matters:!!!!!! ISA tools (ISAcreator - ISAconfigurator) provide Ontologyterm selection and term tagging facilities to help users.SourceNameCharacteris0cs[organism]FactorValue[compoundagent]FactorValue[perturba0onagent]FactorValue[dose]FactorValue[dura0on]FactorValue[washoutperiodFactorValue[dura0on]FactorValue[perturba0onagent]FactorValue[dose] FactorValue[dura0on]individual1 humanSourceNameCharacteris0cs[organism]TermSourceREFTermAccessionNumberCharacteris0cs[dura0on] UnitTermSourceREFTermAccessionNumberFactorValue[compound(htppt://purl]TermSourceREF TermAccessionNumberindividual1 Homosapiens NCBITax 9606 12 week UO UO:wwerwtaaspirin CHEBI 123135420 21. Experimental design and workflows 22. Parallel group designsource: hOp://dx.doi.org/10.1016/S1569-9056(02)00115-X; figure 122 23. Essentials about ISA syntaxRepresenting interventions and treatments! expressing treatments as sets of factor levels examples: treatment is a tadalafil supplementation Factors will be compound, dose and duration (what?, how much?, how long for?)!Characteris0cFactor!SourceNames[organism]ProtocolREFValue[compounFactorValue[dose] FactorValue[dura0on]d]!volunteer1 Homosapiens treatment tadalafil250mg/day 12weeks!volunteer2 Homosapiens treatment tadalafil250mg/day 12weeks!volunteer3 Homosapiens treatment placebo 20mg/day 12weeks! Implicit column order matters but this is independent from the ISAsyntax specification 24. Cross-over design24source: Roberts et al. Journal of the International Society of Sports Nutrition 2007 4:25 doi:10.1186/1550-2783-4-25 25. 08/26/13Cross-over design2510.1371/journal.pone.0037479 26. 08/26/13Cross-over design26!Treatmentdeclaration 27. 08/26/13Cross-over design2710.1371/journal.pone.0037479 28. 08/26/13Assays NMR28 29. 08/26/13Assays NMR29 30. 08/26/13Assays NMR30 31. The software suite 32. 1 33. ISA configurationsAvailable from:http://isa-tools.org/configurations.htmlhttps://github.com/ISA-tools/Configuration-Files Assembling workflow archetypes Setting annotation requirementsfor compliance with database schemas (SRA, MAGE, PRIDE)for compliance with community based requirements (MIAME,MIAPE, MIMS, MIxS, ) Guide usersProvide pre-assembled templatesSpecify vocabulary supportISAconfigurator: Supporting toolhttps://github.com/ISA-tools/ISAconfigurator 34. ISA configurationsAvailable from:http://isa-tools.org/configurations.htmlhttps://github.com/ISA-tools/Configuration-Files Minimum information about any (x) sequence (MIxS) Guidelines asissued by Genomic Standards Consortium ENA-GSC-MIxS checklist XML document:based on MIxS guidelinesaugmented with a number of regular expressions to further validate/regularize inputfixing a number of units used to report measurementissued July 2013 (version 3.0), July 2014 (version 4.0) SRA 1.5 schema requirements (mandatory information and requiredterminology, e.g. Library Selection or Library Strategy) All this information is used to derive ISA MIxS configurations allowing allthose annotation requirements to be embedded in spreadsheet tables 35. ISAconfigurator Tables 36. ISAconfigurator Tables 37. Things to bear in mind with NGS dataImportant considerations for managing dataand submitting to public repositoriesbe aware of support file formats FastA,FastQ,SFF,.....be aware of the need to demultiplex readsSRA schema evolves and updates are needed e.g. Study replaced by Project Updates to the ISAconverter Mapping from ISA is straightforward as brings anumber of element ISA already supported 38. Tools for creating ISA-Tab documentsisacreator 39. isacreatorJava desktop applicationDeveloped to be a userfriendly way to enterstandards-compliantmetadata: it has lots offeatures...But these are just some ofthem we also have a dataentry wizard and an importutility... 40. ISAcreator features: automatic template generation 41. ISACreator Wizard: automatic template generationPrerequisites and Conditions of use:!-supports factorial design experiments, meaning sets of discrete factor levelscombined together, to define a treatment2x2 factorial design as in 2 compounds and 2 time points2x2x3 factorial design as in 2 compounds, 2 time points, 2 doses-assumes one sample collection event (all samples collected at sacrifice time)-supports some but not all currently available assay types-supports fractional factorial design-supports unbalanced factor group population sizes (ethical considerationsfor high dose toxic exposures)-generates automatically sample identifiers, human readable & meaning fulllabels and , if requested, barcodes!-does not support crossover design, which have to be coded manually-does not support sample collection timeline management (underdevelopment) 42. 43 Importing your own spreadsheet:Mapping to third party table 43. ISAcreator features: visualizing experimental workflowsWork completed during investigation of new approach for creation of glyphs with use of taxonomy forguidance. See Maguire et al, Taxonomy-Based Glyph Design with a Case Study on VisualizingWorkflows of Biological Experiments, IEEE Transactions on Visualization and Computer Graphics, 201244 44. OntoMaton: a BioPortal poweredOntology widget for Google SpreadsheetsMaguire et al, 2013BioinformaticsTools for creating ISA-Tab documents!!!!http://www.slideshare.net/proccaserra/ontomaton-icbo2013alternative-ordertwv3http://isatools.wordpress.com/2012/07/13/introducing-ontomaton-ontology-search-tagging-for-google-spreadsheets/ 45. Potential Issues and known hurdles The problem of conflicting versionsespecially high when working with big consortiadistributed, decentralised groups of users Lack of version control and history Absence of collaborative features!Looking for new solutions while retaining thefeatures != + +LOV 46. Bioportal meets Google Spreadsheet47 47. Searching and TaggingTemplates:https://drive.google.com/templates?type=spreadsheets&q=ontomaton 48. Searching and TaggingTemplates:https://drive.google.com/templates?type=spreadsheets&q=ontomaton 49. 50 50. 2 51. 3 52. Risa - ISA-Tab manipulation for analysis in R RISA R-package53 53. R"package"available"since"BioConductor"2.11"h:p://www.bioconductor.org/packages/release/bioc/html/Risa.html" Func@onality"for"parsing"ISAFTab"datasets"into"R"objects,"saving"and"upda@ng"them." It"bridges"the"ISAFTab"metadata"to"analysis"pipelines"of"specific"assay"types,"by"building"objects"for"use"in"other"R"packages"downstream" "currently"considering"mass"spectrometry"(xmcs"package,"xcmsSet)"and"DNA"microarray"(Biobase"package,"ExpressionSet)""1 2 Collect Samples 3 4 Run Assays5Experiment Design Analysis54SAMPLE1SAMPLE2SAMPLE3SAMPLE4SAMPLE5SAMPLE6SAMPLE7SAMPLE8SAMPLE9SAMPLE10SAMPLE11SAMPLE 1SAMPLE 2SAMPLE 3SAMPLE 4SAMPLE 5SAMPLE 6SAMPLE 7SAMPLE 8SAMPLE 9SAMPLE 10SAMPLE 11FILE 1FILE 2FILE 3FILE 4FILE 5FILE 6FILE 7FILE 8FILFILFILArabidopsis thalianaTreatment groups70% 90% 100%6 54. http://isatools.wordpress.com/2013/065/158/isacreator-available-in-genomespace/ 55. http://isatools.wordpress.com/2013/065/168/isacreator-available-in-genomespace/ 56. http://isatools.wordpress.com/2013/065/178/isacreator-available-in-genomespace/ 57. 4 58. Submission Toolhttps://github.com/ISA-tools/ISAcreator/wiki/ENASubmissionTool59 59. Pre-requirements: registration to ENA/EBI Metagenomics data upload by one of the methods provided by ENAhttp://www.ebi.ac.uk/ena/about/sra_data_upload60 60. http://www.ebi.ac.uk/ena/about/sra_data_uploadPre-requirements: registration to ENA/EBI Metagenomics data upload by one of the methods provided by ENA61 61. https://github.com/ISA-tools/ISAcreator/wiki/ENASubmissionTool62 62. https://github.com/ISA-tools/ISAcreator/wiki/ENASubmissionTool63 63. 64 64. 65 65. 66 66. 67ISA-TabvalidationISA-TabtoSRAconversionSubmissionto ENAISA-Tabcreation(SRA-xml schema) 67. 68 68. 69 69. 5 70. http://gigasciencejournal.comhttp://gigadb.org/dataset/100035 71. http://gigasciencejournal.comhttp://gigadb.org/dataset/100035 72. New open-access, online-only publication for descriptions of scientifically valuable datasets Only content type: Data Descriptor, narrative + structured parts Initially focused on the life, environmental and biomedical sciences Data Descriptor will be complementary to traditional research journals and data repositories Designed to foster data sharing and reuse, and ultimately to accelerate scientific discoverywww.nature.com/scientificdata 73. Data Descriptors served by Scientific DataNarrative Section!A brief article-like document like with:!Title!Abstract!Background & Summary!Methods!Technical Validation!Usage Notes !Figures & Tables !ReferencesStructured Section!Detailed descriptions of the experimentalprocedures used to produce the dataFollowing community-defined minimuminformation requirements for a level of detail sufficient to reproduce theexperimentsUsing ontologies & controlled-vocabularies To maximise consistency of the descriptionswww.nature.com/scientificdata 74. Data Descriptors served by Scientific DataNarrative Section!A brief article-like document like with:!Title!Abstract!Background & Summary!Methods!Technical Validation!Usage Notes !Figures & Tables !ReferencesStructured Section!Detailed descriptions of the experimentalprocedures used to produce the dataFollowing community-defined minimuminformation requirements for a level of detail sufficient to reproduce theexperimentsUsing ontologies & controlled-vocabularies To maximise consistency of the descriptionswww.nature.com/scientificdata 75. Training Material76http://isa-tools.org/training.html 76. http://isa-tools.org/training.htmlHands-on Material Software:ISAcreator 1.7.8 (see pre-release)ISAconfigurator 1.6 Configurations:ISA-ENA-MIxS Configurationdefault MultiAssay Configuration ISA-Tab formatted datasetsBII-S-3: Western Channel Water Samples metagenome andmeta transcriptomeBII-S-7: Human gut microbiome targeted gene survey Google Templates and Ontomaton ISA mapping file 77. The Exemplar Datasets BII-S-3: MetagenomeandMetatranscriptomeon454 78. BII-S-7:The Exemplar DatasetsSubmiOedtoENAviaISAcreator:ERP000133 TargetedGeneSurvey(16sRNA)on454 79. ExperimentalMetadataRoadmaplink to analysis platformssubmission to publicrepositoriesdata publication 80. ebiteamsfunders81 81. Thanks for your attention!Questions?You can email [email protected] our websitesView our Git repo & contributehttp://github.com/ISA-toolsView our bloghttp://isatools.wordpress.comFollow us on Twitter@isatools