Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Immune Receptor...

IntroductionNext-generation sequencing technologies have led to a rapidproduction of high-throughput sequence data characterizingadaptive immune-receptor repertoires (AIRRs). As part of theAIRR community (http://airr-community.org) data standardsworking group, we have developed an initial set of metadatarecommendations for publishing AIRR sequencing studies.These recommendations will be implemented in severalpublic repositories, including the NCBI sequence read archive(SRA). Submissions to SRA typically use a flat-file templateand include only a minimal amount of term validation. Inorder to ease the metadata authoring and to implement theontological terms validation of repertoire sequence data, weare developing an interactive template through CEDARworkbench that will allow for ontological validation, andsubsequent deposition in SRA. CEDAR workbench also allowsthe user to populate the template with metadata for datasubmission to various data repositories. The incorporation oftemplate-element level ontology mapping not only facilitatesvalidation of data submission, but also enables intelligentqueries within and across repositories.High-quality Metadata and ChallengesHigh-quality metadata are seen as crucial to facilitateknowledge discovery. The biomedical community has astrong history of tackling metadata challenge by driving thedevelopment of metadata templates. These templates focuson addressing the reproducibility challenge by providingdetailed checklists of the metadata needed to describeparticular types of experimental data sources. The key goal isto provide sufficient metadata to enable the source studiesto be reproduced. While individual metadata templates canprovide a standard format for a particular data source, theyrarely share common structure or semantics. There is also adisconnect between the high-level checklist-based templatedefinitions developed by scientific communities and thesubmission formats required by metadata repositories.Moreover, different repositories provide their locally definedtemplates for describing metadata. These templates lack theuse of common data elements and standard vocabularies.This creates a barrier for sharing and using metadata toenable knowledge discovery. We use CEDAR workbench tocreate common templates for entering metadata. Toenhance machine readability, we use CEDAR’s capability tolink individual data elements and their values to ontologyconcepts

AIRR Data Submission to SRA Leveraging CEDAR Workbench

CEDAR is supported by grant U54 AI117925 awarded by the National Institute of Allergy and Infectious Diseases through funds provided by the trans-NIH Big Data toKnowledge (BD2K) initiative (www.bd2k.nih.gov).

Syed Ahmad Chan Bukhari1, Martin J. O'Connor2, John Graybeal2, Mark A. Musen2, Kei-Hoi Cheung3, Steven H. Kleinstein1

Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Immune Receptor Repertoire Data to the Sequence Read Archive (SRA)

1Department of Pathology, Yale School of Medicine, New Haven, CT , 2Center for Expanded Data Annotation and Retrieval, Stanford Center for Biomedical Informatics Research, Stanford University and 3Department of Emergency Medicine, Yale School of Medicine, New Haven, CT

Figure 1. Metadata Life Cycle in CEDAR Workbench

The CEDAR WorkbenchThe Center for Expanded Data Annotation and Retrieval is studying the creation of comprehensive and expressive metadata for biomedicaldatasets to facilitate data discovery, data interpretation, and data reuse. CEDAR takes advantage of emerging community-based standardtemplates for describing different kinds of biomedical datasets. CEDAR workbench investigates the use of computational techniques to helpinvestigators to assemble templates and to fill in their values. We are creating a repository of metadata from which we plan to identifymetadata patterns that will drive predictive data entry when filling in metadata templates. The metadata repository not only will captureannotations specified when experimental datasets are initially created, but also will incorporate links to the published literature, includingsecondary analyses and possible refinements or retractions of experimental interpretations. By working initially with the Human ImmunologyProject Consortium and the developers of the ImmPort data repository, we are developing and evaluating an end-to-end solution to theproblems of metadata authoring and management that will generalize to other data-management environments.

CEDARCENTER FOR EXPANDED DATA ANNOTATION AND RETRIEVAL

CEDARCENTER FOR EXPANDED DATA ANNOTATION AND RETRIEVAL

CEDAR

CEDAR

CEDAR

CEDAR CENTER FOR EXPANDED DATA ANNOTATION AND RETRIEVAL

Minimal Standards WG Recommendations for The AIRR Sequencing DataAs high throughput experiments become more prevalentin the field of Immunology and elsewhere, there is anincreased need for collective organization of data andstandardized methods of data reporting. No currentstandards exist for adaptive immune receptor repertoiresequencing data. Data and metadata formats need to beharmonized so that data from different experiments canbe mined. Once recovered, the mined data need to havesufficient descriptive metadata in order to be useful. Tofulfill these unmet needs, we propose a set of minimalstandards that we recommend journals adopt and thatcould form the requirements for submission to a publicdata repository:1. The experimental study design including sample data

relationships (e.g., which raw data file(s) relate towhich sample, which samples are technical, which arebiological replicates).

2. The essential sample annotation includingexperimental factors and their values (e.g., the set ofmarkers used to sort the cell population beingstudied).

3. Sufficientannotationoftheamplicon beingsequenced thatwouldallowtherawdatatobetransformed intotheprocessedsequences(e.g.,barcodes,primers,uniquemolecularidentifiers).

4. Therawdata foreachsequencing run(e.g.,FASTQfiles)5. Theessentiallaboratoryanddataprocessingprotocols (e.g.,softwaretools

withversionnumbers,qualitythresholds, primermatchcutoffs,etc.)thathavebeenusedtoobtainthefinalprocesseddata.

6. Thefinalprocessedantigenreceptorsequences forthesetofsamplesintheexperiment(e.g.,thesetofsequencesusedforV(D)Jassignment), alongwiththeV(D)Jassignmentsforeachsequence.

Figure 2. Overview of the six high-level principles and associated data elements that comprise the AIRR standard draft agreed to at the second annual AIRR Community meeting in 2016.

Figure 4. CEDAR Workbench to SRA Conversion Workflow

c

a b

Figure 3(a). ARR Minimal Standard Data Elements, 3(b) Ontology ControlledTemplate Authoring Through CEDAR Workbench and 3(c) AIRR DataSubmission Template

CEDAR JSON-LD to SRA XML Converter DemoReferences1- Musen, Mark A., et al. "The center for expanded data annotation and retrieval." Journal of the American MedicalInformatics Association 22.6 (2015): 1148-1152.2- Leinonen, Rasko, Hideaki Sugawara, and Martin Shumway. "The sequence read archive." Nucleic acids research (2010): gkq1019.Acknowledgement: We acknowledge Dr. Ben Busby from NCBI for his valuable suggestions during this researchwork.

Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Immune Receptor...

Documents

Transcript of Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Immune Receptor...