MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Sequencing Experiment

Minimum information about an Adaptive Immune Receptor Repertoire Sequencing Experiment

Presenter: Syed Ahmad Chan Bukhari, PhD

Department of Pathology, Yale School of Medicine

Inability to reproduce scientific experiments is a big challenge.

Lithgow, G. J., Driscoll, M., & Phillips, P. (2017). A long journey to reproducible results. Nature News, 548(7668), 387.

● A drug-like molecule could extend an roundworm lifespan by as much as 67%.

● Other labs failed to replicate the studies.

● Two cancer labs spent more than a year trying to understand inconsistencies with same tumour biopsy.

● Because of lack of standards, both labs were using different cell isolation protocols.

Inability to reproduce scientific experiments is a big challenge.

Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer research. Nature, 483(7391), 531-533.

● Amgen could reproduce the findings in only 6 of 53 “landmark” papers in cancer biology

● Bayer could validate only 25% of 67 preclinical studies

Inability to reproduce scientific experiments can have multiple reasons behind.● Undocumented scientific procedures

● Datasets size and variability

● Problem with statistical techniques

● A documented but a difficult procedure to follow

Standardization is a proven way to make sense to scientific procedures and outcomes.

How and what array platform used?

Experiments in Immunology facing the similar reproducibility challenges ● High-throughput sequencing (HTS) of B-cell (antibody, immunoglobulin) and T-cell

receptor repertoires has increased dramatically since the technique was introduced in 2009.

○ Previously relied on low-resolution approaches, such as flow cytometry, spectratyping and

Sanger sequencing

● B cell receptors (BCRs) and T cell receptors (TCRs) serve as the primary means for specific detection of foreign antigens.

Adaptive Immune Receptor Repertoire (AIRR) Sequencing

● Collection of BCRs or TCRs in an individual, tissue, cell subset or during an immune response is referred to as the repertoire.

● AIRR-seq studies are associated with complex metadata, such as donor phenotypes, cell types and nucleic acid material used.○ Crucial for ensuring reproducibility and facilitating secondary and meta analyses

● AIRR sequencing has enormous promise for understanding the dynamics of the immune repertoire in vaccinology, infectious disease, autoimmunity, and cancer biology.

Adaptive Immune Receptor Repertoire (AIRR) Sequencing (Popularity)

Adaptive Immune-Receptor Repertoire (AIRR) CommunityNext-generation sequencing of B & T cell receptor repertoires (AIRR-seq)

Developing standard protocols for reporting and sharing AIRR-seq data to optimize their use in biomedical research and patient care

AIRR Community Formed

AIRR Community Data ElementsEach of the 6 high-level principles has been expanded into a set of data elements

“Accurate specification of the pathophysiological

condition is important for cross-comparison of

multiple studies”

● This set describes the experimental study design including the title of the

study, laboratory contact information etc

● For individual subjects, the species, sex, age, and ancestry are included

along with information about disease state(s) etc

● This set describes the metadata about the diagnosis process

“Information about the origin and expected

composition of the biological sample(s) is central

for the interpretation of downstream sequencing

results.”

“Proper interpretation of experimental results for

future comparative analysis require information”

● How cells are prepared for processing?

● how the sequencing is performed?

● Quality of the data produced are all critically important too.

“MiAIRR focuses on what information need to be

shared rather suggesting the analysis techniques

and tools”

Providing raw data enables the most

up-to-date data processing to be performed,

as the analysis tools for AIRR-seq data are

undergoing rapid evolution

● Providing the raw NGS data for each sequencing run (e.g., FASTQ files) permits the

reanalysis, secondary analysis and combination of multiple data sets from different

studies using meta-analysis techniques.

● Variety of tools are in use sequencing and processing. MiAIRR does not

provide tool specific details.

● MiAIRR defines broad categories that cover the essential data processing

steps.

● The software tools with version numbers, quality

thresholds, primer match and length cutoffs, etc.

● This final MiAIRR set will thus comprise the list of processed

sequences, along with sequence-level annotations.

● This should include the V(D)J gene segment and constant region (isotype)

annotation if used in the associated publication, along with the CDR3

sequence.

MiAIRR Elements Distribution to the NCBI

How MiAIRR elements look like?

https://github.com/airr-community/airr-standards/blob/master/AIRR_Minimal_Standard_Data_Elements.tsv

https://github.com/airr-community/airr-standards/blob/master/AIRR_Minimal_Standard_Data_Elements.tsv

https://github.com/airr-community/airr-standards

BioSample

Sequence Read Archive

CAIRR: A pipeline to submit AIRR data to the NCBI through the CEDAR-workbench

NCBI is an important resource to archive biomedical data ● NCBI hosts a collection of biomedical databases:

○ BioProject, BioSample, SRA, GenBank, GEO etc.

● Provide infrastructure to submit experimental data and associated metadata

● Minimal use of standard terminologies to define the necessary metadata○ Ontologies recommended for some data elements (Not implemented)

● NCBI metadata are often described using inconsistent terminologies○ Limit our ability to access, find, interoperate and reuse the data sets

Goal: Leverage CEDAR to improve NCBI metadata submissions

NCBI BioSample guideline suggests to use Disease Ontology terms

What are the issues with the current NCBI submission process?

● Rapid growth● Lack of metadata standardization● Error prone data entry● Lack of community-specific metadata

(e.g., AIRR)● Laborious metadata entry

NC

BI G

rowth

GenB

ank Grow

th

Metadata Diversity in NCBI repositories

How are metadata currently submitted to NCBI?

BioProject

BioSample

Sequence Read Archive

Combination of web-based forms and excel templates

● No mechanism to enforce standardized vocabularies or ontology links

CAIRR Workflow

CAIRR Templates

Created CEDAR templates to submit metadata to: NCBI BioProject, BioSample and SRA

AIRR Data Submission

CAIRR Metadata Generation

Data Submitter

NCBI CAIRR

Controlled Vocabularies

Predictive Entry

Interactive Metadata Entry

Metadata Findability

Metadata Accessibility

Metadata Interoperability

Metadata Reusability

represents limited features availability

Metadata submissions to NCBI BioProject, BioSample and SRA are ontologically controlled and relationally linked, which enables concept-based federated queries across repositories that are silos otherwise.

Why CAIRR?

Resources● Download AIRR NCBI templates:

https://github.com/airr-community/airr-standards● How to submit AIRR data to NCBI Manual?

https://www.overleaf.com/read/tytddwptgkhb

https://github.com/airr-community/airr-standards

https://www.overleaf.com/read/tytddwptgkhb

Breden et. al. “Reproducibility and Reuse of Adaptive Immune Receptor Repertoire Data” (2017)Rubelt, F., Busse, C., Bukhari, SAC et. al. “Adaptive Immune

Receptor Repertoire (AIRR) Community Recommendations for Sharing Immune Repertoire Sequencing Data” (2017)

Kei-Hoi Cheung, Yale University, Dept. of Medical Informatics● AIRR Community

Kleinstein Lab, Yale University, Dept. of Pathology

MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Sequencing Experiment

Data & Analytics

Transcript of MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Sequencing Experiment