Giab jan2016 intro and update 160128

genomeinabottle.org

Genome in a Bottle Consortium January 2016

Stanford University, Stanford, CA

Reference Materials for Human Genome Sequencing

Marc Salit, Ph.D. and Justin Zook, Ph.DNational Institute of Standards and Technology

genomeinabottle.org

PGP Trio data described in…

genomeinabottle.org

PGP Trio data described in…

Lots of social media action

genomeinabottle.org

August Workshop Description

Open, public meeting with

170 attendees from

Government/Academia/Com

mercial partners

Cell Systems , Volume 1 , Issue 3 , 176 - 177

genomeinabottle.org

GIAB Scope• The Genome in a Bottle Consortium is

developing the reference materials, reference methods, and reference data needed to assess confidence in human whole genome variant calls. • Priority is authoritative characterization of

human genomes.GIAB steering committee, Aug 2015

genomeinabottle.org

Genome in a Bottle Consortium Development

• NIST met with sequencing technology developers to assess standards needs– Stanford, June 2011

• Open, exploratory workshop– ASHG, Montreal, Canada– October 2011

• Small workshop at NIST to develop consortium for human genome reference materials– FDA, NCBI, NHGRI, NCI, CDC, Wash

U, Broad, technology developers, clinical labs, CAP, PGP, Partners, ABRF, others

– developed draft work plan– April 2012

• Open, public meetings of GIAB– August 2012 at NIST– March 2013 at Xgen– August 2013 at NIST– January 2014 at Stanford– August 2014 at NIST– January 2015 at Stanford– August 2015 at NIST– January 28-29, 2016 at Stanford– September 15-16, 2016 at NIST?

• Website– www.genomeinabottle.org

genomeinabottle.org

Well-characterized, stable RMs• Obtain metrics for

validation, QC, QA, PT• Determine sources and

types of bias/error• Learn to resolve difficult

structural variants• Improve reference

genome assembly• Optimization• Enable regulated

applicationsComparison of SNP Calls forNA12878 on 2 platforms, 3

analysis methods

Analytical PerformanceSample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• Use gDNA reference materials to benchmark performance

• Characterized Pilot Genome NA12878

• Ashkenazim Trio, Asian Trio from PGP in process

• Tools to facilitate their use– With the Global Alliance

Data Working Group Benchmarking Team

gene

ric m

easu

rem

ent p

roce

ss

genomeinabottle.org

High-confidence SNP/indel calls

Zook et al., Nature Biotechnology, 2014.

• methods to develop SNP/indel call set described in manuscript

• broad and quick adoption of call set for benchmarking– struck nerve

genomeinabottle.org

NIST Released the GIAB Pilot Genome

as RM 8398 in May 2015

>150 units sold so far

genomeinabottle.org

NIST Human Genome Reference Materials (RMs)

• NIST RM 8398 is available!– tinyurl.com/giabpilot– DNA isolated from large

growth cell cultures– Stable, homogeneous – Best for regulated uses– DNA from same cell line at

Coriell (NA12878)

• New AJ and Asian Samples– Available from Coriell now– NIST RM available in 2016

genomeinabottle.org

Jan 2016 Workshop

Thursday• Update and Roadmap• Breakouts

– Analyses for PGP GIAB Trios– Reference Material Selection

and Development• Breakout reports• Roadmap discussion

Friday• Using GIAB Products for

technology development, optimization, and demonstration– Experiences from the

consortium• Steering committee

genomeinabottle.org

Steering Committee Meeting

Topics• Future workshops

• Format• Program committee?

• Crafting a mission statement• Defining scope• Liaison with other efforts

Current members– Marc Salit – Justin Zook– David Mittelman – Andrew Grupe – Michael Eberle– Steve Sherry – Deanna Church – Francisco De La Vega– Christian Olsen – Monica Basehore – Lisa Kalman – Christopher Mason – Elizabeth Mansfield – Liz Kerrigan – Leming Shi – Melvin Limson – Alexander Wait Zaranek – Nils Homer – Fiona Hyland– Steve Lincoln – Don Baldwin – Robyn Temple-Smolkin – Chunlin Xiao– Kara Norman– Luke Hickey

genomeinabottle.org

AgendaMonday• Breakfast and registration• Welcome and Context Setting• NIST RM Update and Status Report• Charge to Working Groups• Coffee Break• Working Group Breakout Discussions• Lunch (provided)• Informal Working Group Reports• Coffee Break• Breakout Topical Discussions

– Topic #1: Moving beyond the 'easy' variants and regions of the genome

– Topic #2: Selecting future genomes for Reference Materials

Tuesday• Breakfast and registration• Use cases: Experiences using the pilot

Reference Material• Discussion of plans to release pilot

Reference Material• Coffee Break• Working Group Breakout discussions• Lunch (provided)• Working Group leaders present plans

and discussion• Steering committee Overview• First meeting of the Steering

Committee (others adjourn)

Please Note

Slides will be made available on SlideShare after the workshop (see genomeinabottle.org).

Tweets are welcome unless the speaker requests otherwise. Please use #giab as the hashtag.

We are liaising with…• Illumina Platinum Genomes• CDC GeT-RM• Korean Genome Project• Genome Reference

Consortium• 1000 Genomes SV group• CAP/CLIA

• ABRF• FDA• SEQC• Global metrology system• Global Alliance for

Genomics and Health Benchmarking Team

• NCBI/CDC GeT-RM Browser• GCAT website

NGS RM Project

Slide courtesy of Lisa Kalman, CDC

Association of Biomolecular Resource Facilities (ABRF)www.abrf.org

Next Generation Sequencing StudyPhase 2: DNA sequencing platforms

Study Design and Launch Plan

Slides courtesy of Chris Mason

January 24, 2016

AimsCreate reference data sets - Sequence data from reference samples will be generated with intra- and inter-lab replication to model the likely range of performance that should be expected under normal service laboratory conditions. Test and create reference samples - Designated reference samples will be easily accessible to the community for self-evaluation by comparison to the reference data. Samples should be standardized, able to be stably reproduced over time, and suitable for development of new laboratory and bioinformatics methods. Data release and Immediate utility - Performance metrics and data will be developed for instrument platforms and sample preparation protocols that are deployed now or will be in the near future in core sequencing facilities. After QC, data will be released to the entire Genome in a Bottle (GIAB) and ABRF Consortia for use and preparation for submission as publications.

ABRF NGS Phase II Study

Samples and Platforms – All tested in triplicate across three distinct sites

Platform Human DNA Bacterial DNAIllumina HiSeq X Ten A, B, C, C2, C2f Illumina HiSeq 4000 A, B, C

Illumina HiSeq 2500 v4 1T A, B, CIllumina HiSeq 2500 v3 Rapid Run C Ste, Eco, Mil, PIllumina NextSeq 500 High Output C

Illumina MiSeq Ste, Eco, Mil, PLife Tech Proton A, B, C exomes Ste, Eco, Mil, P

Life Tech S5 A, B, C exomes Ste, Eco, Mil, PLife Tech PGM Ste, Eco, Mil, P

Pacific Biosciences Ste, Eco, Mil, POxford Nanopore Ste, Eco, Mil, P

mat

erna

l

pate

rnal

son

son

(Cor

iell)

A B C C2

Ste Eco Mil pool

Staphylococcus

epidermidisEscherichia

coli Micrococcus

luteus8 additional

species

Human Trio Bacterial Isolates and Mixture


Reference DNA,TruSeq PCR-free 350

FFPE DNA, TruSeq Nano

FFPE DNA, TruSeq PCR-free

KAPA libraries from sites a-b-c

Ste Eco Mil pool

mat

erna

l

pate

rnal

son

son

(Cor

iell)

%GC: 28 50 72

A B C C2

Personal Genome ProjectNIST Reference Human Genomes

C2f

Reference bacterial genomes

Staphylococcus

epidermidisEscherichia

coli Micrococcus

luteus8 additional

species

TruSeq PCR-free 550

Ca

Illumina (ILMN) - Samples


Organization and Leadership


Sequencing Quality Control Phase II (SEQC2) – An

Introduction

Slides courtesy of Weida Tong, Ph.D.Division of Bioinformatics and Biostatistics,

NCTR/FDA

22

Short reads vs long reads

Detection powerfor rare mutation

Detection accuracy for difficult genes

Application scopeof MiSeq

Variants call (e.g., SNV, CNV, Indels)

Assess the WGS accuracy and reproducibility for variants call by investigating the join effect of reads alignment pipelines, variants call methods and coverage as well as comparing the results from personal genome versus reference genome.

Assess detection power of ultra-deep sequencing (TGS) for subclonal mutation and its dependency on bioinformatics and coverage.

Assess the utility of MiSeq for (1) detection of subclonal mutation, (2) the difficult genes (e.g., HLA), and (3) the difficult variations (e.g., Indels)

Assess the accuracy for some difficult genes that varies significantly due to complexity in their genomic regions (e.g. GC region) with specifically focused on HLA genes.

Datasets:• Approaches: WGS

and TGS• Platforms: Hiseq,

PacBio, MiSeq, etc• Samples: TRIOs, NB,

cell lines, etcParameters:• Personal vs reference

genome • Bioinformatics• Coverage

Study Design

SEQC2 Overview

Assess short reads alone, long reads alone and their combination for genome assembly and subsequent variant calling in WGS.

Trio StudyCoverage/platform

NotesShort reads Long reads

SEQC2:HapMap Trio (European)

80x TBDPlanned for both WGS and TGS; genotyping data and information from HapMap are available

GIAB: Trio

(Ashkenazim)

Illumina 300x 69x (son), 30x (parents) This dataset is generated by Genome

In A Bottle (GIAB) consortium. We closely work with GIAB to obtain the update information of this Trio and the GIAB leaders also participate in SEQC2.

Complete Genomics BioNano

Ion Torrent Moleculo

SOLiD (WGS)

SEQC2:Chinese Trio and

test of LCL-germline

100x 50xPanned; the datasets will be provided by Dr. Leming Shi who is a part of SEQC2 leadership team.

Three Trio Datasets

24

Candidate NIST Reference MaterialsGenome PGP ID Coriell ID NIST ID NIST RM #

CEPH Mother/Daughter

N/A GM12878 HG001 RM8398

AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio)

AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)

AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)

Asian Son hu91BD69 GM24631 HG005 RM8393

Asian Father huCA017E GM24694 N/A N/A

Asian Mother hu38168C GM24695 N/A N/A

NIST Microbial Genomic DNA Reference Materials

Credit:Nate Olson

Analysis process for Microbial RMs

Credit:Nate Olson

genomeinabottle.org

GIAB Progress Update

January 2016

Dataset AJ Son AJ Parents Chinese son Chinese parents

NA12878

Illumina Paired-end X X X X XIllumina Long Mate pair X X X X XIllumina “moleculo” X X X X XComplete Genomics X X X X XComplete Genomics LFR X X XIon exome X X X XBioNano X X X X10X X X XPacBio X X XSOLiD single end X X XIllumina exome X X X XOxford Nanopore X

2013-6

2013-8

2013-10

2013-12

2014-2

2014-4

2014-6

2014-8

2014-10

2014-12

2015-2

2015-4

2015-6

2015-8

2015-10

2015-120

10000

20000

30000

40000

50000

60000

70000

80000

90000

0

200

400

600

800

1000

1200

1400

1600

1800

2000

GIAB ftp site downloads/unique-IPs by month

Month

# IP

s

GIAB Analysis Group – New Data Sets

LeadersFrancisco de la Vega

Stanford, TOMA Biosciences

Chris MasonWeil Cornell Medical Center

Tina GravesWashington University

Valerie SchneiderNCBI

•and Justin and Marc

Strategic Documents• Analysis Group Responsibilities:

– https://docs.google.com/document/d/10eA0DwB4iYTSFM_LPO9_2LyyN2xEqH49OXHhtNH1uzw/edit?usp=sharing

• Analysis Milestones:– https://docs.google.com/spreadsheets/d/1Pj4nSz

H742g40wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?usp=sharing

• Analysis Methods– https://docs.google.com/spreadsheet

s/d/1Je2g85H7oK6kMXbBOoqQ1FMNrvGnFuUJTJn7deyYiS8/edit?usp=sharing

• Analysis Plan:– https://drive.google.com/file/d/0B7Ao1qq

JJDHQdnVEaVdqbWdEdkE/view?usp=sharing

• Collecting Data and analyses on GIAB FTP Site

• Recruiting people to help with the work.

Goal: Establish and distribute a set of authoritative benchmark variant calls of all types and sizes, as well as homozygous reference regions, on GIAB PGP trios

https://docs.google.com/document/d/10eA0DwB4iYTSFM_LPO9_2LyyN2xEqH49OXHhtNH1uzw/edit?usp=sharing



https://docs.google.com/spreadsheets/d/1Pj4nSzH742g40wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?usp=sharing



https://docs.google.com/spreadsheets/d/1Je2g85H7oK6kMXbBOoqQ1FMNrvGnFuUJTJn7deyYiS8/edit?usp=sharing




https://drive.google.com/file/d/0B7Ao1qqJJDHQdnVEaVdqbWdEdkE/view?usp=sharing



GIAB Analysis Group – New Data Sets

Types of analysesSNPs/indels

NIST working on integration10X/moleculo/PacBio for difficult-to-map regions

Assembly2 de novo assemblies Being used for SV calling

StatusStructural variants

Candidate calls being generated by 15+ groups with >20 different algorithms and 6 datasets3+ integration methods~monthly calls

Long-range Phasing2 phased calls so far (CG LFR and 10X)Integration methods needed

Methylation analyses

Goal: Establish and distribute a set of authoritative benchmark variant calls of all types and sizes, as well as homozygous reference regions, on GIAB PGP trios

genomeinabottle.org

SNP/Indel Integration Method Update• Implementing refined integration methods

– Developed so others can readily reproduce results– Consistent results for all GIAB genomes– Simpler process taking advantage of best practices

for each technology• Validating with released NA12878 RM data

– Preliminary comparisons show minor changes• Application to PGP trios

– Plan to analyze AJ trio by Q2 2016– Release of NIST RMs in Q2 2016– Develop calls for GRCh38

genomeinabottle.org

Data Release: Real-time, Open, Public Release

Individual Datasets• Uploaded to GIAB FTP site

as it is collected• Includes raw reads, aligned

reads, and variant/reference calls

Integrated High-confidence Calls• First develop SNP, indel, and

homozygous reference calls• Then develop SV and non-

SV calls• Released calls are versioned• Preliminary callsets will be

made available to be critiqued

GIAB AJ Trio Hybrid PacBio/BioNano Assembly

Hybrid (PacBio with BioNano)

Input Assembly Notes# of

Scaffolds N50 Max TotalHG002 Falcon 248 22.7Mb 92.8Mb 2.38Gb

Trio Falcon 210 29.3Mb 87.6Mb 2.32GbTwo Step

Triocelera (child) +

falcon (trio) 187 34.3Mb 98.0Mb 2.6Gb

Credits: Ali Bashir, Jason Chin, Alex HastiePendleton et al, Nature Methods, 2015

svclassify

Proposed approach to form high-confidence SV (and non-SV) calls

Generate Candidate Calls

Compare/evaluate calls using Parliament/MetaSV/svclassify/others?;

manual inspection

Integrate new and revised calls; manual inspection

Combine integrated calls; manual inspection; targeted experimental validation?

August 30, 2015

January 2016

Plan in January 2016

Feb 2016 and beyond

Deletion overlap summary for son

By # of callsets# of callsets # of calls

1 3780

2 1391

3 859

4 574

5+ 344

By TechnologyTechnology # of calls

Illumina 3277

PacBio 5177

BioNano 812

CG 1758

Illumina/CG+PacBio 2318

Illumina/CG+BioNano 518

PacBio+BioNano 467

2+ technologies 2661

Converted all to bed; combined with bedtools multiinter; Calls within 50bps were merged

Preliminary Confirmation of SVsIntegration results from AJ son

Parliament: BMC Genomics, 2015, 16:286 (performed by Andrew Carroll, DNAnexus)MetaSV: Bioinformatics, 2015, 31:2741 (performed by Marghoob Mohiyuddin, Bina/Roche)

• Parliament– Candidates from Illumina– Confirmed by PacBio and/or

Illumina– ~50% in both technologies– ~4.5k deletions, 1k insertions– 85% of Genotypes consistent

within Trio • MetaSV

– Multiple types of evidence from Illumina

MetaSVTotal:2809

ParliamentTotal:5467

569(20 %)

977(18 %)

MetaSV2240

(80 %)Parliament

4490(82 %)

50 % reciprocal overlapSome overlap within Parliament calls

genomeinabottle.org

GeT-RM Browser from NCBI and CDC• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/• Allows visualization of data underlying call each call

http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/

genomeinabottle.org

Uses of GIAB NA12878

Oncology – Molecular and Cellular Tumor Markers“Next Generation” Sequencing (NGS) guidelines for somatic genetic variant detection

www.bioplanet.com/gcat

Global Alliance for Genomics and Health Benchmarking Task Team

• Initial version of standardized definitions for performance metrics like TP, FP, and FN.

• Continued development of sophisticated benchmarking tools

• vcfeval – Len Trigg• hap.py – Peter Krusche• vgraph – Kevin Jacobs

• Standardized intermediate and final file formats

• Standardized bed files with difficult genome contexts for stratification

• Simulating reads with non-SNP ClinVar variants to demonstrate importance of these tools

• github.com/ga4gh/benchmarking-tools

Next steps• Further analysis to demonstrate

importance of sophisticated tools• Write manuscript about the

team’s tools• Integrate vcfeval and hap.py to

take advantage of strengths of each

• Recommend “Best Practices” for benchmarking

• Explore venues for making the team’s benchmarking process easier to use

Progress

https://github.com/ga4gh/benchmarking-tools


Proposed Performance Metrics Definitions

• Define TP/FP/FN/TN in 4 ways depending on required stringency of match:

• Loose match: TP if within x-bp of a true variant• Allelle match: TP if ALT allele matches• Genotype match: TP if genotype and ALT allele

match• Phasing match: TP if genotype, ALT allele, and

phasing with nearby variants all match• True negatives are difficult to define because an

infinite number of potential alleles exist

genomeinabottle.org

Global Alliance for Genomics and HealthBenchmarking Task Team

Credit: Rebecca Truty, Complete Genomics

How should we interpret this complex variant on chr21?

GA4GH Benchmarking Tool Architecture

Truth VCF

Query VCF

Comparison Enginevcfeval / vgraph / xcmp / bcftools / ...

VCF-I

Two-column VCF with TP/FP/FN

annotations

Quantificatione.g. quantify / hap.py

Stratification BEDfiles

Confident CallRegions

VCF-R

Two-column VCF with

TP/FP/FN/UNK annotations

Counts

Credit: Peter Kruschehttps://github.com/ga4gh/benchmarking-tools

Peter Krusche

Important: we should also annotate superloci.

Rebecca Truty

To facilitate building ROCs and other downstream analysis it would be helpful to preserve scores and annotations as much as possible here. I know that's difficult in a partial credit context, but preserving the annotations on the original call, along with a flag to indicate that this VCF line only represents part of the call, would be great

Peter Krusche

Quantify does this actually, it preserves GQ(X) and QUAL for truth and query separately.

Peter Krusche

Alternatively, this could also be handled in the comparison engine, which could take any given VCF feature (info / format field) and write it out as a uniformly named feature?

Rebecca Truty

That would be ok, but there are other VCF annotations that might be nice to have downstream -- things like functional annotations. It's nice to be able to quickly pull out things like sensitivity in the exome or sensitivity to things novel to dbsnp. If that's already in the input VCF and the annotation is carried through then that becomes trivial. But I also seem the utility of heaving a clean slate with consistent feature names...

Peter Krusche

This can be used to create ROCs in a separate step.

Approaches to Benchmarking Variant Calling

• Well-characterized whole genome Reference Materials

• Many samples characterized in clinically relevant regions

• Synthetic DNA spike-ins• Cell lines with engineered mutations• Simulated reads• Modified real reads• Modified reference genomes• Confirming results found in real samples over time

Challenges in Benchmarking Small Variant Calling

• It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)

• Easiest to benchmark only within high-confidence bed file, but…

• Benchmark calls/regions tend to be biased towards easier variants and regions– Some clinical tests are enriched for difficult sites

• Challenges with benchmarking complex variants near boundaries of high-confidence regions

• Always manually inspect a subset of FPs/FNs• Stratification by variant type and region is important• Always calculate confidence intervals on performance metrics

Particular Challenges in Benchmarking SV Calling

• How to establish benchmark calls for difficult regions?

• How to establish non-SV regions to assess FP rates?• Multiple dimensions of accuracy:

– Predicted SV existence– Predicted SV type– Predicted size– Predicted breakpoints– Predicted exact sequence

Acknowledgments

• FDA – Elizabeth Mansfield

• Many members of Genome in a Bottle– New members

welcome!– Sign up on website

for email newsletters

GIAB Steering Committee– Marc Salit – Justin Zook– David Mittelman – Andrew Grupe – Michael Eberle– Steve Sherry – Deanna Church – Francisco De La Vega– Christian Olsen – Monica Basehore – Lisa Kalman – Christopher Mason – Elizabeth Mansfield – Liz Kerrigan – Leming Shi – Melvin Limson – Alexander Wait Zaranek – Nils Homer – Fiona Hyland– Steve Lincoln – Don Baldwin – Robyn Temple-Smolkin – Chunlin Xiao– Kara Norman– Luke Hickey

For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails

github.com/genome-in-a-bottle – Guide to GIAB data & ftp

www.slideshare.net/genomeinabottle

www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser

Data: http://biorxiv.org/content/early/2015/09/15/026468

Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools

Twice yearly public workshops – Winter at Stanford University, California, USA– Summer at NIST, Maryland, USA

Justin Zook: [email protected] Salit: [email protected]

http://www.genomeinabottle.org/

https://github.com/genome-in-a-bottle



http://www.slideshare.net/genomeinabottle

http://www.slideshare.net/genomeinabottle

http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/

http://biorxiv.org/content/early/2015/09/15/026468





GIAB Roadmap: Where are we, Where are we going?

• Reference Materials– Germline– Somatic

• Informatics– Analysis of GIAB data– Benchmarking

• Documentary Standards/Publications– Documentation of methods– Supporting Use

Giab jan2016 intro and update 160128

Health & Medicine

Transcript of Giab jan2016 intro and update 160128