Giab jan2016 intro and update 160128
-
Upload
genomeinabottle -
Category
Health & Medicine
-
view
527 -
download
4
Transcript of Giab jan2016 intro and update 160128
genomeinabottle.org
Genome in a Bottle Consortium January 2016
Stanford University, Stanford, CA
Reference Materials for Human Genome Sequencing
Marc Salit, Ph.D. and Justin Zook, Ph.DNational Institute of Standards and Technology
genomeinabottle.org
PGP Trio data described in…
genomeinabottle.org
PGP Trio data described in…
Lots of social media action
genomeinabottle.org
August Workshop Description
Open, public meeting with
170 attendees from
Government/Academia/Com
mercial partners
Cell Systems , Volume 1 , Issue 3 , 176 - 177
genomeinabottle.org
GIAB Scope• The Genome in a Bottle Consortium is
developing the reference materials, reference methods, and reference data needed to assess confidence in human whole genome variant calls. • Priority is authoritative characterization of
human genomes.GIAB steering committee, Aug 2015
genomeinabottle.org
Genome in a Bottle Consortium Development
• NIST met with sequencing technology developers to assess standards needs– Stanford, June 2011
• Open, exploratory workshop– ASHG, Montreal, Canada– October 2011
• Small workshop at NIST to develop consortium for human genome reference materials– FDA, NCBI, NHGRI, NCI, CDC, Wash
U, Broad, technology developers, clinical labs, CAP, PGP, Partners, ABRF, others
– developed draft work plan– April 2012
• Open, public meetings of GIAB– August 2012 at NIST– March 2013 at Xgen– August 2013 at NIST– January 2014 at Stanford– August 2014 at NIST– January 2015 at Stanford– August 2015 at NIST– January 28-29, 2016 at Stanford– September 15-16, 2016 at NIST?
• Website– www.genomeinabottle.org
genomeinabottle.org
Well-characterized, stable RMs• Obtain metrics for
validation, QC, QA, PT• Determine sources and
types of bias/error• Learn to resolve difficult
structural variants• Improve reference
genome assembly• Optimization• Enable regulated
applicationsComparison of SNP Calls forNA12878 on 2 platforms, 3
analysis methods
Analytical PerformanceSample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• Use gDNA reference materials to benchmark performance
• Characterized Pilot Genome NA12878
• Ashkenazim Trio, Asian Trio from PGP in process
• Tools to facilitate their use– With the Global Alliance
Data Working Group Benchmarking Team
gene
ric m
easu
rem
ent p
roce
ss
genomeinabottle.org
High-confidence SNP/indel calls
Zook et al., Nature Biotechnology, 2014.
• methods to develop SNP/indel call set described in manuscript
• broad and quick adoption of call set for benchmarking– struck nerve
genomeinabottle.org
NIST Released the GIAB Pilot Genome
as RM 8398 in May 2015
>150 units sold so far
genomeinabottle.org
NIST Human Genome Reference Materials (RMs)
• NIST RM 8398 is available!– tinyurl.com/giabpilot– DNA isolated from large
growth cell cultures– Stable, homogeneous – Best for regulated uses– DNA from same cell line at
Coriell (NA12878)
• New AJ and Asian Samples– Available from Coriell now– NIST RM available in 2016
genomeinabottle.org
Jan 2016 Workshop
Thursday• Update and Roadmap• Breakouts
– Analyses for PGP GIAB Trios– Reference Material Selection
and Development• Breakout reports• Roadmap discussion
Friday• Using GIAB Products for
technology development, optimization, and demonstration– Experiences from the
consortium• Steering committee
genomeinabottle.org
Steering Committee Meeting
Topics• Future workshops
• Format• Program committee?
• Crafting a mission statement• Defining scope• Liaison with other efforts
Current members– Marc Salit – Justin Zook– David Mittelman – Andrew Grupe – Michael Eberle– Steve Sherry – Deanna Church – Francisco De La Vega– Christian Olsen – Monica Basehore – Lisa Kalman – Christopher Mason – Elizabeth Mansfield – Liz Kerrigan – Leming Shi – Melvin Limson – Alexander Wait Zaranek – Nils Homer – Fiona Hyland– Steve Lincoln – Don Baldwin – Robyn Temple-Smolkin – Chunlin Xiao– Kara Norman– Luke Hickey
genomeinabottle.org
AgendaMonday• Breakfast and registration• Welcome and Context Setting• NIST RM Update and Status Report• Charge to Working Groups• Coffee Break• Working Group Breakout Discussions• Lunch (provided)• Informal Working Group Reports• Coffee Break• Breakout Topical Discussions
– Topic #1: Moving beyond the 'easy' variants and regions of the genome
– Topic #2: Selecting future genomes for Reference Materials
Tuesday• Breakfast and registration• Use cases: Experiences using the pilot
Reference Material• Discussion of plans to release pilot
Reference Material• Coffee Break• Working Group Breakout discussions• Lunch (provided)• Working Group leaders present plans
and discussion• Steering committee Overview• First meeting of the Steering
Committee (others adjourn)
Please Note
Slides will be made available on SlideShare after the workshop (see genomeinabottle.org).
Tweets are welcome unless the speaker requests otherwise. Please use #giab as the hashtag.
We are liaising with…• Illumina Platinum Genomes• CDC GeT-RM• Korean Genome Project• Genome Reference
Consortium• 1000 Genomes SV group• CAP/CLIA
• ABRF• FDA• SEQC• Global metrology system• Global Alliance for
Genomics and Health Benchmarking Team
• NCBI/CDC GeT-RM Browser• GCAT website
NGS RM Project
Slide courtesy of Lisa Kalman, CDC
Association of Biomolecular Resource Facilities (ABRF)www.abrf.org
Next Generation Sequencing StudyPhase 2: DNA sequencing platforms
Study Design and Launch Plan
Slides courtesy of Chris Mason
January 24, 2016
AimsCreate reference data sets - Sequence data from reference samples will be generated with intra- and inter-lab replication to model the likely range of performance that should be expected under normal service laboratory conditions. Test and create reference samples - Designated reference samples will be easily accessible to the community for self-evaluation by comparison to the reference data. Samples should be standardized, able to be stably reproduced over time, and suitable for development of new laboratory and bioinformatics methods. Data release and Immediate utility - Performance metrics and data will be developed for instrument platforms and sample preparation protocols that are deployed now or will be in the near future in core sequencing facilities. After QC, data will be released to the entire Genome in a Bottle (GIAB) and ABRF Consortia for use and preparation for submission as publications.
ABRF NGS Phase II Study
Samples and Platforms – All tested in triplicate across three distinct sites
Platform Human DNA Bacterial DNAIllumina HiSeq X Ten A, B, C, C2, C2f Illumina HiSeq 4000 A, B, C
Illumina HiSeq 2500 v4 1T A, B, CIllumina HiSeq 2500 v3 Rapid Run C Ste, Eco, Mil, PIllumina NextSeq 500 High Output C
Illumina MiSeq Ste, Eco, Mil, PLife Tech Proton A, B, C exomes Ste, Eco, Mil, P
Life Tech S5 A, B, C exomes Ste, Eco, Mil, PLife Tech PGM Ste, Eco, Mil, P
Pacific Biosciences Ste, Eco, Mil, POxford Nanopore Ste, Eco, Mil, P
mat
erna
l
pate
rnal
son
son
(Cor
iell)
A B C C2
Ste Eco Mil pool
Staphylococcus
epidermidisEscherichia
coli Micrococcus
luteus8 additional
species
Human Trio Bacterial Isolates and Mixture
ABRF NGS Phase II Study
Reference DNA,TruSeq PCR-free 350
FFPE DNA, TruSeq Nano
FFPE DNA, TruSeq PCR-free
KAPA libraries from sites a-b-c
Ste Eco Mil pool
mat
erna
l
pate
rnal
son
son
(Cor
iell)
%GC: 28 50 72
A B C C2
Personal Genome ProjectNIST Reference Human Genomes
C2f
Reference bacterial genomes
Staphylococcus
epidermidisEscherichia
coli Micrococcus
luteus8 additional
species
TruSeq PCR-free 550
Ca
Illumina (ILMN) - Samples
ABRF NGS Phase II Study
Organization and Leadership
ABRF NGS Phase II Study
Sequencing Quality Control Phase II (SEQC2) – An
Introduction
Slides courtesy of Weida Tong, Ph.D.Division of Bioinformatics and Biostatistics,
NCTR/FDA
22
Short reads vs long reads
Detection powerfor rare mutation
Detection accuracy for difficult genes
Application scopeof MiSeq
Variants call (e.g., SNV, CNV, Indels)
Assess the WGS accuracy and reproducibility for variants call by investigating the join effect of reads alignment pipelines, variants call methods and coverage as well as comparing the results from personal genome versus reference genome.
Assess detection power of ultra-deep sequencing (TGS) for subclonal mutation and its dependency on bioinformatics and coverage.
Assess the utility of MiSeq for (1) detection of subclonal mutation, (2) the difficult genes (e.g., HLA), and (3) the difficult variations (e.g., Indels)
Assess the accuracy for some difficult genes that varies significantly due to complexity in their genomic regions (e.g. GC region) with specifically focused on HLA genes.
Datasets:• Approaches: WGS
and TGS• Platforms: Hiseq,
PacBio, MiSeq, etc• Samples: TRIOs, NB,
cell lines, etcParameters:• Personal vs reference
genome • Bioinformatics• Coverage
Study Design
SEQC2 Overview
Assess short reads alone, long reads alone and their combination for genome assembly and subsequent variant calling in WGS.
Trio StudyCoverage/platform
NotesShort reads Long reads
SEQC2:HapMap Trio (European)
80x TBDPlanned for both WGS and TGS; genotyping data and information from HapMap are available
GIAB: Trio
(Ashkenazim)
Illumina 300x 69x (son), 30x (parents) This dataset is generated by Genome
In A Bottle (GIAB) consortium. We closely work with GIAB to obtain the update information of this Trio and the GIAB leaders also participate in SEQC2.
Complete Genomics BioNano
Ion Torrent Moleculo
SOLiD (WGS)
SEQC2:Chinese Trio and
test of LCL-germline
100x 50xPanned; the datasets will be provided by Dr. Leming Shi who is a part of SEQC2 leadership team.
Three Trio Datasets
24
Candidate NIST Reference MaterialsGenome PGP ID Coriell ID NIST ID NIST RM #
CEPH Mother/Daughter
N/A GM12878 HG001 RM8398
AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio)
AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)
AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)
Asian Son hu91BD69 GM24631 HG005 RM8393
Asian Father huCA017E GM24694 N/A N/A
Asian Mother hu38168C GM24695 N/A N/A
NIST Microbial Genomic DNA Reference Materials
Credit:Nate Olson
Analysis process for Microbial RMs
Credit:Nate Olson
genomeinabottle.org
GIAB Progress Update
January 2016
Dataset AJ Son AJ Parents Chinese son Chinese parents
NA12878
Illumina Paired-end X X X X XIllumina Long Mate pair X X X X XIllumina “moleculo” X X X X XComplete Genomics X X X X XComplete Genomics LFR X X XIon exome X X X XBioNano X X X X10X X X XPacBio X X XSOLiD single end X X XIllumina exome X X X XOxford Nanopore X
2013-6
2013-8
2013-10
2013-12
2014-2
2014-4
2014-6
2014-8
2014-10
2014-12
2015-2
2015-4
2015-6
2015-8
2015-10
2015-120
10000
20000
30000
40000
50000
60000
70000
80000
90000
0
200
400
600
800
1000
1200
1400
1600
1800
2000
GIAB ftp site downloads/unique-IPs by month
Month
# IP
s
GIAB Analysis Group – New Data Sets
LeadersFrancisco de la Vega
Stanford, TOMA Biosciences
Chris MasonWeil Cornell Medical Center
Tina GravesWashington University
Valerie SchneiderNCBI
•and Justin and Marc
Strategic Documents• Analysis Group Responsibilities:
– https://docs.google.com/document/d/10eA0DwB4iYTSFM_LPO9_2LyyN2xEqH49OXHhtNH1uzw/edit?usp=sharing
• Analysis Milestones:– https://docs.google.com/spreadsheets/d/1Pj4nSz
H742g40wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?usp=sharing
• Analysis Methods– https://docs.google.com/spreadsheet
s/d/1Je2g85H7oK6kMXbBOoqQ1FMNrvGnFuUJTJn7deyYiS8/edit?usp=sharing
• Analysis Plan:– https://drive.google.com/file/d/0B7Ao1qq
JJDHQdnVEaVdqbWdEdkE/view?usp=sharing
• Collecting Data and analyses on GIAB FTP Site
• Recruiting people to help with the work.
Goal: Establish and distribute a set of authoritative benchmark variant calls of all types and sizes, as well as homozygous reference regions, on GIAB PGP trios
GIAB Analysis Group – New Data Sets
Types of analysesSNPs/indels
NIST working on integration10X/moleculo/PacBio for difficult-to-map regions
Assembly2 de novo assemblies Being used for SV calling
StatusStructural variants
Candidate calls being generated by 15+ groups with >20 different algorithms and 6 datasets3+ integration methods~monthly calls
Long-range Phasing2 phased calls so far (CG LFR and 10X)Integration methods needed
Methylation analyses
Goal: Establish and distribute a set of authoritative benchmark variant calls of all types and sizes, as well as homozygous reference regions, on GIAB PGP trios
genomeinabottle.org
SNP/Indel Integration Method Update• Implementing refined integration methods
– Developed so others can readily reproduce results– Consistent results for all GIAB genomes– Simpler process taking advantage of best practices
for each technology• Validating with released NA12878 RM data
– Preliminary comparisons show minor changes• Application to PGP trios
– Plan to analyze AJ trio by Q2 2016– Release of NIST RMs in Q2 2016– Develop calls for GRCh38
genomeinabottle.org
Data Release: Real-time, Open, Public Release
Individual Datasets• Uploaded to GIAB FTP site
as it is collected• Includes raw reads, aligned
reads, and variant/reference calls
Integrated High-confidence Calls• First develop SNP, indel, and
homozygous reference calls• Then develop SV and non-
SV calls• Released calls are versioned• Preliminary callsets will be
made available to be critiqued
GIAB AJ Trio Hybrid PacBio/BioNano Assembly
Hybrid (PacBio with BioNano)
Input Assembly Notes# of
Scaffolds N50 Max TotalHG002 Falcon 248 22.7Mb 92.8Mb 2.38Gb
Trio Falcon 210 29.3Mb 87.6Mb 2.32GbTwo Step
Triocelera (child) +
falcon (trio) 187 34.3Mb 98.0Mb 2.6Gb
Credits: Ali Bashir, Jason Chin, Alex HastiePendleton et al, Nature Methods, 2015
svclassify
Proposed approach to form high-confidence SV (and non-SV) calls
Generate Candidate Calls
Compare/evaluate calls using Parliament/MetaSV/svclassify/others?;
manual inspection
Integrate new and revised calls; manual inspection
Combine integrated calls; manual inspection; targeted experimental validation?
August 30, 2015
January 2016
Plan in January 2016
Feb 2016 and beyond
Deletion overlap summary for son
By # of callsets# of callsets # of calls
1 3780
2 1391
3 859
4 574
5+ 344
By TechnologyTechnology # of calls
Illumina 3277
PacBio 5177
BioNano 812
CG 1758
Illumina/CG+PacBio 2318
Illumina/CG+BioNano 518
PacBio+BioNano 467
2+ technologies 2661
Converted all to bed; combined with bedtools multiinter; Calls within 50bps were merged
Preliminary Confirmation of SVsIntegration results from AJ son
Parliament: BMC Genomics, 2015, 16:286 (performed by Andrew Carroll, DNAnexus)MetaSV: Bioinformatics, 2015, 31:2741 (performed by Marghoob Mohiyuddin, Bina/Roche)
• Parliament– Candidates from Illumina– Confirmed by PacBio and/or
Illumina– ~50% in both technologies– ~4.5k deletions, 1k insertions– 85% of Genotypes consistent
within Trio • MetaSV
– Multiple types of evidence from Illumina
MetaSVTotal:2809
ParliamentTotal:5467
569(20 %)
977(18 %)
MetaSV2240
(80 %)Parliament
4490(82 %)
50 % reciprocal overlapSome overlap within Parliament calls
genomeinabottle.org
GeT-RM Browser from NCBI and CDC• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/• Allows visualization of data underlying call each call
genomeinabottle.org
Uses of GIAB NA12878
Oncology – Molecular and Cellular Tumor Markers“Next Generation” Sequencing (NGS) guidelines for somatic genetic variant detection
www.bioplanet.com/gcat
Global Alliance for Genomics and Health Benchmarking Task Team
• Initial version of standardized definitions for performance metrics like TP, FP, and FN.
• Continued development of sophisticated benchmarking tools
• vcfeval – Len Trigg• hap.py – Peter Krusche• vgraph – Kevin Jacobs
• Standardized intermediate and final file formats
• Standardized bed files with difficult genome contexts for stratification
• Simulating reads with non-SNP ClinVar variants to demonstrate importance of these tools
• github.com/ga4gh/benchmarking-tools
Next steps• Further analysis to demonstrate
importance of sophisticated tools• Write manuscript about the
team’s tools• Integrate vcfeval and hap.py to
take advantage of strengths of each
• Recommend “Best Practices” for benchmarking
• Explore venues for making the team’s benchmarking process easier to use
Progress
Proposed Performance Metrics Definitions
• Define TP/FP/FN/TN in 4 ways depending on required stringency of match:
• Loose match: TP if within x-bp of a true variant• Allelle match: TP if ALT allele matches• Genotype match: TP if genotype and ALT allele
match• Phasing match: TP if genotype, ALT allele, and
phasing with nearby variants all match• True negatives are difficult to define because an
infinite number of potential alleles exist
genomeinabottle.org
Global Alliance for Genomics and HealthBenchmarking Task Team
Credit: Rebecca Truty, Complete Genomics
How should we interpret this complex variant on chr21?
GA4GH Benchmarking Tool Architecture
Truth VCF
Query VCF
Comparison Enginevcfeval / vgraph / xcmp / bcftools / ...
VCF-I
Two-column VCF with TP/FP/FN
annotations
Quantificatione.g. quantify / hap.py
Stratification BEDfiles
Confident CallRegions
VCF-R
Two-column VCF with
TP/FP/FN/UNK annotations
Counts
Credit: Peter Kruschehttps://github.com/ga4gh/benchmarking-tools
Approaches to Benchmarking Variant Calling
• Well-characterized whole genome Reference Materials
• Many samples characterized in clinically relevant regions
• Synthetic DNA spike-ins• Cell lines with engineered mutations• Simulated reads• Modified real reads• Modified reference genomes• Confirming results found in real samples over time
Challenges in Benchmarking Small Variant Calling
• It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)
• Easiest to benchmark only within high-confidence bed file, but…
• Benchmark calls/regions tend to be biased towards easier variants and regions– Some clinical tests are enriched for difficult sites
• Challenges with benchmarking complex variants near boundaries of high-confidence regions
• Always manually inspect a subset of FPs/FNs• Stratification by variant type and region is important• Always calculate confidence intervals on performance metrics
Particular Challenges in Benchmarking SV Calling
• How to establish benchmark calls for difficult regions?
• How to establish non-SV regions to assess FP rates?• Multiple dimensions of accuracy:
– Predicted SV existence– Predicted SV type– Predicted size– Predicted breakpoints– Predicted exact sequence
Acknowledgments
• FDA – Elizabeth Mansfield
• Many members of Genome in a Bottle– New members
welcome!– Sign up on website
for email newsletters
GIAB Steering Committee– Marc Salit – Justin Zook– David Mittelman – Andrew Grupe – Michael Eberle– Steve Sherry – Deanna Church – Francisco De La Vega– Christian Olsen – Monica Basehore – Lisa Kalman – Christopher Mason – Elizabeth Mansfield – Liz Kerrigan – Leming Shi – Melvin Limson – Alexander Wait Zaranek – Nils Homer – Fiona Hyland– Steve Lincoln – Don Baldwin – Robyn Temple-Smolkin – Chunlin Xiao– Kara Norman– Luke Hickey
For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://biorxiv.org/content/early/2015/09/15/026468
Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools
Twice yearly public workshops – Winter at Stanford University, California, USA– Summer at NIST, Maryland, USA
Justin Zook: [email protected] Salit: [email protected]
GIAB Roadmap: Where are we, Where are we going?
• Reference Materials– Germline– Somatic
• Informatics– Analysis of GIAB data– Benchmarking
• Documentary Standards/Publications– Documentation of methods– Supporting Use