GIAB GRC Workshop slides
-
Upload
genomeinabottle -
Category
Health & Medicine
-
view
657 -
download
2
Transcript of GIAB GRC Workshop slides
Genome in a Bottle
Justin Zook and Marc SalitNIST Genome-Scale Measurements Group
JIMB
October 18, 2016
Genome in a Bottle ConsortiumWhole Genome Variant Calling
• gDNA reference materials to evaluate performance– materials certified for their variants
against a reference sequence, with confidence estimates
• established consortium to develop reference materials, data, methods, performance metrics
• Characterized Pilot Genome NA12878• Ashkenazim Trio, Asian son from PGP
released in September!
gene
ric m
easu
rem
ent p
roce
ss
In September, we released 4 new GIAB RM Genomes.
• PGP Human Genomes– AJ son– AJ trio– Asian son
• Parents also characterized
We’re also releasing a Microbial Genome RM
This Reference Material (RM) is intended for validation, optimization, process evaluation, and performance assessment of whole genome sequencing.
• Salmonella Typhimurium • Pseudomonas aeruginosa • Staphylococcus aureus• Clostridium sporogenes
Bringing Principles of Metrologyto the Genome
• Reference materials– DNA in a tube you can buy from NIST– NA12878 pilot sample, now 2 PGP-
sourced trios• Extensive state-of-the-art
characterization– as good as we can get for small variants– arbitrated “gold standard” calls for
SNPs, small indels• “Upgradable” as technology
develops
• Analysis of all samples ongoing as technology develops
• PGP genomes suitable for commercial derived products
• Developing benchmarking tools and software– with GA4GH
• Samples being used to develop and demonstrate new technology
NIST Reference MaterialsGenome PGP ID Coriell ID NIST ID NIST RM #CEPH Mother/Daughter
N/A GM12878 HG001 RM8398
AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio)
AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)Asian Son hu91BD69 GM24631 HG005 RM8393Asian Father huCA017E GM24694 N/A N/AAsian Mother hu38168C GM24695 N/A N/A
Data for GIAB PGP TriosDataset Characteristics Coverage Availability Most useful for…
Illumina Paired-end WGS 150x150bp250x250bp
~300x/individual~50x/individual
on SRA/FTP SNPs/indels/some SVs
Complete Genomics 100x/individual on SRA/ftp SNPs/indels/some SVs
SOLiD 5500W WGS 50bp single end 70x/son on FTP SNPs
Illumina Paired-end WES 100x100bp ~300x/individual on SRA/FTP SNPs/indels in exome
Ion Proton Exome 1000x/individual on SRA/FTP SNPs/indels in exome
Illumina Mate pair ~6000 bp insert ~30x/individual on FTP SVs
Illumina “moleculo” Custom library ~30x by long fragments on FTP SVs/phasing/assembly
Complete Genomics LFR 100x/individual on SRA/FTP SNPs/indels/phasing
10X Linked reads 30-45x/individual on FTP SNPs/SVs/phasing/assembly
PacBio ~10kb reads ~70x on AJ son, ~30x on each AJ parent
on SRA/FTP SVs/phasing/assembly/STRs
Oxford Nanopore 5.8kb 2D reads 0.05x on AJ son on FTP SVs/assembly
Nabsys 2.0 ~100kbp N50 nanopore maps
70x on AJ son SVs/assembly
BioNano Genomics 200-250kbp optical map reads
~100x/AJ individual; 57x on Asian son
on FTP SVs/assembly
Dataset AJ Son AJ Parents Chinese son Chinese parents NA12878
Illumina Paired-end X X X X XIllumina Long Mate pair X X X X XIllumina “moleculo” X X X X XComplete Genomics X X X X XComplete Genomics LFR X X XIon exome X X X XBioNano X X X X10X X X XPacBio X X XSOLiD single end X X XIllumina exome X X X XNabsys X XOxford Nanopore X
Paper describing data…51 authors14 institutions12 datasets7 genomesData described in ISA-tab
Integration Methods to Establish Benchmark Variant Calls
Zook et al., Nature Biotechnology, 2014.
Integration Methods to Establish Benchmark Variant Calls
Zook et al., Nature Biotechnology, 2014.
NEW: Reproducible
integration pipeline with
new calls for NA12878 and
PGP Trios!
New Integration Methods to Establish Benchmark Variant Calls for GRCh38
• Comparison with PG– ~300 differences not near filtered
sites in either callset (3x GRCh37)– Appears to result from fewer input
callsets into PG• Future work
– How can we use ALT loci?– How to represent variation with
respect to ALT loci?– How to benchmark variants called
on ALT loci?
• Illumina and 10X– Map reads to GRCh38 with decoy but no
ALT loci– Call variants vs. GRCh38
• Complete Genomics, SOLiD, Ion– Convert vcf and callable bed from
GRCh37 to GRCh38– Use GenomeWarp by Cory McLean, Verily
• Accounts for changed bases• https://github.com/verilylifesciences/
genomewarp
• ~100k fewer calls than GRCh37
Evolution of high-confidence calls
CallsHC
Regions HC CallsHC
indelsConcordant
with PGNIST-only
in bedsPG-only in beds PG-only
v2.19 2.22 Gb 3153247 352937 3030703 87 404 1018795v3.1 2.55 Gb 3453085 - 3330275 71 82 719223v3.2.2 2.53 Gb 3512990 335594 3391783 57 52 657715v3.3 2.57 Gb 3566076 358753 3441361 40 60 608137v3.3.1 2.58 Gb 3746191 505169 3550914 50 67 499023
Newest calls (v3.3.1) vs. 2015 calls (v2.19)
V3.3.1• 2.584Gb high-confidence• 3550914 match PG• 499023 PG calls outside high conf• 195277 calls not in PG• After excluding low confidence regions
and regions around filtered PG calls:– 50 calls not in PG– 67 extra PG calls
V2.19 • 2.216 Gb high-confidence• 3030717 match PG• 1018795 PG calls outside high conf• 122359 calls not in PG• After excluding low confidence regions
and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls
Newest calls (v3.3.1) vs. 2015 calls (v2.19)Example vcf (verily) Stratified
V3.3.1• 16% of SNPs not assessed
– 23% of SNPs in RefSeq coding– 52% of SNPs in “bad promoters”
• 68% of indels not assessed– 2.0% error rate
• 17% FP rate in regions homologous to decoy
V2.19 • 27% of SNPs not assessed
– 36% of SNPs in RefSeq coding– 82% of SNPs in “bad promoters”
• 78% of indels not assessed– 1.2% error rate
• 0.2% FP rate in regions homologous to decoy
Principles of Integration Process
• Form sensitive variant calls from each dataset
• Define “callable regions” for each callset
• Filter calls from each method with annotations unlike concordant calls
• Compare high-confidence calls to other callsets and manually inspect subset of differences– vs. pedigree-based calls– vs. common pipelines– Trio analysis
• When benchmarking a new callset against ours, most putative FPs/FNs should actually be FPs/FNs
Criteria for including new callsets
• Form sensitive variant calls from each dataset
• Define “callable regions” for each callset• Good coverage and MapQ• Use knowledge about technology and
manual inspection to exclude repetitive regions difficult for each dataset
• For new callsets, ensure most FNs in callable regions relative to current high-confidence calls are questionable in the current calls
• Filter calls from each method with annotations unlike concordant calls– Annotations for which outliers are
expected to indicate bias should be selected for each callset
Global Alliance for Genomics and Health Benchmarking Task Team
• Developed standardized definitions for performance metrics like TP, FP, and FN.
• Developing sophisticated benchmarking tools• Integrated into a single framework
with standardized inputs and outputs
• Standardized bed files with difficult genome contexts for stratification
https://github.com/ga4gh/benchmarking-tools
Variant types can change when decomposing or recomposing variants:
Complex variant:chr1 201586350 CTCTCTCTCT CA
DEL + SNP:
chr1 201586350 CTCTCTCTCT Cchr1 201586359 T A
Credit: Peter Krusche, IlluminaGA4GH Benchmarking Team
Workflow output
Benchmarking example: NA12878 / GiaB / 50X / PCR-Free / Hiseq2000
https://illumina.box.com/s/vjget1dumwmy0re19usetli2teucjel1
Credit: Peter Krusche, IlluminaGA4GH Benchmarking Team
GA4GH benchmarking on Github
In-progress benchmarking standards document: doc/standards Description of intermediate formats: doc/ref-impl Truthset descriptions and download links: resources/high-confidence-sets Stratification bed files and descriptions: resources/stratification-bed-files Python-code for HTML reporting and running benchmarks: reporting/basic
Please contribute / join the discussion!
https://github.com/ga4gh/benchmarking-tools
Credit: Peter Krusche, IlluminaGA4GH Benchmarking Team
Benchmarking Tools
Standardized comparison, counting, and stratification with Hap.py + vcfeval
https://precision.fda.gov/ https://github.com/ga4gh/benchmarking-tools
FN rates high in some tandem repeats
1x0.3x 10x3x 30x11
to 5
0 bp
51 to
200
bp
2bp unit repeat
3bp unit repeat
4bp unit repeat
2bp unit repeat
3bp unit repeat
4bp unit repeat
FN rate vs. average
Approaches to Benchmarking Variant Calling
• Well-characterized whole genome Reference Materials• Many samples characterized in clinically relevant regions• Synthetic DNA spike-ins• Cell lines with engineered mutations• Simulated reads• Modified real reads• Modified reference genomes• Confirming results found in real samples over time
Challenges in Benchmarking Variant Calling
• It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)
• Easiest to benchmark only within high-confidence bed file, but…• Benchmark calls/regions tend to be biased towards easier
variants and regions– Some clinical tests are enriched for difficult sites
• Always manually inspect a subset of FPs/FNs• Stratification by variant type and region is important• Always calculate confidence intervals on performance metrics
How can we extend this approach to structural variants?
Similarities to small variants• Collect callsets from multiple
technologies• Compare callsets to find calls
supported by multiple technologies
Differences from small variants• Callsets have limited sensitivity• Variants are often imprecisely
characterized– breakpoints, size, type, etc.
• Representation of variants is poorly standardized, especially when complex
• Comparison tools in infancy
Preliminary process for integrated deletions
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_DraftIntegratedDeletionsgt19bp_v0.1.8
<50bp 50-100bp 100-1000bp 1kb-3kb >3kbp Pre-filtered calls 2627 1600 2306 385 389
Post-filtered calls 2548 1448 1996 297 262
Proposed improved integration process
“sequence-resolved” calls
SV Discovery
Imprecise SV calls
Sequence-based comparison
SV corroboration methods (e.g.,
parliament, svviz, nabsys, bionano)
Heuristics to form tiers of benchmark
SVs
Machine learning to form benchmark
SVs
Comparison of all candidate
calls (SURVIVOR/svco
mpare)
SV Comparison SV Corroboration Form SV benchmark calls
SV refinement? (e.g., parliament?, others?)
Sequence-resolved candidates
Currently sequence-resolved output• MSPacMon• Spiral (now only small have sequence)• Fermikit (now only small have sequence)• Cortex • CG (small)• GATK (small)• Freebayes (small)• Pindel• manta
Potentially sequence-resolved output• Newly submitted
– PBRefine– Some MetaSV– Assemblytics– 10X deletions
• Possible– Parliament?– PBHoney– Smrt-sv.dip– Breakseq?
Draft de novo assemblies for AJ SonData Method
Contig N50
Scaffold N50
Number Scaffolds
Total Size
PacBio Falcon 5.3 Mb 5.3 Mb 13231 3.04 GbPacBio PBcR 4.5 Mb 4.5 Mb 12523 2.99 GbPacBio+ BioNano
Falcon+ BioNano 4.1 Mb 22.7 Mb 478 2.38 Gb
PacBio+ Dovetail
Falcon+ HiRise 5.3 Mb 12.9 Mb 12459 3.04 Gb
PacBio+ Dovetail
PBcR+ HiRise 4.1 Mb 20.6 Mb 10491 2.99 Gb
Illumina DISCOVAR 81 kb 149 kb 1.06M 3.13 GbIllumina+ Dovetail
DISCOVAR+HiRise 85 kb 12.9 Mb 1.03M 3.15 Gb
10X Supernova 106 kb 15.2 Mb 1360 2.73 Gb
Credits for assemblies: Ali Bashir, Mt. SinaiJason Chin, PacBioAlex Hastie, BioNanoSerge Koren, NHGRIAdam Phillippy, NHGRIKareina Dill, DovetailNoushin Ghaffari, TAMU10X Genomics
Assembly-based SV calls: MSPACAssemblyticsPBRefineIMPORTANT NOTE: These are draft assemblies and statistics should not be used to
compare quality of assembly methods.
New Samples
Additional ancestries• Shorter term
– Use existing PGP individual samples– Use existing integration pipeline
• Data-based selection– E.g., PCA of existing samples
• 3 to 8 new samples• Longer term
– Recruit large family– Recruit trios from other ancestry groups
Cancer samples• Longer term• Make PGP-consented tumor and
normal cell lines from same individual• Select tumor with diversity of mutation
types
Acknowledgements
• NIST– Marc Salit– Jenny McDaniel– Lindsay Vang– David Catoe
• Genome in a Bottle Consortium• GA4GH Benchmarking Team
• FDA– Liz Mansfield– Zivana Tevak– David Litwack
For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://www.nature.com/articles/sdata201625
Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools
Public workshops – Possible SV integration mini-workshop in Spring 2017– Next large workshop in Fall 2017
NIST postdoc opportunities available!Justin Zook: [email protected] Salit: [email protected]