171114 best practices for benchmarking variant calls justin

Best practices for benchmarking variant calls

Justin Zook and the GA4GH Benchmarking Team

NIST Genome-Scale Measurements Group

Joint Initiative for Metrology in Biology (JIMB)

Genome in a Bottle Consortium

November 14, 2017

Take-home Messages

• Benchmarking variant calls is easy to do incorrectly

• The GA4GH Benchmarking Team has developed a set of public tools for robust, standardized benchmarking of variant calls

• Benchmarking results should be interpreted critically

• Ongoing work on difficult variants and regions

Why are we doing this work?

• Technologies evolving rapidly

• Different sequencing and bioinformatics methods give different results

• Now have concordance in easy regions, but not in difficult regions

• Challenge:– How do we benchmark variants in a

6 billion base-pair genome?

O’Rawe et al, Genome Medicine, 2013https://doi.org/10.1186/gm432

https://doi.org/10.1186/gm432

Genome in a Bottle ConsortiumAuthoritative Characterization of Human Genomes

Sample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• gDNA reference materials to evaluate performance

• established consortium to develop reference materials, data, methods, performance metrics

gen

eric

me

asu

rem

en

t p

roce

ss

www.slideshare.net/genomeinabottle

http://www.slideshare.net/genomeinabottle

Bringing Principles of Metrologyto the Genome

• Reference materials

– DNA in a tube you can buy from NIST

• Extensive state-of-the-art characterization

– arbitrated “gold standard” calls for SNPs, small indels

• “Upgradable” as technology develops

• PGP genomes suitable for commercial derived products

• Developing benchmarking tools and software

– with GA4GH

• Samples being used to develop and demonstrate new technology

Benchmarking the GIAB benchmarks

• Compare high-confidence calls to other callsets and manually inspect subset of differences

– vs. pedigree-based calls

– vs. common pipelines

– Trio analysis

• When benchmarking a new callset against ours, most putative FPs/FNs should actually be FPs/FNs

Manual curation is required

Evolution of high-confidence calls

CallsHC

Regions HC CallsHC

indelsConcordant

with PG

NIST-only in beds

PG-only in beds PG-only

Variants Phased

v2.19 2.22 Gb 3153247 352937 3030703 87 404 1018795 0.3%v3.2.2 2.53 Gb 3512990 335594 3391783 57 52 657715 3.9%v3.3 2.57 Gb 3566076 358753 3441361 40 60 608137 8.8%v3.3.2 2.58 Gb 3691156 487841 3529641 47 61 469202 99.6%

5-7 errors in NIST

1-7 errors in NIST

~2 FPs and ~2 FNs per million NIST variants in PG and NIST bed files

Global Alliance for Genomics and Health Benchmarking Task Team

• Developed standardized definitions for performance metrics like TP, FP, and FN.

• Developing sophisticated benchmarking tools

• Integrated into a single framework with standardized inputs and outputs

• Standardized bed files with difficult genome contexts for stratification

https://github.com/ga4gh/benchmarking-tools

Variant types can change when decomposing or recomposing variants:

Complex variant:chr1 201586350 CTCTCTCTCT CA

DEL + SNP:

chr1 201586350 CTCTCTCTCT C

chr1 201586359 T A

Credit: Peter Krusche, IlluminaGA4GH Benchmarking Team


Why are definitions important?

Challenges

• Genotype comparisons don’t naturally fall into 2 categories as required for sensitivity, precision, and specificity

• Sometimes variants are partially called and/or partially filtered

• Clustered variants can be counted individually or as a single complex event

• How should filtered variants or “no-call” sites be treated?

Example cases

• Truth is a heterozygous SNP but vcf has a homozygous SNP– 1 FP, 1 FN, and 1 Genotype mismatch

• Truth is an indel but vcf has a SNP at same position– 1 FP, 1 FN, and 1 allele mismatch

• Truth is a deletion + SNP but vcf has the deletion only– 1 TP and 1 FN, or 1 FP and 1-2 FNs,

depending on representations and comparison method

Why are sophisticated comparison tools needed?Normalization isn’t sufficient

Comparison methods affect performance metrics

• Some callers are affected by the comparison method more than others–Biggest effect from clustering nearby variants

GA4GH Reference Implementation

Truth VCF

Query VCF

Comparison Enginevcfeval / vgraph / xcmp / bcftools / ...

VCF-I

Quantification

quantify / hap.py

Stratification BEDfiles

Confident CallRegions

VCF-R

Counts / ROCs

HTML Report e.g. for precisionFDA

Workflow output

Benchmarking example: NA12878 / GiaB / 50X / PCR-Free / Hiseq2000

https://illumina.box.com/s/vjget1dumwmy0re19usetli2teucjel1

Credit: Peter Krusche, IlluminaGA4GH Benchmarking Team

https://illumina.box.com/s/vjget1dumwmy0re19usetli2teucjel1

Benchmarking ToolsStandardized comparison, counting, and stratification with Hap.py + vcfeval

https://precision.fda.gov/https://github.com/ga4gh/benchmarking-tools

https://precision.fda.gov/


FN rates high in some tandem repeats

1x0.3x 10x3x 30x1

1 t

o 5

0 b

p5

1 t

o 2

00

bp

2bp unit repeat

3bp unit repeat

4bp unit repeat

2bp unit repeat

3bp unit repeat

4bp unit repeat

FN rate vs. average

Benchmarking stats can be difficult to interpretExample: FN SNPs in coding regions

RefSeq Coding Regions

• Studies often focus on variants in coding regions

• We look at FN SNP rates for bwa-GATK using the decoy

SNP benchmarking stats vs. PG and 3.3.2

• 97.98% sensitivity vs. PG– FNs predominately in low MQ and/or

segmental duplication regions

– ~80% of FNs supported by long or linked reads

• 99.96% sensitivity vs. NISTv3.3.2– 62x lower FN rate than vs PG

• As always, true sensitivity is unknown

Benchmarking stats can be difficult to interpretExample: FN SNPs in coding regions

RefSeq Coding Regions

• Studies often focus on variants in coding regions

• We look at FN SNP rates for bwa-GATK using the decoy

SNP benchmarking stats vs. PG and 3.3.2

• 97.98% sensitivity vs. PG– FNs predominately in low MQ and/or

segmental duplication regions

– ~80% of FNs supported by long or linked reads

• 99.96% sensitivity vs. NISTv3.3.2– 62x lower FN rate than vs PG

• As always, true sensitivity is unknown

True accuracy is hard to estimate, especially in

difficult regions

Benchmarking against each GIAB genomeGenome Type Subset 100% -

recall100% - precision Recall Precision Fraction of calls

outside high-conf bed

HG001 SNP all 0.0277 0.1274 0.9997 0.9987 0.1653

HG002 SNP all 0.0664 0.1342 0.9993 0.9987 0.1910

HG003 SNP all 0.0625 0.1489 0.9994 0.9985 0.1967

HG004 SNP all 0.0633 0.1592 0.9994 0.9984 0.1975

HG005 SNP all 0.1175 0.0870 0.9988 0.9991 0.1834

HG001 SNP notinalldifficultregions 0.0096 0.0783 0.9999 0.9992 0.0491





HG001 INDEL all 0.8354 0.7458 0.9916 0.9925 0.4485

HG002 INDEL all 0.8271 0.7016 0.9917 0.9930 0.4547

HG003 INDEL all 0.7546 0.6523 0.9925 0.9935 0.4632

HG004 INDEL all 0.7345 0.6390 0.9927 0.9936 0.4592

HG005 INDEL all 0.9840 0.7418 0.9902 0.9926 0.4850

HG001 INDEL notinalldifficultregions 0.0551 0.1475 0.9994 0.9985 0.1927





Approaches to Benchmarking Variant Calling

• Well-characterized whole genome Reference Materials

• Many samples characterized in clinically relevant regions

• Synthetic DNA spike-ins

• Cell lines with engineered mutations

• Simulated reads

• Modified real reads

• Modified reference genomes

• Confirming results found in real samples over time

Challenges in Benchmarking Variant Calling

• It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)

• Easiest to benchmark only within high-confidence bed file, but…

• Benchmark calls/regions tend to be biased towards easier variants and regions

– Some clinical tests are enriched for difficult sites

• Can you predict your performance for clinical variants of interest based on sequencing reference samples?

Best Practices for BenchmarkingBenchmark sets Use benchmark sets with both high-confidence variant calls as well as high-confidence regions, so that both false negatives and

false positives can be assessed.

Stringency of variant comparison

Determine whether it is important that the genotype match exactly, only the allele matches, or the call just needs to be near the true variant.

Variant comparison tools

Use sophisticated variant comparison engines such as vcfeval, xcmp, or varmatch that are able to determine if different representations of the same variant are consistent with the benchmark call. Subsetting by high-confidence regions and, if desired, targeted regions, should only be done after comparison to avoid problems comparing variants with differing representations.

Manual curation Manually curate alignments, ideally from multiple data types, around at least a subset of putative false positive and false negative calls in order to ensure they are truly errors in the user’s callset and to understand the cause(s) of errors. Report back to benchmark set developers any potential errors found in the benchmark set (e.g., using https://goo.gl/forms/ECbjHY7nhz0hrCR52for GIAB).

Interpretation of metrics

All performance metrics should only be interpreted with respect to the limitations of the variants and regions in the benchmark set. Performance metrics are likely to be lower for more difficult variant types and regions that are not fully represented in the benchmark set, such as those in repetitive or difficult-to-map regions. When comparing methods, method 1 may perform better in the high-confidence regions, but method 2 may perform better for more difficult variants outside the high-confidence regions.

Stratification Overall performance metrics can be useful, but for many applications it is important to assess performance for particular variant types and genome contexts. Performance often varies significantly across variant types and genome contexts, and stratification allows users to understand this. In addition, stratification allows users to see if some variant types and genome contexts of interest are not sufficiently represented.

Confidence Intervals

Confidence intervals for performance metrics such as precision and recall should be calculated. This is particularly critical for the smaller numbers of variants found when benchmarking in targeted regions and/or less common stratified variant types and regions.

https://goo.gl/forms/ECbjHY7nhz0hrCR52

Ongoing and Future Work

• Characterizing difficult variants and regions– Large indels and structural variants

– Tandem repeats and homopolymers

– Difficult to map regions

– Complex variants

• New germline samples– Additional ancestries

• Tumor/normal cell lines– Developing IRB protocol for broadly-consented samples

Acknowledgements

• NIST/JIMB

– Marc Salit

– Jenny McDaniel

– Lindsay Vang

– David Catoe

– Lesley Chapman

• Genome in a Bottle Consortium

• GA4GH Benchmarking Team

• FDA

For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails

github.com/genome-in-a-bottle – Guide to GIAB data & ftp

www.slideshare.net/genomeinabottle

www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser

Data: http://www.nature.com/articles/sdata201625

Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools– Web-based implementation at precision.fda.gov

Public workshops – Next workshop Jan 25-26, 2018 at Stanford University, CA, USA

NIST/JIMB postdoc opportunities available!Justin Zook: [email protected] Salit: [email protected]

http://www.genomeinabottle.org

https://github.com/genome-in-a-bottle

http://www.slideshare.net/genomeinabottle

http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/

http://www.nature.com/articles/sdata201625


171114 best practices for benchmarking variant calls justin

Health & Medicine

Transcript of 171114 best practices for benchmarking variant calls justin