Comprehensive structural and copy-number variant detection ... · Comprehensive structural and...

1
For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. Femto Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners. Comprehensive structural and copy-number variant detection with long reads Aaron M. Wenger, Armin Töpfer, Yuan Li, Luke Hickey Pacific Biosciences, 1305 O'Brien Drive, Menlo Park, CA 94025 Variant Detection with HiFi Reads Existing long read variant calling methods rely on de novo assembly or spanning reads to detect variants. These methods are effective for SVs but miss many CNVs that involve long segmental duplications. Copy-number variant (CNV) calling with pbsv CNVs in HG001 and COLO829T PacBio highly accurate, long reads (HiFi reads) comprehensively detect variants in the human genome, including in difficult repetitive regions. References 1. Wenger AM, Peluso P et al. (2019). Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome . Nat Biotechnol. doi:10.1038/s41587-019-0217-9. 2. Zook JM et al. (2019). An open resource for accurately benchmarking small variant and reference calls . Nat Biotechnol. 37(5):561-566. 3. Zook JM et al. (2019). A robust benchmark for germline structural variant detection . bioRxiv. doi:10.1101/664623. [Preprint] 4. Chaisson MJP et al. (2019). Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun. 10(1):1784. 5. https://github.com/PacificBiosciences/pbsv Thank you to David Scherer, Kristin Robertshaw, and Pamela Bentley Mills for poster production support. Ref GCAGGCAGCGACTACGTACGCTAACAGCGATCTCAG Alt GCAGGCAGCGACTACGTCCTCTAACAGCGATCTCAG SNVs Indels SVs 99.97% Precision 99.95% Recall 99.10% Precision 98.86% Recall Precision and recall, as measured against the Genome in a Bottle (GIAB) benchmarks 2,3 , is high for single nucleotide variants (SNVs), indels, and structural variants (SVs). Figure 1. Accurate PacBio HiFi reads detect variants in difficult-to-map exons of the disease gene STRC 1 . Ref GCAGGCAGCGACTACGTACGCTAACAGCGATCTCAG Alt GCAGGCAGCGACTACGT-CGCTAACAGCGATCTCAG Figure 2. PacBio SNV calling performance for HG002 with 32-fold HiFi coverage (Rowell, poster 1866/W). Figure 3. PacBio indel calling performance for HG002 with 32-fold HiFi coverage. "We determined that 57% and 15% of the copy number variable bases within segmental duplications detected by dCGH and Genome STRiP, respectively, were not in contigs resolved by de novo assembly [of long reads]." – Chaisson et al. 2019 (ref. 4) We extended the PacBio SV caller, pbsv 5 , to detect CNVs using a combination of read clipping and depth. Determine genome-wide coverage median coverage of non-gap positions at mapping quality 60 Identify candidate CNV breakpoints positions with multiple clipped reads Evaluate coverage between adjacent candidate breakpoints calculate z-score vs Poisson expectation pbsv CNV algorithm Figure 5. PacBio read coverage is Poisson distributed in autosomes for (a) samples from GIAB and (b) the normal sample from a tumor/normal pair (provided by WP Kloosterman). A tumor sample shows CNV regions of reduced and increased coverage. (c) To call CNVs, pbsv identifies candidate CNV breakpoints with multiple clipped reads and then evaluates read depth between adjacent breakpoints compared to the genome-wide typical coverage. a b c candidate breakpoint candidate breakpoint HG001 PacBio reads Segdups Repeats GRCh37 chr6:103,727,761-103,773,116 (45 kb) 25 kb 96.26% Precision 94.93% Recall Ref Alt Figure 4. PacBio SV calling performance for HG002 with 32-fold HiFi coverage. HG001 PacBio reads Segdups Repeats GRCh37 chr19:50,579,869-50,659,719 (80 kb) HG001 pbsv CNV HG001 PacBio reads Segdups Repeats HG001 pbsv CNV GRCh37 chr6:220,626-410,470 (189 kb) Genes Genes Copy number Copy number Total CNV length (Mb) Coverage Coverage 100 bp windows 100 bp windows HG001 COLO829T (tumor) segdup segdup unique unique Figure 6. pbsv CNV calls in HG001 in (a) segmentally duplicated (segdup) sequence and (b) unique sequence. a b Figure 7. Genomic distribution of pbsv CNV calls. (a, b) Most CNVs in the "healthy" genome HG001 involve segdups. (c, d) The tumor genome COLO829T has many more CNVs than HG001 and most fall outside of segdups. Summary PacBio HiFi reads comprehensively detect variants in a human genome. The pbsv variant caller identifies CNVs in HiFi reads using read clipping and depth signatures. b a d c GIAB benchmark DeepVariant PacBio callset 3,047,837 1,571 1,030 GIAB benchmark DeepVariant PacBio callset 459,481 5,283 4,329 GIAB benchmark DeepVariant PacBio callset 9,196 489 357 54 0 6 12 18 24 30 36 42 48 60 90 0 10 20 30 40 50 60 70 80 100 0% 1% 2% 3% 4% 5% 6% 0% 1% 2% 3% 4% 5% 9 0 1 2 3 4 5 6 7 8 ≥10 9 0 1 2 3 4 5 6 7 8 ≥10 0 1 2 3 4 5 0 1 2 3 4 5 0 50 100 150 200 0 50 100 150 200 HG001 HG002 HG005 COLO829N COLO829T

Transcript of Comprehensive structural and copy-number variant detection ... · Comprehensive structural and...

Page 1: Comprehensive structural and copy-number variant detection ... · Comprehensive structural and copy-number variant detection with long reads Aaron M. Wenger, Armin Töpfer, Yuan Li,

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. Femto Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners.

Comprehensive structural and copy-number variant detection with long readsAaron M. Wenger, Armin Töpfer, Yuan Li, Luke HickeyPacific Biosciences, 1305 O'Brien Drive, Menlo Park, CA 94025

Variant Detection with HiFi Reads

Existing long read variant calling methods rely on de novo assembly or spanning reads to detect variants. These methods are effective for SVs but miss many CNVs that involve long segmental duplications.

Copy-number variant (CNV)calling with pbsv CNVs in HG001 and COLO829T

PacBio highly accurate, long reads (HiFi reads) comprehensively detect variants in the human genome, including in difficult repetitive regions.

References1. Wenger AM, Peluso P et al. (2019). Accurate circular consensus long-read sequencing

improves variant detection and assembly of a human genome. Nat Biotechnol. doi:10.1038/s41587-019-0217-9.

2. Zook JM et al. (2019). An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol. 37(5):561-566.

3. Zook JM et al. (2019). A robust benchmark for germline structural variant detection. bioRxiv. doi:10.1101/664623. [Preprint]

4. Chaisson MJP et al. (2019). Multi-platform discovery of haplotype-resolved structuralvariation in human genomes. Nat Commun. 10(1):1784.

5. https://github.com/PacificBiosciences/pbsvThank you to David Scherer, Kristin Robertshaw, and Pamela Bentley Mills for poster production support.

Ref GCAGGCAGCGACTACGTACGCTAACAGCGATCTCAGAlt GCAGGCAGCGACTACGTCCTCTAACAGCGATCTCAG

SNVs

Indels

SVs

99.97% Precision99.95% Recall

99.10% Precision98.86% Recall

Precision and recall, as measured against the Genome in a Bottle (GIAB) benchmarks2,3, is high for single nucleotide variants (SNVs), indels, and structural variants (SVs).

Figure 1. Accurate PacBio HiFi reads detect variants in difficult-to-map exons of the disease gene STRC1.

Ref GCAGGCAGCGACTACGTACGCTAACAGCGATCTCAGAlt GCAGGCAGCGACTACGT-CGCTAACAGCGATCTCAG

Figure 2. PacBio SNV calling performance for HG002 with 32-fold HiFi coverage (Rowell, poster 1866/W).

Figure 3. PacBio indel calling performance for HG002 with 32-fold HiFi coverage.

"We determined that 57% and 15% of the copy number variable bases within segmental duplications detected by dCGH and Genome STRiP, respectively, were not in contigs resolved by de novo assembly [of long reads]."– Chaisson et al. 2019 (ref. 4)

We extended the PacBio SV caller, pbsv5, to detect CNVs using a combination of read clipping and depth.

Determine genome-wide coveragemedian coverage of non-gap

positions at mapping quality 60

Identify candidate CNV breakpointspositions with multiple clipped reads

Evaluate coverage between adjacent candidate breakpoints

calculate z-score vs Poisson expectation

pbsv CNV algorithm

Figure 5. PacBio read coverage is Poisson distributed in autosomes for (a) samples from GIAB and (b) the normal sample from a tumor/normal pair (provided by WP Kloosterman). A tumor sample shows CNV regions of reduced and increased coverage. (c) To call CNVs, pbsvidentifies candidate CNV breakpoints with multiple clipped reads and then evaluates read depth between adjacent breakpoints compared to the genome-wide typical coverage.

a b

c

candidate breakpoint

candidate breakpoint

HG001PacBio

reads

Segdups

Repeats

GRCh37 chr6:103,727,761-103,773,116 (45 kb)

25 kb

96.26% Precision94.93% Recall

RefAlt

Figure 4. PacBio SV calling performance for HG002 with 32-fold HiFi coverage.

HG001PacBio

reads

Segdups

Repeats

GRCh37 chr19:50,579,869-50,659,719 (80 kb)

HG001pbsv CNV

HG001PacBio

reads

Segdups

Repeats

HG001pbsv CNV

GRCh37 chr6:220,626-410,470 (189 kb)

Genes

Genes

Copy number Copy number

Tota

l CN

V le

ngth

(Mb)

CoverageCoverage

100

bp w

indo

ws

100

bp w

indo

ws

HG001 COLO829T (tumor)

segdup segdup

unique unique

Figure 6. pbsv CNV calls in HG001 in (a) segmentally duplicated (segdup) sequence and (b) unique sequence.

a

b

Figure 7. Genomic distribution of pbsv CNV calls. (a, b) Most CNVs in the "healthy" genome HG001 involve segdups. (c, d) The tumor genome COLO829T has many more CNVs than HG001 and most fall outside of segdups.

Summary

• PacBio HiFi reads comprehensively detect variants in a human genome.

• The pbsv variant caller identifies CNVs in HiFi reads using read clipping and depth signatures.

b

a

d

cGIABbenchmark

DeepVariantPacBio callset

3,047,8371,5711,030

GIABbenchmark

DeepVariantPacBio callset

459,4815,2834,329

GIABbenchmark

DeepVariantPacBio callset

9,196489357

540 6 12 18 24 30 36 42 48 60 900 10 20 30 40 50 60 70 80 1000%

1%

2%

3%

4%

5%

6%

0%

1%

2%

3%

4%

5%

90 1 2 3 4 5 6 7 8 ≥10 90 1 2 3 4 5 6 7 8 ≥100

1

2

3

4

5

0

1

2

3

4

5

0

50

100

150

200

0

50

100

150

200

HG001HG002HG005

COLO829NCOLO829T