Targeted sequencing and cromosomal haplotype assembly ... · Title: Targeted sequencing and...

1
For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. TLA is a trademark of Cergentis. All other trademarks are the sole property of their respective owners. TLA and SMRT Sequencing: Targeted Sequencing and Chromosomal Haplotype Assembly Lawrence S Hon 1 , Yu-Chih Tsai 1 , Steve Kujawa 1 , Erik Splinter 2 , Marieke Simonis 2 , Tyson Clark 1 , Jonas Korlach 1 , Max van Min 2 1 PacBio, 1380 Willow Road, Menlo Park, CA 94025 2 Cergentis B.V., Padualaan 8, 3584 CH Utrecht, The Netherlands The combination of SMRT Sequencing and Cergentis’ Targeted Locus Amplification (TLA) Technology was applied in the preparation, sequencing and haplotyping of individual genes, chromosomes and genomes. Introduction TLA is a strategy to selectively amplify and sequence complete loci on the basis of the crosslinking of physically proximal sequences. Unlike other targeted sequencing methods, TLA works without detailed prior locus information, as one primer pair is sufficient to amplify and sequence tens to hundreds of kilobases of surrounding DNA. TLA enables targeted complete sequencing and the detection of single nucleotide and structural variants in genes of interest. In addition, TLA enables the haplotyping of sequenced regions. Unamplified TLA Template can be used for genome-wide phasing and assembly. SMRT Sequencing enables the complete sequencing of TLA products and therefore empowers phasing and assembly. References: Bansal, V. and Bafna, V., HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics, 2008. de Vree, J.P., et al., Targeted sequencing by proximity ligation for comprehensive variant detection and local haplotyping. Nature Biotechnology, 2014. GIAB data: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/ ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/ TLA and SMRT Sequencing Additional Information Whole Genome Phasing Mapped reads generated with the BRCA1 TLA fully cover the BRCA1 region (panel A), with heterozygous SNPs clearly visible from the TLA data (panel B), allowing excellent phasing performance (table C). We show that the 81 kb length of BRCA1 is represented by a single haplotype block (haplotyping was validated against a reference dataset). A) B) C) # Haplotype Blocks 1 Block Span 81,463 bp # hetSNPs Phased 116 # hetSNPs in Validation Set 117 Switch Errors 0 Statistics of Longest Phasing Block on Chr17 Block Span 79,628,306 bp Chromosome 17 Size 81,195,210 bp # Phased Bases 28,133,018 bp # hetSNPs Phased 21,762 Long Switch Rate 0.4% Short Switch Rate 0.08% Because the targeted TLA data has segments aligning far outside of the BRCA1 gene region (plot on right), longer range phasing by combining those data with whole-genome shotgun PacBio data was performed. HAPCUT was able to construct a phasing block that spanned all of chromosome 17 and had low switch rates demonstrating feasibility of the approach. In a whole-genome TLA Template dataset, segments from the same read have significant distances (plot A), and many reads had >10 segments (plot B), which greatly increases the chance that two segments from one read will each have a heterozygous SNP. Combining these data with shotgun data from the same individual, the number of phased SNPs dramatically increases (table C, validation in progress). A) B) Statistics of Longest Phasing Block on Chr17 Block Span 81,121,761 bp Chromosome 17 Size 81,195,210 bp # Phased Bases 70,906,325 bp # hetSNPs Phased 48,349 C) Experiment Here, we applied TLA on the BRCA1 gene on NA12878 with a primer pair at (hg19) Chr17:41237179-41236511 (located ~ 40 kb from the start of the 81 kb BRCA1 gene) and then sequenced the resulting 2 kb circles on the PacBio RS II instrument. We then explored chromosomal-scale haplotype assembly by combining these data with whole-genome shotgun PacBio long reads. Finally, by size-selecting TLA Templates >5 kb to maximize the number of segments per read and then sequencing, we targeted whole-genome haplotype assembly across all chromosomes. PacBio SMRTbell libraries were created from the Cergentis samples following published PacBio sample prep procedures (with 6 kb BluePippin size selection and additional damage repair for the whole-genome TLA Template) and sequenced on the PacBio RS II. TLA yields 2 kb CCS reads with ~4 segments/read, and TLA Template yields >10 kb reads with >20 segments/read. For targeted BRCA1 phasing, SNPs were de novo called using SAMtools and BCFtools. For whole-chromosome analysis, BAM (PacBio shotgun) and VCF files were obtained from GIAB. HAPCUT was then used to phase selected regions, incorporating whole- genome PacBio shotgun data for whole-chromosome phasing. Sample Prep Library size Sequencing Chemistry Fold Coverage NA12878 TLA targeting BRCA1 2 kb P6-C4 Variable with peak at BRCA1 NA12878 Whole-genome shotgun ~7 kb P5-C3 and older ~40X GM24385 TLA Template 10 kb P6-C4 0.8X GM24385 Whole-genome shotgun >10 kb P6-C4 ~50X Schematic depiction of TLA BRCA1 SMRT Sequencing and Phasing BRCA1 Sequencing & Phasing Whole-Chromosome Phasing Schematic depiction of TLA BRCA1 SMRT Sequencing-based phasing of chromosome 17 (only one allele shown). Schematic depiction of TLA Template SMRT Sequencing based phasing of chromosome 17 (only one allele shown)

Transcript of Targeted sequencing and cromosomal haplotype assembly ... · Title: Targeted sequencing and...

Page 1: Targeted sequencing and cromosomal haplotype assembly ... · Title: Targeted sequencing and cromosomal haplotype assembly using TLA and SMRT Sequencing Author: Lawrence Hon;Yu-Chih

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel

are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. TLA is a trademark of Cergentis. All other trademarks are the sole property of their

respective owners.

TLA and SMRT Sequencing: Targeted Sequencing and Chromosomal Haplotype Assembly Lawrence S Hon1, Yu-Chih Tsai1, Steve Kujawa1, Erik Splinter2, Marieke Simonis2, Tyson Clark1, Jonas Korlach1, Max van Min2 1 PacBio, 1380 Willow Road, Menlo Park, CA 94025 2 Cergentis B.V., Padualaan 8, 3584 CH Utrecht, The Netherlands

The combination of SMRT Sequencing and Cergentis’ Targeted Locus

Amplification (TLA) Technology was applied in the preparation, sequencing

and haplotyping of individual genes, chromosomes and genomes.

Introduction

TLA is a strategy to selectively amplify and sequence complete loci on the

basis of the crosslinking of physically proximal sequences. Unlike other

targeted sequencing methods, TLA works without detailed prior locus

information, as one primer pair is sufficient to amplify and sequence tens to

hundreds of kilobases of surrounding DNA. TLA enables targeted complete

sequencing and the detection of single nucleotide and structural variants in

genes of interest. In addition, TLA enables the haplotyping of sequenced

regions.

Unamplified TLA Template can be used for genome-wide phasing and

assembly.

SMRT Sequencing enables the complete sequencing of TLA products and

therefore empowers phasing and assembly.

References:

Bansal, V. and Bafna, V., HapCUT: an efficient and accurate algorithm for the haplotype

assembly problem. Bioinformatics, 2008.

de Vree, J.P., et al., Targeted sequencing by proximity ligation for comprehensive variant

detection and local haplotyping. Nature Biotechnology, 2014.

GIAB data: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/

ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/

TLA and SMRT Sequencing

Additional Information

Whole Genome Phasing Mapped reads generated with the BRCA1 TLA fully cover the BRCA1

region (panel A), with heterozygous SNPs clearly visible from the TLA

data (panel B), allowing excellent phasing performance (table C). We

show that the 81 kb length of BRCA1 is represented by a single haplotype

block (haplotyping was validated against a reference dataset).

A)

B)

C)

# Haplotype Blocks 1

Block Span 81,463 bp

# hetSNPs Phased 116

# hetSNPs in Validation Set 117

Switch Errors 0

Statistics of Longest

Phasing Block on Chr17

Block Span 79,628,306 bp

Chromosome 17 Size 81,195,210 bp

# Phased Bases 28,133,018 bp

# hetSNPs Phased 21,762

Long Switch Rate 0.4%

Short Switch Rate 0.08%

Because the targeted TLA data has

segments aligning far outside of the

BRCA1 gene region (plot on right),

longer range phasing by combining

those data with whole-genome shotgun

PacBio data was performed. HAPCUT

was able to construct a phasing block

that spanned all of chromosome 17 and

had low switch rates demonstrating

feasibility of the approach.

In a whole-genome TLA Template dataset, segments from the same read

have significant distances (plot A), and many reads had >10 segments

(plot B), which greatly increases the chance that two segments from one

read will each have a heterozygous SNP. Combining these data with

shotgun data from the same individual, the number of phased SNPs

dramatically increases (table C, validation in progress).

A) B)

Statistics of Longest

Phasing Block on Chr17

Block Span 81,121,761 bp

Chromosome 17 Size 81,195,210 bp

# Phased Bases 70,906,325 bp

# hetSNPs Phased 48,349

C)

Experiment

Here, we applied TLA on the BRCA1 gene on NA12878 with a primer pair at

(hg19) Chr17:41237179-41236511 (located ~ 40 kb from the start of the 81

kb BRCA1 gene) and then sequenced the resulting 2 kb circles on the

PacBio RS II instrument.

We then explored chromosomal-scale haplotype assembly by combining

these data with whole-genome shotgun PacBio long reads.

Finally, by size-selecting TLA Templates >5 kb to maximize the number of

segments per read and then sequencing, we targeted whole-genome

haplotype assembly across all chromosomes.

PacBio SMRTbell libraries were created from the Cergentis samples

following published PacBio sample prep procedures (with 6 kb BluePippin

size selection and additional damage repair for the whole-genome TLA

Template) and sequenced on the PacBio RS II.

TLA yields 2 kb CCS reads with ~4 segments/read, and TLA Template yields

>10 kb reads with >20 segments/read. For targeted BRCA1 phasing, SNPs

were de novo called using SAMtools and BCFtools. For whole-chromosome

analysis, BAM (PacBio shotgun) and VCF files were obtained from GIAB.

HAPCUT was then used to phase selected regions, incorporating whole-

genome PacBio shotgun data for whole-chromosome phasing.

Sample Prep Library size Sequencing

Chemistry Fold Coverage

NA12878 TLA targeting BRCA1 2 kb P6-C4 Variable with

peak at BRCA1

NA12878 Whole-genome shotgun ~7 kb P5-C3 and older ~40X

GM24385 TLA Template 10 kb P6-C4 0.8X

GM24385 Whole-genome shotgun >10 kb P6-C4 ~50X

Schematic depiction of TLA BRCA1 SMRT Sequencing and Phasing

BRCA1 Sequencing & Phasing

Whole-Chromosome Phasing

Schematic depiction of TLA BRCA1 SMRT Sequencing-based

phasing of chromosome 17 (only one allele shown).

Schematic depiction of TLA Template SMRT Sequencing based

phasing of chromosome 17 (only one allele shown)