Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to...

28
1 Efficient whole genome haplotyping and single molecule phasing with barcode-linked reads David Redin 1 , Tobias Frick 1 , Hooman Aghelpasand 1 , Jennifer Theland 1 , Max Käller 1 , Erik Borgström 1 , Remi-Andre Olsen 2 , and Afshin Ahmadian 1,* . 1 .Royal Institute of Technology (KTH), School of Engineering Sciences in Chemistry, Biotechnology and Health, Department of Gene Technology, Science for Life Laboratory, SE-171 65, Solna, Sweden. 2 .Stockholm University, Department of Biochemistry and Biophysics, Science for Life Laboratory, Box 1031, 171 21 Solna, Sweden. * _Correspondence and requests for materials should be addressed to A.A. (email: [email protected]). SUMMARY The future of human genomics is one that seeks to resolve the entirety of genetic variation through sequencing. The prospect of utilizing genomics for medical purposes require cost-efficient and accurate base calling, long-range haplotyping capability, and reliable calling of structural variants. Short-read sequencing has lead the development towards such a future but has struggled to meet the latter two of these needs. To address this limitation, we developed a technology that preserves the molecular origin of short sequencing reads, with an insignificant increase to sequencing costs. We demonstrate a library preparation method which enables whole genome haplotyping, long-range phasing of single DNA molecules, and de novo genome assembly through barcode-linked reads (BLR). Millions of random barcodes are used to reconstruct megabase-scale phase blocks and call structural variants. We also highlight the versatility of our technology by generating libraries from different organisms using picograms to nanograms of input material. KEYWORDS Whole genome haplotyping, single molecule phasing, de novo assembly, barcode-linked reads, DNA phasing, BLR, droplet barcoding, whole genome sequencing, linked-read sequencing,

Transcript of Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to...

Page 1: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

1

Efficient whole genome haplotyping and single molecule phasing with barcode-linked reads

David Redin1, Tobias Frick1, Hooman Aghelpasand1, Jennifer Theland1, Max Käller1, Erik Borgström1,

Remi-Andre Olsen2, and Afshin Ahmadian1,*.

1.Royal Institute of Technology (KTH), School of Engineering Sciences in Chemistry, Biotechnology and

Health, Department of Gene Technology, Science for Life Laboratory, SE-171 65, Solna, Sweden.

2.Stockholm University, Department of Biochemistry and Biophysics, Science for Life Laboratory, Box

1031, 171 21 Solna, Sweden.

*_Correspondence and requests for materials should be addressed to A.A. (email:

[email protected]).

SUMMARY

The future of human genomics is one that seeks to resolve the entirety of genetic variation through

sequencing. The prospect of utilizing genomics for medical purposes require cost-efficient and accurate

base calling, long-range haplotyping capability, and reliable calling of structural variants. Short-read

sequencing has lead the development towards such a future but has struggled to meet the latter two of these

needs. To address this limitation, we developed a technology that preserves the molecular origin of short

sequencing reads, with an insignificant increase to sequencing costs. We demonstrate a library preparation

method which enables whole genome haplotyping, long-range phasing of single DNA molecules, and de

novo genome assembly through barcode-linked reads (BLR). Millions of random barcodes are used to

reconstruct megabase-scale phase blocks and call structural variants. We also highlight the versatility of

our technology by generating libraries from different organisms using picograms to nanograms of input

material.

KEYWORDS

Whole genome haplotyping, single molecule phasing, de novo assembly, barcode-linked reads, DNA

phasing, BLR, droplet barcoding, whole genome sequencing, linked-read sequencing,

Page 2: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

2

INTRODUCTION

Elucidating the true impact of genetic variation and its potential contribution to healthcare requires a

complete characterization of the human genome. While massive efforts and technological developments

have been made towards this goal, the vast majority of whole genome sequencing data produced has been

limited to profiles of unphased nucleotide variations. High-throughput short read sequencing has become

the backbone of genomics as a field, yet generating a haploid consensus rather than a haplotype-resolved

genome has limited our ability to associate genetic variation with health and disease (Sudmant et al., 2015).

Deriving phenotypes from more than just profiles of SNVs (single nucleotide variations), by investigating

structural variants, gene fusion events, and the cumulative effects of mutations across long distances is

likely to be greatly beneficial. It is estimated that more than half of human genomic variation is constituted

by structural variants (Huddleston et al., 2017; Huddleston and Eichler, 2016) in the form of deletions,

insertions, inversions, duplications and translocations, and studies have shown such events to have a larger

effect on gene expression than SNVs (Chiang et al., 2017). Identifying such variants all but equates to a

need for long-range haplotype information so differences between maternal and paternal alleles can be

resolved (Huddleston and Eichler, 2016; Collins et al., 2017). The importance of long-range phasing

information is exemplified by the characterization of compound heterozygosity being essential for

diagnosis of recessive Mendelian diseases. Furthermore, delving deeper into the genetic basis of elaborate

phenotypes have shown structural variants to be drivers of cancers and complex diseases (Feuk et al., 2006;

Moncunill et al., 2014).

Despite the promise of new-found insights for medical genomics, the adoption of whole genome

haplotyping technologies have been limited by high costs due to specialized instruments and platform-

dependent reagents. Assays based on dilution of genomic fragments in discrete compartments, combined

with compartment-specific barcoding, have been proven effective for obtaining long-range phasing

information by linking reads that share a common barcode (Zheng et al., 2016; Amini et al., 2014; Peters et

al., 2014; Lan et al., 2016). Such technologies are able to utilize the high throughput and accuracy of short

read sequencing platforms while maintaining the long-range information crucial for haplotyping. One

technology in particular (Zheng et al., 2016), has in recent years established itself as the foremost

alternative for whole genome haplotyping, but its reliance on microfluidic equipment and barcoded beads

have limited its scalability and flexibility. Furthermore, the high cost of preparing libraries for sequencing

has notably limited widespread adoption of this technology. Sequencing based on ‘contiguity preserving

transposition on beads’ (Zhang et al., 2017), offers an alternative solution for genome-wide haplotyping

with a single tube reaction setup. Performing tagmentation of genomic fragments on uniquely barcoded

beads, rather than in discrete compartments, opens up the potential for automated library preparation in the

future. However, an evident bottleneck of this technology is the laborious generation of a transposase-

Page 3: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

3

linked bead library with barcodes of sufficient complexity to resolve a human genome. Furthermore, a

product that enables phasing of DNA molecules has not yet been made commercially available. Alternative

platforms based on long read sequencing of single molecules (Clarke et al., 2009; Eid et al., 2009) provide

long-range phasing information, albeit with a lower sequencing accuracy and throughput than short read

sequencing, rendering them unfit for the scale required to haplotype human-sized genomes (Laver et al.,

2015; Quail et al., 2012; Loman et al., 2015). Ultimately, these platforms have mostly been used to provide

longer scaffolding information for genome assembly, whilst still relying on short read sequencing for

coverage (Koren et al., 2012; Pendleton et al., 2015).

The key to making an assay affordable is minimizing the dependency on proprietary technologies in terms

of instruments and reagent kits. A number of droplet-based barcoding strategies have been developed to

enable high-throughput analysis of single molecules (Lan et al., 2016) or single cells (Zheng et al., 2017;

Klein et al., 2015; Macosko et al., 2015), all utilizing microfluidic devices for droplet generation and/or

uniquely barcoded beads to distinguish the contents in each droplet. In-house manufacturing of

microfluidic systems has been the solution for many academic groups to avoid commercial devices for

droplet generation which are expensive and typically constricted to particular assays. Although the

hardware components for microfluidic systems are easy to obtain, the manufacturing of single-use

microfluidic chips requires expertise and equipment that goes beyond what is available in most laboratories

(Lan et al., 2016). Likewise, the production of barcoded beads is not a trivial matter as highly complex

libraries of unique barcodes are required to perform high-throughput analyses. The process of barcoding

beads can for instance be done by clonal amplification in droplets (Borgström et al., 2015), by

combinatorial extension cycling (Klein et al., 2015) or by split-pool cycles of phosphoramidite synthesis

(Macosko et al., 2015); but regardless of strategy it remains a costly and laborious pre-requisite for the

intended assay reaction.

Here we describe a low-cost assay for whole genome haplotyping and single molecule phasing, based on

previous work of barcoding of long DNA fragments in emulsion droplets formed by simple shaking (Redin

et al., 2017). It does not require microfluidic devices or complex libraries of barcoded beads, making it

more scalable and affordable than alternative methods. Being free from these requirements also means

libraries can be prepared in any laboratory setting, using readily available reagents at a cost of $19 per

library and without investments in platform-specific laboratory equipment. The number of genomic

fragments per droplet can be tuned according to sample input, from picograms to nanograms of DNA input,

enabling haplotyping of whole human genomes, or smaller genomes with single molecule resolution. In

this study, we present a haplotype-resolved human genome and investigate use of barcode-linked reads for

the reference-free assembly of human and bacterial genomes.

Page 4: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

4

First, a tagmentation reaction introduces a universal DNA

sequence at arbitrary yet evenly distributed positions throughout

the genome. Bead-linked transposases preserve the proximity of

tagmented constituents from long DNA fragments, and by

linking the template molecule(s) from each bead to a mutually

exclusive barcode it enables the information of proximity to be

conserved through DNA sequencing (Figure 1). This is achieved

by separating the beads into millions of discrete compartments

(emulsion droplets) together with single copies of a barcoding

oligonucleotide. The barcoding oligonucleotide features a semi-

randomized sequence with an unrestricted complexity, ensuring

the barcode present in each compartment is unique. Within each

droplet, PCR amplification is used to first generate clonal

populations of single stranded barcoding oligos, and then to

couple the barcode to template molecules (Figure S1). As a

consequence of limited dilution, droplets without either the

barcoding or template molecules will be formed, but neither will

yield coupled amplicons. Such products are removed from the

library through a target-specific enrichment following breakage

of the emulsion reaction. Libraries consisting of barcode-linked

molecules are then sequenced using the standard Illumina short

read platform where reads are grouped according to the barcode

to reconstruct long-range haplotype information of the original

fragment(s).

Figure 1. Overview of the BLR technology. High molecular weight DNA fragments are diluted and tagmented with bead-linked transposases. DNA-loaded beads are put into emulsion droplets with barcoding oligonucleotides and primers for amplification, and the constituents of each original molecule is coupled to a unique barcode sequence through emulsion PCR. Following the removal of uncoupled molecules, the library undergoes standard short read sequencing and subsequent grouping of reads according to the barcode sequence. The resulting barcode-linked reads are utilized for long-range DNA phasing, genome-wide haplotyping or de novo genome assembly.

Page 5: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

5

RESULTS

We generated barcode-linked read libraries from various sources and amounts of input DNA to validate the

technology and showcase its flexibility. First, to highlight the aspect of single molecule phasing, a library

was prepared with 25 pg of input DNA from E. coli strain BL21 and sequenced to a depth of 190X.

Processing of 17,247,153 sequencing reads yielded 358,792 barcoded read groups and an N50 molecule

length of 42,780 bp. A reference-free assembly of these reads initially yielded 190 contigs, which was

reduced to 8 scaffolds when utilizing the barcode connections for scaffolding; with a single scaffold

covering 84.2% of the genome (or 97.8% with two scaffolds). In contrast, assembling with conventional

scaffolding, without taking the barcode links into account, yielded 131 scaffolds (Figure 2a). Moreover,

the NG50 value was 115,062 bp for the short-read assembly compared to 3,812,784 bp for the barcode-

linked assembly.

Next, a library from the human ‘genome in a bottle’ (GIAB) reference individual GM24385 was generated

to enable benchmarking of results against external haplotyping technologies and sequencing data. A total of

451M sequencing read pairs were processed (Table S1), leaving 321M read pairs assigned to 2,204,497

barcoded read groups that were used as input for phasing analysis. With our barcode sequences translated

to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length of 1,832,815 bp.

The haplotype was resolved for 3,620,251 SNVs (97.9% of identified SNVs) with a mean sequencing depth

of 19.1X and 0.7% of the reference genome not covered. Figure 2b shows that we obtain an even coverage

across the whole genome and for both haplotypes, with minor discrepancies observed in some

heterochromatic regions of chromosomes. The software calculated that 74.8% and 18.6% of input

molecules were over 20 Kb and 100 Kb, respectively. Variant calling of constructed phase blocks yielded

29 large structural variant (LSV) calls (Figure 2b) and 4,008 short deletion calls. Based on sequenced

bases and mapping coordinates, we estimate the coupling efficiency of our assay to be 74.4%

(Supplementary Note). For the purpose of benchmarking, our data was compared to a GIAB resource

dataset based on the 10x Genomics Chromium Genome platform (Zook et al., 2016), see Table S2 for a

corresponding overview of output metrics from the Long Ranger analysis pipeline.

Comparing SNVs to an external ‘ground truth’ dataset (Zook et al., 2016), based on standard Illumina

sequencing to a depth of ~300X, the sensitivity of our assay was estimated to be 76.1%, with an accuracy

rate of 95.4% (Table S3). In context, the GIAB (10x Genomics) resource dataset with significantly higher

sequencing depth, featured a sensitivity of 85.1% and accuracy of 90.7%. As an example, our data shows

all HLA family genes have their haplotypes resolved. Out of 29 large structural variant calls, 28 (96.6%)

were validated through manual review of correspondence against the GIAB (10x Genomics) dataset (Table

S4). Called structural variants consist of inversions, deletions and duplication events across the genome,

varying from 40 kb to 1.2 Mb, three of which are visualized in Figure 2c and Figure 2d.

Page 6: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

6

Figure 2. Whole genome haplotyping and de novo assembly results. (a) De novo assembly of the E. coli

str. BL21 genome. From center, contigs assembled using short reads (grey), standard scaffolding

performed using paired end reads (light indigo), scaffolding performed with barcode-linked reads (indigo),

and the relative sequence coverage across the genome (light grey). (b) Sequence data of haplotype-

resolved human genome, GM24385 (19X). From center, phased SNV density and relative read coverage

for haplotype 1 (red), phased SNV density and relative read coverage for haplotype 2 (orange), total read

coverage (light grey) on a scale from 0 to 25X. The localization of large structural variants is visualized by

bands in grey (not drawn to scale). (c) Heatmap of barcode overlap for reads spanning called structural

variants with a window size of 250 kb, an 86.0 kb inversion in chromosome 12 (left) and a 40.8 kb

heterozygous deletion in chromosome 2 (right). Relative read coverage collapsed for the two haplotypes

shown in grey, x and y axis are identical. (d) Barcode-linked reads of a heterozygous deletion identified in

chromosome 4, with reads assigned to either haplotype, and where the reads on each line share a mutually

exclusive barcode. Reads in the top haplotype (shown in orange) are linked across the deletion spanning

49.5 Kb.

Page 7: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

7

Complementing the GM24385 library with an additional library and sequencing increased the coverage to

34.7X, enabling the evaluation of a reference-free assembly of the combined dataset. Running the human

de novo pipeline generated an assembly with a total length of 3.2 Gb, with scaffolds covering 85.4% of the

GRCh38 reference genome (heterochromatic regions not excluded, Figure S2). Scaffolding based on

barcode-linked reads increased the assembly N50 from 24.4 kb to 5.60 Mb, whereof the longest contig

spanned 47.9 Mb. The benefit of a higher sequencing depth was also evaluated for the Long Ranger

pipeline, which with an input of 540M sequencing read pairs resulted in 98.8% of SNVs phased and phase

blocks up to 11.9 megabases (N50 phase block length of 2,812,019 bp). Variant calling was in agreement

with the dataset of lower sequencing depth, resulting in 30 LSVs and 4,047 short deletion calls (Table S2).

DISCUSSION

Geneticists are recognizing that short read sequencing by itself is not sufficient to resolve the connection

between genetic variation and common aspects of health and disease (Chaisson et al., 2017). Expanding on

the capability of short read sequencing platforms to generate vast amounts of high quality data, by

resolving haplotypes and calling structural variants, presents a pivotal change in genomics that enables

more accurate reconstruction of genomes (Church et al., 2015; Schneider et al., 2017). It is clear that

preserving the contiguity of short sequences is currently the most affordable strategy for obtaining

haplotyping information on a human genome-wide scale. We have described a novel library preparation

method for barcode-linked reads that enables whole genome haplotyping for $19 per library (Table S5),

substantially less than the most established alternative (Zheng et al., 2016). Furthermore, as the assay does

not require an investment in a platform-specific instrument it can be implemented in any laboratory with a

thermal cycler, in a single day (Figure S3).

Our results show that we can phase up to 99% of called SNVs in the human genome with an accuracy of

~95% and generate phase blocks with an N50 of 2.8 Mb. Analysis of data from a single lane of sequencing

identified 29 large structural variants in the human genome, of which 28 could be independently verified

(96.6%). Overall, the data shows that our assay is comparable to that of alternative technologies, in terms

of generating long-range haplotyping information, phasing SNVs (Table S3) and calling LSVs (Table S4)

with high accuracy. To showcase the aspect of single molecule phasing and de novo sequencing, we also

applied our method to a sample of bacterial DNA using only 25 pg of input material. The results

demonstrate a drastic reduction in assembly contigs and improvement in NG50 values when barcode-linked

reads are utilized compared to standard short read sequencing. To highlight the potential for de novo

assembly of complex genomes, we were able to assemble the human genome with a modest sequencing

depth of 35X using barcode-linked reads. These results showcase the power of the BLR technology for

such applications.

Page 8: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

8

The presented method offers laboratories all over the world the benefit of adding long-range phasing

information to short read sequencing, through a simple protocol independent of specialized equipment and

expensive reagents. Generating millions of discreet reaction chambers by shaking presents a convenient

and scalable approach suitable for a wide range of applications. We recognize that droplets formed by

shaking results in a non-uniform, albeit controllable (Redin et al., 2017), size distribution compared to

microfluidic chips. Though if microfluidic devices are readily available the proposed chemistry would be

fully compatible. Regardless, the results in this study show that microfluidic devices are not necessary for

producing high quality phased data. As the field seeks to draw upon the advantages of long-range

haplotyping information, standard practices for extracting DNA will need to shift towards more meticulous

protocols aimed at maintaining the integrity of large genomic fragments for phasing. Unlike long read

sequencing platforms, an advantage of our assay is that there is no inherent bias in the length of fragments

that can be phased.

We present a flexible and scalable solution for whole genome haplotyping and de novo sequencing that can

be used for DNA inputs ranging from picograms to nanograms. Consequently, it can be tailored according

to the size or complexity of the genome and the resolution to which clinical or biological studies would

require – e.g. mammals, insects or plant genomics. An intriguing application of this technology would be

reference-free assembly of complex metagenome samples, where high-throughput phasing of long single

molecules is of particular interest. Furthermore, the low input requirement means haplotype-resolved

genomes from single cells are feasible as a future prospect. For the purpose of expanding our understanding

of population diversity and individual variance, the next frontier for large-scale genomics ought to be de

novo and haplotype-resolved genome analyses (Aleman, 2017). The rise of long read sequencing and

linked-read platforms show that more and more researchers are realizing the benefit of long-range phasing

information. The method proposed within offers a unique opportunity for researchers to tackle the hurdles

of de novo sequencing and genome-wide haplotyping without being limited by a lack of resources.

Combined with the continued reduction in short read sequencing costs, the need for an affordable library

preparation that maximizes the yield of medically relevant information is evident. In the near future, there

will simply be no room for library preparation assays that cost hundreds of dollars per sample when the

cost of sequencing the human genome will be reduced to a fraction of that.

Page 9: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

9

MATERIALS AND METHODS

Sample preparation. High molecular weight gDNA was extracted from freshly cultivated cells of E. coli strain BL21 according to the MagAttract HMW DNA Kit (Qiagen). Following extraction, the DNA was quantified by Qubit 3.0 (Life Technologies) and diluted in Elution Buffer (Qiagen). A sample of human gDNA from GIAB (Genome in a Bottle) individual GM24385 was obtained from the 10x Chromium Genome Kit (Control gDNA). On-bead tagmentation of DNA was performed using Nextera DNA Flex Library Prep (Illumina) according to the manufacturer's reference guide for tagmentation; except that each reagent was scaled down to 15% of the specified volumes to reduce the amount of BLT (bead-linked transposases) used per reaction. The input quantity of DNA added to the on-bead tagmentation reactions varied according to the size of the genome assayed, 1 ng gDNA was used for human samples and 25 pg was used for the E. coli sample. Rather than amplification of tagmented DNA, the polymerase amplification was prepared without the addition of Nextera Flex indexes, and beads were subjected to an incubation at 68 degrees for 10 min instead of the specified PCR cycling protocol. The beads were subsequently washed twice with the supplied TWB reagent and then resuspended in 5 ul Elution Buffer (Qiagen) prior to emulsification.

Reaction emulsification and retrieval. Assay reactions consist of 50 µl PCR reagents that are mixed and added on top of emulsion oil before shaking for emulsification. The PCR volumes consist of 5 ul beads with tagmented DNA (see section above), 1x Phusion Flex Master Mix (New England Biolabs), 1 M Betaine (Sigma Aldrich), 3%vol DMSO (Thermo Scientific), 2%wt PEG-6000 (Sigma Aldrich), 2%vol Tween-20 (Sigma Aldrich), 400 nM Enrichment Oligo, 80 nM Coupling Oligo and 330 fM Barcoding Oligo (see Table S6 for oligonucleotide sequences; purchased from Integrated DNA Technologies). Emulsification is carried out by pipetting the 50 ul PCR volume on top of 100 µl HFE-7500 oil with 5%(w/V) 008-Fluorosurfactant (Ran Biotechnologies) in a Qubit tube (Life Technologies) and shaking at 14.0 Hz for 8 min using a Tissuelyser instrument (Qiagen). Emulsion reactions were then left to stand upright for 15 min to settle, the excess oil was removed from the bottom and the remaining emulsion phase was transferred to a PCR tube with 60 µl FC-40 oil with 5%(w/V) 008-Fluorosurfactant (Ran Biotechnologies) already added to it. 85 µl of mineral oil (Sigma Aldrich) was then added on top of the emulsion reactions, before placing the reaction tubes in a Mastercycler Pro S (Eppendorf) instrument for reaction cycling with the following protocol: 5 min at 95°C, 30 cycles of [95°C for 30 s - 55°C for 30 s - 72°C for 30 s], followed by 8 cycles of [95°C for 1 min - 40°C for 2 min - 72°C for 5 min, with 3% temperature ramp speed] and ending the protocol with 10 min at 72°C and holding at 12°C. Following emulsion PCR, the mineral oil was discarded with a pipette and 4 µl EDTA (100 mM) was added. The entire emulsion reaction and excess emulsion oil was transferred to a 0.5 ml tube (Eppendorf) and 100 µl 1H,1H,2H,2H-Perfluoro-1-octanol (Sigma Aldrich) was added before vortexing at maximum speed. After centrifugation for 1 min at 20,000 g, the aqueous phase was collected from the top and a magnetic rack was used to discard the beads.

Page 10: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

10

Sample enrichment and sequencing library preparation. Following retrieval of aqueous phases from

emulsion reactions the library preparation was continued by a bead-based purification to remove short and

uncoupled barcoding amplicons below 200 bp using sample purification beads included in the Nextera Flex

kit. Biotinylated and barcode-coupled molecules were then enriched for by washing and incubating the

sample with 20 ul DynaBeads MyOne Streptavidin T1 beads (Life Technologies) in B&W buffer (1 M

NaCl, 5 mM Tris-HCl, 500 µM EDTA) for 30 min under rotation at room temperature. The supernatant

was then discarded and the beads washed twice with Elution Buffer, four times with NaOH (0.125 N), and

finally two more times with Elution Buffer. An indexing PCR was then performed on the washed

enrichment beads in 1x Phusion Flex Master Mix with 400 nM Indexing Oligo; using a protocol starting

with 2 min at 95°C, 5 min at 55°C (with 10% ramp speed), 10 min at 72°C (with 3% ramp speed), and 1

min at 95°C. At this point, the PCR reaction was paused and placed on a magnetic rack (heated to 80°C),

and the supernatant was transferred to a fresh PCR tube. The i5 Adapter Oligo (Table S6) was then added

to a final concentration of 400 nM to the reaction and the PCR indexing protocol was continued by running

4 cycles of [95°C for 30 s - 55°C for 30 s - 72°C for 1 min], followed by 2 min at 72°C. Reactions were

then purified and samples were quantified by Qubit (Life Technologies). Human samples were diluted to 2

nM and sequenced using the HiSeq X platform (Illumina) with 150 bp paired-end sequencing and an 8 bp

single-end index. The bacterial sample was sequenced using the MiSeq platform (Illumina) with a v2 kit

for 150 bp paired-end sequencing.

Data analysis overview. The analysis pipeline combines custom written python scripts, commonly used

bioinformatics tools and three pipelines for sequencing reads linked by barcodes, used depending on the

sample type and purpose as detailed below and outlined in Figure S4. Samtools (v1.7) and pysam (v0.14)

were used extensively in in-house developed scripts (Li et al., 2009). All parts of the pipeline are available

through GitHub (https://github.com/FrickTobias/BLR), and all sequencing data is available on the

Sequence Read Archive (SRA) accession PRJNA479411.

Read trimming and barcode deconvolution. For all samples, sequencing reads were initially trimmed

with cutadapt (v1.16) (Martin, 2011) using a similarity threshold of 20%, to remove universal handle

sequences upstream of the barcode sequence, downstream of the barcode sequence, and downstream of

genomic inserts (when applicable). Reads not following the structure and sequence of the universal handles

were omitted from downstream analysis steps (Table S1). The barcode sequence, matching a predefined

design (Table S6), were extracted from reads and divided in 8 subsets based on the first two bases, using

bc_extract.py. These files were then individually clustered using CD-HIT-454 (v4.6) (Fu et al., 2012) with

k-mer cutoff of 0.9, to deconvolute the barcoded read groups.

Bacterial de novo assembly. Bacterial samples were first required to have their barcodes grouped by

making an initial assembly with IDBA-UD (Peng et al., 2012) followed by mapping original reads to it

Page 11: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

11

with BWA (Li and Durbin, 2009). Mappings were put into the barcode-aware Athena (Bishara et al., 2017)

to yield an assembly. This assembly was preprocessed with ARCS (v1.0.1) (Yeo et al., 2018) to enable

scaffolding of linked-read data with LINKS (v1.8.5) (Warren et al., 2015), using a head tail region of 5 kb

for the final assembly. The conventional short-read assembly was instead generated by scaffolding the

initial non-linked assembly using SPACE (v2.2) (Boetzer et al., 2011). Bacterial assembly metrics were

calculated using metaQUAST (v5.0.0) (Mikheenko et al., 2016).

Human genome haplotyping and de novo assembly. Trimmed reads were first mapped using Bowtie2

(v2.3.4.1) (Langmead et al., 2012) with GRCh38 as reference, thereafter read duplicates with the same

barcode were called and removed using picard tools (v2.5.0) (http://broadinstitute.github.io/picard/). To

identify and collapse barcode sequences originating from the same droplet, barcode-linked reads sharing at

least two proximal (<100 Kb) read pairs were merged using cluster_rmdup.py. To exclude potentially

erroneous phasing information from abnormally large droplets, the barcode sequence was stripped from all

reads originating from droplets with >260 molecules using filter_clusters.py. For this purpose, molecules

were defined as barcode-linked reads whereby each read mapped no further than 30 kb from its closest

neighbor. Reads along with corresponding barcodes were then converted in accordance to the input of the

10x Genomics analysis pipelines (www.10xgenomics.com) with wfa2tenx.py. To enable use of these

pipelines, which feature a limit of 4.8M barcodes, the complexity of our barcode population was reduced

by stripping the barcode information from all barcoded read groups with less than 4 read pairs. For whole

genome haplotyping analysis, the Long Ranger analysis pipeline (v2.2.2) was run with GATK (v3.8) for

SNV calling and a 10x GRCh38 blacklist to reduce false positive variant calls. For the de novo assembly,

the Supernova (Weisenfeld et al., 2017) pipeline was utilized, with the option of no-preflight to allow for

shorter insert sizes than the typical Chromium data.

Data validation and visualization. For de novo assemblies, the final scaffolds zero coverage and

mismatch percentages were calculated based on metrics supplied from metaQUAST (Mikheenko et al.,

2016) for the bacterial assembly and QUAST (4.6.4) (Gurevich et al., 2013) for the human genome

assembly respectively. Human genome scaffolds were aligned using Minimap2 (v2.4) (Li, 2018). SNV

calls were evaluated by comparing output variant files from the Long Ranger analysis to variant calls from

a reference dataset (300X Illumina sequencing) (Zook et al., 2016) using VCFtools (Danecek et al., 2011)

and large structural variation calls (>30kb) were compared manually (Table S4, Figure S5, Figure S6) to

variants identified in the resource dataset GIAB (10x Genomics) (Zook et al., 2016). The GIAB (10x

Genomics) resource dataset was downloaded and run through Long Ranger with the same version and

parameters as previously described. In addition to previously mentioned software, MultiQC (v1.6) (Ewels

et al., 2016), bedtools (v2.27.1) (Quinlan and Hall, 2010), Circos (v0.69.6) (Krzyminski et al., 2009) and

dotPlotly (https://github.com/tpoorten/dotPlotly) were used for calculations and data visualization.

Page 12: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

12

ACKNOWLEDGEMENTS

The authors would like to thank Christian Pou for assistance in cultivating the bacterial strain, and the

National Genomics Infrastructure (NGI) and UPPMAX for providing sequencing and computational

support and infrastructure. This work was funded by the Erling Persson Family Foundation, and Olle

Engkvist Foundation [2015/347].

AUTHOR CONTRIBUTIONS

A.A. and D.R. conceived the technology. D.R., T.F. and A.A. designed the experiments. T.F. developed the

analysis pipeline, with support from E.B., R.O and M.K.. D.R. performed all experiments, with support

from H.A.. J.T. produced the bacterial de novo assembly. R.O. produced the human de novo assembly. All

authors contributed in writing the manuscript.

COMPETING INTERESTS

The authors declare no competing interests.

Page 13: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

13

REFERENCES

Aleman, F. (2017). The necessity of diploid genome sequencing to unravel the genetic component of complex phenotypes. Front. Genet. 8, 148.

Amini, S., Pushkarev, D., Christiansen, L., Kostem, E., Royce, T., Turk, C., Pignatelli, N., Adey, A., Kitzman, J.O., Vijayan, K., et al. (2014). Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat. Genet. 46, 1343-1349.

Boetzer, M., Henkel, C. V., Jansen, H.J., Butler, D., and Pirovano, W. (2011). Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27, 578-579.

Borgström, E., Redin, D., Lundin, S., Berglund, E., Andersson, A.F., and Ahmadian, A. (2015). Phasing of single DNA molecules by massively parallel barcoding. Nat. Commun. 6

Chaisson, M.J.P., Sanders, A.D., Zhao, X., Malhotra, A., Porubsky, D., Rausch, T., Gardner, E.J., Rodriguez, O., Guo, L., Collins, R.L., et al. (2017). Multi-platform discovery of haplotype-resolved structural variation in human genomes. BioRxiv 193144. doi:10.1101/193144

Chiang, C., Scott, A.J., Davis, J.R., Tsang, E.K., Li, X., Kim, Y., Hadzic, T., Damani, F.N., Ganel, L., Montgomery, S.B., et al. (2017). The impact of structural variation on human gene expression. Nat. Genet. 49, 692-699.

Church, D.M., Schneider, V.A., Steinberg, K.M., Schatz, M.C., Quinlan, A.R., Chin, C.S., Kitts, P.A., Aken, B., Marth, G.T., Hoffman, M.M., et al. (2015). Extending reference assembly models. Genome Biol. 16.

Clarke, J., Wu, H.C., Jayasinghe, L., Patel, A., Reid, S., and Bayley, H. (2009). Continuous base identification for single-molecule nanopore DNA sequencing. Nat. Nanotechnol. 4, 265-270.

Collins, R.L., Brand, H., Redin, C.E., Hanscom, C., Antolik, C., Stone, M.R., Glessner, J.T., Mason, T., Pregno, G., Dorrani, N., et al. (2017). Defining the diverse spectrum of inversions, complex structural variation, and chromothripsis in the morbid human genome. Genome Biol. 18.

Danecek, P., Auton, A., Abecasis, G., Albers, C. a, Banks, E., DePristo, M. a, Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., et al. (2011). The variant call format and VCF tools. Bioinformatics 27, 2156-2158.

Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso, P., Rank, D., Baybayan, P., Bettman, B., et al. (2009). Real-time DNA sequencing from single polymerase molecules. Science 323, 133-138.

Ewels, P., Magnusson, M., Lundin, S., and Käller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047-3048.

Feuk, L., Carson, A.R., and Scherer, S.W. (2006). Structural variation in the human genome. Nat. Rev. Genet. 7, 85-97.

Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012). CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150-3152.

Gurevich, A., Saveliev, V., Vyahhi, N., and Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072-1075.

Huddleston, J., Chaisson, M.J.P., Steinberg, K.M., Warren, W., Hoekzema, K., Gordon, D., Graves-Lindsay, T.A., Munson, K.M., Kronenberg, Z.N., Vives, L., et al. (2017). Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677-685.

Page 14: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

14

Huddleston, J., and Eichler, E.E. (2016). An incomplete understanding of human genetic variation. Genetics 202, 1251–1254.

Klein, A.M., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., Peshkin, L., Weitz, D.A., and Kirschner, M.W. (2015). Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187-1201.

Koren, S., Schatz, M.C., Walenz, B.P., Martin, J., Howard, J.T., Ganapathy, G., Wang, Z., Rasko, D.A., McCombie, W.R., Jarvis, E.D., et al. (2012). Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693-700.

Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J., and Marra, M.A. (2009). Circos: An information aesthetic for comparative genomics. Genome Res. 19, 1639-1645.

Lan, F., Haliburton, J.R., Yuan, A., and Abate, A.R. (2016). Droplet barcoding for massively parallel single-molecule deep sequencing. Nat. Commun. 7.

Langmead, B., and Salzberg, S.L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods. 9, 357-359.

Laver, T., Harrison, J., O’Neill, P.A., Moore, K., Farbos, A., Paszkiewicz, K., and Studholme, D.J. (2015). Assessing the performance of the Oxford Nanopore Technologies MinION. Biomol. Detect. Quantif. 3, 1-8.

Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., Data, G.P., et al. (2009). The Sequence Alignment / Map format and SAMtools. Bioinformatics 25, 2078-2079.

Li, H. (2017). Minimap2: versatile pairwise alignment for nucleotide sequences. Bioinformatics bty191. doi:10.1093/bioinformatics/bty191

Loman, N.J., Quick, J., and Simpson, J.T. (2015). A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods. 12, 733-735.

Macosko, E.Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh, I., Bialas, A.R., Kamitaki, N., Martersteck, E.M., et al. (2015). Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214.

Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.Journal 17, 10-12.

Mikheenko, A., Saveliev, V., and Gurevich, A. (2016). MetaQUAST: Evaluation of metagenome assemblies. Bioinformatics 32, 1088-1090.

Moncunill, V., Gonzalez, S., Beà, S., Andrieux, L.O., Salaverria, I., Royo, C., Martinez, L., Puiggròs, M., Segura-Wang, M., Stütz, A.M., et al. (2014). Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads. Nat. Biotechnol. 32, 1106-1112.

Moss, E., Bishara, A., Tkachenko, E., Kang, J.B., Andermann, T.M., Wood, C., Handy, C., Ji, H., Batzoglou, S., and Bhatt, A.S. (2017). De novo assembly of microbial genomes from human gut metagenomes using barcoded short read sequences. BioRxiv 125211. doi:10.1101/125211

Page 15: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

15

Pendleton, M., Sebra, R., Pang, A.W.C., Ummat, A., Franzen, O., Rausch, T., Stütz, A.M., Stedman, W., Anantharaman, T., Hastie, A., et al. (2015). Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods. 12, 780-786.

Peng, Y., Leung, H.C.M., Yiu, S.M., and Chin, F.Y.L. (2012). IDBA-UD: A de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420-1428.

Peters, B.A., Kermani, B.G., Sparks, A.B., Alferov, O., Hong, P., Alexeev, A., Jiang, Y., Dahl, F., Tang, Y.T., Haas, J., et al. (2012). Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487, 190-195.

Quail, M.A., Smith, M., Coupland, P., Otto, T.D., Harris, S.R., Connor, T.R., Bertoni, A., Swerdlow, H.P., and Gu, Y. (2012). A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13.

Quinlan, A.R., and Hall, I.M. (2010). BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842.

Redin, D., Borgström, E., He, M., Aghelpasand, H., Käller, M., and Ahmadian, A. (2017). Droplet Barcode Sequencing for targeted linked-read haplotyping of single DNA molecules. Nucleic Acids Res. 45.

Schneider, V.A., Graves-Lindsay, T., Howe, K., Bouk, N., Chen, H.C., Kitts, P.A., Murphy, T.D., Pruitt, K.D., Thibaud-Nissen, F., Albracht, D., et al. (2017). Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849-864.

Sudmant, P.H., Rausch, T., Gardner, E.J., Handsaker, R.E., Abyzov, A., Huddleston, J., Zhang, Y., Ye, K., Jun, G., Fritz, M.H.Y., et al. (2015). An integrated map of structural variation in 2,504 human genomes. Nature 526, 75-81.

Warren, R.L., Yang, C., Vandervalk, B.P., Behsaz, B., Lagman, A., Jones, S.J.M., and Birol, I. (2015). LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. Gigascience 4.

Weisenfeld, N.I., Kumar, V., Shah, P., Church, D.M., and Jaffe, D.B. (2017). Direct determination of diploid genome sequences. Genome Res. 27, 757-767.

Yeo, S., Coombe, L., Warren, R.L., Chu, J., and Birol, I. (2018). ARCS: Scaffolding genome drafts with linked reads. Bioinformatics 34, 725-731.

Zhang, F., Christiansen, L., Thomas, J., Pokholok, D., Jackson, R., Morrell, N., Zhao, Y., Wiley, M., Welch, E., Jaeger, E., et al. (2017). Haplotype phasing of whole human genomes using bead-based barcode partitioning in a single tube. Nat. Biotechnol. 35, 852-857.

Zheng, G.X.Y., Lau, B.T., Schnall-Levin, M., Jarosz, M., Bell, J.M., Hindson, C.M., Kyriazopoulou-Panagiotopoulou, S., Masquelier, D.A., Merrill, L., Terry, J.M., et al. (2016). Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303-311.

Zheng, G.X.Y., Terry, J.M., Belgrader, P., Ryvkin, P., Bent, Z.W., Wilson, R., Ziraldo, S.B., Wheeler, T.D., McDermott, G.P., Zhu, J., et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8.

Zook, J.M., Catoe, D., McDaniel, J., Vang, L., Spies, N., Sidow, A., Weng, Z., Liu, Y., Mason, C.E., Alexander, N., et al. (2016). Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data. 3.

Page 16: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

16

Figure S1. Detailed method overview illustrating bead-bound template preparation (a) and subsequent assay

protocol (s). For clarity, only the contents of a single droplet and the manipulation of corresponding product

molecules are shown in the figure. The protocol for (a) follows User Guidelines for the Nextera DNA Flex

Library Preparation kit, with the exception that an extension is performed without added oligonucleotides

instead of the specified PCR cycling protocol. The process of tagmentation introduces a pair of universal

sequences (shown in red and blue) at multiple random positions over the length of long DNA fragments. (b)

Bead-bound template molecules and assay oligonucleotides (Table S6) are incorporated into emulsion droplets

by shaking. Inside the droplets, clonal amplification of the barcoding molecule (featuring DBS, a Droplet

Barcode Sequence, with a degeneracy of 3^20, or ~3.5*10^9) is performed asymmetrically to yield single

stranded products, followed by coupling to the universal sequences on template molecules. After emulsion

breakage, uncoupled amplicons are removed through bead-assisted size selection, and then by streptavidin-

biotin based enrichment. An on-bead extension is performed and the supernatant is collected after denaturing to

exclude uncoupled barcoding molecules to interfere with the indexing. Finally, an i5 sequencing adaptor is

added by PCR for sequencing using the Illumina platform. After sequencing, the reads are clustered according

to the barcode sequence to form barcoded read groups (Figure S4).

Page 17: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

17

Figure S2. Dot plot of GM24385 de novo assembly against the GRCh38 reference genome. Data is shown as a

heatmap of the mean percent identity (per query), for a minimum query length of 1 kb and excluding scaffolds

<10 kb. The reference is divided in the order of descending chromosome size. A lower assembly contiguity can

be observed in chrX and chrY but such discrepancies are to be expected since these chromosomes are harder to

assemble.

100%

75%

50%

25%

0%

1 2 3 4 5 6 7 X 8 9 11 10 12 13 14 15 16 17 18 20 19 Y 22 21

de n

ovo

asse

mbl

y

GRCh38 reference genome

Page 18: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

18

Figure S3. Overview of protocol steps with corresponding bench and instrument times. The whole protocol can

be performed in a single day.

TAGMENTATION AND REACTION SETUP

EMULSIFICATION AND PCR SETUP

DNA EXTRACTION AND QUANTIFICATION

SEQUENCING

EMULSION PCR

CLEAN UP AND QC

INDEXING PCR

EMULSION BREAKAGE AND CLEAN UP

20 min | Bench Time40 min | Incubation Time

20 min | Bench Time10 min | Instrument Time

15 min | Bench Time30 min | Instrument Time

15 min | Bench Time30 min | Instrument Time

3 hours | Instrument Time

20 min | Bench Time40 min | Instrument Time

30 min | Bench Time30 min | Instrument Time

TOTAL TIME 8 hours

HANDS ON~2 hours

ENRICHMENT AND WASHING

Page 19: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

19

Figure S4. Flowchart of computational pipeline. (a) Read trimming (b) barcode clustering (c) Bacterial de novo

assembly (d) Filtering followed by human phasing and de novo pipelines. In-house developed scripts are

marked in dark grey.

bowtie2

tag_bam.py

picardtools

cluster_rmdup.py

filter_clusters.py

cdhit_prep.py

CD-HIT-454

Clustered Barcodes

Phased genome

Assembled genome

c d

a b

cutadapt

cutadapt

bc_extract.py

Raw Reads

IDBA-UD

BWA

Athena

ARCS

LINKS

SPACE

Barcode-linked Assembly

Conventional Assembly

wfa2tenx.py

Long Ranger Supernova

Haplotyped Genome

de novo Assembly

Trimmed Reads

Page 20: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

20

Figure S5. Visual inspection of barcode overlaps in the GIAB (10x Genomics) resource dataset, with the

coordinates corresponding to (a) variant call 8 and (b) variant call 25 in the GM24385 (19X) dataset shown by

dotted lines (see Table S4 for variant calls). Shown in (a), variant call 8 is a duplication event spanning 120 kb

on chromosome 6, the barcode overlap and coverage of the reference dataset strongly suggests the presence of

localized duplication. Similarly for (b), there appears to be a heterozygous deletion identified within the same

region as variant call 25.

a

b

200,

000

440,

000

15,7

75,0

0016

,195

,000 26

0

26

0

70 k

bC

all 8

Cal

l 25

120

kb

Page 21: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

21

Figure S6. Visual inspection of barcode overlapping for non-verified variant call 15, corresponding to

GM24385 (19 X) (left) and GIAB (10x Genomics) reference dataset (right). The region covered by the

identified variant in chromosome 12 (duplication event spanning 120 kb) features a significant drop out in

coverage that has likely influenced the accuracy by which variants can be called. Since the same pattern of non-

existent barcode overlap is observed for the reference dataset, the authenticity of the variant call could not be

verified.

10,9

10,0

0011

,240

,000 26

0

110

kbC

all 1

5GM24835 (19X) GIAB (10x Genomics)

Page 22: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

22

Table S1. Processing of reads for GM24385 (19 X). The number of sequencing reads (not paired) and the

corresponding number of barcoded read groups are detailed after removal of reads within the analysis pipeline,

as described in Methods and Figure S4. A total of 233,006,619 reads (25.8%) were marked as duplicates.

Analysis Step Reads Remaining % Total Barcoded Read Groups

Raw data 903,723,210 100% 3,339,866Trimming 890,047,118 98.5% 3,339,866Filtering 641,457,522 71.0% 2,204,497

Page 23: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

23

Table S2. Analysis output from the Long Ranger phasing pipeline. Haplotyping of human libraries produced

with barcode-linked reads and an external resource dataset from the GIAB consortium (based on the 10x

Genomics platform) is shown. It is reasonable to believe that the metrics regarding phase block lengths are

influenced by the sequencing coverage and length distribution of input DNA molecules. In addition, the perfect

quality of sequencing reads reported for the GIAB (10x Genomics) dataset suggests that the data has been

filtered before phasing analysis with Long Ranger; a process which in all likelihood has affected key assay

metrics for the better.

* Large structural variant (LSV) calls featured heterozygous deletions calls in chromosome X, which following correspondence with 10x Genomics were confirmed as erroneous (Table S4).

Library GM24385 (19X) GM24385 (35X) GIAB (10x Genomics)

Sequencing reads 641,457,522 1,080,294,792 976,557,530Mean depth 19.1 X 34.7 X 41.7 XSNVs Phased 97.9% 98.8% 98.5%N50 Phase Block 1,832,815 bp 2,812,019 bp 9,657,460 bpLongest Phase Block 7,771,012 bp 11,919,151 bp 35,805,844 bpMean Molecule Length 25,946 bp 26,780 bp 104,745 bpMolecules >20 kb 74.8% 74.8% 92.2%Molecules >100 kb 18.6% 18.6% 44.7%LSV Calls * 35 35 35Short Deletion Calls 4,008 4,047 4,383Median Insert Size 231 bp 234 bp 308 bpMapped Reads 82.7% 89.6% 96.2%Zero Coverage 0.735% 0.507% 0.178%Q30 bases, Read 1 82.4% 86.2% 100%Q30 bases, Read 2 64.2% 69.3% 100%

Page 24: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

24

Table S3. Verification of Single Nucleotide Variation (SNV) calls for all datasets against a reference ground

truth dataset based on Illumina sequencing to a sequencing depth of 300x. The reference dataset contained

4,756,689, of which 4,319,399 SNVs had a minimum phred score of 60. Calculations of sensitivity and

accuracy are based on SNV calls with a minimum phred score of 60.

Library GM24385 (19X) GM24385 (35X) GIAB (10x Genomics)

Sequencing Depth 19.1X 34.7X 41.1XTotal SNV Calls 3,697,907 3,924,654 4,517,053SNVs (phred score >60) 3,445,072 3,804,691 4,054,372True Positive SNV Calls 3,287,856 3,599,024 3,675,500Non-called SNVs 1,031,533 720,359 643,885False Positive SNV Calls 157,210 205,655 378,862Discordant SNV Calls 6 12 10Detection Sensitivity 76.12% 83.32% 85.09%Detection Accuracy 95.44% 94.59% 90.66%

Page 25: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

25

Table S4. Verification of large structural variant (LSV) calls in GM24385 (19 X) against the GIAB (10x

Genomics) resource dataset. Phasing analysis was performed with the same reference and version of the Long

Ranger pipeline for both datasets. In addition to specified LSV calls, multiple heterozygous deletions were

called in the X chromosome for both datasets, despite the individual in question being a male. These anomalies

were reported and subsequently confirmed as erroneous by 10x Genomics.

* Variant type not classified in the GIAB (10x Genomics) dataset. ** Adjoining calls identified as a single variant in the GIAB (10x Genomics) dataset. *** Variant call visually supported from barcode overlap and read mappings (Figure S5) **** Visual inspection support does not support verification of the variant call (Figure S6).

Call Chr Position Type Size Position Type Size Note

1 1 72,300,641 - 72,346,143 DEL 45.5 kb 72,300,641 - 72,346 200 DEL 45.6 kb2 1 121,610,000 - 121,940,000 DEL 330 kb 121,610,000 - 121,840,000 DEL 230 kb3 2 34,470,699 - 34,511,486 DEL 40.8 kb 34,470,693 - 34,511,500 DEL 40.8 kb4 3 130,044,539 - 130,088,085 DEL 43.5 kb 130,044,542 - 130,087,886 DEL 43.3 kb5 4 34,778,034 - 34,827,542 DEL 49.5 kb 34,778,232 - 34,827,339 DEL 49.5 kb6 4 68,508,038 - 68,625,387 DEL 117 kb 68,508,058 - 68,625,383 DEL 117 kb7 4 69,270,000 - 69,370,000 DEL 100 kb 69,280,000 - 69,370,000 DEL 90 kb8 6 260,000 - 380,000 DUP 120 kb N/A ***9 6 32,483,200 - 32,586,623 DUP 103 kb 32,483,184 - 32,591,863 UNK 109 kb *

10 7 143,127,754 - 143,196,851 DEL 69.1 kb 143,127,755 - 143,196,800 DEL 69.0 kb11 9 60,550,000 - 60,630,000 DEL 80.0 kb 60,550,000 - 60,630,000 DEL 80.0 kb12 10 41,864,883 - 41,904,216 UNK 39.3 kb 41,859,660 - 41,900,226 UNK 40.6 kb13 11 55,590,000 - 55,660,000 DEL 70.0 kb 55,597,954 - 55,660,000 UNK 80.4 kb *14 12 17,767,858 - 17,853,857 INV 86.0 kb 17,773,181 - 17,869,963 INV 96.8 kb15 12 11,020,000 - 11,130,000 DUP 110 kb N/A ****16 12 11,084,352 - 11,126,407 INV 42.1 kb 11,083,875 - 11,126,374 UNK 42.5 kb *17 14 19,860,000 - 19,950,000 DUP 90.0 kb 19,700,000 - 19,950,000 DUP 225 kb18 14 105,770,000 - 106,420,000 DEL 650 kb 105,760,000 - 106,630,000

19 14 106,420,000 - 106,540,000 DEL 120 kb 105,760,000 - 106,630,000

20 14 106,540,000 - 106,630,000 DEL 90.0 kb 105,760,000 - 106,630,000

21 16 22,610,000 - 22,700,000 DUP 90.0 kb 22,530,000 - 22,700,000 DUP 170 kb22 16 34,040,000 - 34,240,000 DUP 200 kb 34,110,000 - 34,210,000 DUP 100 kb23 17 46,140,000 - 46,240,000 DUP 100 kb 46,140,000 - 46,230,000 DUP 90 kb24 19 24,330,000 - 24,410,000 DUP 80.0 kb 24,330,000 - 24,410,000 DUP 80.0 kb25 22 15,950,000 - 16,020,000 DEL 70.0 kb N/A ***26 22 22,760,000 - 22,910,000 DEL 150 kb 22,758,960 - 22,907,024 DEL 148 kb27 Y 9,470,000 - 10,250,000 DEL 780 kb 9,460,000 - 10,260,000 DEL 800 kb28 Y 18,850,000 - 20,070,000 DEL 1.22 Mb 10,600,000 - 20,070,000 DEL 9.47 Mb29 Y 20,350,000 - 21,490,000 DEL 1.14 Mb 20,350,000 - 26,630,000 DEL 6.28 Mb

Visually confirmed

GM243825 (19 X) GIAB (10x Genomics)

**

Visually confirmed

Not verified

DEL 870 kb

Page 26: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

26

Table S5. Assay cost assessment.

* Based on reagent volumes provided in Nextera DNA Flex Library Preparation Kit for 24 reactions, of which 15% of a kit reaction volume was used per assay library.

** Costs for reagents have been converted from SEK to USD using an exchange rate of 0.116 USD/SEK. Common materials (plastic consumables and buffers) have not been included in the cost assessment.

Reagent Supplier

Nextera DNA Flex Library Prep Kit * Illumina $5.83Phusion® Hot Start Flex 2X Master Mix NEB $3.62Emulsion Additives Sigma $0.04Oligonucleotides IDT $0.05Novec 7500 with 5%wt 008-FS RAN Biotechnologies $3.78FC-40 with 5%wt 08-FS RAN Biotechnologies $2.761H ,1H ,2H ,2H -Perfluoro-1-octanol Sigma $1.22DynaBeads MyOne Streptavidin T1 Thermo Scientific $1.65Total ** $18.95

Cost Per Library

Page 27: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

27

Table S6. Oligonucleotide sequences.

* The indexing oligonucleotide is identical to Illumina’s Adapter Sequences for Nextera (PCR primer for index 1 read) that features an 8 bp i7-index from a pre-defined set of sequences.

Name Sequence (5’ - 3’)

Enrichment Oligo BIOTIN-GGAGCCATTAAGTCGTAGCT

Barcoding Oligo GGAGCCATTAAGTCGTAGCTCAGTTGATCATCAGCAGGTAATCTGGBDHVBDHVBDHVBDHVBDHVCATGACCTCTTGGAACTGTC

Coupling Oligo CTGTCTCTTATACACATCTGACAGTTCCAAGAGGTCATG

Indexing Oligo * CAAGCAGAAGACGGCATACGAGAT [i7] GTCTCGTGGGCTCGG

i5 Adaptor Oligo AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTCAGTTGATCATCAGCAGGTAATCTGG

Page 28: Efficient whole genome haplotyping and single molecule ... › smash › get › diva2:... · to 10x barcodes (Methods), the Long Ranger pipeline yielded an N50 phase block length

28

Supplementary Note. Coupling efficiency.

Based on mapping positions identified within the barcoded read groups, molecule lengths were defined as the

distance spanning the outer most reads. Given the reads spanning each molecule, a measure of bases sequenced

was calculated for each molecule, yielding an average coverage of 18.6%.

Given transpose adaptors A and B, the template molecules formed will be structured as either A-A’, A-B’, B-

A’, or B-B’. As a consequence of this, half of the template molecule will be unavailable for amplification due

to complementary ends forming secondary hairpin structures (A-A’ and B-B’). Furthermore, given the protocol

used for library indexing (Figure S1), only half of the available molecules will be amplified to yield products

for sequencing. The coupling efficiency of barcode sequences to template molecules can thus crudely be

appreciated to be 74.4% (18.6 * 2 * 2).