Error Correction and Assembly of Single Molecule...
Transcript of Error Correction and Assembly of Single Molecule...
![Page 1: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/1.jpg)
Error Correction and Assembly of Single Molecule Sequencing Data
James Gurtowski
![Page 2: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/2.jpg)
Assembly Complexity
A" R"
B"
C"
A" R" B" R" C" R"
![Page 3: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/3.jpg)
Assembly Complexity
A" R" B" C"
A" R" B" R" C" R"
R" R"
A" R" B" R" C" R"
The advantages of SMRT sequencing Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology. 14:405
![Page 4: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/4.jpg)
Oxford Nanopore MinION • Thumb drive sized sequencer
powered over USB
• Senses DNA by measuring changes to ion flow
• Reads both DNA Strands (2D)
![Page 5: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/5.jpg)
Nanopore Basecalling
Basecalling currently performed at Amazon with frequent updates to algorithm
![Page 6: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/6.jpg)
Our Data - Yeast W303
30 Flowcells
Best: 446Mb
Mean Read Length: ~6kb (10kb shear)
Mean: 50Mb
0
50
100
150
200
250
300
350
400
450
500
06/23/14
06/26/14
07/01/14
07/02/14
07/04/14
07/10/14
07/10/14
07/11/14
07/11/14
07/12/14
07/16/14
07/16/14
07/18/14
07/18/14
07/21/14
07/23/14
07/24/14
07/24/14
07/25/14
07/28/14
07/28/14
08/04/14
08/05/14
08/07/14
08/07/14
08/07/14
08/14/14
08/14/14
08/14/14
08/24/14
0
2000
4000
6000
8000
10000
12000
Yiel
d (M
egab
ases
)
Read
Len
gth
(bp)
Date
Oxford Flowcell Yields
Flowcell YieldRead Length
![Page 7: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/7.jpg)
Nanopore Readlengths
Max: 146,992bp 8x over 20kb
41x over 10kbp
Spike-in
Mean: 5473bp
noise
![Page 8: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/8.jpg)
Nanopore Alignments
Max: 50,900bp 1.8x over 20kb
13.8x over 10kbp
Mean: 6903bp
Alignment Statistics (BLASTN) Mean read length at ~7kbp
Shearing targeted 10kbp 70k reads align (32%)
40x coverage
![Page 9: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/9.jpg)
Nanopore Accuracy Alignment Quality (BLASTN) Of reads that align, average ~64% identity “2D base-calling” improves to ~70% identity
57% Mismatches 32% Deletions 11% Insertions
![Page 10: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/10.jpg)
Nanopore Alignment Summary 32% of the data map using BLASTN
Full Length Alignments
Partial Alignments
Unaligned
![Page 11: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/11.jpg)
Long Read Correction Algorithms PacBioToCA & ECTools
Hybrid Error Correction
Koren, Schatz, et al (2012) Nature Biotechnology. 30:693–700
HGAP & Quiver
LR-only Correction & Polishing
Chin et al (2013) Nature Methods. 10:563–569
PBJelly
Gap Filling and Assembly Upgrade
English et al (2012) PLOS One. 7(11): e47768
< 5x > 50x Long Read Coverage
![Page 12: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/12.jpg)
NanoCorr: Nanopore-Illumina Hybrid Error Correction
1. BLAST Miseq reads to all raw Oxford Nanopore reads
2. Select non-repetitive alignments
○ First pass scans to remove “contained” alignments
○ Second pass uses Dynamic Programming (LIS) to select set of high-identity alignments with minimal overlaps
3. Compute consensus of each Oxford
Nanopore read ○ Currently using Pacbio’s pbdagcon
Mean: 97%
Mean: 64%
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
40 50 60 70 80 90 100
Freq
uenc
y
Percent Identity
Read Identity Before and After Correction (NanoCorr)
CorrectedRaw
![Page 13: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/13.jpg)
Long Read Assembly S288C Reference sequence • 12.1Mbp; 16 chromo + mitochondria; N50: 924kbp
Oxford Nanopore 30x corrected reads > 6kb NanoCorr + Celera Assembler • 234 non-redundant contigs • N50: 362kbp >99.78% id
Illumina MiSeq 30x, 300bp PE (Flashed) Celera Assembler • 6953 non-redundant contigs • N50: 59kb >99.9% id
Pacific Biosciences 25x corrected reads > 10kb HGAP + Celera Assembler • 21 non-redundant contigs • N50: 811kb >99.8% id
![Page 14: Error Correction and Assembly of Single Molecule ...schatzlab.cshl.edu/presentations/2014.09.20.Genome Informatics... · Roberts, RJ, Carneiro, MO, Schatz, MC (2013) Genome Biology.](https://reader033.fdocuments.us/reader033/viewer/2022042920/5f65fb0c81651f5e68395546/html5/thumbnails/14.jpg)
Acknowledgements
Michael Schatz
Dick McCombie
Sara Goodwin
Schatz Lab