A Case Study of Strategic Plan Sangat Penting by MIkhail Pevzner
Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1,...
-
Upload
edmund-webster -
Category
Documents
-
view
214 -
download
0
Transcript of Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1,...
Improving the Accuracy of Genome Assemblies
July 17th 2012
Roy Ronen*,1, Christina Boucher*,1, Hamidreza Chitsaz2 and Pavel Pevzner1
1. University of California, San Diego2. Wayne State University, Michigan
* Contributed equally to this work
4
Sample Preparation
Sequencing
Assembly
Analysis, Analysis, Analysis
Fragments
Reads
Contigs
Draft Genome from HTS
5
Sample Preparation
Sequencing
Analysis, Analysis, Analysis
Fragments
Reads
Contigs
Assembly
HTS assemblies (contigs) still contain an abundance of error:• 20-30 subst. errors per
100kbp with SOAPdenovo.• 5-20 subst. errors per 100kbp
with Velvet. • Small (<50 bp) INDEL errors.• Misassemblies, large INDELs,
etc.
6
Sample Preparation
Sequencing
Analysis, Analysis, Analysis
Fragments
Reads
Contigs
AssemblyErrors in the assembled
contigs will profoundly affect any downstream analysis.
7
Sample Preparation
Sequencing
Analysis, Analysis, Analysis
Fragments
Reads
Contigs
Assembly SEQuel
Refined Contigs
De Bruijn Graph
GCC CCA CCA CAT CAT ATT ATT TTA
GCC CCT CCT CTT CTT TTT TTT TTA
CCT CTA CTA TAT TAT ATT
(Pevzner, Tang, Waterman 2001) 9
De Bruijn Graph
GCC CCA CCA CAT CAT ATT ATT TTA
GCC CCT CCT CTT CTT TTT TTT TTA
CCT CTA CTA TAT TAT ATT
(Pevzner, Tang, Waterman 2001) 10
De Bruijn Graph
GCC CAT CAT ATT ATT TTA
GCC CCT CCT CTT CTT TTT TTT TTA
CCT CTA CTA TAT TAT ATT
CCA
(Pevzner, Tang, Waterman 2001) 11
De Bruijn Graph
GCC CAT CAT ATT ATT TTA
GCC CTT CTT TTT TTT TTA
CTA CTA TAT TAT ATT
CCA
CCT
(Pevzner, Tang, Waterman 2001) 12
De Bruijn Graph
GCC
CAT CAT ATT ATT TTA
CTT CTT TTT TTT TTA
CTA CTA TAT TAT ATT
CCA
CCT
(Pevzner, Tang, Waterman 2001) 13
GCC CCT CTA TAG AGG GGA GAC
CAC ACT CTT TTG TGG GGC GCA
..............GCCTAGGAC.............CACTTGGCA..............GCCTAGGACGCCTAGGACGCCTAGGAC
CACTTGGCACACTTGGCA
CACTTGGCA
16
17
Sequencing errors cause bulges in the de Bruijn graph
GCC CCT CTA TAG AGG GGA GAC
CAC ACT CTT TTG TGG GGC GCA
..............GCCTAGGAC.............CACTTGGCA..............GCCTAGGACGCCTAGGACGCCTTGGAC
CACTTGGCACACTTGGCA
CACTTGGCA
CCTT
TGGA
CTTGTTGA
18
Sequencing errors cause bulges in the de Bruijn graph
GCC CCT
CTA TAG AGG
GGA GAC
CAC ACT
CTT TTG TGG
GGC GCA
..............GCCTAGGAC.............CACTTGGCA..............GCCTAGGACGCCTAGGACGCCTTGGAC
CACTTGGCACACTTGGCA
CACTTGGCA
22 2
23
14 4 1
33
33
3
19
Sequencing errors cause bulges in the de Bruijn graph
GCC CCT GGA GAC
CAC ACT
CTT TTG TGG
GGC GCA
..............GCCTAGGAC.............CACTTGGCA..............GCCTAGGACGCCTAGGACGCCTTGGAC
CACTTGGCACACTTGGCA
CACTTGGCA
31
4 4 1
33
33
3
......CACTTGGCA............GCCTTGGAC......
21
Sample Preparation
Sequencing
Analysis, Analysis, Analysis
Fragments
Reads
Contigs
Assembly SEQuel
Refined Contigs
Permissively aligned read-pair: a read-pair for which at least one read aligned uniquely.
12 2519 32 40 348 21 29
53
2621 34 39 44 57 68 8175 89
The SEQuel Algorithm
22
Positional De Bruijn Graph
GCC,111 CCA,112 CCA,112 CAT,113 CAT,113 ATT,114 ATT,114 TTA,115
CCT,112 CTT,113 CTT,113 TTT,114 TTT,114 TTA,115 GCC,975 CCT,976
CCT,976 CTA,977 CTA,977 TAT,978 TAT,978 ATT,979
Positional k-mer: a pair (k-mer, position), e.g. (GCCA, 111).
24
Positional De Bruijn Graph
GCC,111 CCA,112 CCA,112 CAT,113 ATT,114 ATT,114 TTA,115
CCT,112 CTT,113 CTT,113 TTT,114 TTT,114 TTA,115 GCC,975 CCT,976
CCT,976 CTA,977 CTA,977 TAT,978 TAT,978 ATT,979
CCA,112 ATT,114CAT,113
ATT,979
25
partial contig #1: GCCATTA
partial contig #2: GCCTATT
The SEQuel Algorithm
27
GTATTCCGAGGACCACTGGATTATGAOriginal contig
35
Results
• Standard and Single-Cell E. coli.
• 100 bp paired-end, Illumina (GAII) reads.
• Mean coverage ≈ 600x.
• Assemblies compared to reference with & without
SEQuel.
Summary
41
• Removed 35% to 96% of small-scale assembly errors.
• Introduced positional de Bruijn graph for contig refinement.
• Demonstrated utility in hard (single-cell) assembly.
• SEQuel can be used in combination with any assembler.
• Freely available at: http://bix.ucsd.edu/SEQuel