Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on
Imaging Ultra-Long Single DNA Molecules !
Jennifer Shelton 2014
3) use sequence reference to adjust molecule stretch for each scan
Assembly Pipeline
In recent datasets when SNR is low and alignment is good we see a spike in bases per pixel (bpp) in the first scan, a plateau and a lower plateau
Assembly Pipeline
First scan in a flow cell
5) Use sequence reference to determine assembly noise parameters. Estimated genome size is used to set the p-value threshold.
Assembly Pipeline
6/7) Variants of the starting p-value and default minimum molecule length are explored in nine assemblies.
Assembly Pipeline
223 scaffolds from the sequence-based assembly were longer than 20 (kb) with more than 5 labels and were converted into in silico CMAPs
Current Tribolium sequence-based assembly
Input file N50 (Mb) Number of Contigs
Cumulative Length (Mb)
Genome FASTA 1.16 2240 160.74in silico CMAP from FASTA 1.20 223 152.53
BNG assembled molecules had a higher N50 and longer cumulative length than the sequence assembly
!The estimated size of the Tribolium genome is ~200 (Mb)
Assembly Results
Input file N50 (Mb) Number of Contigs
Cumulative Length (Mb)
Genome FASTA 1.16 2240 160.74in silico CMAP from FASTA 1.20 223 152.53
CMAP from assembled BNG molecules (BNG CMAP)
1.35 216 200.47
Breadth of alignment coverage for in silico CMAP: 2.1 (Mb) Total alignment length for in silico CMAP: 2.1 (Mb)
!Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)
Total alignment length for BNG CMAP: 2.4 (Mb)
Simplest XMAP alignment description
1 (Mb)
1.1 (Mb) 1.3 (Mb)
in silico CMAP from genome
FASTA
CMAP from assembled molecules
in silico CMAP 2in silico CMAP 1
BNG CMAP 1 BNG CMAP 2
1.1 (Mb)
Breadth of alignment coverage for in silico CMAP: 1 (Mb) Total alignment length for in silico CMAP: 2 (Mb)
!Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)
Total alignment length for BNG CMAP: 2.4 (Mb)
Complex XMAP alignment description
in silico CMAP 1
BNG CMAP 1 BNG CMAP 2
1 (Mb)
1.1 (Mb) 1.3 (Mb)
in silico CMAP from genome
FASTA
CMAP from assembled molecules
Breadth of alignment coverage compared to total aligned length can indicate relevant relationships between assemblies
!In this example differences between "breadth" and "total" length could be due to:
!Duplications in sample molecules were extracted from
Assembly of alternate haplotypes Mis-assembly creating redundant contigs Collapsed repeat in sequence assembly
Alignment of CMAPs
in silico CMAP 1
BNG CMAP 1 BNG CMAP 2
1 (Mb)
1.1 (Mb) 1.3 (Mb)
in silico CMAP from genome
FASTA
CMAP from assembled molecules
Close to 4% of the alignment of the in silico CMAP appears to be redundant !
Overall 81% of the in silico CMAP aligns to the BNG consensus map
Alignment of BNG assembly to reference genome
CMAP name Breadth of alignment coverage for CMAP (Mb)
Length of total alignment for CMAP (Mb)
Percent of CMAP aligned
in silico CMAP from FASTA 124.04 132.40 81
CMAP from assembled BNG molecules (BNG CMAP)
131.64 132.34 67
Typically where redundant alignments occur two BNG consensus maps aligned suggesting they represent haplotypes although this has not been
verified
Alignment of BNG assembly to reference genome
min confidence 10
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
BNG consensus maps
ChLG 9!scaffolds
BNG consensus maps
ChLG 9 super!scaffold
Tribolium super-scaffolds overlapping BNG cmap
min confidence 10
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
BNG consensus maps
ChLG 9!scaffolds
BNG consensus maps
ChLG 9 super!scaffold
Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
in silico CMAP aligned as reference
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
in silico CMAP aligned as reference
alignment is inverted and
used as input for stitch
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
in silico CMAP aligned as reference
alignment is inverted and
used as input for stitch
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4
alignments are filtered based on alignment length
relative total possible
alignment length and confidence
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
+ in silico CMAP 1
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
BNG CMAP 1
+ in silico CMAP 1
alignment passes because
the alignment length is greater than 30% of the
potential alignment length
+ in silico CMAP 2
BNG CMAP 1
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment passes because
the alignment length is greater than 30% of the
potential alignment length
- in silico CMAP 2
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment passes because
the alignment length is greater than 30% of the
potential alignment length
- in silico CMAP 2
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment fails because the
alignment length is less than 30% of the potential
alignment length
+ in silico CMAP 2
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment fails because the
alignment length is less than 30% of the potential
alignment length
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment passes because
the alignment length is greater than 30% of the
potential alignment length
- in silico CMAP 3
BNG CMAP 2
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment fails because the
alignment length is less than 30% of the potential
alignment length- in silico CMAP 3
BNG CMAP 2
Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
alignment passes because
the alignment length is greater than 30% of the
potential alignment length
+ in silico CMAP 4
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4
high quality scaffolding
alignments...+ in silico CMAP 1
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
are filtered for longest and
highest confidence
alignment for each in silico
CMAP
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4
high quality scaffolding
alignments...+ in silico CMAP 1
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
Passing alignments are used to super
scaffold
are filtered for longest and
highest confidence
alignment for each in silico
CMAP
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4
high quality scaffolding
alignments...+ in silico CMAP 1
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
Stitch is iterated and additional
super scaffolding
alignments are found
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
Iteration takes advantage of alignments where sequence-based scaffolds stitch BNG consensus maps
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
Stitch is iterated and additional
super scaffolding
alignments are found
- in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
Until all super scaffolds are
joined - in silico CMAP 3+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4+ in silico CMAP 1
Iteration takes advantage of alignments where sequence-based scaffolds stitch BNG consensus maps
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds
- in silico CMAP 3
+ in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
+ in silico CMAP 4
+ in silico CMAP 1
If gap length is estimated to be negative gaps are represented by 100 (bp) fillers
Gap lengths
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))
!Of the manually edited Tribolium super-scaffolds there were 66 gaps had
known lengths and 24 had negative lengths (set to 100 (bp))
Distribution of gap lengths for automated output
Gap length (bp)
Cou
nt
−1500000 −1000000 −500000 0 500000 1000000
05
1015
20 Negative gap lengthsPositive gap lengths
Gap lengths
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))
!Of the manually edited Tribolium super-scaffolds there were 66 gaps had
known lengths and 24 had negative lengths (set to 100 (bp))
Distribution of gap lengths for automated output
Gap length (bp)
Cou
nt
−1500000 −1000000 −500000 0 500000 1000000
05
1015
20 Negative gap lengthsPositive gap lengths
Negative gap lengths
The longest negative gap length is from a BNG consenus map joining in silico 23 and 136
Is part of scaffold_23 connected to 136?!I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. !!In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly?
22 23 129 136 137
Negative gap lengths
!Because the same region of 136 aligns to another BNG consensus map that
aligns to its chromosome linkage group this alignment was rejected and stitch was re-run
Is part of scaffold_23 connected to 136?!I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. !!In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly?
22 23 129 136 137
Negative gap lengths
Two new super scaffolds were created and the sequence similarity is being evaluated
min confidence 10
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
BNG consensus maps
ChLG 9!scaffolds
BNG consensus maps
ChLG 9 super!scaffold
min confidence 10
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
BNG consensus maps
ChLG 9!scaffolds
BNG consensus maps
ChLG 9 super!scaffold
min confidence 10
U 18 14 16 19 20 21 22 23 24 25 26 27 28 30
BNG consensus maps
ChLG 2!scaffolds
BNG consensus maps
ChLG 2 super!scaffold
min confidence 10
U 18 14 16 19 20 21 22 23 24 25 26 27 28 30
BNG consensus maps
ChLG 2!scaffolds
BNG consensus maps
ChLG 2 super!scaffold
Gap lengths
This negative alignment also indicated a potential assembly issue
Distribution of gap lengths for automated output
Gap length (bp)
Cou
nt
−1500000 −1000000 −500000 0 500000 1000000
05
1015
20 Negative gap lengthsPositive gap lengths
Negative gap lengths
This negative gap length is from a BNG consenus map joining in silico 81 and 102 and 103
Half of scaffold_81 aligns with ChLG7
Negative gap lengths
Because the other half of 81 aligns to another BNG consensus map that aligns to its chromosome linkage group this alignment was rejected and stitch was re-
run !
The BNG maps suggest a mis-assembly of in silico 81 at a sequence level
Half of scaffold_81 aligns with ChLG7
79 80 81 82 83
Distribution of gap lengths for automated output
Gap length (bp)
Cou
nt
−1500000 −1000000 −500000 0 500000 1000000
05
1015
20 Negative gap lengthsPositive gap lengths
Gap lengths
All extremely small negative gap lengths, < -40,000 (bp) (shaded), were independently flagged as potential sequence mis-assemblies to be checked at
the sequence-level
Distribution of gap lengths for automated output
Gap length (bp)
Cou
nt
−1500000 −1000000 −500000 0 500000 1000000
05
1015
20 Negative gap lengthsPositive gap lengths
Gap lengths
All gaps from the shaded regions were also manually rejected and stitch.pl was rerun without them for the current super-scaffolded assembly
!We suspect extremely small negative gap sizes may be useful in locating
sequence mis-assemblies
N50 of the super-scaffolded genome was ~4 times greater than the original !
Super-scaffolds tend to agree with the Tribolium genetic map
Tribolium super-scaffolds
Input file N50 (Mb) Number of Contigs
Cumulative Length (Mb)
genome FASTA 1.16 2240 160.74
super-scaffold FASTA
4.46 2150 165.92
For Tribolium : first minimum percent aligned = 30%
first minimum confidence = 13 second minimum percent aligned = 90%
second minimum confidence = 8 !
Lower quality alignments were manually selected if genetic map also supported the order
Complex scaffolds were broken manually for sequence level evaluation
Tribolium super-scaffolds
Input file N50 (Mb) Number of Contigs
Cumulative Length (Mb)
genome FASTA 1.16 2240 160.74
super-scaffold FASTA
4.46 2150 165.92
ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to ChLG 3
Tribolium super-scaffolds
min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
BNG consensus maps
ChLG X!scaffolds
BNG consensus maps
ChLG X super!scaffold
U 3 4 5 6 7 U 8 9 10 11 12 13
The second scaffold from ChLG X aligned to scaffolds from a portion of ChLG 3
Tribolium super-scaffolds
min confidence 10
51 U 43 45 44 46
47 U U 152 48 49 50 52 53 54 U 57 55
BNG consensus maps
ChLG 3!scaffolds
BNG consensus maps
ChLG 3 super!scaffold
32 33 34 35 36 2 37 38 39 40 41 42
BNG consensus maps
ChLG 3 super!scaffold
BNG consensus maps
ChLG 3!scaffolds
BNG consensus maps
ChLG 3 super!scaffold
Two unplaced scaffolds aligned to ChLG X
Tribolium super-scaffolds
min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
BNG consensus maps
ChLG X!scaffolds
BNG consensus maps
ChLG X super!scaffold
U 3 4 5 6 7 U 8 9 10 11 12 13
4% Redundancy in alignment may be from assembly of haplotypes (generally observed as two BNG consensus maps aligning to the same in silico map)
Tribolium super-scaffolds
min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
BNG consensus maps
ChLG X!scaffolds
BNG consensus maps
ChLG X super!scaffold
U 3 4 5 6 7 U 8 9 10 11 12 13
Tribolium super-scaffolds overlapping BNG cmap
min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
BNG consensus maps
ChLG X!scaffolds
BNG consensus maps
ChLG X super!scaffold
U 3 4 5 6 7 U 8 9 10 11 12 13
For ChLG 9 21 scaffolds were reduced to 9
Tribolium super-scaffoldsmin confidence 10
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
BNG consensus maps
ChLG 9!scaffolds
BNG consensus maps
ChLG 9 super!scaffold
For ChLG 5 17 scaffolds were reduced to 4
Tribolium super-scaffoldsmin confidence 10
BNG consensus maps
ChLG 5!scaffolds
BNG consensus maps
ChLG 5 super!scaffold
69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83
K-INBRE Bioinformatics Core!Susan Brown - PI Nic Herndon - script development Nanyan Lu - manual editing Michelle Coleman - extractions and running the Irys! Zachary Sliefert - metric summaries !Bionano Genomics!Ernest Lam - assembly pipeline best practices assistance Weiping Wang - assistance with data formats Palak Sheth - collaboration to standardize analysis !Script availability!https://github.com/i5K-KINBRE-script-share/Irys-scaffolding BNG scripts available by request from BNG
Acknowledgements
Gap lengths
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))
!Of the manually edited Tribolium super-scaffolds there were 66 gaps had
known lengths and 24 had negative lengths (set to 100 (bp))
Distribution of gap lengths for automated output
Gap length (bp)
Cou
nt
−1500000 −1000000 −500000 0 500000 1000000
05
1015
20 Negative gap lengthsPositive gap lengths
Top Related