Bng presentation draft

51
Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules Jennifer Shelton 2014

description

Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules

Transcript of Bng presentation draft

Page 1: Bng presentation draft

Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on

Imaging Ultra-Long Single DNA Molecules !

Jennifer Shelton 2014

Page 2: Bng presentation draft

3) use sequence reference to adjust molecule stretch for each scan

Assembly Pipeline

Page 3: Bng presentation draft

In recent datasets when SNR is low and alignment is good we see a spike in bases per pixel (bpp) in the first scan, a plateau and a lower plateau

Assembly Pipeline

First scan in a flow cell

Page 4: Bng presentation draft

5) Use sequence reference to determine assembly noise parameters. Estimated genome size is used to set the p-value threshold.

Assembly Pipeline

Page 5: Bng presentation draft

6/7) Variants of the starting p-value and default minimum molecule length are explored in nine assemblies.

Assembly Pipeline

Page 6: Bng presentation draft

223 scaffolds from the sequence-based assembly were longer than 20 (kb) with more than 5 labels and were converted into in silico CMAPs

Current Tribolium sequence-based assembly

Input file N50 (Mb) Number of Contigs

Cumulative Length (Mb)

Genome FASTA 1.16 2240 160.74in silico CMAP from FASTA 1.20 223 152.53

Page 7: Bng presentation draft

BNG assembled molecules had a higher N50 and longer cumulative length than the sequence assembly

!The estimated size of the Tribolium genome is ~200 (Mb)

Assembly Results

Input file N50 (Mb) Number of Contigs

Cumulative Length (Mb)

Genome FASTA 1.16 2240 160.74in silico CMAP from FASTA 1.20 223 152.53

CMAP from assembled BNG molecules (BNG CMAP)

1.35 216 200.47

Page 8: Bng presentation draft

Breadth of alignment coverage for in silico CMAP: 2.1 (Mb) Total alignment length for in silico CMAP: 2.1 (Mb)

!Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)

Total alignment length for BNG CMAP: 2.4 (Mb)

Simplest XMAP alignment description

1 (Mb)

1.1 (Mb) 1.3 (Mb)

in silico CMAP from genome

FASTA

CMAP from assembled molecules

in silico CMAP 2in silico CMAP 1

BNG CMAP 1 BNG CMAP 2

1.1 (Mb)

Page 9: Bng presentation draft

Breadth of alignment coverage for in silico CMAP: 1 (Mb) Total alignment length for in silico CMAP: 2 (Mb)

!Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)

Total alignment length for BNG CMAP: 2.4 (Mb)

Complex XMAP alignment description

in silico CMAP 1

BNG CMAP 1 BNG CMAP 2

1 (Mb)

1.1 (Mb) 1.3 (Mb)

in silico CMAP from genome

FASTA

CMAP from assembled molecules

Page 10: Bng presentation draft

Breadth of alignment coverage compared to total aligned length can indicate relevant relationships between assemblies

!In this example differences between "breadth" and "total" length could be due to:

!Duplications in sample molecules were extracted from

Assembly of alternate haplotypes Mis-assembly creating redundant contigs Collapsed repeat in sequence assembly

Alignment of CMAPs

in silico CMAP 1

BNG CMAP 1 BNG CMAP 2

1 (Mb)

1.1 (Mb) 1.3 (Mb)

in silico CMAP from genome

FASTA

CMAP from assembled molecules

Page 11: Bng presentation draft

Close to 4% of the alignment of the in silico CMAP appears to be redundant !

Overall 81% of the in silico CMAP aligns to the BNG consensus map

Alignment of BNG assembly to reference genome

CMAP name Breadth of alignment coverage for CMAP (Mb)

Length of total alignment for CMAP (Mb)

Percent of CMAP aligned

in silico CMAP from FASTA 124.04 132.40 81

CMAP from assembled BNG molecules (BNG CMAP)

131.64 132.34 67

Page 12: Bng presentation draft

Typically where redundant alignments occur two BNG consensus maps aligned suggesting they represent haplotypes although this has not been

verified

Alignment of BNG assembly to reference genome

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

Page 13: Bng presentation draft

Tribolium super-scaffolds overlapping BNG cmap

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

Page 14: Bng presentation draft

Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

in silico CMAP aligned as reference

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Page 15: Bng presentation draft

Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

in silico CMAP aligned as reference

alignment is inverted and

used as input for stitch

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Page 16: Bng presentation draft

Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

in silico CMAP aligned as reference

alignment is inverted and

used as input for stitch

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

alignments are filtered based on alignment length

relative total possible

alignment length and confidence

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

+ in silico CMAP 1

Page 17: Bng presentation draft

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

BNG CMAP 1

+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

Page 18: Bng presentation draft

+ in silico CMAP 2

BNG CMAP 1

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

Page 19: Bng presentation draft

- in silico CMAP 2

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

Page 20: Bng presentation draft

- in silico CMAP 2

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment fails because the

alignment length is less than 30% of the potential

alignment length

Page 21: Bng presentation draft

+ in silico CMAP 2

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment fails because the

alignment length is less than 30% of the potential

alignment length

Page 22: Bng presentation draft

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

- in silico CMAP 3

BNG CMAP 2

Page 23: Bng presentation draft

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment fails because the

alignment length is less than 30% of the potential

alignment length- in silico CMAP 3

Page 24: Bng presentation draft

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

+ in silico CMAP 4

Page 25: Bng presentation draft

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

high quality scaffolding

alignments...+ in silico CMAP 1

Page 26: Bng presentation draft

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

are filtered for longest and

highest confidence

alignment for each in silico

CMAP

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

high quality scaffolding

alignments...+ in silico CMAP 1

Page 27: Bng presentation draft

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

Passing alignments are used to super

scaffold

are filtered for longest and

highest confidence

alignment for each in silico

CMAP

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

high quality scaffolding

alignments...+ in silico CMAP 1

Page 28: Bng presentation draft

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

Stitch is iterated and additional

super scaffolding

alignments are found

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Iteration takes advantage of alignments where sequence-based scaffolds stitch BNG consensus maps

Page 29: Bng presentation draft

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

Stitch is iterated and additional

super scaffolding

alignments are found

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Until all super scaffolds are

joined - in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Iteration takes advantage of alignments where sequence-based scaffolds stitch BNG consensus maps

Page 30: Bng presentation draft

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3

+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

+ in silico CMAP 1

If gap length is estimated to be negative gaps are represented by 100 (bp) fillers

Page 31: Bng presentation draft

Gap lengths

Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))

!Of the manually edited Tribolium super-scaffolds there were 66 gaps had

known lengths and 24 had negative lengths (set to 100 (bp))

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

Page 32: Bng presentation draft

Gap lengths

Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))

!Of the manually edited Tribolium super-scaffolds there were 66 gaps had

known lengths and 24 had negative lengths (set to 100 (bp))

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

Page 33: Bng presentation draft

Negative gap lengths

The longest negative gap length is from a BNG consenus map joining in silico 23 and 136

Is part of scaffold_23 connected to 136?!I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. !!In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly?

22 23 129 136 137

Page 34: Bng presentation draft

Negative gap lengths

!Because the same region of 136 aligns to another BNG consensus map that

aligns to its chromosome linkage group this alignment was rejected and stitch was re-run

Is part of scaffold_23 connected to 136?!I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. !!In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly?

22 23 129 136 137

Page 35: Bng presentation draft

Negative gap lengths

Two new super scaffolds were created and the sequence similarity is being evaluated

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

min confidence 10

U 18 14 16 19 20 21 22 23 24 25 26 27 28 30

BNG consensus maps

ChLG 2!scaffolds

BNG consensus maps

ChLG 2 super!scaffold

min confidence 10

U 18 14 16 19 20 21 22 23 24 25 26 27 28 30

BNG consensus maps

ChLG 2!scaffolds

BNG consensus maps

ChLG 2 super!scaffold

Page 36: Bng presentation draft

Gap lengths

This negative alignment also indicated a potential assembly issue

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

Page 37: Bng presentation draft

Negative gap lengths

This negative gap length is from a BNG consenus map joining in silico 81 and 102 and 103

Half of scaffold_81 aligns with ChLG7

Page 38: Bng presentation draft

Negative gap lengths

Because the other half of 81 aligns to another BNG consensus map that aligns to its chromosome linkage group this alignment was rejected and stitch was re-

run !

The BNG maps suggest a mis-assembly of in silico 81 at a sequence level

Half of scaffold_81 aligns with ChLG7

79 80 81 82 83

Page 39: Bng presentation draft

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

Gap lengths

All extremely small negative gap lengths, < -40,000 (bp) (shaded), were independently flagged as potential sequence mis-assemblies to be checked at

the sequence-level

Page 40: Bng presentation draft

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

Gap lengths

All gaps from the shaded regions were also manually rejected and stitch.pl was rerun without them for the current super-scaffolded assembly

!We suspect extremely small negative gap sizes may be useful in locating

sequence mis-assemblies

Page 41: Bng presentation draft

N50 of the super-scaffolded genome was ~4 times greater than the original !

Super-scaffolds tend to agree with the Tribolium genetic map

Tribolium super-scaffolds

Input file N50 (Mb) Number of Contigs

Cumulative Length (Mb)

genome FASTA 1.16 2240 160.74

super-scaffold FASTA

4.46 2150 165.92

Page 42: Bng presentation draft

For Tribolium : first minimum percent aligned = 30%

first minimum confidence = 13 second minimum percent aligned = 90%

second minimum confidence = 8 !

Lower quality alignments were manually selected if genetic map also supported the order

Complex scaffolds were broken manually for sequence level evaluation

Tribolium super-scaffolds

Input file N50 (Mb) Number of Contigs

Cumulative Length (Mb)

genome FASTA 1.16 2240 160.74

super-scaffold FASTA

4.46 2150 165.92

Page 43: Bng presentation draft

ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to ChLG 3

Tribolium super-scaffolds

min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

Page 44: Bng presentation draft

The second scaffold from ChLG X aligned to scaffolds from a portion of ChLG 3

Tribolium super-scaffolds

min confidence 10

51 U 43 45 44 46

47 U U 152 48 49 50 52 53 54 U 57 55

BNG consensus maps

ChLG 3!scaffolds

BNG consensus maps

ChLG 3 super!scaffold

32 33 34 35 36 2 37 38 39 40 41 42

BNG consensus maps

ChLG 3 super!scaffold

BNG consensus maps

ChLG 3!scaffolds

BNG consensus maps

ChLG 3 super!scaffold

Page 45: Bng presentation draft

Two unplaced scaffolds aligned to ChLG X

Tribolium super-scaffolds

min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

Page 46: Bng presentation draft

4% Redundancy in alignment may be from assembly of haplotypes (generally observed as two BNG consensus maps aligning to the same in silico map)

Tribolium super-scaffolds

min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

Page 47: Bng presentation draft

Tribolium super-scaffolds overlapping BNG cmap

min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

Page 48: Bng presentation draft

For ChLG 9 21 scaffolds were reduced to 9

Tribolium super-scaffoldsmin confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

Page 49: Bng presentation draft

For ChLG 5 17 scaffolds were reduced to 4

Tribolium super-scaffoldsmin confidence 10

BNG consensus maps

ChLG 5!scaffolds

BNG consensus maps

ChLG 5 super!scaffold

69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83

Page 50: Bng presentation draft

K-INBRE Bioinformatics Core!Susan Brown - PI Nic Herndon - script development Nanyan Lu - manual editing Michelle Coleman - extractions and running the Irys! Zachary Sliefert - metric summaries !Bionano Genomics!Ernest Lam - assembly pipeline best practices assistance Weiping Wang - assistance with data formats Palak Sheth - collaboration to standardize analysis !Script availability!https://github.com/i5K-KINBRE-script-share/Irys-scaffolding BNG scripts available by request from BNG

Acknowledgements

Page 51: Bng presentation draft

Gap lengths

Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))

!Of the manually edited Tribolium super-scaffolds there were 66 gaps had

known lengths and 24 had negative lengths (set to 100 (bp))

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths