Bng presentation draft

Post on 05-Jul-2015

274 views 3 download

description

Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules

Transcript of Bng presentation draft

Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on

Imaging Ultra-Long Single DNA Molecules !

Jennifer Shelton 2014

3) use sequence reference to adjust molecule stretch for each scan

Assembly Pipeline

In recent datasets when SNR is low and alignment is good we see a spike in bases per pixel (bpp) in the first scan, a plateau and a lower plateau

Assembly Pipeline

First scan in a flow cell

5) Use sequence reference to determine assembly noise parameters. Estimated genome size is used to set the p-value threshold.

Assembly Pipeline

6/7) Variants of the starting p-value and default minimum molecule length are explored in nine assemblies.

Assembly Pipeline

223 scaffolds from the sequence-based assembly were longer than 20 (kb) with more than 5 labels and were converted into in silico CMAPs

Current Tribolium sequence-based assembly

Input file N50 (Mb) Number of Contigs

Cumulative Length (Mb)

Genome FASTA 1.16 2240 160.74in silico CMAP from FASTA 1.20 223 152.53

BNG assembled molecules had a higher N50 and longer cumulative length than the sequence assembly

!The estimated size of the Tribolium genome is ~200 (Mb)

Assembly Results

Input file N50 (Mb) Number of Contigs

Cumulative Length (Mb)

Genome FASTA 1.16 2240 160.74in silico CMAP from FASTA 1.20 223 152.53

CMAP from assembled BNG molecules (BNG CMAP)

1.35 216 200.47

Breadth of alignment coverage for in silico CMAP: 2.1 (Mb) Total alignment length for in silico CMAP: 2.1 (Mb)

!Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)

Total alignment length for BNG CMAP: 2.4 (Mb)

Simplest XMAP alignment description

1 (Mb)

1.1 (Mb) 1.3 (Mb)

in silico CMAP from genome

FASTA

CMAP from assembled molecules

in silico CMAP 2in silico CMAP 1

BNG CMAP 1 BNG CMAP 2

1.1 (Mb)

Breadth of alignment coverage for in silico CMAP: 1 (Mb) Total alignment length for in silico CMAP: 2 (Mb)

!Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)

Total alignment length for BNG CMAP: 2.4 (Mb)

Complex XMAP alignment description

in silico CMAP 1

BNG CMAP 1 BNG CMAP 2

1 (Mb)

1.1 (Mb) 1.3 (Mb)

in silico CMAP from genome

FASTA

CMAP from assembled molecules

Breadth of alignment coverage compared to total aligned length can indicate relevant relationships between assemblies

!In this example differences between "breadth" and "total" length could be due to:

!Duplications in sample molecules were extracted from

Assembly of alternate haplotypes Mis-assembly creating redundant contigs Collapsed repeat in sequence assembly

Alignment of CMAPs

in silico CMAP 1

BNG CMAP 1 BNG CMAP 2

1 (Mb)

1.1 (Mb) 1.3 (Mb)

in silico CMAP from genome

FASTA

CMAP from assembled molecules

Close to 4% of the alignment of the in silico CMAP appears to be redundant !

Overall 81% of the in silico CMAP aligns to the BNG consensus map

Alignment of BNG assembly to reference genome

CMAP name Breadth of alignment coverage for CMAP (Mb)

Length of total alignment for CMAP (Mb)

Percent of CMAP aligned

in silico CMAP from FASTA 124.04 132.40 81

CMAP from assembled BNG molecules (BNG CMAP)

131.64 132.34 67

Typically where redundant alignments occur two BNG consensus maps aligned suggesting they represent haplotypes although this has not been

verified

Alignment of BNG assembly to reference genome

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

Tribolium super-scaffolds overlapping BNG cmap

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

in silico CMAP aligned as reference

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

in silico CMAP aligned as reference

alignment is inverted and

used as input for stitch

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

in silico CMAP aligned as reference

alignment is inverted and

used as input for stitch

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

alignments are filtered based on alignment length

relative total possible

alignment length and confidence

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

+ in silico CMAP 1

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

BNG CMAP 1

+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

+ in silico CMAP 2

BNG CMAP 1

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

- in silico CMAP 2

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

- in silico CMAP 2

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment fails because the

alignment length is less than 30% of the potential

alignment length

+ in silico CMAP 2

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment fails because the

alignment length is less than 30% of the potential

alignment length

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

- in silico CMAP 3

BNG CMAP 2

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment fails because the

alignment length is less than 30% of the potential

alignment length- in silico CMAP 3

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

+ in silico CMAP 4

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

high quality scaffolding

alignments...+ in silico CMAP 1

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

are filtered for longest and

highest confidence

alignment for each in silico

CMAP

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

high quality scaffolding

alignments...+ in silico CMAP 1

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

Passing alignments are used to super

scaffold

are filtered for longest and

highest confidence

alignment for each in silico

CMAP

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

high quality scaffolding

alignments...+ in silico CMAP 1

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

Stitch is iterated and additional

super scaffolding

alignments are found

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Iteration takes advantage of alignments where sequence-based scaffolds stitch BNG consensus maps

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

Stitch is iterated and additional

super scaffolding

alignments are found

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Until all super scaffolds are

joined - in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Iteration takes advantage of alignments where sequence-based scaffolds stitch BNG consensus maps

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3

+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

+ in silico CMAP 1

If gap length is estimated to be negative gaps are represented by 100 (bp) fillers

Gap lengths

Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))

!Of the manually edited Tribolium super-scaffolds there were 66 gaps had

known lengths and 24 had negative lengths (set to 100 (bp))

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

Gap lengths

Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))

!Of the manually edited Tribolium super-scaffolds there were 66 gaps had

known lengths and 24 had negative lengths (set to 100 (bp))

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

Negative gap lengths

The longest negative gap length is from a BNG consenus map joining in silico 23 and 136

Is part of scaffold_23 connected to 136?!I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. !!In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly?

22 23 129 136 137

Negative gap lengths

!Because the same region of 136 aligns to another BNG consensus map that

aligns to its chromosome linkage group this alignment was rejected and stitch was re-run

Is part of scaffold_23 connected to 136?!I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. !!In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly?

22 23 129 136 137

Negative gap lengths

Two new super scaffolds were created and the sequence similarity is being evaluated

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

min confidence 10

U 18 14 16 19 20 21 22 23 24 25 26 27 28 30

BNG consensus maps

ChLG 2!scaffolds

BNG consensus maps

ChLG 2 super!scaffold

min confidence 10

U 18 14 16 19 20 21 22 23 24 25 26 27 28 30

BNG consensus maps

ChLG 2!scaffolds

BNG consensus maps

ChLG 2 super!scaffold

Gap lengths

This negative alignment also indicated a potential assembly issue

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

Negative gap lengths

This negative gap length is from a BNG consenus map joining in silico 81 and 102 and 103

Half of scaffold_81 aligns with ChLG7

Negative gap lengths

Because the other half of 81 aligns to another BNG consensus map that aligns to its chromosome linkage group this alignment was rejected and stitch was re-

run !

The BNG maps suggest a mis-assembly of in silico 81 at a sequence level

Half of scaffold_81 aligns with ChLG7

79 80 81 82 83

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

Gap lengths

All extremely small negative gap lengths, < -40,000 (bp) (shaded), were independently flagged as potential sequence mis-assemblies to be checked at

the sequence-level

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

Gap lengths

All gaps from the shaded regions were also manually rejected and stitch.pl was rerun without them for the current super-scaffolded assembly

!We suspect extremely small negative gap sizes may be useful in locating

sequence mis-assemblies

N50 of the super-scaffolded genome was ~4 times greater than the original !

Super-scaffolds tend to agree with the Tribolium genetic map

Tribolium super-scaffolds

Input file N50 (Mb) Number of Contigs

Cumulative Length (Mb)

genome FASTA 1.16 2240 160.74

super-scaffold FASTA

4.46 2150 165.92

For Tribolium : first minimum percent aligned = 30%

first minimum confidence = 13 second minimum percent aligned = 90%

second minimum confidence = 8 !

Lower quality alignments were manually selected if genetic map also supported the order

Complex scaffolds were broken manually for sequence level evaluation

Tribolium super-scaffolds

Input file N50 (Mb) Number of Contigs

Cumulative Length (Mb)

genome FASTA 1.16 2240 160.74

super-scaffold FASTA

4.46 2150 165.92

ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to ChLG 3

Tribolium super-scaffolds

min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

The second scaffold from ChLG X aligned to scaffolds from a portion of ChLG 3

Tribolium super-scaffolds

min confidence 10

51 U 43 45 44 46

47 U U 152 48 49 50 52 53 54 U 57 55

BNG consensus maps

ChLG 3!scaffolds

BNG consensus maps

ChLG 3 super!scaffold

32 33 34 35 36 2 37 38 39 40 41 42

BNG consensus maps

ChLG 3 super!scaffold

BNG consensus maps

ChLG 3!scaffolds

BNG consensus maps

ChLG 3 super!scaffold

Two unplaced scaffolds aligned to ChLG X

Tribolium super-scaffolds

min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

4% Redundancy in alignment may be from assembly of haplotypes (generally observed as two BNG consensus maps aligning to the same in silico map)

Tribolium super-scaffolds

min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

Tribolium super-scaffolds overlapping BNG cmap

min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

For ChLG 9 21 scaffolds were reduced to 9

Tribolium super-scaffoldsmin confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

For ChLG 5 17 scaffolds were reduced to 4

Tribolium super-scaffoldsmin confidence 10

BNG consensus maps

ChLG 5!scaffolds

BNG consensus maps

ChLG 5 super!scaffold

69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83

K-INBRE Bioinformatics Core!Susan Brown - PI Nic Herndon - script development Nanyan Lu - manual editing Michelle Coleman - extractions and running the Irys! Zachary Sliefert - metric summaries !Bionano Genomics!Ernest Lam - assembly pipeline best practices assistance Weiping Wang - assistance with data formats Palak Sheth - collaboration to standardize analysis !Script availability!https://github.com/i5K-KINBRE-script-share/Irys-scaffolding BNG scripts available by request from BNG

Acknowledgements

Gap lengths

Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))

!Of the manually edited Tribolium super-scaffolds there were 66 gaps had

known lengths and 24 had negative lengths (set to 100 (bp))

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths