Cross_genome: Assembly Scaffolding using Cross-species Synteny Zemin Ning High Performance Assembly.

19
Cross_genome: Assembly Scaffolding using Cross-species Synteny Zemin Ning Zemin Ning High Performance Assembly High Performance Assembly

description

RACA - Reference-assisted chromosome assembly

Transcript of Cross_genome: Assembly Scaffolding using Cross-species Synteny Zemin Ning High Performance Assembly.

Cross_genome: Assembly Scaffolding using Cross-species

SyntenyZemin NingZemin Ning

High Performance Assembly High Performance Assembly

Can synteny help? And How?

Contig gap

closure

Scaffolding

RACA - Reference-assisted chromosome assembly

Target sequence

Reference

Scaffold 1

Scaffold 2

Scaffold 3

Q = scaff(i)*2Q = scaff(i)*23232 + contig_loci(j) + contig_loci(j)

Lattice of Target -Reference

Target sequence

Reference

Scaffold 1

After Noise Cleaning

Y

X

Gap_size = Y - X Gap_size = Y - X

Scaffold 2

Scaffold 3

Cases Shouldn’t JoinCases Shouldn’t Join

ReferenceReference

TargetTargetScaffold 1 Scaffold 2

Scaffold 2Scaffold 1Gap_size Gap_size

ReferenceReference

TargetTarget

Assembler N_bases N_scaffs N50 (Mb)Original 88.8 418 81.6

Allpahts-LG RACA 86.8Cross_genome 89 221 85.5Original 78.6 1472 0.37

Bambus2 RACA 72.1Cross_genome 78.6 1094 13.7Original 86.5 498 0.4

CABOG RACA 81.4Cross_genome 86.3 46 85.5Original 89.7 1094 0.88

MSR-CA RACA 83.4Cross_genome 89.6 13.7Original 94.7 30975 0.075

SGA RACA 57.4Cross_genome 94.8 29662 77.3Original 108 38477 0.453

SOAPdenovo RACA 84.4Cross_genome 102.8 12955 78.9Original 143.8 61455 0.84

Velvet RACA 123Cross_genome 139.4 3278 8.71

GAGE: Human Chr14 and RACA using Orangutan GAGE: Human Chr14 and RACA using Orangutan

Original Cross_g References

Panda 1.3Mb 25Mb Dog, Human

Tibetan Antelope 2.6Mb 42Mb Cattle, Dog, Human

Tasmanian Devil 1.8Mb 6.8Mb Opossum

Scaffold N50 for Other Genome Assemblies Scaffold N50 for Other Genome Assemblies

Availability Availability

ftp://ftp.sanger.ac.uk/pub/users/zn1/merge/cross_genome/

Improve gorilla assembly using human reference

Contig Merge/Break

Variation correction

Contig gap size re-estimation

Read AlignmentPair-wise/Multiple

Combined Gorilla-Human Assembly

Human Reference

Gorilla Assembly

Final Gorilla Assembly

Gap size

New gap size

Target sequence

Reference sequence

Re-estimate Contig Gap Sizes from Reference Re-estimate Contig Gap Sizes from Reference

New gap size

Read alignment and variation correction

Ref seq inserted

Contig Consensus using Gap5 Contig Consensus using Gap5 Target (query) aligned against ReferenceTarget (query) aligned against Reference

Before

Target (query) aligned against ReferenceTarget (query) aligned against Reference

Reference Sequence Replacement &VariationCorrection

Variations: 2 indels (4bp and 1bp) correctedVariations: 2 indels (4bp and 1bp) corrected

Original Contig (query) against New Original Contig (query) against New Assembly after Contig BreakAssembly after Contig Break

Alignment Alignment InconsistencyInconsistency

Original Contig (query) against New Original Contig (query) against New Assembly after Contig BreakAssembly after Contig Break

Alignment Alignment InconsistencyInconsistency

Original New

Total number of contigs: 464,875 285,139

N50 contig size: 11.7kb 23.9kb

Largest contig: 191,556 322,733

Averaged contig size: 6085 9928

The Gorilla AssembliesThe Gorilla Assemblies

Acknowledgements:

Hanness Ponstingl Frank Liu – Nanjing University of

Information Technology (NUIT) Yan Li – (NUIT)

Gorilla genome sequencing data BGI – Panda and Tibetan Antelope

assemblies