Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and...

24
Fuzzypath – Algorithms, Applications Fuzzypath – Algorithms, Applications and Future Developments and Future Developments Zemin Ning Zemin Ning Sequence Assembly and Analysis Sequence Assembly and Analysis

Transcript of Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and...

Page 1: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Fuzzypath – Algorithms, Fuzzypath – Algorithms, Applications and Future Applications and Future

DevelopmentsDevelopments

Zemin NingZemin Ning

Sequence Assembly and AnalysisSequence Assembly and Analysis

Page 2: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Outline of the Talk:

Sequence Reconstruction and Euler Path Assembly strategy Sequence extension using read pairs, base qualities,

fuzzy kmers or longer reads Repeat junctions Installation, data process and running Gap5 - visual inspection for mis-assembly errors Integration into the Phusion pipeline

Page 3: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Repeat Repeat Repeat

Sequence Repeat Graph

Sequences

Page 4: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Sequence ReconstructionSequence Reconstruction- Hamiltonian path approach- Hamiltonian path approach

S=(ATGCAGGTCC)S=(ATGCAGGTCC)ATG ATG ->-> TGC TGC -> -> GCA GCA ->-> CAG CAG -> -> AGG AGG ->-> GGT GGT -> -> GTC GTC ->-> TCC TCC

ATG AGG TGC TCC GTC GGT GCA CAGATG AGG TGC TCC GTC GGT GCA CAG

VerticesVertices: k-tuples from the spectrum shown in red (8);: k-tuples from the spectrum shown in red (8);EdgesEdges: overlapping k-tuples (7);: overlapping k-tuples (7);PathPath: visiting all vertices corresponding to the : visiting all vertices corresponding to the sequence.sequence.

Page 5: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Sequence ReconstructionSequence Reconstruction- Euler path approach- Euler path approach

VerticesVertices: : correspond to (k-I)-tuples (7);correspond to (k-I)-tuples (7);EdgesEdges: : correspond to k-tuples from the spectrum (8);correspond to k-tuples from the spectrum (8);PathPath: : visiting all EDGES corresponding to the visiting all EDGES corresponding to the sequence.sequence.

ATAT

GTGT CGCG

CACA

GCGCTGTG

GGGG

ATGCGTGGCAATGCGTGGCA ATGGCGTGCAATGGCGTGCA

ATG ATG ->-> TGG TGG -> -> GGC GGC ->-> GCG GCG -> -> CGT CGT ->-> GTG GTG -> -> TGC TGC ->-> GCA GCA

Page 6: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Assembly StrategyAssembly Strategy

Solexa read assembler to extend short reads to 1-2 kb long reads

Genome/Chromosome

Capillary reads assemblerPhrap/Phusion

forward-reverse paired reads

30-75 bp

known dist

~500 bp

30-75 bp

Page 7: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Kmer Extension & WalkKmer Extension & Walk

Page 8: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Base Quality to Filter Base ErrorsBase Quality to Filter Base Errors

Page 9: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Read Pairs in Repeat JunctionsRead Pairs in Repeat Junctions

Page 10: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Means to handle repeats:Means to handle repeats: - Base quality- Base quality - Read pair- Read pair - Fuzzy kmers- Fuzzy kmers - Closely related reference- Closely related reference - 454 or Sanger reads- 454 or Sanger reads

Kmer Extension & Repeat JunctionsKmer Extension & Repeat Junctions

Pileup of other reads like 454, Sanger etc Pileup of other reads like 454, Sanger etc at a repeat junction at a repeat junction

Consensus

Page 11: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Handling of Repeat JunctionsHandling of Repeat Junctions

Page 12: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Handling of Single Base Variations Handling of Single Base Variations

Page 13: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Fuzzypath PipelineFuzzypath Pipeline

Page 14: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Fuzzypath Read FileFuzzypath Read File

Page 15: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Fuzzypath Fastq FileFuzzypath Fastq File

Page 16: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Solexa reads:Number of reads: 6,000,000;Finished genome size: ~4.8 Mbp;Read length: 2x37 bp;Estimated read coverage: ~92.5 X;Insert size: 170/50-300 bp;

Assembly features: - contig statsSolexa 454

Total number of contigs: 75; 390Total bases of contigs: 4.80 Mbp 4.77 MbN50 contig size: 139,353 25,702Largest contig: 395,600 62,040Averaged contig size: 63,969 12,224Contig coverage on genome: ~99.8 % 99.4%Contig extension errors: 0Mis-assembly errors: 0 4

Salmonella seftenberg Salmonella seftenberg Solexa Solexa Assembly from Pair-End ReadsAssembly from Pair-End Reads

Page 17: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

maqmaq

ssaha2ssaha2

Page 18: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

maqmaq

ssaha2ssaha2

Page 19: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.
Page 20: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

maqmaq

ssaha2ssaha2

Page 21: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

maqmaq

ssaha2ssaha2

Page 22: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

New Phusion AssemblerNew Phusion Assembler

SolexaReads

Assembly

Reads Group

Data Process Long Insert Reads

Supercontig

Contigs

PRono

Fuzzypath

Phrap

Velvet

2x75 or 2x100

Page 23: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Solexa reads:Number of reads: 557 Million;Finished genome size: 3.0 GB;Read length: 2x75bp;Estimated read coverage: ~25X;Insert size: 190/50-300 bp;Number of reads clustered: 458 Million

Assembly features: - contig statsTotal number of contigs: 1,040,582;Total bases of contigs: 2.703 GbN50 contig size: 6,484;Largest contig: 85,595 Averaged contig size: 2,597;Contig coverage over the genome: ~90 %;Mis-assembly errors: ?

Human AssemblyHuman Assembly – – COLO-829COLO-829Normal CellNormal Cell

Page 24: Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis.

Acknowledgements:

Yong Gu James Bonfield Heng Li Hannes Ponstingl Daniel Zerbino (EBI) Helen Beasley Siobhan Whitehead Tony Cox