Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and...
-
Upload
angelica-davis -
Category
Documents
-
view
227 -
download
0
Transcript of Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and...
Fuzzypath – Algorithms, Fuzzypath – Algorithms, Applications and Future Applications and Future
DevelopmentsDevelopments
Zemin NingZemin Ning
Sequence Assembly and AnalysisSequence Assembly and Analysis
Outline of the Talk:
Sequence Reconstruction and Euler Path Assembly strategy Sequence extension using read pairs, base qualities,
fuzzy kmers or longer reads Repeat junctions Installation, data process and running Gap5 - visual inspection for mis-assembly errors Integration into the Phusion pipeline
Repeat Repeat Repeat
Sequence Repeat Graph
Sequences
Sequence ReconstructionSequence Reconstruction- Hamiltonian path approach- Hamiltonian path approach
S=(ATGCAGGTCC)S=(ATGCAGGTCC)ATG ATG ->-> TGC TGC -> -> GCA GCA ->-> CAG CAG -> -> AGG AGG ->-> GGT GGT -> -> GTC GTC ->-> TCC TCC
ATG AGG TGC TCC GTC GGT GCA CAGATG AGG TGC TCC GTC GGT GCA CAG
VerticesVertices: k-tuples from the spectrum shown in red (8);: k-tuples from the spectrum shown in red (8);EdgesEdges: overlapping k-tuples (7);: overlapping k-tuples (7);PathPath: visiting all vertices corresponding to the : visiting all vertices corresponding to the sequence.sequence.
Sequence ReconstructionSequence Reconstruction- Euler path approach- Euler path approach
VerticesVertices: : correspond to (k-I)-tuples (7);correspond to (k-I)-tuples (7);EdgesEdges: : correspond to k-tuples from the spectrum (8);correspond to k-tuples from the spectrum (8);PathPath: : visiting all EDGES corresponding to the visiting all EDGES corresponding to the sequence.sequence.
ATAT
GTGT CGCG
CACA
GCGCTGTG
GGGG
ATGCGTGGCAATGCGTGGCA ATGGCGTGCAATGGCGTGCA
ATG ATG ->-> TGG TGG -> -> GGC GGC ->-> GCG GCG -> -> CGT CGT ->-> GTG GTG -> -> TGC TGC ->-> GCA GCA
Assembly StrategyAssembly Strategy
Solexa read assembler to extend short reads to 1-2 kb long reads
Genome/Chromosome
Capillary reads assemblerPhrap/Phusion
forward-reverse paired reads
30-75 bp
known dist
~500 bp
30-75 bp
Kmer Extension & WalkKmer Extension & Walk
Base Quality to Filter Base ErrorsBase Quality to Filter Base Errors
Read Pairs in Repeat JunctionsRead Pairs in Repeat Junctions
Means to handle repeats:Means to handle repeats: - Base quality- Base quality - Read pair- Read pair - Fuzzy kmers- Fuzzy kmers - Closely related reference- Closely related reference - 454 or Sanger reads- 454 or Sanger reads
Kmer Extension & Repeat JunctionsKmer Extension & Repeat Junctions
Pileup of other reads like 454, Sanger etc Pileup of other reads like 454, Sanger etc at a repeat junction at a repeat junction
Consensus
Handling of Repeat JunctionsHandling of Repeat Junctions
Handling of Single Base Variations Handling of Single Base Variations
Fuzzypath PipelineFuzzypath Pipeline
Fuzzypath Read FileFuzzypath Read File
Fuzzypath Fastq FileFuzzypath Fastq File
Solexa reads:Number of reads: 6,000,000;Finished genome size: ~4.8 Mbp;Read length: 2x37 bp;Estimated read coverage: ~92.5 X;Insert size: 170/50-300 bp;
Assembly features: - contig statsSolexa 454
Total number of contigs: 75; 390Total bases of contigs: 4.80 Mbp 4.77 MbN50 contig size: 139,353 25,702Largest contig: 395,600 62,040Averaged contig size: 63,969 12,224Contig coverage on genome: ~99.8 % 99.4%Contig extension errors: 0Mis-assembly errors: 0 4
Salmonella seftenberg Salmonella seftenberg Solexa Solexa Assembly from Pair-End ReadsAssembly from Pair-End Reads
maqmaq
ssaha2ssaha2
maqmaq
ssaha2ssaha2
maqmaq
ssaha2ssaha2
maqmaq
ssaha2ssaha2
New Phusion AssemblerNew Phusion Assembler
SolexaReads
Assembly
Reads Group
Data Process Long Insert Reads
Supercontig
Contigs
PRono
Fuzzypath
Phrap
Velvet
2x75 or 2x100
Solexa reads:Number of reads: 557 Million;Finished genome size: 3.0 GB;Read length: 2x75bp;Estimated read coverage: ~25X;Insert size: 190/50-300 bp;Number of reads clustered: 458 Million
Assembly features: - contig statsTotal number of contigs: 1,040,582;Total bases of contigs: 2.703 GbN50 contig size: 6,484;Largest contig: 85,595 Averaged contig size: 2,597;Contig coverage over the genome: ~90 %;Mis-assembly errors: ?
Human AssemblyHuman Assembly – – COLO-829COLO-829Normal CellNormal Cell
Acknowledgements:
Yong Gu James Bonfield Heng Li Hannes Ponstingl Daniel Zerbino (EBI) Helen Beasley Siobhan Whitehead Tony Cox