BLAST Genome assembly Conclusion
BLAST & Genome assembly
Solon P. Pissis Tomas Flouri
Heidelberg Institute for Theoretical Studies
May 15, 2014
BLAST Genome assembly Conclusion
1 BLASTWhat is BLAST?The algorithm
2 Genome assemblyDe novo assemblyMapping assembly
3 ConclusionOverview
BLAST Genome assembly Conclusion
Contents
1 BLAST
2 Genome assembly
De novo assemblyMapping assembly
3 Conclusion
BLAST Genome assembly Conclusion
What is BLAST?
What is BLAST?
One of the most widely used algorithms in Bioinformatics
BLAST Genome assembly Conclusion
What is BLAST?
What is BLAST?
One of the most widely used algorithms in Bioinformatics
Purpose: Query large databases for similar sequences
BLAST Genome assembly Conclusion
What is BLAST?
What is BLAST?
One of the most widely used algorithms in Bioinformatics
Purpose: Query large databases for similar sequences
“Google for molecular biology”
BLAST Genome assembly Conclusion
What is BLAST?
What is BLAST?
One of the most widely used algorithms in Bioinformatics
Purpose: Query large databases for similar sequences
“Google for molecular biology”
Publicly available under:http://blast.st-va.ncbi.nlm.nih.gov/Blast.cgi
BLAST Genome assembly Conclusion
What is BLAST?
BLAST: a set of programs
Basic Local Alignment Search Tool is a set of programs forfast and approximate comparison of biological sequences.
BLAST Genome assembly Conclusion
What is BLAST?
BLAST: a set of programs
Basic Local Alignment Search Tool is a set of programs forfast and approximate comparison of biological sequences.In particular, BLAST is useful for the comparison between aquery sequence and a library or database of sequences, inorder to identify library sequences that resemble the querysequence above a certain threshold.
BLAST Genome assembly Conclusion
What is BLAST?
BLAST: a set of programs
Basic Local Alignment Search Tool is a set of programs forfast and approximate comparison of biological sequences.In particular, BLAST is useful for the comparison between aquery sequence and a library or database of sequences, inorder to identify library sequences that resemble the querysequence above a certain threshold.The five traditional BLAST implementations are:
BLAST Genome assembly Conclusion
What is BLAST?
BLAST: a set of programs
Basic Local Alignment Search Tool is a set of programs forfast and approximate comparison of biological sequences.In particular, BLAST is useful for the comparison between aquery sequence and a library or database of sequences, inorder to identify library sequences that resemble the querysequence above a certain threshold.The five traditional BLAST implementations are:
BLASTN: both the database and the query are nucleotidesequences
BLAST Genome assembly Conclusion
What is BLAST?
BLAST: a set of programs
Basic Local Alignment Search Tool is a set of programs forfast and approximate comparison of biological sequences.In particular, BLAST is useful for the comparison between aquery sequence and a library or database of sequences, inorder to identify library sequences that resemble the querysequence above a certain threshold.The five traditional BLAST implementations are:
BLASTN: both the database and the query are nucleotidesequencesBLASTP: both the database and the query are proteinsequences
BLAST Genome assembly Conclusion
What is BLAST?
BLAST: a set of programs
Basic Local Alignment Search Tool is a set of programs forfast and approximate comparison of biological sequences.In particular, BLAST is useful for the comparison between aquery sequence and a library or database of sequences, inorder to identify library sequences that resemble the querysequence above a certain threshold.The five traditional BLAST implementations are:
BLASTN: both the database and the query are nucleotidesequencesBLASTP: both the database and the query are proteinsequencesBLASTX: the database are protein sequences and the query isnucleotide translated into protein sequence
BLAST Genome assembly Conclusion
What is BLAST?
BLAST: a set of programs
Basic Local Alignment Search Tool is a set of programs forfast and approximate comparison of biological sequences.In particular, BLAST is useful for the comparison between aquery sequence and a library or database of sequences, inorder to identify library sequences that resemble the querysequence above a certain threshold.The five traditional BLAST implementations are:
BLASTN: both the database and the query are nucleotidesequencesBLASTP: both the database and the query are proteinsequencesBLASTX: the database are protein sequences and the query isnucleotide translated into protein sequenceTBLASTN: the database are nucleotide translated into proteinsequence and the query is a protein sequence
BLAST Genome assembly Conclusion
What is BLAST?
BLAST: a set of programs
Basic Local Alignment Search Tool is a set of programs forfast and approximate comparison of biological sequences.In particular, BLAST is useful for the comparison between aquery sequence and a library or database of sequences, inorder to identify library sequences that resemble the querysequence above a certain threshold.The five traditional BLAST implementations are:
BLASTN: both the database and the query are nucleotidesequencesBLASTP: both the database and the query are proteinsequencesBLASTX: the database are protein sequences and the query isnucleotide translated into protein sequenceTBLASTN: the database are nucleotide translated into proteinsequence and the query is a protein sequenceTBLASTX: both the database and the query are nucleotidetranslated into protein sequences
BLAST Genome assembly Conclusion
The algorithm
The algorithm
“Why not Smith-Waterman algorithm?”
BLAST Genome assembly Conclusion
The algorithm
The algorithm
“Why not Smith-Waterman algorithm?”
Smith-Waterman algorithm computes the optimal (maximumscoring) local alignment between two sequences.
BLAST Genome assembly Conclusion
The algorithm
The algorithm
“Why not Smith-Waterman algorithm?”
Smith-Waterman algorithm computes the optimal (maximumscoring) local alignment between two sequences.
In biological applications, we usually need to infer thestatistically significant alignments very fast.
BLAST Genome assembly Conclusion
The algorithm
The algorithm
“Why not Smith-Waterman algorithm?”
Smith-Waterman algorithm computes the optimal (maximumscoring) local alignment between two sequences.
In biological applications, we usually need to infer thestatistically significant alignments very fast.
BLAST does not explore the entire search space (DP matrix)but it minimizes the search space for efficiency...
BLAST Genome assembly Conclusion
The algorithm
The algorithm
“Why not Smith-Waterman algorithm?”
Smith-Waterman algorithm computes the optimal (maximumscoring) local alignment between two sequences.
In biological applications, we usually need to infer thestatistically significant alignments very fast.
BLAST does not explore the entire search space (DP matrix)but it minimizes the search space for efficiency...
...at the cost of sensitivity
BLAST Genome assembly Conclusion
The algorithm
The algorithm
“Why not Smith-Waterman algorithm?”
Smith-Waterman algorithm computes the optimal (maximumscoring) local alignment between two sequences.
In biological applications, we usually need to infer thestatistically significant alignments very fast.
BLAST does not explore the entire search space (DP matrix)but it minimizes the search space for efficiency...
...at the cost of sensitivity
It uses three layers of rules to sequentially identify refinepotential high scoring pairs (HSPs).
BLAST Genome assembly Conclusion
The algorithm
The algorithm
“Why not Smith-Waterman algorithm?”
Smith-Waterman algorithm computes the optimal (maximumscoring) local alignment between two sequences.
In biological applications, we usually need to infer thestatistically significant alignments very fast.
BLAST does not explore the entire search space (DP matrix)but it minimizes the search space for efficiency...
...at the cost of sensitivity
It uses three layers of rules to sequentially identify refinepotential high scoring pairs (HSPs).
These heuristics layers—seeding, extension, andevaluation—form a stepwise refinement procedure.
BLAST Genome assembly Conclusion
The algorithm
The algorithm
“Why not Smith-Waterman algorithm?”
Smith-Waterman algorithm computes the optimal (maximumscoring) local alignment between two sequences.
In biological applications, we usually need to infer thestatistically significant alignments very fast.
BLAST does not explore the entire search space (DP matrix)but it minimizes the search space for efficiency...
...at the cost of sensitivity
It uses three layers of rules to sequentially identify refinepotential high scoring pairs (HSPs).
These heuristics layers—seeding, extension, andevaluation—form a stepwise refinement procedure.
Allows for sampling the entire search space without wastingtime on dissimilar regions.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: seeding
BLAST assumes that significant alignments have commonsubwords (substrings or factors) of a fixed-length W .
BLAST Genome assembly Conclusion
The algorithm
The algorithm: seeding
BLAST assumes that significant alignments have commonsubwords (substrings or factors) of a fixed-length W .
It first determines the locations of all the common exactmatching substrings which are called word hits.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: seeding
BLAST assumes that significant alignments have commonsubwords (substrings or factors) of a fixed-length W .
It first determines the locations of all the common exactmatching substrings which are called word hits.
Only those regions with word hits will be used as alignmentseeds.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: seeding
BLAST assumes that significant alignments have commonsubwords (substrings or factors) of a fixed-length W .
It first determines the locations of all the common exactmatching substrings which are called word hits.
Only those regions with word hits will be used as alignmentseeds.
In this way BLAST ingores a large fraction of search space.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: seeding
BLAST assumes that significant alignments have commonsubwords (substrings or factors) of a fixed-length W .
It first determines the locations of all the common exactmatching substrings which are called word hits.
Only those regions with word hits will be used as alignmentseeds.
In this way BLAST ingores a large fraction of search space.
The neighborhood of a subword contains the word itself andall other words whose score is ≤ T when compared via thesubstitution matrix to the subword.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: seeding
BLAST assumes that significant alignments have commonsubwords (substrings or factors) of a fixed-length W .
It first determines the locations of all the common exactmatching substrings which are called word hits.
Only those regions with word hits will be used as alignmentseeds.
In this way BLAST ingores a large fraction of search space.
The neighborhood of a subword contains the word itself andall other words whose score is ≤ T when compared via thesubstitution matrix to the subword.
We may adjust T to control the size of theneighborhood—affecting speed and sensitivity.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: seeding
BLAST assumes that significant alignments have commonsubwords (substrings or factors) of a fixed-length W .
It first determines the locations of all the common exactmatching substrings which are called word hits.
Only those regions with word hits will be used as alignmentseeds.
In this way BLAST ingores a large fraction of search space.
The neighborhood of a subword contains the word itself andall other words whose score is ≤ T when compared via thesubstitution matrix to the subword.
We may adjust T to control the size of theneighborhood—affecting speed and sensitivity.
Hence, the interplay between W , T , and the substitutionmatrix is critical!!!
BLAST Genome assembly Conclusion
The algorithm
The algorithm: extension
BLAST Genome assembly Conclusion
The algorithm
The algorithm: extension
Once the search space is seeded, alignments can be generatedby starting from the individual seeds.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: extension
Once the search space is seeded, alignments can be generatedby starting from the individual seeds.
BLAST extends a longer alignment between the query and thedatabase sequence in the left and right direction of the word.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: extension
Once the search space is seeded, alignments can be generatedby starting from the individual seeds.
BLAST extends a longer alignment between the query and thedatabase sequence in the left and right direction of the word.
It only searches a subset of the space, so it needs amechanism to know when to stop the extension procedure.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: extension
Once the search space is seeded, alignments can be generatedby starting from the individual seeds.
BLAST extends a longer alignment between the query and thedatabase sequence in the left and right direction of the word.
It only searches a subset of the space, so it needs amechanism to know when to stop the extension procedure.
It uses a threshold X representing how much the score isallowed to drop off since the last maximum.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: extension
Once the search space is seeded, alignments can be generatedby starting from the individual seeds.
BLAST extends a longer alignment between the query and thedatabase sequence in the left and right direction of the word.
It only searches a subset of the space, so it needs amechanism to know when to stop the extension procedure.
It uses a threshold X representing how much the score isallowed to drop off since the last maximum.
The extension is stopped as soon as the sum score decreasesby more than X when compared with the highest valueobtained during the extension process.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: extension
Once the search space is seeded, alignments can be generatedby starting from the individual seeds.
BLAST extends a longer alignment between the query and thedatabase sequence in the left and right direction of the word.
It only searches a subset of the space, so it needs amechanism to know when to stop the extension procedure.
It uses a threshold X representing how much the score isallowed to drop off since the last maximum.
The extension is stopped as soon as the sum score decreasesby more than X when compared with the highest valueobtained during the extension process.
The alignment is trimmed back to the maximum score.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: extension
Once the search space is seeded, alignments can be generatedby starting from the individual seeds.
BLAST extends a longer alignment between the query and thedatabase sequence in the left and right direction of the word.
It only searches a subset of the space, so it needs amechanism to know when to stop the extension procedure.
It uses a threshold X representing how much the score isallowed to drop off since the last maximum.
The extension is stopped as soon as the sum score decreasesby more than X when compared with the highest valueobtained during the extension process.
The alignment is trimmed back to the maximum score.
It is generally a good idea to use a large value for X , whichreduces the risk of premature termination.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: evaluation
Once seeds have been extended in both directions to createalignments, these alignments are evaluated (post-processed)to determine if they are statistically significant.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: evaluation
Once seeds have been extended in both directions to createalignments, these alignments are evaluated (post-processed)to determine if they are statistically significant.
The significant alignments are termed HSPs (High ScoringPairs).
BLAST Genome assembly Conclusion
The algorithm
The algorithm: evaluation
Once seeds have been extended in both directions to createalignments, these alignments are evaluated (post-processed)to determine if they are statistically significant.
The significant alignments are termed HSPs (High ScoringPairs).
At the simplest level we can use an optional alignment scorethreshold (cut-off) S—empirically determined—to sort thealignments into low and high scoring.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: evaluation
Once seeds have been extended in both directions to createalignments, these alignments are evaluated (post-processed)to determine if they are statistically significant.
The significant alignments are termed HSPs (High ScoringPairs).
At the simplest level we can use an optional alignment scorethreshold (cut-off) S—empirically determined—to sort thealignments into low and high scoring.
By examining the distribution of the alignment scores modeledby comparing random sequences, S can be determined suchthat its value is large enough to guarantee the significance ofthe remaining HSPs.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: evaluation
BLAST next assesses the statistical significance of each HSPscore by using a so called expect value or E-value.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: evaluation
BLAST next assesses the statistical significance of each HSPscore by using a so called expect value or E-value.
It computes the probability p of observing a score S equal toor grater than score x by exploiting the Gumbel extreme valuedistribution (GEDV).
BLAST Genome assembly Conclusion
The algorithm
The algorithm: evaluation
BLAST next assesses the statistical significance of each HSPscore by using a so called expect value or E-value.
It computes the probability p of observing a score S equal toor grater than score x by exploiting the Gumbel extreme valuedistribution (GEDV).
It is shown that the distribution of Smith-Waterman localalignment scores between two random sequences followsGEDV.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: evaluation
BLAST next assesses the statistical significance of each HSPscore by using a so called expect value or E-value.
It computes the probability p of observing a score S equal toor grater than score x by exploiting the Gumbel extreme valuedistribution (GEDV).
It is shown that the distribution of Smith-Waterman localalignment scores between two random sequences followsGEDV.
The computation of p is based on statistical parametersdepending upon the substitution matrix, the gap penalties,and the problem size.
BLAST Genome assembly Conclusion
The algorithm
The algorithm: evaluation
BLAST next assesses the statistical significance of each HSPscore by using a so called expect value or E-value.
It computes the probability p of observing a score S equal toor grater than score x by exploiting the Gumbel extreme valuedistribution (GEDV).
It is shown that the distribution of Smith-Waterman localalignment scores between two random sequences followsGEDV.
The computation of p is based on statistical parametersdepending upon the substitution matrix, the gap penalties,and the problem size.
The expect value E (computed by p) of a database match isthe expected number of times that a random sequence wouldobtain a score S higher than x by chance.
BLAST Genome assembly Conclusion
Contents
1 BLAST
2 Genome assembly
De novo assemblyMapping assembly
3 Conclusion
BLAST Genome assembly Conclusion
Genome assembly: DNA sequencing
ATTAGCATAC...
BLAST Genome assembly Conclusion
Genome assembly
Genome assembly is the process of taking a huge number ofDNA sequences and putting them back together to create arepresentation of the genome from which the DNA originated.
BLAST Genome assembly Conclusion
Genome assembly
Genome assembly is the process of taking a huge number ofDNA sequences and putting them back together to create arepresentation of the genome from which the DNA originated.
De novo: assembling short reads to createfull-length—sometimes novel—sequences.
BLAST Genome assembly Conclusion
Genome assembly
Genome assembly is the process of taking a huge number ofDNA sequences and putting them back together to create arepresentation of the genome from which the DNA originated.
De novo: assembling short reads to createfull-length—sometimes novel—sequences.
Mapping: assembling reads by aligning them against anexisting reference sequence—building a sequence that issimilar but not necessarily identical to the reference.
BLAST Genome assembly Conclusion
Genome assembly
Genome assembly is the process of taking a huge number ofDNA sequences and putting them back together to create arepresentation of the genome from which the DNA originated.
De novo: assembling short reads to createfull-length—sometimes novel—sequences.
Mapping: assembling reads by aligning them against anexisting reference sequence—building a sequence that issimilar but not necessarily identical to the reference.
Genome assembly is generally a very difficult computationalproblem, and since 2005, probably, one of the hottests inBioinformatics.
BLAST Genome assembly Conclusion
Genome assembly
Genome assembly is the process of taking a huge number ofDNA sequences and putting them back together to create arepresentation of the genome from which the DNA originated.
De novo: assembling short reads to createfull-length—sometimes novel—sequences.
Mapping: assembling reads by aligning them against anexisting reference sequence—building a sequence that issimilar but not necessarily identical to the reference.
Genome assembly is generally a very difficult computationalproblem, and since 2005, probably, one of the hottests inBioinformatics.
In terms of time and space complexity, de novo assembly isorders of magnitude slower and more memory intensive thanmapping assembly.
BLAST Genome assembly Conclusion
Contents
1 BLAST
2 Genome assembly
De novo assemblyMapping assembly
3 Conclusion
BLAST Genome assembly Conclusion
De novo assembly: what is it?
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly: what is it?
De novo assembly is a hierarchical data structure that mapsthe sequence data to a putative reconstruction of the target.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly: what is it?
De novo assembly is a hierarchical data structure that mapsthe sequence data to a putative reconstruction of the target.
It groups reads into contigs and contigs into scaffolds.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly: what is it?
De novo assembly is a hierarchical data structure that mapsthe sequence data to a putative reconstruction of the target.
It groups reads into contigs and contigs into scaffolds.
Contigs provide a multiple sequence alignment of reads plusthe consensus sequence.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly: what is it?
De novo assembly is a hierarchical data structure that mapsthe sequence data to a putative reconstruction of the target.
It groups reads into contigs and contigs into scaffolds.
Contigs provide a multiple sequence alignment of reads plusthe consensus sequence.
Scaffolds define the contig order and orientation and the sizesof the gaps between contigs using mate pairs (paired-end)information.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly: what is it?
De novo assembly is a hierarchical data structure that mapsthe sequence data to a putative reconstruction of the target.
It groups reads into contigs and contigs into scaffolds.
Contigs provide a multiple sequence alignment of reads plusthe consensus sequence.
Scaffolds define the contig order and orientation and the sizesof the gaps between contigs using mate pairs (paired-end)information.
Assemblies are measured by the size and accuracy of theircontigs and scaffolds.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly: what is it?
De novo assembly is a hierarchical data structure that mapsthe sequence data to a putative reconstruction of the target.
It groups reads into contigs and contigs into scaffolds.
Contigs provide a multiple sequence alignment of reads plusthe consensus sequence.
Scaffolds define the contig order and orientation and the sizesof the gaps between contigs using mate pairs (paired-end)information.
Assemblies are measured by the size and accuracy of theircontigs and scaffolds.
Assembling a genome using many short NGS reads requires adifferent approach than the methods developed for the fewerbut longer reads produced by Sanger sequencing.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly: what is it?
De novo assembly is a hierarchical data structure that mapsthe sequence data to a putative reconstruction of the target.
It groups reads into contigs and contigs into scaffolds.
Contigs provide a multiple sequence alignment of reads plusthe consensus sequence.
Scaffolds define the contig order and orientation and the sizesof the gaps between contigs using mate pairs (paired-end)information.
Assemblies are measured by the size and accuracy of theircontigs and scaffolds.
Assembling a genome using many short NGS reads requires adifferent approach than the methods developed for the fewerbut longer reads produced by Sanger sequencing.
There are two basic algorithmic approaches for de novoassembly: overlap graphs and de Bruijn graphs.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: Overlap graphs
Figure: Colored nucleotides indicate overlaps between reads
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: Overlap graphs
Compute all pair-wise overlaps between the reads and capturethis information in a graph.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: Overlap graphs
Compute all pair-wise overlaps between the reads and capturethis information in a graph.
Each node in the graph corresponds to a read, and an edgedenotes an overlap between two reads.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: Overlap graphs
Compute all pair-wise overlaps between the reads and capturethis information in a graph.
Each node in the graph corresponds to a read, and an edgedenotes an overlap between two reads.
The overlap graph is used to compute an arrangement ofreads and a consensus sequence of contigs.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: Overlap graphs
Compute all pair-wise overlaps between the reads and capturethis information in a graph.
Each node in the graph corresponds to a read, and an edgedenotes an overlap between two reads.
The overlap graph is used to compute an arrangement ofreads and a consensus sequence of contigs.
This method works best when there is a small number ofreads with significant overlap.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: Overlap graphs
Compute all pair-wise overlaps between the reads and capturethis information in a graph.
Each node in the graph corresponds to a read, and an edgedenotes an overlap between two reads.
The overlap graph is used to compute an arrangement ofreads and a consensus sequence of contigs.
This method works best when there is a small number ofreads with significant overlap.
Some NGS assemblers use overlap graphs, but this traditionalapproach is computationally intensive: even a de novoassembly of small-sized genomes needs millions of reads,making the overlap graph extremely large.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: Overlap graphs
Walking along a Hamiltonian cycle (each vertex once) byfollowing the edges in numerical order allows one toreconstruct the genome by combining alignments betweensuccessive reads.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: Overlap graphs
This method, however, although simple is computationallyextremely expensive.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: Overlap graphs
A million reads will require a trillion pairwise alignments.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: Overlap graphs
A million reads will require a trillion pairwise alignments.
There is no known efficient algorithm for finding aHamiltonian cycle.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
Figure: The trick is to construct the de Brujin graph by representing allk-mer prefixes and suffixes as nodes and then drawing edges thatrepresent k-mers having a particular prefix and suffix
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
Most NGS assemblers use de Bruijn graphs.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
Most NGS assemblers use de Bruijn graphs.
De Bruijn graphs reduce the computational effort by breakingreads into smaller sequences of DNA, called k-mers, where theparameter k denotes the length in bases of these sequences.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
Most NGS assemblers use de Bruijn graphs.
De Bruijn graphs reduce the computational effort by breakingreads into smaller sequences of DNA, called k-mers, where theparameter k denotes the length in bases of these sequences.
The de Bruijn graph captures overlaps of length k − 1between these k-mers and not between the actual reads.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
Most NGS assemblers use de Bruijn graphs.
De Bruijn graphs reduce the computational effort by breakingreads into smaller sequences of DNA, called k-mers, where theparameter k denotes the length in bases of these sequences.
The de Bruijn graph captures overlaps of length k − 1between these k-mers and not between the actual reads.
By reducing the entire data set down to k-mer overlaps the deBruijn graph reduces redundancy in short-read data sets(same k-mers are represented by a unique node in the graph).
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
Most NGS assemblers use de Bruijn graphs.
De Bruijn graphs reduce the computational effort by breakingreads into smaller sequences of DNA, called k-mers, where theparameter k denotes the length in bases of these sequences.
The de Bruijn graph captures overlaps of length k − 1between these k-mers and not between the actual reads.
By reducing the entire data set down to k-mer overlaps the deBruijn graph reduces redundancy in short-read data sets(same k-mers are represented by a unique node in the graph).
The most efficient k-mer size for a particular assembly isdetermined by the read length as well as the error rate; k hassignificant influence on the quality of the assembly.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
Most NGS assemblers use de Bruijn graphs.
De Bruijn graphs reduce the computational effort by breakingreads into smaller sequences of DNA, called k-mers, where theparameter k denotes the length in bases of these sequences.
The de Bruijn graph captures overlaps of length k − 1between these k-mers and not between the actual reads.
By reducing the entire data set down to k-mer overlaps the deBruijn graph reduces redundancy in short-read data sets(same k-mers are represented by a unique node in the graph).
The most efficient k-mer size for a particular assembly isdetermined by the read length as well as the error rate; k hassignificant influence on the quality of the assembly.
Another attractive property of de Bruijn graphs is that repeatsin the genome can be collapsed in the graph and do not leadto many spurious overlaps.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
Finding an Eulerian cycle (visit each edge once) allows one toreconstruct the genome by forming an alignment in whicheach succesive k-mer (from successive edges) is shifted by oneposition.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
Hence we avoid the computationally expensive task of findinga Hamiltonian cycle.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly algorithms: de Bruijn graphs
As we visit all edges of the de Brujin graph, which representall possible k-mers we can spell out a candidate genome; foreach edge we traverse, we record the first nucleotide of thek-mer assigned to that edge.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly: a note for Computer Scientists
A simple formulation of the de novo assembly problem as anoptimization problem phrases the problem as a classicalproblem of algorithms on strings: the Shortest CommonSuperstring (SCS) problem.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly: a note for Computer Scientists
A simple formulation of the de novo assembly problem as anoptimization problem phrases the problem as a classicalproblem of algorithms on strings: the Shortest CommonSuperstring (SCS) problem.
Input: strings s1, s2, . . . , sk , where si ∈ Σ∗.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly: a note for Computer Scientists
A simple formulation of the de novo assembly problem as anoptimization problem phrases the problem as a classicalproblem of algorithms on strings: the Shortest CommonSuperstring (SCS) problem.
Input: strings s1, s2, . . . , sk , where si ∈ Σ∗.
Output: the shortest string s containing each si as a factor.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly: a note for Computer Scientists
A simple formulation of the de novo assembly problem as anoptimization problem phrases the problem as a classicalproblem of algorithms on strings: the Shortest CommonSuperstring (SCS) problem.
Input: strings s1, s2, . . . , sk , where si ∈ Σ∗.
Output: the shortest string s containing each si as a factor.
e.g. given s1 = abaab, s2 = baba, s3 = aabbb, ands4 = bbab, we want to output s = bbabaabbb.
BLAST Genome assembly Conclusion
De novo assembly
De novo assembly: a note for Computer Scientists
A simple formulation of the de novo assembly problem as anoptimization problem phrases the problem as a classicalproblem of algorithms on strings: the Shortest CommonSuperstring (SCS) problem.
Input: strings s1, s2, . . . , sk , where si ∈ Σ∗.
Output: the shortest string s containing each si as a factor.
e.g. given s1 = abaab, s2 = baba, s3 = aabbb, ands4 = bbab, we want to output s = bbabaabbb.
SCS problem is shown to be NP-complete! (via the TravelingSalesman problem)
BLAST Genome assembly Conclusion
Mapping assembly
Contents
1 BLAST
2 Genome assembly
De novo assemblyMapping assembly
3 Conclusion
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly: what is it?
ATTAGCATAC...~3GB
Depth 10 * 3GB = 30GB
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly: what is it?
Hundreds of millions of short reads (dozens or hundreds ofGigabytes) must be mapped (aligned) against a reference sequence(3Gb for human).
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly: what is it?
Hundreds of millions of short reads (dozens or hundreds ofGigabytes) must be mapped (aligned) against a reference sequence(3Gb for human).
Definition
Given a text t of length n, where t ∈ Σ+, Σ = {A,C,G,T}, a set
{p1, p2, . . . , pr } of patterns, each of length m < n, where pi ∈ Σ+,
for all 1 ≤ i ≤ r , and an integer e < m, find all the factors of t,which are at Hamming distance less than, or equal to, e from pi .
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly: what is it?
Hundreds of millions of short reads (dozens or hundreds ofGigabytes) must be mapped (aligned) against a reference sequence(3Gb for human).
Definition
Given a text t of length n, where t ∈ Σ+, Σ = {A,C,G,T}, a set
{p1, p2, . . . , pr } of patterns, each of length m < n, where pi ∈ Σ+,
for all 1 ≤ i ≤ r , and an integer e < m, find all the factors of t,which are at Hamming distance less than, or equal to, e from pi .
where Σ+ denotes the set of all the strings on the alphabet Σ
except the empty string ε.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly: algorithms
The most straightforward way of finding all the occurrences ofa read, if no gap is allowed, consists in sliding the read alongthe genome sequence and noting the positions where thereexists a match.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly: algorithms
The most straightforward way of finding all the occurrences ofa read, if no gap is allowed, consists in sliding the read alongthe genome sequence and noting the positions where thereexists a match.
Unfortunately, although conceptually simple, this algorithmhas a huge complexity.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly: algorithms
The most straightforward way of finding all the occurrences ofa read, if no gap is allowed, consists in sliding the read alongthe genome sequence and noting the positions where thereexists a match.
Unfortunately, although conceptually simple, this algorithmhas a huge complexity.
When gaps are allowed, one has to resort to traditionaldynamic programming algorithms, such as theNeedleman-Wunsch algorithm.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly: algorithms
The most straightforward way of finding all the occurrences ofa read, if no gap is allowed, consists in sliding the read alongthe genome sequence and noting the positions where thereexists a match.
Unfortunately, although conceptually simple, this algorithmhas a huge complexity.
When gaps are allowed, one has to resort to traditionaldynamic programming algorithms, such as theNeedleman-Wunsch algorithm.
Unfortunately, the complexity becomes even larger.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly: algorithms
The most straightforward way of finding all the occurrences ofa read, if no gap is allowed, consists in sliding the read alongthe genome sequence and noting the positions where thereexists a match.
Unfortunately, although conceptually simple, this algorithmhas a huge complexity.
When gaps are allowed, one has to resort to traditionaldynamic programming algorithms, such as theNeedleman-Wunsch algorithm.
Unfortunately, the complexity becomes even larger.
Therefore, to be efficient, all the methods must rely on somesort of pre-processing.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly: algorithms
The most straightforward way of finding all the occurrences ofa read, if no gap is allowed, consists in sliding the read alongthe genome sequence and noting the positions where thereexists a match.
Unfortunately, although conceptually simple, this algorithmhas a huge complexity.
When gaps are allowed, one has to resort to traditionaldynamic programming algorithms, such as theNeedleman-Wunsch algorithm.
Unfortunately, the complexity becomes even larger.
Therefore, to be efficient, all the methods must rely on somesort of pre-processing.
i.e. index the genome to provide a direct and fast access to itssubstrings of a given size, using either hashing-based indexesor Burrows-Wheeler-transform-based indexes.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
Store the positions of the k-mers in an array of linked lists,using a value of k significantly less than the read size, sayk = 9.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
Store the positions of the k-mers in an array of linked lists,using a value of k significantly less than the read size, sayk = 9.
In terms of space, the problem is tractable since there are, atmost, 49
= 262144 different 9-mers in the genome.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
Store the positions of the k-mers in an array of linked lists,using a value of k significantly less than the read size, sayk = 9.
In terms of space, the problem is tractable since there are, atmost, 49
= 262144 different 9-mers in the genome.
Select a k-mer for each read (a good choice is the leftmostpart, because the quality is better) and map it to the genomeusing the hashing procedure—the seed.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
Store the positions of the k-mers in an array of linked lists,using a value of k significantly less than the read size, sayk = 9.
In terms of space, the problem is tractable since there are, atmost, 49
= 262144 different 9-mers in the genome.
Select a k-mer for each read (a good choice is the leftmostpart, because the quality is better) and map it to the genomeusing the hashing procedure—the seed.
For each possible hit, the procedure would then try to mapthe rest (extend) of the read to the genome (possibly allowingerrors, in a Needleman-Wunsch-like algorithm).
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
Store the positions of the k-mers in an array of linked lists,using a value of k significantly less than the read size, sayk = 9.
In terms of space, the problem is tractable since there are, atmost, 49
= 262144 different 9-mers in the genome.
Select a k-mer for each read (a good choice is the leftmostpart, because the quality is better) and map it to the genomeusing the hashing procedure—the seed.
For each possible hit, the procedure would then try to mapthe rest (extend) of the read to the genome (possibly allowingerrors, in a Needleman-Wunsch-like algorithm).
This two-steps strategy is called seed and extend.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
Store the positions of the k-mers in an array of linked lists,using a value of k significantly less than the read size, sayk = 9.
In terms of space, the problem is tractable since there are, atmost, 49
= 262144 different 9-mers in the genome.
Select a k-mer for each read (a good choice is the leftmostpart, because the quality is better) and map it to the genomeusing the hashing procedure—the seed.
For each possible hit, the procedure would then try to mapthe rest (extend) of the read to the genome (possibly allowingerrors, in a Needleman-Wunsch-like algorithm).
This two-steps strategy is called seed and extend.
Drawback is that seeds are usually highly repeated in thereference genome: huge linked lists!
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
A better approach is to divide each read into q equally-longnon-overlapping substrings.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
A better approach is to divide each read into q equally-longnon-overlapping substrings.
Suppose that one allows for e mismatches.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
A better approach is to divide each read into q equally-longnon-overlapping substrings.
Suppose that one allows for e mismatches.
At least q − e out of the q substrings can be mapped exactly(in the worst case, the e errors are located in e differentsubstrings, thus leaving q − e substrings without error).
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
A better approach is to divide each read into q equally-longnon-overlapping substrings.
Suppose that one allows for e mismatches.
At least q − e out of the q substrings can be mapped exactly(in the worst case, the e errors are located in e differentsubstrings, thus leaving q − e substrings without error).
The above follows immediately from the pigeon-hole principleand is known as the filtering or partitioning into exactmatches strategy.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
A better approach is to divide each read into q equally-longnon-overlapping substrings.
Suppose that one allows for e mismatches.
At least q − e out of the q substrings can be mapped exactly(in the worst case, the e errors are located in e differentsubstrings, thus leaving q − e substrings without error).
The above follows immediately from the pigeon-hole principleand is known as the filtering or partitioning into exactmatches strategy.
The q − e substrings that exactly match the genomeconstitute an anchor.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
A better approach is to divide each read into q equally-longnon-overlapping substrings.
Suppose that one allows for e mismatches.
At least q − e out of the q substrings can be mapped exactly(in the worst case, the e errors are located in e differentsubstrings, thus leaving q − e substrings without error).
The above follows immediately from the pigeon-hole principleand is known as the filtering or partitioning into exactmatches strategy.
The q − e substrings that exactly match the genomeconstitute an anchor.
There exist( q
q−e
)
possible anchor combinations of the qfragments of a read that we have to check and also extend.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: hashing
A better approach is to divide each read into q equally-longnon-overlapping substrings.
Suppose that one allows for e mismatches.
At least q − e out of the q substrings can be mapped exactly(in the worst case, the e errors are located in e differentsubstrings, thus leaving q − e substrings without error).
The above follows immediately from the pigeon-hole principleand is known as the filtering or partitioning into exactmatches strategy.
The q − e substrings that exactly match the genomeconstitute an anchor.
There exist( q
q−e
)
possible anchor combinations of the qfragments of a read that we have to check and also extend.
In practice, for the seed part, we use q = 4 and e = 2:( q
q−e
)
= 6 combinations.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
Burrows-Wheeler Transform (BWT) (Burrows and Wheeler,1994) is an algorithm used in data compression applicationssuch as bzip2.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
Burrows-Wheeler Transform (BWT) (Burrows and Wheeler,1994) is an algorithm used in data compression applicationssuch as bzip2.
It can be applied to create a permanent index of the referencesequence, which may be re-used across mapping runs.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
Burrows-Wheeler Transform (BWT) (Burrows and Wheeler,1994) is an algorithm used in data compression applicationssuch as bzip2.
It can be applied to create a permanent index of the referencesequence, which may be re-used across mapping runs.
Consider the n × n matrix in which each row contains adifferent cyclic rotation of the original text of length n.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
Burrows-Wheeler Transform (BWT) (Burrows and Wheeler,1994) is an algorithm used in data compression applicationssuch as bzip2.
It can be applied to create a permanent index of the referencesequence, which may be re-used across mapping runs.
Consider the n × n matrix in which each row contains adifferent cyclic rotation of the original text of length n.Sort the rows lexicographically.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
Burrows-Wheeler Transform (BWT) (Burrows and Wheeler,1994) is an algorithm used in data compression applicationssuch as bzip2.
It can be applied to create a permanent index of the referencesequence, which may be re-used across mapping runs.
Consider the n × n matrix in which each row contains adifferent cyclic rotation of the original text of length n.Sort the rows lexicographically.BWT is the rightmost column in the sorted matrix.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
Burrows-Wheeler Transform (BWT) (Burrows and Wheeler,1994) is an algorithm used in data compression applicationssuch as bzip2.
It can be applied to create a permanent index of the referencesequence, which may be re-used across mapping runs.
Consider the n × n matrix in which each row contains adifferent cyclic rotation of the original text of length n.Sort the rows lexicographically.BWT is the rightmost column in the sorted matrix.
If the text has several repeating substrings, then the BWT willhave several places where a single character is repeated;e.g. BWT(mississippi) = pssmipissii.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
Burrows-Wheeler Transform (BWT) (Burrows and Wheeler,1994) is an algorithm used in data compression applicationssuch as bzip2.
It can be applied to create a permanent index of the referencesequence, which may be re-used across mapping runs.
Consider the n × n matrix in which each row contains adifferent cyclic rotation of the original text of length n.Sort the rows lexicographically.BWT is the rightmost column in the sorted matrix.
If the text has several repeating substrings, then the BWT willhave several places where a single character is repeated;e.g. BWT(mississippi) = pssmipissii.
The remarkable thing about the BWT is that it isreversible—allowing the original text to be re-generated onlyfrom the last column!
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
mississippi
imississipp
pimississip
ppimississi
ippimississ
sippimissis
ssippimissi
issippimiss
sissippimis
ssissippimi
ississippim
Table: n × n matrix of the cyclic rotations of mississippi
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
Prefix of n − 1 letters nth letter (BWT)
imississip p
ippimissis s
issippimis s
ississippi m
mississipp i
pimississi p
ppimississ i
sippimissi s
sissippimi s
ssippimiss i
ssissippim i
Table: n × n lexicographically sorted matrix of the cyclic rotations ofmississippi
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
There exists a direct relationship between the BWT and thesuffix array—an efficient indexing data structure from whichwe may obtain directly the BWT.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
There exists a direct relationship between the BWT and thesuffix array—an efficient indexing data structure from whichwe may obtain directly the BWT.
The amount of storage that we need to store the BWT,however, is significantly smaller than that suffix array.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
There exists a direct relationship between the BWT and thesuffix array—an efficient indexing data structure from whichwe may obtain directly the BWT.
The amount of storage that we need to store the BWT,however, is significantly smaller than that suffix array.
An increasing number of algorithms is developed to searchthese compressed full-text indexes for permitting fastsubstring queries; the most well-known is the FM-index(Ferragina and Manzini, 2000).
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
There exists a direct relationship between the BWT and thesuffix array—an efficient indexing data structure from whichwe may obtain directly the BWT.
The amount of storage that we need to store the BWT,however, is significantly smaller than that suffix array.
An increasing number of algorithms is developed to searchthese compressed full-text indexes for permitting fastsubstring queries; the most well-known is the FM-index(Ferragina and Manzini, 2000).
It can be used to efficiently find the number of occurrences ofa pattern within the compressed text, as well as to locate theposition of each occurrence.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
There exists a direct relationship between the BWT and thesuffix array—an efficient indexing data structure from whichwe may obtain directly the BWT.
The amount of storage that we need to store the BWT,however, is significantly smaller than that suffix array.
An increasing number of algorithms is developed to searchthese compressed full-text indexes for permitting fastsubstring queries; the most well-known is the FM-index(Ferragina and Manzini, 2000).
It can be used to efficiently find the number of occurrences ofa pattern within the compressed text, as well as to locate theposition of each occurrence.
Both the query time and storage space requirements aresublinear with respect to the size of the input data.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: BWT
There exists a direct relationship between the BWT and thesuffix array—an efficient indexing data structure from whichwe may obtain directly the BWT.
The amount of storage that we need to store the BWT,however, is significantly smaller than that suffix array.
An increasing number of algorithms is developed to searchthese compressed full-text indexes for permitting fastsubstring queries; the most well-known is the FM-index(Ferragina and Manzini, 2000).
It can be used to efficiently find the number of occurrences ofa pattern within the compressed text, as well as to locate theposition of each occurrence.
Both the query time and storage space requirements aresublinear with respect to the size of the input data.
Most recent mapping tools are based on such BWT indexes.
BLAST Genome assembly Conclusion
Mapping assembly
Mapping assembly algorithms: some experiments
Table: Mapping 25, 000, 000 64 bp-long simulated reads to the humanchromosome 6 (166, 880, 988 bp)
Programme Total time Reads alignedIndexing Mapping
SOAP2 5m10s 28m25s 22,699,605REAL -q 0 0m00s 26m43s 22,509,708Bowtie 7m35s 49m11s 21,594,916REAL -q 1 0m00s 31m54s 22,519,739
All programmes were run with 48 bp-long seed, with at most two
mismatches in the seed, and reported best hits only.
BLAST Genome assembly Conclusion
Contents
1 BLAST
2 Genome assembly
De novo assemblyMapping assembly
3 Conclusion
BLAST Genome assembly Conclusion
Overview
Overview
Recent technological advances have dramatically improvednext-generation sequencing throughput and quality.
BLAST Genome assembly Conclusion
Overview
Overview
Recent technological advances have dramatically improvednext-generation sequencing throughput and quality.
In parallel with the technological improvements that haveincreased the throughput of the next-generation short-readsequencers, many algorithmic advances have been made.
BLAST Genome assembly Conclusion
Overview
Overview
Recent technological advances have dramatically improvednext-generation sequencing throughput and quality.
In parallel with the technological improvements that haveincreased the throughput of the next-generation short-readsequencers, many algorithmic advances have been made.
BLAST: a set of programs for the comparison of biologicalsequences.
BLAST Genome assembly Conclusion
Overview
Overview
Recent technological advances have dramatically improvednext-generation sequencing throughput and quality.
In parallel with the technological improvements that haveincreased the throughput of the next-generation short-readsequencers, many algorithmic advances have been made.
BLAST: a set of programs for the comparison of biologicalsequences.
Genome assembly: taking a huge number of DNA sequencesand putting them back together to create a representation ofthe genome from which the DNA originated.
BLAST Genome assembly Conclusion
Overview
Overview
Recent technological advances have dramatically improvednext-generation sequencing throughput and quality.
In parallel with the technological improvements that haveincreased the throughput of the next-generation short-readsequencers, many algorithmic advances have been made.
BLAST: a set of programs for the comparison of biologicalsequences.
Genome assembly: taking a huge number of DNA sequencesand putting them back together to create a representation ofthe genome from which the DNA originated.
De novo assembly: assembling short reads to createfull-length—sometimes novel—sequences.
BLAST Genome assembly Conclusion
Overview
Overview
Recent technological advances have dramatically improvednext-generation sequencing throughput and quality.
In parallel with the technological improvements that haveincreased the throughput of the next-generation short-readsequencers, many algorithmic advances have been made.
BLAST: a set of programs for the comparison of biologicalsequences.
Genome assembly: taking a huge number of DNA sequencesand putting them back together to create a representation ofthe genome from which the DNA originated.
De novo assembly: assembling short reads to createfull-length—sometimes novel—sequences.
Mapping assembly: assembling reads by aligning them againstan existing reference sequence—building a sequence that issimilar but not necessarily identical to the reference.
BLAST Genome assembly Conclusion
Overview
Want to know more?
A very nice online course:
“Bioinformatics Algorithms (Part 1)”by Phillip E. C. Compeau, Nikolay Vyahhi, Pavel Pevzner
https://class.coursera.org/bioinformatics-001
BLAST Genome assembly Conclusion
Overview
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J.Lipman.Basic Local Alignment Search Tool.Journal of Molecular Biology, 215(3):403–410, 1990.
M. Burrows and D. J. Wheeler.A block-sorting lossless data compression algorithm.Technical Report SRC-RR-124, Standord Univeristy, 1994.
J. C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer.SHARCGS, a fast and highly accurate short-read assemblyalgorithm for de novo genomic sequencing.Genome Res, 17(11):1697–1706, November 2007.
P. Ferragina and G. Manzini.Opportunistic data structures with applications.In IEEE, editor, Proceedings of the fourty-first annualSymposium on Foundations of Computer Science (FOCS2000), pages 390–398, USA, 2000. IEEE Computer Society.
BLAST Genome assembly Conclusion
Overview
R. D. Fleischmann, M. D. Adams, O. White, R. A. Clayton,E. F. Kirkness, A. R. Kerlavage, C. J. Bult, J. F. Tomb, B. A.Dougherty, and J. M. Merrick.Whole-genome random sequencing and assembly ofHaemophilus influenzae.Science, 269:496–512, 1995.
K. Frousios, C. S. Iliopoulos, L. Mouchard, S. P. Pissis, andG. Tischler.REAL: an efficient REad ALigner for next generationsequencing reads.In A. Zhang, M. Borodovsky, G. Ozsoyoglu, and A. R. Mikler,editors, Proceedings of the first ACM International Conferenceon Bioinformatics and Computational Biology (BCB 2011),pages 154–159, USA, 2010. ACM.
D. Hernandez, P. Francois, L. Farinelli, M. Osteras, andJ. Schrenzel.
BLAST Genome assembly Conclusion
Overview
De novo bacterial genome sequencing: millions of very shortreads assembled on a desktop computer.Genome Res, March 2008.
B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg.Ultrafast and memory-efficient alignment of short DNAsequences to the human genome.Genome biology, 10(3):R25+, 2009.
H. Li and R. Durbin.Fast and accurate short read alignment with Burrows-Wheelertransform.Bioinformatics, 25(14):1754–1760, 2009.
R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, andJ. Wang.SOAP2: an improved ultrafast tool for short read alignment.Bioinformatics, 25(16):1966–1967, 2009.
BLAST Genome assembly Conclusion
Overview
R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li,G. Shan, K. Kristiansen, S. Li, H. Yang, J. Wang, andJ. Wang.De novo assembly of human genomes with massively parallelshort read sequencing.Genome Research, 20(2):265–272, 2010.
M. Margulies, M. Egholm, W. E. Altman, S. Attiya, J. S.Bader, L. A. Bemben, J. Berka, M. S. Braverman, Y.-J. Chen,Z. Chen, S. B. Dewell, L. Du, J. M. Fierro, X. V. Gomes, B. C.Godwin, W. He, S. Helgesen, C. H. Ho, G. P. Irzyk, S. C.Jando, M. L. I. Alenquer, T. P. Jarvie, K. B. Jirage, J.-B. Kim,J. R. Knight, J. R. Lanza, J. H. Leamon, S. M. Lefkowitz,M. Lei, J. Li, K. L. Lohman, H. Lu, V. B. Makhijani, K. E.McDade, M. P. McKenna, E. W. Myers, E. Nickerson, J. R.Nobile, R. Plant, B. P. Puc, M. T. Ronan, G. T. Roth, G. J.Sarkis, J. F. Simons, J. W. Simpson, M. Srinivasan, K. R.Tartaro, A. Tomasz, K. A. Vogt, G. A. Volkmer, S. H. Wang,
BLAST Genome assembly Conclusion
Overview
Y. Wang, M. P. Weiner, P. Yu, R. F. Begley, and J. M.Rothberg.Genome sequencing in microfabricated high-density picolitrereactors.Nature, 437(7057):376–380, 2005.
J. R. Miller, S. Koren, and G. Sutton.Assembly algorithms for next-generation sequencing data.Genomics, 95(6):315–327, 2010.
J. Shendure, G. J. Porreca, N. B. Reppas, X. Lin, J. P.McCutcheon, A. M. Rosenbaum, M. D. Wang, K. Zhang,R. D. Mitra, and G. M. Church.Accurate Multiplex Polony Sequencing of an Evolved BacterialGenome.Science, 309(5741):1728–1732, 2005.
J. R. ten Bosch and W. W. Grody.
BLAST Genome assembly Conclusion
Overview
Keeping up with the next generation : Massively parallelsequencing in clinical diagnostics.Journal of Molecular Diagnostics, 10(6):484–492, 2008.
Top Related