Genome Assembly
description
Transcript of Genome Assembly
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick
Genome Assembly
Outline Stake Holders Biology NGS Review Introduction to Genome Assembly Challenges Analysis pipeline/ strategy Tool selection Summary (final pipeline)
Stakeholders
CDC (Centers for Disease Control and Prevention) GaTech Immunocompromised individuals Consumers of seafood Prediction group (and subsequent groups)
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Biology…
Image of V. vulnificus
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Vibrio vulnificus
Gram-negativeo Lipopolysaccharide membrane
Motile, facultative anaerobe Halophilic (salt-loving) organism
abundant in estuarine ecosystems Major cause of seafood related
deaths
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Vibrio vulnificus – genome architecture Bacterial genomes are coding-
denseo Introns rare
Contains plasmids (pYJ016) V. vulnificus ~5.2mbp genome
(similar to E. coli, ~50%)o GC content: 45-47%
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Vibrio navarrensis Gram-negative
Lipopolysaccharide membrane
Motile, facultative anaerobe Moderately halophilic organism
Some strains do not grow well in moderate to high salt concentrations
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Vibrio navarrensis - genomic architecture
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
NGS - Review
Sample input: Genomic DNA, BACs, amplicons, cDNA
Generation of small DNA fragments via shearing
Ligation of A/B-Adaptors flanking single- stranded DNA fragments
Emulsification of beads and fragments in water-in-oil microreactors
Clonal amplification of fragments bound to beads in microreactors
Sequencing and base calling
One Fragment
One Bead
One Read
400,000 reads per run
Roche 454 sequencing workflow overview
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Flowgram
GS FLX Data analysis – flowgram generation
T Flow Order4‐
AC
mer
G
3 mer‐
TTCTGCGAA
2 mer‐1 mer‐
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Example of homopolymer errors from 454 sequencing data
Example of 454 sff file (text format)
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
cBot GAIIx0.1 - 1.0μg
User or core facility
Illumina sequencing overview
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Example of Illumina *.fastq file
@C3PO_0001:2:1:17:1499#0/1 TGAATTCATTGACCATAACAATCATATGCATGATGCAAATTATAATATCATTTTTAGTGACGTCGTGAATCGTTT+C3PO_0001:2:1:17:1499#0/1abaaaaaaaaaaa`a`aa_aaaaaaaaaaaaaaaa_a aaa`aaaaa^aaaaa`a]^`a YZYZ^`NJDJ\_Z@C3PO_0001:2:1:17:1291#0/1 TGTTTGAGCAAATGATTCATAATAATGTATTTCAATATTTTTAGGAATATCTCCCAATATTGCGCGTGCTGAATT+C3PO_0001:2:1:17:1291#0/1a`_`_\a_aaaa_a^Z^^a[a^aa]a_^_a_``aa `aa`X^X^^`aa_\_]VR`\a_]W\_`_a]a]][\RZV@C3PO_0001:2:2:1452:1316#0/1 GTCCATCCGCAGCAGCGAATTTTTGACGTCCCCCCCCGAANGGANGNGANNNNGNNGNNNTNTNNAAANGNNNNN+C3PO_0001:2:2:1452:1316#0/1_U a\ `]_`ZP\\_Z^[]aa^a_]XNBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Genome Assembly
Input reads
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
V. navarrensis V. vulnificus2423-01 2009V-1368
08-2462 06-2432
2541-90 08-2435
2756-81 08-2439
07-2444
Introduction to genome assembly
An assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target.
In addition to contigs, a set of unassembled or partially assembled reads is also given as an output.
Reads
Contigs multiple sequence alignment of reads plus the consensus sequence.
Scaffolds - define the contig order and orientation
Output (FASTA)
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
• N50
• minimum/maximum contig length
• No. of contigs
• No. of errors
• FRC (feature response curve)
How do we check the quality of our assembly?
METRICS!
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Feature-by-Feature – evaluating de-novo assembly• BREAKPOINT: Points in the assembly where leftover reads partially align; • COMPRESSION: Area representing a possible repeat col- lapse; • STRETCH: Area representing a possible repeat expansion; • LOW_GOOD_CVG: Area composed of paired reads at the right distance and with the right
orientation but at low coverage; • HIGH_NORMAL_CVG: Area composed of normal oriented reads but at high coverage;• HIGH_LINKING_CVG: Area composed of reads with associated mates in another scaffold;• HIGH_SPANNING_CVG: Area composed of reads with associated mates in another contig;• HIGH_OUTIE_CVG: Area composed of incorrectly oriented mates (--> -->, <-- -->); • HIGH_SINGLEMATE_CVG: Area composed of single reads (mate not present anywhere); • HIGH_READ_COVERAGE: Region in assembly with unexpectedly high local read coverage;• HIGH_SNP: SNP with high coverage; • KMER_COV: Problematic k-mer distribution.
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Feature-by-Feature – evaluating de-novo assembly• Most of the traditional metrics used to evaluate assemblies (N50, mean contig size, etc.) emphasize only size, while nothing (or
almost nothing) is said about how correct the assemblies are.
• A typical such metric (especially, in the NGS context) consists in aligning contigs back to an available reference. However, this naive technique simply counts the number of mis-assemblies without attempting to distinguish or categorize them any further.
• After running amosvalidate, each contig is assigned the number of features that correspond to doubtful sequences in the assembly.
• For a fixed feature threshold w, the contigs are sorted by size and, starting from the longest, only those contigs are tallied, if their sum of features is ƒw. For this set of contigs, the corresponding approximate genome coverage is computed, leading to a single point of the Feature-Response curve (FRC).
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Assembly Challenges
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Challenges Intrinsic
Genome architecture Repeats Homopolymer runs Sequence complexity Chimeras?
Contaminants
Technical Short reads Poisson distribution of coverage Sequencing errors Variable quality Sequence tags
454 raw reads
Pre-processing
Illumina raw reads
Pre-processing
454 reads
Illumina reads
Statistical analysis
Read stats
Published Genomes from public databases
V. vulnificus
YJ016
V. vulnificus CMCP6
V. vulnificus MO6-24/O
Align Illumina against the reference
• Fastqc• Prinseq• NGS QC
Compare mapping statistics
Reference genome
• samstats
• bwa
Reference selection
Hybrid DeNovo • Ray• MIRA
Illumina/ 454/ Hybrid DeNovo assembly
454 DeNovo• Newbler• CABOG• SUTTA
Illumina DeNovo• Allpaths LG• SOAP DeNovo• Velvet• Abyss• Taipan• Bambus2• SUTTA
contigs * 3
Align illumina reads against 454 contigs
Unmapped reads
• Mac vector• CLC wb
contigs
Unmapped reads
Evaluation
• GAGE• Hawk-eye
Illumina/(454?) reference based
assembly• AMOScmp
contigs
Unmapped reads
DeNovo assembly
Reference based assembly
Draft/ Finished genome
Reference evaluation
Reference evaluation
• DNA Diff• DNA Diff
Parameter optimization
Contig merging
All possible combinations of the
best 3• Mimimus• MAIA
• PAGIT• Mauve
Finished genomeScaffolds
• GAGE
Genome finishing
Gap filling Nulceotide identity
• DNA Diff
• GRASS• Built-in
Process
454
Illumina
Info.
Chosen Ref.
Assemblers
Assemblers
Illumina454
LEGEND
hybrid
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges / Analysis Pipeline-Strategy / Tool Selection / Summary
Tool Selection - Assembly Algorithm profile
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Greedy Seed-and-extension Graph based Branch-and-Bound
Basic operation: given any read or contig, add one more read or contig until no more reads or contigs are available The contigs grow by “greedy extension” always incorporating a read that is found with the
highest scoring overlap
Makes locally optimal choice with the hope of finding a globally optimal choice No foresight -> misassembly
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Greedy Seed-and-extension Graph based Branch-and-Bound
It was the best of
of times, it was the
best of times, it was
times, it was the worst
was the best of times,
the best of times, it
of times, it was the
times, it was the age
It was the best of
of times, it was the
times, it was the worst
was the best of times,
the best of times, it
was the worst of times,
best of times, it
was it was the age
ofit was the worst of
of times, it was the
times, it was the age
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
it was the age of
was the age of foolishness,
the worst of times, it
• It was the best of times, it was the [worst/age]
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Greedy Seed-and-Extension Graph based Branch-and-Bound
Variation of the greedy assembler Common in aligners, thus some assemblers/aligners may incorporate this approach
Particularly designed for short reads based on a contig heuristic scheme
Prefix-tree data structure A contig is elongated at either end contingent upon the existence of reads with a prefix of
minimal length perfectly matching the end of the contig
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Overlap: find potentially overlapping reads
Layout: layout the reads based on matching alignment
Consensus: derive the DNA sequence consensus by joining read sequences ..ACGATTACAATAGGTT..
Greedy Seed-and-extension Graph based Branch-and-Bound
Overlap-layout-consensus (OLC): pairwise consensus
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Hamiltonian ApproachFind an assembled sequence that explains the observed sequence = finding a path through a graph that visits every vertex once
Repeat Repeat Repeat
Greedy Seed-and-extension Graph based Branch-and-Bound
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Greedy Seed-and-extension Graph based Branch-and-Bound
Basic operation: k-mer approach Eulerian approach
GGA CTG
GGG TGG
GAC ACT CTT TTT
Reads
AAGA ACTT ACTC ACTG AGAG CCGA CGAC CTCC CTGG CTTT…
de Bruijn Graph
CCG TCC
Potential Genomes
AAGACTCCGACTGGGACTTT
AAGACTGGGACTCCGACTTTCTCCGA
AAG AGA
de-Brujin Graph
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Greedy Seed-and-extension Graph based Branch-and-Bound
Basic operation: relies on “consistent layouts”; it generates all possible consistent layouts organizing them as paths in a “double tree” structure, rooted at a randomly selected seed read
Progressive evaluation of optimal criteria encoded by a set of score functions based on the set of overlaps along the layout
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Tid-bits of advice
Greedy Seed-and-Extension
OLC De-Brujin Branch-and-Bound
Advantages Guaranteed to find a solution
sensitivity Suitable for low coverage long reads
Repeats are immediately recognized; suitable for high coverage short reads
Algorithm allows for checks
Disadvantages MisassemblyEasily confused by complex repeats
Can be very slow, memory usage
Computation of overlaps time intensive
RAM intensive Ambiguities delay pruning
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Tools of Choice
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
454 platform assembly
Name Algorithm
Newbler 2.5 OLC Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data
CABOG OLC Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data
SUTTA
Branch-and-Bound
Feature-by-Feature – Evaluating De Novo Sequence Assembly
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Evaluation of 454 assemblers Genomes Used For Comparison
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data Brief Bioinform (2012) 13(3): 269-280
Comparison of 454 assemblers using E. coli genome
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data Brief Bioinform (2012) 13(3): 269-280
Comparison of 454 assemblers using E. coli genome The maximum value reached by the bars is the hypothetical reconstruction HR, defined as the ratio between the assembled bases and the reference
length The white section represents the real reconstruction RR, i.e. the portion of genome correctly reconstructed by assemblers. The difference between hypothetical and RR, here called erroneous reconstruction ER, is shown in black
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data Brief Bioinform (2012) 13(3): 269-280
Illumina platform assembly
Name Algorithm Supporting Evidence
ALLPATHS-LG OLCGAGE: A critical evaluation of genome assemblies and assembly algorithms
Velvet de-BrujinComparative studies of de novo assembly tools for next-generation sequencing technologies
Taipan Hybrid(Greedy-based and graph)
A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies
SOAPdenovo de-Brujin Feature-by-Feature- Evaluating De Novo Sequence Assembly
SUTTABranch-and-
BoundFeature-by-Feature – Evaluating De Novo Sequence Assembly
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Evaluation of illumina assemblers Genomes Used For Comparison
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567
Comparison of illumina assemblers
• The best value for each column is shown in bold. For all assemblies
• The Errors column contains the number of misjoins plus indel errors >5 bp for contigs, and the total number of misjoins for scaffolds.
• Corrected N50 values were computed after correcting contigs and scaffolds by breaking them at each error. See the evaluation section in the text for details on how errors were identified.
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567
Comparison of illumina assemblers• A ‘‘chaff’’ contig is defined as a single contig <200 bp in length. In many cases, these contigs can be as small as the k-mer size used to build
the de Bruijn graph (e.g., 36 bp) and are too short to support any further genomic analysis.
• A duplicated repeat is one that appears in more copies than necessary in the assembly, and a compressed repeat is one that occurs in fewer copies.
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567
Comparison of illumina assemblers• ‘‘Misjoin’’ errors are perhaps the most harmful type, in that they represent a significant structural error. A misjoin occurs when an
assembler incorrectly joins two distant loci of the genome, which most often occurs within a repeat sequence. • We have tallied three types of misjoins: (1) inversions, where part of a contig or scaffold is reversed with respect to the true genome; (2)
relocations, or rearrangements that move a contig or scaffold within a chro- mosome; and (3) translocations, or rearrangements between chromosomes
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567
Comparison of illumina assemblers• Average contig (A) and scaffold (B) sizes, measured by N50 values, versus error rates, averaged over all three genomes for which the true assembly
is known: S. aureus, R. sphaeroides, and human chromosome 14. • Errors (vertical axis) are measured as the average distance between errors, in kilobases. • In both plots, the best assemblers appear in the upper right.
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Applicability of assemblers Genomes used for comparison
A Practical Comparison of De novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. Wenyu Zhang, et al. Plos One. 2011 6: 1-12
Comparison of illumina assemblers
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
A Practical Comparison of De novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. Wenyu Zhang, et al. Plos One. 2011 6: 1-12
Comparison of illumina assemblers
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
A Practical Comparison of De novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. Wenyu Zhang, et al. Plos One. 2011 6: 1-12
Hybrid Platform Assembly
Name Algorithm Supporting Evidence
RAYSBH Feature-by-Feature – Evaluating De Novo Sequence Assembly
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Feature-by-Feature – evaluating de-novo assembly
• COMPRESSION: Area representing a possible repeat col- lapse; • LOW_GOOD_CVG: Area composed of paired reads at the right distance and with the right
orientation but at low coverage; • HIGH_OUTIE_CVG: Area composed of incorrectly oriented mates (--> -->, <-- -->); • HIGH_SINGLEMATE_CVG: Area composed of single reads (mate not present anywhere); • HIGH_READ_COVERAGE: Region in assembly with unexpectedly high local read coverage;• KMER_COV: Problematic k-mer distribution.
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Feature-by-Feature: evaluating de-novo assembly Real Data - Long Reads
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Feature-by-Feature – evaluating de-novo assembly Real Data - Short Reads
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Final Approach
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
454 raw reads
Pre-processing
Illumina raw reads
Pre-processing
454 reads
Illumina reads
Statistical analysis
Read stats
Published Genomes from public databases
V. vulnificus
YJ016
V. vulnificus CMCP6
V. vulnificus MO6-24/O
Align Illumina against the reference
• Fastqc• Prinseq• NGS QC
Compare mapping statistics
Reference genome
• samstats
• bwa
Reference selection
Hybrid DeNovo • Ray
Illumina/ 454/ Hybrid DeNovo assembly
454 DeNovo• Newbler• CABOG• SUTTA
Illumina DeNovo• Allpaths LG• SOAP DeNovo• Velvet• Taipan• SUTTA
contigs * 3
Align illumina reads against 454 contigs
Unmapped reads
• Mac vector• CLC wb
contigs
Unmapped reads
Evaluation
• GAGE• Hawk-eye
Illumina/(454?) reference based
assembly• AMOScmp
contigs
Unmapped reads
DeNovo assembly
Reference based assembly
Draft/ Finished genome
Reference evaluation
Reference evaluation
• DNA Diff• MUMer
Parameter optimization
Contig merging
All possible combinations of the
best 3• Mimimus• MAIA
• PAGIT• Mauve
Finished genomeScaffolds
• GAGE
Genome finishing
Gap filling Nulceotide identity
• MUMer
• GRASS• Built-in
Process
454
Illumina
Info.
Chosen Ref.
Assemblers
Assemblers
Illumina454
LEGEND
hybrid
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
References1. Finotello, F., et al., Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data. Brief Bioinform, 2012. 13(3): p. 269-80.2. Vezzi, F., G. Narzisi, and B. Mishra, Feature-by-feature--evaluating de novo sequence assembly. PLoS One, 2012. 7(2): p. e31002.3. Zhang, W., et al., A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS One, 2011. 6(3): p. e17915.4. Salzberg, S.L., et al., GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res, 2012. 22(3): p. 557-67.5. Narzisi, G. and B. Mishra, Comparing de novo genome assembly: the long and short of it. PLoS One, 2011. 6(4): p. e19175.6. Miller, J.R., S. Koren, and G. Sutton, Assembly algorithms for next-generation sequencing data. Genomics, 2010. 95(6): p. 315-27.7. Li, Z., et al., Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Brief Funct Genomics, 2012. 11(1): p. 25-37.8. Lin, Y., et al., Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics, 2011. 27(15): p. 2031-7.9. Zhang, J., et al., The impact of next-generation sequencing on genomics. J Genet Genomics, 2011. 38(3): p. 95-109.