Genome Assembly Final Results
description
Transcript of Genome Assembly Final Results
![Page 1: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/1.jpg)
J E R I D I LT SS U Z A N N A K I M
H E M A N A G R A J A ND E E PA K P U R U S H O T H A M
A M B I LY S I VA D A SA M I T R U PA N I
L E O W U
Genome Assembly Final Results
0 2 - 2 2 - 2 0 1 2
![Page 2: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/2.jpg)
Outline
Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo
![Page 3: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/3.jpg)
Pipeline for evaluation
![Page 4: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/4.jpg)
Strategy – Key alterations
Prinseq Preprocessing Unnecessary, assemblers have built in capabilities Use Prinseq for data statistics
Error Correction Does not fit methods Coral is based on Overlap-layout-consensus and
works best with de Bruijin Graph assemblers Echo has never been tested on 454 data
Final Assemblers Newbler, Mira, Celera, AmosCMP Discarded Assemblers Abyss, Velvet, and Pcap454
MAIA Hybrid Assembly Needs a close phylogenetic reference genome
![Page 5: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/5.jpg)
Outline
Pipeline for evaluationQuantitative EvaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo
![Page 6: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/6.jpg)
Metrics No. of Contigs -> Lesser the better N50 -> Higher the better Assembly size -> Closer to the estimated genome, the
betterQuantitative Assembly Score
N50 * Assembly size No. of Contigs
Higher the score, the better!
Quantitative Evaluation
![Page 7: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/7.jpg)
M19107 - Evaluation
Runs # Contigs
N50 Total Size
Score
Newbler 199 16319 1753573 8.16
Mira 201 19353 1790088 8.24
Celera 146 20609 1747621 8.39
Newbler_Mira 129 25914 1774129 8.55
Newbler_Celera 104 27207 1719874 8.65
Newbler_Mira_Celera
96 27478 1701316 8.69
![Page 8: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/8.jpg)
Outline
Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo
![Page 9: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/9.jpg)
Qualitative Evaluation
Strategy Align the assembly contigs to the original reference
genome and compute differencesChallenges
No Original reference genome for our data setApproach
Create simulated 454 read datasets from a completely sequenced genome
Tools used FlowSim 454Sim Art-454
![Page 10: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/10.jpg)
FlowSim
A simulation pipeline based on real dataLets you model each step of pyrosequencing processUtilities:
Clonesim : To simulate the shearing step Usage: clonesim -c count -l dist input.fasta
Gelfilter: To select a certain range of clone lengths. Usage: gelfilter min max
Kitsim: To attach A and B adaptors. Usage: kitsim -k key -a adapter input.fasta -o output.fasta
Mutator: To introduce random substitutions and indels in the sequences. Usage: mutator -i indel_rate -s subst_rate input.fasta -o output.fasta
Duplicator: To generate artificial duplicates of many clones. Usage: duplicator dup_prob
Flowsim : To simulate the actual pyrosequencing process Usage: flowsim -G generation input.fasta -o output.sff
Example: clonesim -c 400000 –l “Normal 350 95” input.fasta | gelfilter 25 600| kitsim | mutator | duplicator 0.03 | flowsim –G Titanium -o output.sff
![Page 11: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/11.jpg)
454Sim
454 Simulation at higher speed and accuracyUSP: Configurable statistical modelsSupport GS FLX, Titanium and GS 20Utilities:
fragsim: To simulate shearing Usage: fragsim -c 1000000 -l 1000 genome.fasta >
genome.fragments.fasta 454sim: To simulate the sequencing step
Usage: 454sim -o genome.sff genome.fragments.fastaExample:
fragsim -c 250000 -l 1000 genome.fasta | 454sim –g FLX -o genome.sff
![Page 12: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/12.jpg)
ART-454
Supports Illumina, 454 and Solexa read simulation
Used for 1000 Genomes ProjectUsage:
Art_454 Input.fasta Output prefix Fold_coverage (single – end reads)
Art_454 Input.fasta Output prefix Fold_coverage Mean_Flag_Len Std_Deviation (paired end reads)
![Page 13: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/13.jpg)
Running pipeline on Simulated reads
Reference – Haemophilus influenzae F3047 (NC_014922)
Ran 454Sim, FlowSim and Art-454 to generate reads
Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG)
Merged assemblies using Minimus2
Evaluate Assembly Accuracy (How?)
![Page 14: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/14.jpg)
Assembly Accuracy
Challenges Alignment of contigs to the reference genome
Approach Local alignment (BLAST, bwa, bowtie) Whole genome alignment (Mauve, MUMmer)
Align the assembly to the reference genome Compute nucleotide differences, gaps and rearranged
segments
![Page 15: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/15.jpg)
Mauve
Uses positional homology genome alignment Each site in the assembly maps to at most one site on
the reference Optimized contiguity E.g. progressiveMauve
Ordering of contigs: Mauve Contig Mover algorithm
Compare to identify differences
![Page 16: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/16.jpg)
Mauve Genome Aligner
![Page 17: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/17.jpg)
After Ordering of Contigs
![Page 18: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/18.jpg)
Mauve Assembly Metrics
Basecalling accuracy Count and location of bases called wrongly Direction of miscalling, e.g. A->G Count and location of bases predicted to exist, but
uncalledGenome content accuracy
Count and location of bases missing from the assembly Count and location of extra bases in the assembly Size distribution of the missing and extra fragments
Genome structure accuracy Estimate of misassembly count
![Page 19: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/19.jpg)
Reference genome AGGCTAGCGCGCGATTAGGAT
CAssembly
AGTAGCGGGCCGATTAAGANC
Alignment AGGCTAGCGCG -
CGATTAGGATC AG - -
TAGCGGGCCGATTAAGANC
Example
Miscalls 2 (C->G and G->A)Uncalled bases 1 (N)Extra bases 1 (Insertion of C )Missing bases 2 (Deletion of GC )Missing segments 1Extra segments 1
![Page 20: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/20.jpg)
Scoring simulated reads with Mauve
Reference – Haemophilus influenzae F3047 (NC_014922)
Ran 454Sim, FlowSim and Art-454 to generate reads
Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG)
Merged assemblies using Minimus2Ran Mauve to align the assemblies back to
the reference genomeComputed Assembly metrics
![Page 21: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/21.jpg)
Miscalled Bases
0
20
40
60
Number of miscalled bases
Newbler
16
Mira
52
CA
8
Newbler+Mira
18
Newbler+CA
24
Newbler+Mira
+CA
36
![Page 22: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/22.jpg)
Uncalled bases
0
10
20
30
40
Number of uncalled bases
Newbler Mira
3
CA Newbler+Mira
14
Newbler+CA
7
Newbler+Mira
+CA
34
![Page 23: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/23.jpg)
Total missing bases
0
50,000
100,000
150,000
200,000
Number of missed bases
Newbler
90,490
Mira
76,648
CA
92,196
Newbler+Mira
73,632
Newbler+CA
82,121
Newbler+Mira
+CA
195,619
![Page 24: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/24.jpg)
Total extra segments
0
2,000
4,000
6,000
Number of extra bases
Newbler
709
Mira
4,387
CA
1,907
Newbler+Mira
5,895
Newbler+CA
4,590
Newbler+Mira
+CA
5,218
![Page 25: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/25.jpg)
Outline
Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo
![Page 26: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/26.jpg)
Choosing the BEST assembly
Quantitative metrics N50 Contig count Assembly size
Qualitative metrics Miscalled bases Uncalled Missing bases Extra bases
![Page 27: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/27.jpg)
Quantitative Score N50 * Assembly size
No. of Contigs
Qualitative Score ( % Accuracy ) Miscalls + Uncalled + Missing + Extra + Gaps in Ref + Gaps in
Assembly
Assembly Scores
Reference Size
1 -
![Page 28: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/28.jpg)
Metrics Summary – Art 454
ASSEMBLY SCORE
QUALITY SCORE
![Page 29: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/29.jpg)
Assembly spec. vs Accuracy plot – 454Sim
0.1
0.2
0.3
0.4
0.5
90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 99.0 100.0%
Assembly score
Mira
Newbler+CA+Mira
Newbler+Mira
Newbler+CA
CA
Newbler
Quality of output
![Page 30: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/30.jpg)
Assembly spec. vs Accuracy plot - Art-454
0
1
2
3
4
5
6
90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 99.0 100.0%
Assembly score
Mira
Newbler
Newbler+CA
CA
Quality of output
Newbler+Mira
Newbler+Mira+CA
![Page 31: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/31.jpg)
Assembly spec. vs Accuracy plot – FlowSim
0
1
2
3
4
5
6
7
8
90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 99.0 100.0%
Assembly score
Quality of output
Mira
Newbler+Mira
Newbler+Mira+CA
Newbler+CA
CA
Newbler
![Page 32: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/32.jpg)
Assembly spec. vs Accuracy plot – M21709
0
2
4
6
8
50.0 55.0 60.0 65.0 70.0 75.0 80.0%
Assembly score
Newbler+Mira
AMOScmp
Quality of output
Celera
Mira
Newbler+Mira+CA
Newbler
Newbler+CA
AMOScmp+Newbler
![Page 33: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/33.jpg)
Inference
Striking a balance is critical
We chose Newbler + MIRA for H. haemolyticus Newbler + AMOScmp for H. influenzae
Universally applicable pipeline
Assembling specific genomes/strains
Adopt the most consistent tool /pipeline (Conservative approach)
NEWBLER
Choose the one that works the best balance for your genome
NEWBLER + (CELERA/MIRA)
![Page 34: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/34.jpg)
Outline
Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo
![Page 35: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/35.jpg)
Final Results
Genomes
Contig # N50 Size Method
M19107 129 25914 1774129
Newbler + Mira
M19501 19 284900 1809865
Newbler + Mira
M21127 32 122121 2029793
Newbler + Mira
M21621 27 139238 1959123
Newbler + Mira
M21639 56 87673 2397857
Newbler + Mira
M21709 28 140484 1808157
Newbler + AMOScmp
![Page 36: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/36.jpg)
Key take-aways
Understand your data Platform, long/short reads, Coverage, Paired/Non-paired,
Quality of basecalling etcEvaluate the need for error correctionChoose a set of “best” assemblers
De novo /Reference assembly, DBG/OLC algorithmMerge assembliesOrdering and ScaffoldingFinishing
Evaluate your assembly at every step to ensure that you are on the right track!
![Page 37: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/37.jpg)
Coming next >>>Demo
![Page 38: Genome Assembly Final Results](https://reader036.fdocuments.us/reader036/viewer/2022062310/5681662a550346895dd98907/html5/thumbnails/38.jpg)