Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology
description
Transcript of Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology
![Page 1: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/1.jpg)
1
Pamela Ferretti
Laboratory of Computational Metagenomics
Centre for Integrative BiologyUniversity of Trento
Italy
Microbial Genome Assembly
![Page 2: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/2.jpg)
2
Outline-summary
4. CASE STUDY
2. GENOME ASSEMBLY
3. ASSEMBLY STRATEGIES
1. QUICK INTRODUCTION
![Page 3: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/3.jpg)
3
DNA packaging
![Page 4: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/4.jpg)
4
DNA packaging
![Page 5: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/5.jpg)
5
Outline-summary
4. CASE STUDY
2. GENOME ASSEMBLY
3. ASSEMBLY STRATEGIES
1. QUICK INTRODUCTION
![Page 6: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/6.jpg)
6
Next Generation Sequencing
TCTTATTGTGACC TAGGCTAGCTTAG
GCAATGCAGTAAC TCCAGCTAGGTTC
ACGTAGGCTAGCGTTAGCGA ........ CTGCAT C
![Page 7: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/7.jpg)
7
Genome Assembly
1. GENOME SEQUENCING2. PRELIMINARY ANALYSIS3. ASSEMBLY4. ADVANCED BIOINFORMATIC ANALYSIS
OVERLAPPING SEQUENCE ALIGMENT
![Page 8: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/8.jpg)
Sequencing the human genome with shotgun sequencing + assembly is the only feasible strategy
Computational assembly of shotgun sequencing data is simply unfeasible, and a bad idea anyway
Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing." Genome Research 7.5 (1997): 401-409.
Green, Philip. "Against a whole-genome shotgun.“Genome Research 7.5 (1997): 410-417.
They were both right!(…well, Weber and Myers were a bit more right from the practical viewpoint…)
On the feasibility of sequence assembly
![Page 9: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/9.jpg)
9
Outline-summary
4. CASE STUDY
2. GENOME ASSEMBLY
3. ASSEMBLY STRATEGIES
1. QUICK INTRODUCTION
![Page 10: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/10.jpg)
10
Genome assembly strategies Greedy approach → SSAKE
De Bruijn graph (DBG) → Velvet, SOAPdenovo
Overlap Consensus Layout (OLC) → MIRA
Mixed approaches → MaSuRCA
![Page 11: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/11.jpg)
11
Genome assembly strategies DE BRUIJN GRAPH APPROACH (DBG)
Velvet, SOAPdenovo2
Nodes = overlapping sequences of reads of uniform lengthEdges = kmer (unique subsequences within reads)
EULERIAN PATH
![Page 12: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/12.jpg)
12
Genome assembly strategies
OVERLAP CONSENSUS LAYOUT (OLC)
MIRA
Nodes = readsEdges = overlap between reads
1. OVERLAP2. LAYOUT3. CONSENSUS
HAMILTONIAN PATH
![Page 13: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/13.jpg)
13
Genome assembly strategies
![Page 14: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/14.jpg)
14
Genome assembly strategies
DBG OLC
ADVANTAGES Very sensitive to repeats Modular algorithmic design
Kmer storaged just once Flexibility and robustness
Eulerian cycle
Never explicitly computes pairwise computation
DISADVANTAGES Sensitive to sequencing errors (new k-mers)
Hamiltonian cycle
Large computational memory space requirements
Overlap stage istime-consuming
Genome-size limitations
![Page 15: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/15.jpg)
15
Greedy approach → SSAKE
De Bruijn graph (DBG) → Velvet, SOAPdenovo
Overlap Consensus Layout (OLC) → MIRA
Mixed approaches → MaSuRCA
Genome assembly strategies
![Page 16: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/16.jpg)
16
Genome Assemblers
Average CoverageNumber of ContigsNumber of Contigs > 1KbN50 contig sizeFraction of reads assembledTotal consensus (in nt)Number of scaffolds N50 scaffolds size
Ion Torrent PGM → MIRA 3.9
Illumina → MaSuRCA MIRA 3.9 too produced good quality results, but it has a longer execution time
and it becomes unstable with large amount of small reads
![Page 17: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/17.jpg)
17
Outline-summary
4. CASE STUDY
2. GENOME ASSEMBLY
3. ASSEMBLY STRATEGIES
1. QUICK INTRODUCTION
![Page 18: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/18.jpg)
18
Mycobacteria Assembly: Case Study
Responsible for many animal and human diseases M. tuberculosis and M. leprae (TM)M. fortuitum (NTM) outbreak (nail salon, 2002)M. chelonae (NTM) outbreak (face lifts, 2004)
Illumina HiSeq sequencing (NGS Facility – CIBIO/UNITN) Twenty mycobacterial strains From 20 different Mycobacteria species
→ MaSuRCA
Novel mycobacteria detection clinical tests
![Page 19: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/19.jpg)
19
Fastq-mcf tool
• poor quality ends of reads• Ns, duplicates and sequencing
adapters• reads that are too short
Reduction up to 73%
Raw data quality assessment and pre-processing
![Page 20: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/20.jpg)
20
K-mers: strings of a particular length k, which are shorter than entire reads
Best empirical k-mer length: 91 bases long
Assembly parameters setting
High coverage
![Page 21: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/21.jpg)
21
MaSuRCA results of Mycobacteria
Abnormal GC content
Genome size too high
![Page 22: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/22.jpg)
22
Examples of environmental contaminations
GC content based quality analysis
Staphylococcus epidermidis
![Page 23: Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology](https://reader036.fdocuments.us/reader036/viewer/2022062812/56816417550346895dd5cae3/html5/thumbnails/23.jpg)
Thanks
Photocoming
soon
http://gcat.davidson.edu/phast/#methods