Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1...
Transcript of Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1...
![Page 1: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/1.jpg)
Genome Assembly at JGI
Alicia Clum Genomic Technologies Workshop JGI User Meeting March 22, 2016
![Page 2: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/2.jpg)
Outline
• Overview • Improving assemblies with long
read technology • Future improvements
3/23/16 2
![Page 3: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/3.jpg)
Outline
• Overview • Improving assemblies with long
read technology • Future improvements
3/23/16 3
![Page 4: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/4.jpg)
Genome assembly review
3/23/16 4
Genomic DNA
fragmentation
Library creation
Sequencing
Assemble reads
![Page 5: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/5.jpg)
Overview of assembly at JGI
ProgramSize (MB) LibrariesAssembler
Target assemblies / year
Microbe 5 1 SPAdes/ HGAP 1,330
Fungi 10's 1 ALLPATHS-LG/ Falcon 160
Plant100-10
000 3+
Arachne/ ALLPATHS-LG/Falcon 20
Metagenome10-100
00 1 MEGAHIT 825
![Page 6: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/6.jpg)
Challenges in genome assembly
• Repeat content • Genome size • GC content • DNA quality
and quantity • Ploidy
Genome Size (MB)
Rep
eat C
onte
nt
Fungal Repeat Content vs Genome Size (MB)
• 37 MB median genome size • 9% median repeat content
![Page 7: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/7.jpg)
Making assemblies better
![Page 8: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/8.jpg)
Outline
• Overview • Improving assemblies with long
read technology • Future improvements
3/23/16 8
![Page 9: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/9.jpg)
Microbial drafts- number of contigs by data type
Num
ber o
f con
tigs
Illumina fragment
PacBio 10kb
Data Type
Median=43 N=1203
Median=2 N=216
![Page 10: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/10.jpg)
Overview of Assembly at JGI
ProgramSize (MB) LibrariesAssembler
Target genomes / year
Microbe 5 1 SPAdes/ HGAP 1,330
Fungi 10's 1 ALLPATHS-LG/ Falcon 160
Plant100-10
000 3+
Arachne/ ALLPATHS-LG/Falcon 20
Metagenome10-100
00 1 MEGAHIT 825
![Page 11: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/11.jpg)
Timeline - PacBio for fungal genomes
Feb. - First Illumina/PacBio hybrid release (APLG)
2012 2013
May - First PacBio only release (HBAR-DTK)
2014
July – Falcon development begins
summer – JGI Falcon testing begins, first good diploid assemblies
July – daligner work begins
2015
Jan. – Falcon incorporates daligner
Oct. – First Falcon assembly to annotation
Summer -Validated switch to PacBio for fungal assemblies for FY 2016
2016
![Page 12: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/12.jpg)
Can a single PacBio library approach produce better fungal assemblies?
Genome Size (MB)Repeat Content (%)PloidyClavicorona pyxidata 43 14 diploidByssothecium circinans 48 15 haploidClathrospora elynae 45 47 haploidLindgomyces ingoldianus 66 20 diploid
1 Illumina fragment library
1 Illumina 4kb mate-pair library
10 kb AMPure PacBio library
ALLPATHS-LG Falcon
4 fungal genomes (~5 ug DNA each)
Image Credit: Laszlo Nagy, Manfred Binder, Pedro Crous, David Culley
![Page 13: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/13.jpg)
PacBio assemblies have fewer contigs
0
500
1000
1500
2000
2500
Clavicorona pyxidata
Byssothecium circinans
Clathrospora elynae
Lindgomyces ingoldianus
Con
tigs
(N)
Genome
Number of Contigs
PacBio
Illumina
![Page 14: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/14.jpg)
PacBio assemblies produce longer contigs
0 100 200 300 400 500 600 700 800
Clavicorona pyxidata
Byssothecium circinans
Clathrospora elynae
Lindgomyces ingoldianus
Con
tig L
50 (k
b)
Genome
Contig L50
PacBio
Illumina
![Page 15: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/15.jpg)
PacBio assemblies are larger
• larger assembled genome sizes representing assembled repeat content
0 10 20 30 40 50 60 70 80
Clavicorona pyxidata
Byssothecium circinans
Clathrospora elynae
Lindgomyces ingoldianus
Ass
embl
ed S
ize
(MB
)
Genome
Assembled Genome Size
PacBio
Illumina
![Page 16: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/16.jpg)
PacBio assembles more repeat content
0
10
20
30
40
50
60
Basme Boled Hesve Lacbi Lizem Pirfi
Mas
ked
Sequ
ence
(%)
Genome
Percent of Assembled Genome Repeat Masked
PacBio
Illumina
Median difference of 7 % between how much sequence is masked in Illumina vs. PacBio
Data courtesy of the fungal annotation team
![Page 17: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/17.jpg)
PacBio only assembly now implemented for fungal assembly Genomic
DNA
Short insert fragment (270bp)
Random fragmentation
Paired-end short insert
reads (millions)
Library Creation
Sequencing
Assemble reads
Long fragment (10kb)
Long reads (~100,000)
Illumina PacBio
![Page 18: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/18.jpg)
Outline
• Overview • Improving assemblies with long
read technology • Future improvements
3/23/16 18
![Page 19: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/19.jpg)
Courtesy: Jason Chin
![Page 20: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/20.jpg)
Courtesy: Jason Chin
(Clavicorona pyxidata HHB10654)
Managed to phase >50% of the genome. JGI data with current Falcon is at < 25%.
![Page 21: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/21.jpg)
Conclusions
• Assembly pipelines vary by program and input data
• Long read technology and assembly algorithm development have improved assembly results
• Continued efforts for further improvements
![Page 22: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/22.jpg)
Acknowledgments
3/23/16 22
JGI Alex Copeland Igor Grigoriev & Fungal Annotation Group Chris Daum & Sequencing Technologies Group Genome Assembly & QA/QC Groups Pacific Biosciences Jason Chin Paul Peluso David Rank Kristi Spittle
![Page 23: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/23.jpg)
Supplement
3/23/16 23
![Page 24: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/24.jpg)
Long Reads Span Common Repetitive Elements
3/23/16 24
Example for the Input Data: Length Distribution of the Pre-assembled Reads For Assembly
6
Transposons
45S rDNAs
Retrotransposons
Common repeat element lengths
Methods for pre-assembly consensus: Genome Biology 2013, 14:R101 S. Koren, et al. Nature Methods 10, 563–569 (2013), C.-S. Chin, et al.
Acc. > 99%
PacBio Read Length Distribution
![Page 25: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/25.jpg)
>10kb AMPure Subread Lengths
L50 subread lengths range from 3.3 kb-6.5 kb
![Page 26: Genome Assembly at JGI · 4/3/2016 · Arachne/ ALLPATHS-LG/ Falcon 20 Metagenome 10-100 00 1 MEGAHIT 825. Timeline - PacBio for fungal genomes Feb. - First Illumina/ PacBio hybrid](https://reader034.fdocuments.us/reader034/viewer/2022042712/5f9a1b753d34e61707061f35/html5/thumbnails/26.jpg)
Evaluating Assemblers
3/23/16 26