How to sequence a large eukaryotic genome
-
Upload
lex-nederbragt -
Category
Technology
-
view
12.442 -
download
3
description
Transcript of How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
and how we sequenced the cod genome
Lex NederbragtNorwegian High-Throughput Sequencing Centre (NSC)
andCentre for Ecological and Evolutionary Synthesis (CEES)
What is a genome assembly?
A hierarchical data structure
that maps the sequence data
to a putative reconstruction of the target
Miller et al 2010, Genomics 95 (6): 315-327
Hierarchical structure
Sequence data
Reads
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Reads!
http://www.sciencephoto.com/media/210915/enlarge
Contigs
Building contigs
Contigs
Building contigs
Repeat copy 1 Repeat copy 2
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Collapsed repeat consensus
Contig orienation?Contig order?
Mate pairs
Other read type
Repeat copy 1 Repeat copy 2
mate pair reads(much) longer fragments
Scaffolds
Ordered, oriented contigs
contigs
mate pairs
gap size estimate
Hierarchical structure
Algorithms
All are graph-based
Read 1 Read 2
Overlap
Graph-theory!
Algorithms
Hamiltonian path– a path that contains all the nodes
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Algorithms
Overlap calculation (alignment)– computationally intensive
Read 1 Read 2
Overlap
Algorithms
Path through the graphcontig
Read 1 Read 2
Overlap
Read 3
Overlap
Read 4
Overlap
Greedy extension
Oldest
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Overlap-Layout-Consensus
Typical for Sanger-type reads– also used by newbler from 454 Life Sciences
Steps– Overlap computation– Layout: graph simplification– Consensus: sequence
Overlap-Layout-Consensus
Overlap phase:– K-mer seeds initiate overlap
ACGCGATTCAGGTTACCACG
de Bruijn graphs
Developed outside of DNA-related work– Best solution for very short reads ≤100 nt
GACCTACAGAC ACC CCT CTA TAC ACA
Read
K-mers (K=3)
K-1 bases overlap
de Bruijn graph
Graphs
Schatz M C et al. Genome Res. 2010;20:1165-1173
Graphs
Simplify the graph
Add scaffolding information
Sequence data
Sequencing errors– add complexity to graph– create new k-mers
Correction of errors– k-mer frequency
Kelley et al. Genome Biology 2010 11:R116
How to sequence a genome
human 1990'scod 1 2009 - 2011cod 2 2011 - 2012
Human genome
Public effort– BAC-by-BAC sequencing– hierarchical shotgun sequencing
Genome
BACs
Select BACs
http://www.cbcb.umd.edu/research/assembly_primer.shtml
shotgun sequencing
100-150 kb
Human genome
Celera: shotgun sequencing– entire genome shotgun– use of mate pairs
How to sequence a genome
PreparationsBAC-by-BAC
Add shotgunand mate pairs
Most projects
Double haploid individual ✔
OR Inbred line ✔
BAC library ✔
Genetic map ✔
Physical map ✔
The cod genome project
Preparations
Most projects
Codproject
Double haploid individual ✔ -
OR Inbred line ✔ -BAC library ✔ ✔*
Genetic map ✔ -Physical map ✔ -
* From a different individual
Cod: strategy
‘454 only’– NO subcloning– Pure ‘shotgun’ approach– 454 specific paired end libraries
Supplementary– BAC ends using Sanger sequencing
Cod: sequencing
Cod: assembly
Input for assembly– 84 million reads– 28 billion bases (Gb)
• 34x coverage
Assembly program– Newbler from 454– Celera from Venter Inst.
Computing nodes– 24 cpus– 128 GB of memory
Cod: assembly
611 Mb in 6 467 scaffolds– but 35% gap bases– short contigs– incomplete genes
Cod: gaps
Polymorphic contig 2Polymorphic contig 2
Polymorphic contig 3Polymorphic contig 3
Contig 4
Contig 1
Heterozygosity
Short Tandem Repeats
ACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACAACACACACACACACACACACACACACACACACACACACACACACACACACACACACACA
Cod: annotation
Ensembl– 'repair' genes based on stickleback sequence– ~22 000 genes
http://pre.ensembl.org/Gadus_morhua/
Cod 2: 2011-2012
Close the gaps– increase contig size
Pseudochromosomes– genetic linkage map– scaffolds to 'chromosomes'
• anchoring• ordering and orienting
Cod 2: strategy
New data– Illumina reads– longer 454 reads ~700 bases– PacBio reads?
Improved programs– newbler
New programs– assembly– gap closing
Many programs to choose from
Assembly competitions
Assemblathon 1– simulated datasets– ALLPATHS_LG – Broad Institute MIT (US)– Soapdenovo – BGI (China)– SGA – Sanger Institute (UK)
Assembly competitions
Assemblathon 2– real datasets
• snake – Illumina only• cichlid fish – Illumina only• parrot
– Illumina– 454 FLX+– PacBio
http://assemblathon.org/
How to sequence a genome
In 2011
Most projects
Double haploid individual ✔
OR Inbred line ✔
BAC library ✔
Genetic map ✔
Physical map ✔
Cheap alternative: RAD-tag sequencing
How to sequence a genome
Foundation of Illumina data– 100x coverage Paired End reads (2x100bp)– several Mate Pair libraries
• 2kb, 3kb, 8k, 10kb, bigger?
– this is now very cheap!
Fill gaps with long reads– 454 or PacBio
How to sequence a genome
Add lots of bioinformatics...
http://cores.montana.edu/index.php?page=bioinformatics-core-facility
Thank you!
www.sequencing.uio.no
www.sequencing.uio.no