Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
-
Upload
caroline-ford -
Category
Documents
-
view
218 -
download
0
Transcript of Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Next generation sequence data and de novo assembly
For human genetics
By Jaap van der Heijden
De novo assembly
Overall idea
Overall idea
Repeats and non random sheering
scaffolding
•Multiple libraries•contigs are directed by mate pairs
-> scaffolding
4 types of assemblers
Greedy algorithms
Overlap-layout-consensus
Align-layout-consensus
Bac by Bac sequencing
Types of assemblers I
Greedy algorithms joins similar reads
easily confused by repeats
Types of assemblers II
Overlap layout consensus assembler nodes represent end of read
lines represent similarity between reads (overlap)
layout step removes redundant information
consensus step is building of genome
Types of assemblers III
Align-layout-consensus. process called comparative assembly.
The overlap stage of assembly is replaced by an alignment step.
The layout stage is also greatly simplified due to the additional constraints provided by the alignment to the reference.
Types of assemblers IV
Bac by bac sequencing genome broken in fragments
Bac’s location is determined in the lab
minimum tiling path (whole genome is covered by at least one Bac
Bac’s sequenced
Lander-Waterman equation
“rain drops” to cover a tile
8-10 fold coverage 5 contigs for 1MB
genome
Timeline 1975 Sanger sequencing
1990 First shotgun/EST assemblers overlap-layout-consensus approach
2000 Human shotgun assembly
2001 Mouse shotgun assembly
2005 454 roche available
2006 Solexa available
2007 short read assembers de Bruijn graphs
The complexity of sequence assembly
• Long reads– better identification– much slower
• Short reads– faster to align– more difficult with
repeats
• Amount of reads
• Length of reads
• Mismatches
• Algorithms can show quadratic or even exponential complexity
ABySS genomes Solexa, SOLiD Simpson, J. et al.
AMOS genomes Sanger, 454 Salzberg, S. et al.
Arachne WGA (large) genomes Sanger Batzoglou, S. et al.
CAP3, PCAP genomes Sanger Huang, X. et al.
Celera WGA Assembler / CABOG (large) genomes Sanger, 454, Solexa Myers, G. et al.; Miller G. et al.
CLC Genomics Workbench genomesSanger, 454, Solexa, SOLiD CLC bio
CodonCode Aligner BACs (,genomes?) Sanger CodonCode Corporation
Euler genomesSanger, 454 (,Solexa ?) Pevzner, P. et al.
Euler-sr genomes 454, Solexa Chaisson, MJ. et al.
MIRA, miraEST genomes, ESTs Sanger, 454, Solexa Chevreux, B.NextGENe (small genomes?) 454, Solexa, SOLiD SoftgeneticsNewbler genomes 454, Sanger 454/Roche
Phrap genomes Sanger, 454 Green, P.
TIGR Assembler genomic Sanger -
Sequencher (small) genomes Sanger Gene Codes CorporationSeqMan NGen (small) genomes Sanger, 454, Solexa DNASTAR
SHARCGS (small) genomes Solexa Dohm et al.
SSAKE (small) genomesSolexa (SOLiD? Helicos?) Warren, R. et al.
Staden gap4 package BACs (, small genomes?) Sanger Staden et al.
VCAKE (small) genomesSolexa (SOLiD?, Helicos?) Jeck, W. et al.
Phusion assembler (large) genomes Sanger Mullikin JC, et.al.Quality Value Guided SRA (QSRA) genomes Sanger, Solexa Bryant DW, et.al.
Velvet (algorithm) (small) genomesSanger, 454, Solexa, SOLiD Zerbino, D. et al.
3 NGS Projects
Dragon fly
Medical Maggots
EST comparison
Dragon Fly (libelle)
Class Odonata
3000 species 90 in Europe
Undergo a morphic change
Pilot study for African Dragon Fly
Morphic change
Some migrate others don't
Genetically divergent
Contain lots of introns in their genome
Project questions
What are the homologies with other species?
How big is the genome?
Are there already sequences in Genbank and are they present in the data?
Dragon fly project data
Genomic Single end
1 x 1147762 reads
Trimmed to 34/51 nucleotides
39.023.908 nucleotides sequenced
CDNA Paired end
2 x 1291901 reads
Read lenght = 51
131.773.902 nucleotides sequenced
Dragon fly methods
Assemble cDNA
Blast resulting contigs to determine homologies
Align genomic DNA to contigs
Calculate genome size
Dragon fly assembly results
total contigs: 3898 average length of contigs: 176
average coverage of contigs: 24
contigs larger than 300 nucleotides: 800
average length of contigs larger then 300: 508
average coverage of contigs larger then 300: 15
Dragon fly genes and homologies
libellula pulchella
Enallagma aspersum
Erythromma najas
Ischnura verticalis
many Drosophila species Criteria used for in this analysis was an e-
value of less then 1*10^-40 and a score of more than 200.
COII gene with accession number GQ256052.1 (partial)
COI gene with accession number GQ256032.1 (partial)
NDI gene with accession number GQ255994.1 (partial) found in the cDNA contigs.
Dragon fly genome size
30 genomic genes selected after blasting
Size 300-1500
Alignment with Bowtie
“calculation”
Medicinal maggots
Treated to non healing wounds
genes revealed Signaling proteins
Inhibitor of apoptosis protein 2
Digestive enzymes Lipases proteinases
antimicrobial peptides (AMPs) Lucilia defensin diptericin
Medicinal maggots data
5 degenerate peptide sequences 36 Peptides
cDNA 8.199.983 reads
read lenght 32
2.623.994.560
Medicinal maggots question
Have we sequenced (pieces) of the genes corresponding to the peptides.
Medicinal maggots methods
Build local library of peptides
Assemble contigs CLCbio
Nextgene
Velvet
Blast contigs to peptides
Find hits
Make coverage plot
Nextgene assembly maggots
aantal contigs = 59048
gemiddelde lengte = 59
gemiddelde coverage = 11
aantal contigs >300 = 719
gemiddelde lengte >300 = 661
gemiddelde coverage >300 = 64
CLC assembly
Aantal contigs = 78
gemiddelde lengte = 2282
gemiddelde coverage = 514
Velvet assembly made
total contigs: 586
length of contigs: 168
coverage of contigs: 55
contigs larger than 300 nucleotides: 62
length of contigs larger then 300: 779
coverage of contigs larger then 300: 63
Found Genes Maggots
C.vicina mRNA for arylphorin subunit A4
Velvet Drosophila willistoni GK21455 (Dwil\GK21455) mRNA
nextgene Lucilia cuprina clone sbsp9 serine proteinase mRNA
nextgene
EST comparison
Traditional EST sequencing
known library
assemblers CLCbio
Nextgene
Velvet
EST comparison method
Assemble cDNA and match with known ESTs
EST results
conclusions
Big differences between assemblers coverage
length
amount of nodes
sequence
x performs best on EST test
Questions?