Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

37
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden

Transcript of Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Page 1: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Next generation sequence data and de novo assembly

For human genetics

By Jaap van der Heijden

Page 2: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

De novo assembly

Page 3: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Overall idea

Page 4: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Overall idea

Page 5: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Repeats and non random sheering

Page 6: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

scaffolding

•Multiple libraries•contigs are directed by mate pairs

-> scaffolding

Page 7: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

4 types of assemblers

Greedy algorithms

Overlap-layout-consensus

Align-layout-consensus

Bac by Bac sequencing

Page 8: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Types of assemblers I

Greedy algorithms joins similar reads

easily confused by repeats

Page 9: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Types of assemblers II

Overlap layout consensus assembler nodes represent end of read

lines represent similarity between reads (overlap)

layout step removes redundant information

consensus step is building of genome

Page 10: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Types of assemblers III

Align-layout-consensus. process called comparative assembly.

The overlap stage of assembly is replaced by an alignment step.

The layout stage is also greatly simplified due to the additional constraints provided by the alignment to the reference.

Page 11: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Types of assemblers IV

Bac by bac sequencing genome broken in fragments

Bac’s location is determined in the lab

minimum tiling path (whole genome is covered by at least one Bac

Bac’s sequenced

Page 12: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Lander-Waterman equation

“rain drops” to cover a tile

8-10 fold coverage 5 contigs for 1MB

genome

Page 13: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Timeline 1975 Sanger sequencing

1990 First shotgun/EST assemblers overlap-layout-consensus approach

2000 Human shotgun assembly

2001 Mouse shotgun assembly

2005 454 roche available

2006 Solexa available

2007 short read assembers de Bruijn graphs

Page 14: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

The complexity of sequence assembly

• Long reads– better identification– much slower

• Short reads– faster to align– more difficult with

repeats

• Amount of reads

• Length of reads

• Mismatches

• Algorithms can show quadratic or even exponential complexity

Page 15: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

ABySS genomes Solexa, SOLiD Simpson, J. et al.

AMOS genomes Sanger, 454 Salzberg, S. et al.

Arachne WGA (large) genomes Sanger Batzoglou, S. et al.

CAP3, PCAP genomes Sanger Huang, X. et al.

Celera WGA Assembler / CABOG (large) genomes Sanger, 454, Solexa Myers, G. et al.; Miller G. et al.

CLC Genomics Workbench genomesSanger, 454, Solexa, SOLiD CLC bio

CodonCode Aligner BACs (,genomes?) Sanger CodonCode Corporation

Euler genomesSanger, 454 (,Solexa ?) Pevzner, P. et al.

Euler-sr genomes 454, Solexa Chaisson, MJ. et al.

MIRA, miraEST genomes, ESTs Sanger, 454, Solexa Chevreux, B.NextGENe (small genomes?) 454, Solexa, SOLiD SoftgeneticsNewbler genomes 454, Sanger 454/Roche

Phrap genomes Sanger, 454 Green, P.

TIGR Assembler genomic Sanger -

Sequencher (small) genomes Sanger Gene Codes CorporationSeqMan NGen (small) genomes Sanger, 454, Solexa DNASTAR

SHARCGS (small) genomes Solexa Dohm et al.

SSAKE (small) genomesSolexa (SOLiD? Helicos?) Warren, R. et al.

Staden gap4 package BACs (, small genomes?) Sanger Staden et al.

VCAKE (small) genomesSolexa (SOLiD?, Helicos?) Jeck, W. et al.

Phusion assembler (large) genomes Sanger Mullikin JC, et.al.Quality Value Guided SRA (QSRA) genomes Sanger, Solexa Bryant DW, et.al.

Velvet (algorithm) (small) genomesSanger, 454, Solexa, SOLiD Zerbino, D. et al.

Page 16: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

3 NGS Projects

Dragon fly

Medical Maggots

EST comparison

Page 17: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Dragon Fly (libelle)

Class Odonata

3000 species 90 in Europe

Undergo a morphic change

Page 18: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Pilot study for African Dragon Fly

Morphic change

Some migrate others don't

Genetically divergent

Contain lots of introns in their genome

Page 19: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Project questions

What are the homologies with other species?

How big is the genome?

Are there already sequences in Genbank and are they present in the data?

Page 20: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Dragon fly project data

Genomic Single end

1 x 1147762 reads

Trimmed to 34/51 nucleotides

39.023.908 nucleotides sequenced

CDNA Paired end

2 x 1291901 reads

Read lenght = 51

131.773.902 nucleotides sequenced

Page 21: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Dragon fly methods

Assemble cDNA

Blast resulting contigs to determine homologies

Align genomic DNA to contigs

Calculate genome size

Page 22: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Dragon fly assembly results

total contigs: 3898 average length of contigs: 176

average coverage of contigs: 24

contigs larger than 300 nucleotides: 800

average length of contigs larger then 300: 508

average coverage of contigs larger then 300: 15

Page 23: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Dragon fly genes and homologies

libellula pulchella

Enallagma aspersum

Erythromma najas

Ischnura verticalis

many Drosophila species Criteria used for in this analysis was an e-

value of less then 1*10^-40 and a score of more than 200.

COII gene with accession number GQ256052.1 (partial)

COI gene with accession number GQ256032.1 (partial)

NDI gene with accession number GQ255994.1 (partial) found in the cDNA contigs.

Page 24: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Dragon fly genome size

30 genomic genes selected after blasting

Size 300-1500

Alignment with Bowtie

“calculation”

Page 25: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Medicinal maggots

Treated to non healing wounds

genes revealed Signaling proteins

Inhibitor of apoptosis protein 2

Digestive enzymes Lipases proteinases

antimicrobial peptides (AMPs) Lucilia defensin diptericin

Page 26: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Medicinal maggots data

5 degenerate peptide sequences 36 Peptides

cDNA 8.199.983 reads

read lenght 32

2.623.994.560

Page 27: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Medicinal maggots question

Have we sequenced (pieces) of the genes corresponding to the peptides.

Page 28: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Medicinal maggots methods

Build local library of peptides

Assemble contigs CLCbio

Nextgene

Velvet

Blast contigs to peptides

Find hits

Make coverage plot

Page 29: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Nextgene assembly maggots

aantal contigs = 59048

gemiddelde lengte = 59

gemiddelde coverage = 11

aantal contigs >300 = 719

gemiddelde lengte >300 = 661

gemiddelde coverage >300 = 64

Page 30: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

CLC assembly

Aantal contigs = 78

gemiddelde lengte = 2282

gemiddelde coverage = 514

Page 31: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Velvet assembly made

total contigs: 586

length of contigs: 168

coverage of contigs: 55

contigs larger than 300 nucleotides: 62

length of contigs larger then 300: 779

coverage of contigs larger then 300: 63

Page 32: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Found Genes Maggots

C.vicina mRNA for arylphorin subunit A4

Velvet Drosophila willistoni GK21455 (Dwil\GK21455) mRNA

nextgene Lucilia cuprina clone sbsp9 serine proteinase mRNA

nextgene

Page 33: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

EST comparison

Traditional EST sequencing

known library

assemblers CLCbio

Nextgene

Velvet

Page 34: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

EST comparison method

Assemble cDNA and match with known ESTs

Page 35: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

EST results

Page 36: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

conclusions

Big differences between assemblers coverage

length

amount of nodes

sequence

x performs best on EST test

Page 37: Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.

Questions?