Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012.

Assembling Genome

Timothee Cezard

EBI NGS workshop16/10/2012

Assembly Algorithms

• Goal: Find the shortest common sequence of a set of reads.

• This is NP-hard problem, we need to use some approximation algorithm.

Main algorithm used:• Overlap Layout Consensus• Debrujin graphs

Overlap-layout-consensusStep 1: Find Overlapping Reads

Need efficient alignment algorithmDoesn’t scale well when number of read is highUse seed based alignment with extension

TACATAGATTACACAGATTACTGA

|| ||||||||||||||||||||

TAGTTAGATTACACAGATTACTAGA

Overlap-layout-consensusStep 2: Construct overlap graph

• A graph is constructed:(1) Nodes are reads(2) Edges represent overlapping reads

CGTAGTGGCAT

ATTCACGTAG

Overlap graph

Try to find the Hamiltonian path:• a path in the graph contains each node exactly once.• Expensive computationally

Overlap-layout-consensusStep 3: Find Contigs

CGTAGTGGCAT

ATTCACGTAG

Overlap-layout-consensus

• This approach is used in Celera (CABOG), Newbler, Mira, SGA…

• It is mostly used with Sanger or 454 data.

• Can’t assemble repeat longer than read length

• Could come back if read gets longer.

De Bruijn Graphs example

“It was the best of times, it was the worst of times, it was the age of

wisdom, it was the age of foolishness, it was the epoch of belief, it was

the epoch of incredulity,.... “

Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall

Velvet example courtesy of J. Leipzig 2010

De Bruijn Graphs exampleitwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…

Generate random ‘reads’ How do we assemble?

Traditional all-vs-all comparisons of datasets this size require immense computational resources.

De Bruijn solution: Construct a graph efficiently

fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe theageofwi foolishnes incredulit ofbeliefit chofincred beliefitwa beliefitwa wisdomitwa eageoffool eoffoolish itwastheag mesitwasth epochofinc ssitwasthe itwastheep astheageof stheageoff sitwasthee thebestoft oolishness heepochofb ochofbelie wastheepoc bestoftime mesitwasth ebestoftim pochofincr

…etc. to 10’s of millions of reads

De Bruijn GraphsStep 1: create kmer

Step 1: “Kmerize” the data

Reads: theageofwi

age

geo

eof

ofw

fwi

sthebestof

sth

the

heb

ebe

bes

est

sto

tof

astheageof

ast

sth

the

hea

eag

age

geo

eof

worstoftim

wor

ors

rst

sto

tof

oft

fti

tim

imesitwast

ime

mes

esi

sit

itw

twa

was

ast

…..etc for all reads in the dataset

Kmers :(k=3)

the

hea

eag

De Bruijn GraphsStep2 Build the graph

age geo eof ofw fwihea eagthesth the

heb ebe bes est sto tof

ast sththe hea eag age geo eof

Look for k-1 overlaps: given by the reads

wor ors rststo tof

oft fti tim

ime mes

esisititwtwa

was

ast

…..etc for all ‘kmers’ in the dataset

De Bruijn Graphsstep3: simplify the graph

No single solution!Break the graph to give the final assembly

De Bruijn Graphsstep4: Create contigs

De Bruijn exampleThe final assembly (k=3)

wor times itwasthe foolishness

incredulity age epoch be

st wisdom

of belief

A better assembly (k=10)itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…

Repeat with a longer “kmer” length

Why not always use longest ‘k’ possible?

Sequencing errors:

sthebentof

sth theheb

ebeben

entnto

tof

sthebentof

k=3

k=10100% wrong kmer

Mostly unaffected kmers

Strengths and problems of De Bruijn approach

Strengths:• No need to calculate the overlaps• Size of the final graph is function of the genome size• Repeats are collapsed

Problems:• Can only resolve k long repeat• Loose connectivity when create the contigs

Resolve repeat through scaffolding

Align reads from short insert or long insert library

Join contigs using evidence from paired end data

Contigs from assembly

Scaffold

De Bruijn assembler

• Velvet: http://www.ebi.ac.uk/~zerbino/velvet/

• ABySS: http://www.bcgsc.ca/platform/bioinfo/software/abyss

• SOAP-denovo: http://soap.genomics.org.cn/soapdenovo.html

• ALLPATH-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/

• IDBA-UD: http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/

http://www.ebi.ac.uk/~zerbino/velvet/

http://www.bcgsc.ca/platform/bioinfo/software/abyss

http://soap.genomics.org.cn/soapdenovo.html

http://www.broadinstitute.org/software/allpaths-lg/blog/

http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/

What makes an assembly good?

• High coverage: 50 to 100X• Different but precise insert size libraries• Little to no sequencing errors• Avoid large number of variant.

• Try different assembler• Need a big fat memory machine (from 16Go to 1To)

What makes your assembly better?

Error Correction: Correct the read before assemblyhttp://bib.oxfordjournals.org/content/early/2012/04/06/bib.bbs015.full

• SOAP-denovo• Reptile: http://aluru-sun.ece.iastate.edu/doku.php?id=reptile • SGA: https://github.com/jts/sga

Joining overlapping reads:• COPE: ftp://ftp.genomics.org.cn/pub/cope/

• FLASH: http://genomics.jhu.edu/software/FLASH/index.shtml

http://bib.oxfordjournals.org/content/early/2012/04/06/bib.bbs015.full

http://aluru-sun.ece.iastate.edu/doku.php?id=reptile

https://github.com/jts/sga

ftp://ftp.genomics.org.cn/pub/cope/

http://genomics.jhu.edu/software/FLASH/index.shtml

What makes your assembly better?

Tsai et al. Genome biology 2010

Gap Filling - Image

Assembly validation

N50 is the most commonly used metric:Weighted median such as 50% of your assembly is

contained in contig of length >=N50

CEGMA: Core Eukaryotic Genes Mapping Approach• Looks in your assembly for gene that should be there• Usually best assembly have best CEGMA scorehttp://korflab.ucdavis.edu/datasets/cegma/

There are no magic tool

http://korflab.ucdavis.edu/datasets/cegma/

Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012.

Documents

Transcript of Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012.