Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012.
-
Upload
jocelin-murphy -
Category
Documents
-
view
214 -
download
0
Transcript of Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012.
Assembling Genome
Timothee Cezard
EBI NGS workshop16/10/2012
2
Assembly Algorithms
• Goal: Find the shortest common sequence of a set of reads.
• This is NP-hard problem, we need to use some approximation algorithm.
Main algorithm used:• Overlap Layout Consensus• Debrujin graphs
Overlap-layout-consensusStep 1: Find Overlapping Reads
Need efficient alignment algorithmDoesn’t scale well when number of read is highUse seed based alignment with extension
TACATAGATTACACAGATTACTGA
|| ||||||||||||||||||||
TAGTTAGATTACACAGATTACTAGA
Overlap-layout-consensusStep 2: Construct overlap graph
• A graph is constructed:(1) Nodes are reads(2) Edges represent overlapping reads
CGTAGTGGCAT
ATTCACGTAG
Overlap graph
Try to find the Hamiltonian path:• a path in the graph contains each node exactly once.• Expensive computationally
Overlap-layout-consensusStep 3: Find Contigs
CGTAGTGGCAT
ATTCACGTAG
Overlap-layout-consensus
• This approach is used in Celera (CABOG), Newbler, Mira, SGA…
• It is mostly used with Sanger or 454 data.
• Can’t assemble repeat longer than read length
• Could come back if read gets longer.
De Bruijn Graphs example
“It was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it was
the epoch of incredulity,.... “
Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall
Velvet example courtesy of J. Leipzig 2010
De Bruijn Graphs exampleitwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
Generate random ‘reads’ How do we assemble?
Traditional all-vs-all comparisons of datasets this size require immense computational resources.
De Bruijn solution: Construct a graph efficiently
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe theageofwi foolishnes incredulit ofbeliefit chofincred beliefitwa beliefitwa wisdomitwa eageoffool eoffoolish itwastheag mesitwasth epochofinc ssitwasthe itwastheep astheageof stheageoff sitwasthee thebestoft oolishness heepochofb ochofbelie wastheepoc bestoftime mesitwasth ebestoftim pochofincr
…etc. to 10’s of millions of reads
De Bruijn GraphsStep 1: create kmer
Step 1: “Kmerize” the data
Reads: theageofwi
age
geo
eof
ofw
fwi
sthebestof
sth
the
heb
ebe
bes
est
sto
tof
astheageof
ast
sth
the
hea
eag
age
geo
eof
worstoftim
wor
ors
rst
sto
tof
oft
fti
tim
imesitwast
ime
mes
esi
sit
itw
twa
was
ast
…..etc for all reads in the dataset
Kmers :(k=3)
the
hea
eag
De Bruijn GraphsStep2 Build the graph
age geo eof ofw fwihea eagthesth the
heb ebe bes est sto tof
ast sththe hea eag age geo eof
Look for k-1 overlaps: given by the reads
wor ors rststo tof
oft fti tim
ime mes
esisititwtwa
was
ast
…..etc for all ‘kmers’ in the dataset
De Bruijn Graphsstep3: simplify the graph
No single solution!Break the graph to give the final assembly
De Bruijn Graphsstep4: Create contigs
De Bruijn exampleThe final assembly (k=3)
wor times itwasthe foolishness
incredulity age epoch be
st wisdom
of belief
A better assembly (k=10)itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Repeat with a longer “kmer” length
Why not always use longest ‘k’ possible?
Sequencing errors:
sthebentof
sth theheb
ebeben
entnto
tof
sthebentof
k=3
k=10100% wrong kmer
Mostly unaffected kmers
Strengths and problems of De Bruijn approach
Strengths:• No need to calculate the overlaps• Size of the final graph is function of the genome size• Repeats are collapsed
Problems:• Can only resolve k long repeat• Loose connectivity when create the contigs
Resolve repeat through scaffolding
Align reads from short insert or long insert library
Join contigs using evidence from paired end data
Contigs from assembly
Scaffold
De Bruijn assembler
• Velvet: http://www.ebi.ac.uk/~zerbino/velvet/
• ABySS: http://www.bcgsc.ca/platform/bioinfo/software/abyss
• SOAP-denovo: http://soap.genomics.org.cn/soapdenovo.html
• ALLPATH-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/
• IDBA-UD: http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/
What makes an assembly good?
• High coverage: 50 to 100X• Different but precise insert size libraries• Little to no sequencing errors• Avoid large number of variant.
• Try different assembler• Need a big fat memory machine (from 16Go to 1To)
What makes your assembly better?
Error Correction: Correct the read before assemblyhttp://bib.oxfordjournals.org/content/early/2012/04/06/bib.bbs015.full
• SOAP-denovo• Reptile: http://aluru-sun.ece.iastate.edu/doku.php?id=reptile • SGA: https://github.com/jts/sga
Joining overlapping reads:• COPE: ftp://ftp.genomics.org.cn/pub/cope/
• FLASH: http://genomics.jhu.edu/software/FLASH/index.shtml
What makes your assembly better?
Tsai et al. Genome biology 2010
Gap Filling - Image
Assembly validation
N50 is the most commonly used metric:Weighted median such as 50% of your assembly is
contained in contig of length >=N50
CEGMA: Core Eukaryotic Genes Mapping Approach• Looks in your assembly for gene that should be there• Usually best assembly have best CEGMA scorehttp://korflab.ucdavis.edu/datasets/cegma/
There are no magic tool