MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th,...

MGM WorkshopAssembly Tutorial

Matthew BlowDOE Joint Genome Institute,

Walnut Creek, CA

Sep 26th, 2011

• Introduction to short-read genome sequencing• Short read genome assembly theory• Factors effecting short read genome assembly• Short read genome assembly of microbes at JGI

Contents

Sanger 454 Illumina HiSeq

Mb/dayCost / MbRead length

150bp

450bp

650bp

1

$400

$15 $0.11,000

20,000

Traditional genome sequencing technology

Short-read

We have to figure out how to sequence microbial genomes using only illumina data!

Why sequence genomes using short reads?

Short read genome sequencing

How do we convert this data back into a genome?

GenomicDNA

300bpfragments

Random fragmentation

3-10 kbfragments

Paired-end long insert

reads(10’s millions)

Paired-end short insert

reads(10’s millions)

molecular biology

Sequencing(Illumina)

Contents

• Introduction to short-read genome sequencing• Short read genome assembly theory• Factors effecting short read genome assembly• Short read genome assembly of microbes at JGI

Why assemble?

Unassembled reads

Assemblies

Data Insights

Minimal

• Genes / operons(high quality consensus)• Pathways• Reference

sequences

Data reduction

Gb / Tb

Mb

Short read assembly strategy

Contigs

ScaffoldsAssembly algorithms

e.g. Allpaths, Velvet,

Meraculous

~107 reads

~107 reads

‘Finished’


Contigs

Scaffolds

~107

~107

‘Finished’

‘De Bruijn’ assembly

De Bruijn Graph example

“It was the best of times, it was the worst of times, it was the

age of wisdom, it was the age of foolishness, it was the

epoch of belief, it was the epoch of incredulity,.... “

Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall

Velvet example courtesy of J. Leipzig 2010

De Bruijn example

itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…

Generate random ‘reads’ How do we assemble?

Traditional all-vs-all comparisons of datasets this size require immense computational resources.

De Bruijn solution: represent the data as a graph

fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof

stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie

eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor

ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof

esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit

stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft

ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi

astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe

ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth

ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo

ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof

astheworst sitwasthew theageoffo eepochofbe theageofwi foolishnes incredulit ofbeliefit chofincred beliefitwa beliefitwa wisdomitwa eageoffool

eoffoolish itwastheag mesitwasth epochofinc ssitwasthe itwastheep astheageof stheageoff sitwasthee thebestoft oolishness heepochofb

ochofbelie wastheepoc bestoftime mesitwasth ebestoftim pochofincr…etc. to 10’s of millions of reads

De Bruijn example

Step 1: “Kmerize” the data

Reads: theageofwi

age

geo

eof

ofw

fwi

sthebestof

sth

the

heb

ebe

bes

est

sto

tof

astheageof

ast

sth

the

hea

eag

age

geo

eof

worstoftim

wor

ors

rst

sto

tof

oft

fti

tim

imesitwast

ime

mes

esi

sit

itw

twa

was

ast

…..etc for all reads in the dataset

Kmers :(k=3)

the

hea

eag

De Bruijn example

Step 2: Represent the ‘kmers’ in a graph

age geo eof ofw fwihea eagthesth the

heb ebe bes est sto tof

ast sththe hea eag age geo eof

Look for k-1 overlaps

wor ors rststo tof

oft fti tim

ime mes

esisititwtwa

was

ast

…..etc for all ‘kmers’ in the dataset

De Bruijn example

Step 3: Simplify the graph as much as possible

->A De Bruijn graph

Strengths of De Bruijn approach

• Computationally efficient• Overlaps are implicit in graph

Size of graph (and therefore computational memory requirement) is function of genome size, not number of reads

Drawback of De Bruijn approach

Information is lost where repeats are ‘collapsed’

itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…

A ‘collapsed” repeat

No single solution!Break the graph to give the final assembly

Drawback of De Bruijn approach

Connectivity is lost extracting the assembly

De Bruijn example

The final assembly (k=3)

wor times itwasthe foolishness

incredulity age epoch be

st wisdom

of belief

A better assembly (k=20)

itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…

Repeat with a longer “kmer” length

Why not always use longest ‘k’ possible?

Sequencing errors:

sthebentof

sth theheb

ebeben

entnto

tof

sthebentof

k=3

k=10100% wrong kmer

Mostly unaffected

kmers

Does the De Bruijn approach work for assembly of microbe genomes?

Simulate short read genome assembly from six microbes with known genomes

Simulated De Bruijn assembly for six ‘known’microbial genomes

0 10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

0

10

20

30

40

50

60

70

80

90

100

A. haemolyticumB. MurdochiiC. FlavigenaS. SmaragdinaeH. Turkmenica

'Kmer' length (bp)

% G

en

om

e i

n u

niq

ue

‘k

me

rs’

Bacteria name:

Kmer =30, most of the genome CAN be assembled

97%

Kmer = 150A small fraction of the genome remains ‘unassemblable’

1. What fraction of the genome should we be able to assemble?

How do we use short read data to improve this?

Microbe

A. haemolyticum B. Murdochii C. Flavigena C. Woesei H. Turkmenica S. Smaragdinae0

50

100

150

200

250

Pre

dic

ted

nu

mb

er

of

fra

gm

en

ts i

n t

he

as

-s

em

bly

Simulated De Bruijn assembly for six ‘known’microbial genomes

2. How fragmented do we expect microbial assemblies to be?


Contigs

Scaffolds

~107

~107

‘De Bruijn’ assembly

‘Finished’

scaffolding

Scaffolding using paired-end information

Align reads from short insert or long insert library

Join contigs using evidence from paired end data

Contigs from De Bruijn assembly

Scaffold

Predicted improvement of microbial genome assemblies by scaffolding

A. hae

moly

ticum

B. Murd

ochii

C. Fla

vigen

a

C. Woes

ei

H. Turk

men

ica

S. Sm

arag

dinae

0

50

100

150

200

250

Pre

dic

ted

nu

mb

er

of

fra

gm

en

ts i

n

the

as

se

mb

ly

No scaffolding:~100 fragments

Scaffolding with short insert library:

~30 fragments

Scaffolding with long insert library:

1 – 7 fragments

Summary of short read genome assembly theory

• De Bruijn graphs can efficiently assemble massive short read datasets

• Pairing information from short reads substantially improves contiguity of assemblies

• Theoretically, complete or near-complete microbe genomes can be generated using only short-read data

• Introduction to short-read genome sequencing• Short read genome assembly theory• Factors effecting short read genome assembly in

practice• Short read genome assembly of microbes at JGI

Contents

Effect of genome properties on assembly results

Biased sequence composition

RESULT: incomplete / fragmented assembly

ACTGTCTAGTCAGCGCGCGCGCGCGCGCCCGCGCGCGCGGGCGGCGGCGCGGGCGGGCGCATGTAGTGATC

High repeat content

RESULT: misassemblies /

collapsed assemblies

r

rrr

r

Polyploidy

RESULT: fragmented assembly

a a’

Bias – non-uniform sampling of gDNA due to gDNA prep, sample prep or sequencing.RESULT: incomplete / fragmented assembly

Read quality – Sequence error, inhomogeneity RESULT: fragmented assembly

Contamination – fragments containing vector, adapter , linker, stuffer or undesirable gDNA.RESULT: incorrect and inflated assembly

Chimeric reads – distinct genomic locations artificially connected in a read. RESULT: mis-assembly

Effect of data properties on assembly results

How to get a good assembly

• Use the best available sequence data and informatics tools– JGI: Constant improvement of sequencing

chemistry, molecular biology and software• Quality control of sequence data is essential

– JGI: Automated QC pipeline to detect and filter out known problems

• Introduction to short-read genome sequencing• Short read genome assembly theory• Factors effecting short read genome assembly• Progress in short read genome assembly of

microbes at JGI

Contents

Short read assemblies are improving over time

0

2

4

6

8

10

0

20

40

60

80

100

Ge

no

me

siz

e

(Mb

)

Ge

no

me

GC

(%)

Q4 2009(n = 64)

Q1 2010(n = 90)

Q2 2010(n = 31)

Q3 2010(n = 68)

Q4 2010(n = 94)

Q1 2011(n = 43)

50

100

150

10

20

30

40

Lo

ng

es

t c

on

tig

(K

b)

co

nti

g N

50

(K

b)

20

40

60

80

405060708090100

Re

ad

nu

m-

be

r (m

illio

ns

)

Hig

h q

ua

lity

re

ad

s (

%)

Sequenced genome properties remain constant

But illumina sequence quantity and quality is increasing…

…resulting in better microbial genome assemblies

Average results from sub-optimal “QC” assemblies

How close are we to obtaining “finished” genomes using only short-read genome sequencing?

Analyze real short read genome assembly from six microbes with known genomes

real

B. Murdochii

C. Flavigena

C. Woesei

H. Turkmenica

S. Smaragdinae

0 20 40 60 80 100 120 140

B. Murdochii

C. Flavigena

C. Woesei

H. Turkmenica

S. Smaragdinae

0 1 2 3 4 5 6 7 8

Short insert library only

simulated

Short read assemblies are approaching predicted ‘best possible’ results

Short + long insert libraries

Number of fragments in the assembly

Comparison of short-read assemblies with simulated ‘best possible’ results

Average number of fragments

Average % known genes

identified

85 97.3%

3 97.4%

Short insert library only

Short + long insert libraries

1 100%

?

Short read assemblies include the vast majority of known genes

Comparison of short-read assemblies with reference genome annotation

Pacific Biosciences Sequencer

Long reads from “3rd generation” Pacific Biosciences sequencer hold promise for improving short-read based assemblies

Maximum read length >4kb

1 32 4 5 6 7Read length (Kb)

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Nu

mb

er o

f re

ads

Mean read length = 1080bp

Conclusion

• High quality genome sequencing using only short-reads is within reach

• Existing short-read microbial genomes assemblies are minimally fragmented and contain the vast majority known genes

• Third-generation sequencing may provide an inexpensive path to finished genomes

Metagenome assembly is an ongoing challenge

-All challenges of isolate genome assembly remain-Extra challenges from diversity and different abundance of constituent genomes- The same strategies as isolate assembly can be used, but many heuristics fail for metagenomes

De Bruijn art

Useful Reviews

• Miller JR, Koren S, Sutton G. , Assembly algorithms for next-generation sequencing data. Genomics. 2010 Jun;95(6):315-27.

• Mihai Pop, Genome assembly reborn: recent computational challenges. Brief Bioinform (2009) 10 (4):354-366.

http://www.ncbi.nlm.nih.gov/pubmed?term=%22Miller%20JR%22%5BAuthor%5D

http://www.ncbi.nlm.nih.gov/pubmed?term=%22Koren%20S%22%5BAuthor%5D

http://www.ncbi.nlm.nih.gov/pubmed?term=%22Sutton%20G%22%5BAuthor%5D

Illumina data qualitySyntrophorhabdus aromaticivorans

PASS

Read Quality

Genome Properties

Library Quality

Run Quality

Illumina data quality Opitutaceae bacterium TAV2

Genome Properties

Library Quality

Run Quality

Read Quality

FAIL

Metagenomes are harder to assemble

velvet gap size distribution of aligned contig shreds

0

20

40

60

80

100

120

< 100 100-999 1000-1999 2000-2999 3000-3999 > 4000

Gap size (bp)

nu

mb

er

of

ga

ps

4085750 std (trimmed toQ20)4085750 jumping(trimmed to Q20)4085750 std trimmed +jumping trimmed4086221 std

4086221 jumping

4086221 std + jumping

Velvet gaps Fibrobacter succinogenes & Ignisphaera aggregans

Features of Assemblers

Algorithm Feature OLC Assemblers DBG Assemblers

Read features Base substitutions Euler, AllPaths, SOAPHomopolymer miscount CABOGConcentrated error in 3′ end EulerFlow space Newbler

Removal of erroneous reads Based on K-mer frequencies Euler, Velvet, AllPathsBased on K-mer freq and QV AllPathsFor multiple values of K AllPathsBy alignment to other reads CABOG

Base error correction Based on K-mer frequencies Euler, SOAPBased on Kmer freq and QV AllPathsBased on alignments CABOG

Graph construction Reads as graph nodes CABOG, Newbler, EdenaK-mers as graph nodes Euler, Velvet, ABySS, SOAPSimple paths as graph nodes AllPaths

Graph reduction Collapse simple paths CABOG, Newbler Euler, Velvet, SOAPErosion of spurs CABOG, Edena Euler, Velvet, AllPaths, SOAPBubble smoothing Edena Euler, Velvet, SOAPBubble detection AllPathsReads separate tangled paths Euler, SOAPBreak at low coverage Velvet, SOAPBreak at high coverage CABOG EulerHigh coverage indicates repeat CABOG Velvet

Graph partitions Partition by K-mers ABySSPartition by scaffolds AllPaths

Mate pairs Constrain path searches Euler, Velvet, AllPathsGuide path selection Euler, AllpathsMerge contigs or fill gaps CABOG, Shorty Velvet, ABySS, SOAPTransitive link reduction CABOG SOAPDetect, avoid repeat contigs CABOG Velvet, SOAPCreate scaffolds CABOG, Shorty Euler, Velvet, AllPaths, SOAP

J.R.Miller et al. Genomics 95 (2010)

Pop M Brief Bioinform 2009;10:354-366

(A) Overlap between two read (agreement within overlapping region need not be perfect); (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C) Incorrect assembly produced by the greedy approach.(D) Disagreement between two reads (thin lines) that could extend a contig (thick line), indicating a potential repeat boundary. Contig extension must be terminated in order to avoid misassembly.

Greedy overlap

CORRECT

INCORRECT

Overlap graph of a genome containing a two-copy repeat (B).

Overlap-Layout-Consensus (OLC)

Comparing Overlap and de Bruijn Graphs

Schatz et al., Genome Res. (2010)

Iterative kmer evaluation: IDBA

Y Peng, et al. IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler (2010)

http://i.cs.hku.hk/~alse/idba/files/idba.pdf

Jeremy Leipzig (jerdemo.blogspot.com/2009/11/using-vmatch-to-combine-assemblies.html

N50

The N50 size of a set of entities (e.g., contigs or scaffolds) represents the largest entity E such that at least half of the total size of the entities is contained in entities larger than E.

For example, given a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb. (http://www.cbcb.umd.edu/research/castats.shtml)

N50 length is the length ‘x’ such that 50% of the sequence is contained in contigs of length x or greater.

(Waterston http://www.pnas.org/cgi/reprint/100/6/3022.pdf)

Theoretical performance

Cahill et al., PLoS ONE (2010)

Assessing performance of a range of read lengths

Repeat-induced gaps

Supplement: Assembler flowcharts

Phrap

Bamidele-Abegunde T. 2010 http://library2.usask.ca

CAP3 & PCAP


MIRA


Velvet


Supplement: Genome Improvement

Typical Microbial project

FINISHING

Annotation

Public release

Sequencing

Draftassembly

Goals:

Completely restore genome

Produce high quality consensus

Metagenomic assembly and Finishing

• Typically size of metagenomic sequencing project is very large

• Different organisms have different coverage. Non-uniform sequence coverage results in significant under- and over-representation of certain community members

• Low coverage for the majority of organisms in highly complex communities leads to poor (if any) assemblies

• Chimerical contigs produced by co-assembly of sequencing reads originating from different species.

• Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in closely related organisms further complicate assembly.

• No assemblers developed for metagenomic data sets

The whole-genome shotgun sequencing approach was used for a number of

microbial community projects, however useful quality control and assembly

of these data require reassessing methods developed to handle relatively

uniform sequences derived from isolate microbes.

QC: Annotation of poor quality sequence

To avoid this:

make sure you use high quality sequence

choose proper assembler

MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th,...

Documents

Transcript of MGM Workshop Assembly Tutorial Matthew Blow DOE Joint Genome Institute, Walnut Creek, CA Sep 26th,...