2012 stamps-mbl-1

63
Metagenome assembly – part I C. Titus Brown [email protected]

Transcript of 2012 stamps-mbl-1

Page 1: 2012 stamps-mbl-1

Metagenome assembly – part I

C. Titus [email protected]

Page 2: 2012 stamps-mbl-1

About me

• Asst Prof at MSU, in CSE and Micro

• Software: http://github.com/ged-lab/

• Blog: http://ivory.idyll.org/blog/

• Pubs & grants:http://ged.msu.edu/interests.html

Page 3: 2012 stamps-mbl-1

Tomorrow (talk #2)

My research!

Soil! Great Prairie Grand Challenge!

MASSIVE AMOUNTS OF DATA!!!!

My research solves all your problems!! *

* Results may vary. Terms and conditions apply.

Page 4: 2012 stamps-mbl-1

Some basic assembly references

• “Assembly algorithms for next gen sequence data,” Miller et al., pmid 20211242

• Metagenome assembly tools:– MetaVelvet, pmid 22821567– MetaIDBA, pmid 21685107– SOAPdenovo, pmid 20511140

• My precious! khmer, pmid 22847406.

Page 5: 2012 stamps-mbl-1

Illumina + metagenomic assembly

• MetaHIT (2010): pmid 20203603• Rumen (2011): pmid 21273488• Permafrost (2011): pmid 22056985• Hydrothermal plumes (2012): pmid 22695863• HMP (2012): pmid 22699610

Please let me know if I’ve missed any!

Page 6: 2012 stamps-mbl-1

Culture independent methods

• Observation that 99% of microbes cannot easily be cultured in the lab. (“The great plate count anomaly”)

• While this is less true for host-associated microbes, culture independent methods are still important:– Syntrophic relationships– Niche-specificity or unknown physiology– Dormant microbes– Abundance within communities

Single-cell sequencing & shotgun metagenomics are two common ways to investigate microbial communities.

Page 7: 2012 stamps-mbl-1

Shotgun metagenomics

• Collect samples;

• Extract DNA;

• Feed into sequencer;

• Computationally analyze.

Wikipedia: Environmental shotgun sequencing.png

Page 8: 2012 stamps-mbl-1

Shotgun sequencing & assembly

Randomly fragment & sequence from DNA;reassemble computationally.

UMD assembly primer (cbcb.umd.edu)

Page 9: 2012 stamps-mbl-1

Shotgun sequencing & assembly• Why assembly?

– Assumption free (no reference needed)– Necessary for soil and marine; useful for host-associated?– Assembly can serve as reference for metatranscriptome interpretation

• Fragment, sequence, computationally assemble.

• What kind of results do you get?– Almost certainly chimerism between different strains; but still useful

for gene content & operon structure.– Specificity seems high, but sensitivity is dependent on sequencing

depth.

• Because of sampling rate, Illumina is primary choice for complex metagenomes.

Page 10: 2012 stamps-mbl-1

Shotgun metagenomics: good news

• Cheap and easy to generate vast whole metagenome/metatranscriptome shotgun data sets from essentially any community you can sample.

• Such data can be quite interesting!– Low hanging fruit – correlation with diet, etc.– Still early days for observation of “pan genome” and functional

content.

• Potential to illuminate or inform:– Dynamics and selective pressures of antibiotic resistance, virulence

genes, and pathogenicity islands– Phage and viral communities– Community interactions.

Page 11: 2012 stamps-mbl-1

Shotgun metagenomics: bad news • Massive data needed for complex populations (tomorrow!)

• Computational techniques are still relatively immature– Mapping to known genomes?– Discovery of unknown genomes & strain variants?– Sensitivity and specificity are hard to evaluate.– Computational ecosystem is not that rich…

• Interpreting the data is still the bottleneck, of course.– Vast majority of genes not usefully annotated.– Reliance on specific reference databases, annotations.– Tools for (e.g.) inferring community interactions from

community dynamics & functional capacity are desperately needed.

Page 12: 2012 stamps-mbl-1

Assembly vs mapping• No reference needed, for assembly!– De novo genomes, transcriptomes…

• But:– Scales poorly; need a much bigger computer.– Biology gets in the way (repeats!)– Need higher coverage

• But but:– Often your reference isn’t that great, so assembly

may actually be the best/only way to go.

Page 13: 2012 stamps-mbl-1
Page 14: 2012 stamps-mbl-1

Assembly

It was the best of times, it was the wor, it was the worst of times, it was the isdom, it was the age of foolishness

mes, it was the age of wisdom, it was th

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness

…but for lots and lots of fragments!

Page 15: 2012 stamps-mbl-1

Repeats cause problems:

Assemble based on word overlaps:

Page 16: 2012 stamps-mbl-1

Shotgun sequencing & assembly

Randomly fragment & sequence from DNA;reassemble computationally.

UMD assembly primer (cbcb.umd.edu)

Page 17: 2012 stamps-mbl-1

Assembly – no subdivision!

Assembly is inherently an all by all process. There is no good way to subdivide the reads without potentially missing a key connection

Page 18: 2012 stamps-mbl-1

Short-read assembly

• Short-read assembly is problematic• Relies on very deep coverage, ruthless read

trimming, paired ends.

UMD assembly primer (cbcb.umd.edu)

Page 19: 2012 stamps-mbl-1

Short read lengths are hard.

Whiteford et al., Nuc. Acid Res, 2005

Page 20: 2012 stamps-mbl-1

Short read lengths are hard.

Whiteford et al., Nuc. Acid Res, 2005

Conclusion: even witha read length of 200, theE. coli genome cannot beassembled completely.

Why?

Page 21: 2012 stamps-mbl-1

Short read lengths are hard.

Whiteford et al., Nuc. Acid Res, 2005

Conclusion: even witha read length of 200, theE. coli genome cannot beassembled completely.

Why? REPEATS.

This is why paired-endsequencing is so importantfor assembly.

Page 22: 2012 stamps-mbl-1

Four main challenges for de novo sequencing.

• Repeats.• Low coverage.• Errors

These introduce breaks in theconstruction of contigs.

• Variation in coverage – transcriptomes and metagenomes, as well as amplified genomic.

This challenges the assembler to distinguish between erroneous connections (e.g. repeats) and real connections.

Page 23: 2012 stamps-mbl-1

Repeats

• Overlaps don’t place sequences uniquely when there are repeats present.

UMD assembly primer (cbcb.umd.edu)

Page 24: 2012 stamps-mbl-1

Coverage

Easy calculation:

(# reads x avg read length) / genome size

So, for haploid human genome:

30m reads x 100 bp = 3 bn

Page 25: 2012 stamps-mbl-1

Coverage

• “1x” doesn’t mean every DNA sequence is read once.

• It means that, if sampling were systematic, it would be.

• Sampling isn’t systematic, it’s random!

(What does ‘coverage’ mean, for metagenomes?)

Page 26: 2012 stamps-mbl-1

Actual coverage varies widely from the average.

Page 27: 2012 stamps-mbl-1

Two basic assembly approaches

• Overlap/layout/consensus• De Bruijn k-mer graphs

The former is used for long reads, esp all Sanger-based assemblies. The latter is used because

of memory efficiency.

Page 28: 2012 stamps-mbl-1

Overlap/layout/consensus

Essentially,1.Calculate all overlaps2.Cluster based on overlap.3.Do a multiple sequence alignment

UMD assembly primer (cbcb.umd.edu)

Page 29: 2012 stamps-mbl-1

K-mers

Essentially, break reads (of any length) down into multiple overlapping words of fixed length k.

ATGGACCAGATGACAC (k=12) =>

ATGGACCAGATG TGGACCAGATGA GGACCAGATGAC GACCAGATGACA ACCAGATGACAC

Page 30: 2012 stamps-mbl-1

K-mers – what k to use?

Butler et al., Genome Res, 2009

Page 31: 2012 stamps-mbl-1

K-mers – what k to use?

Butler et al., Genome Res, 2009

Page 32: 2012 stamps-mbl-1

Big genomes are problematic

Butler et al., Genome Res, 2009

Page 33: 2012 stamps-mbl-1

K-mer graphs - overlaps

J.R. Miller et al. / Genomics (2010)

Page 34: 2012 stamps-mbl-1

K-mer graph (k=14)

Each node represents a 14-mer;Links between each node are 13-mer overlaps

Page 35: 2012 stamps-mbl-1

K-mer graph (k=14)

Branches in the graph represent partially overlapping sequences.

Page 36: 2012 stamps-mbl-1

K-mer graph (k=14)

Single nucleotide variations cause long branches

Page 37: 2012 stamps-mbl-1

K-mer graph (k=14)

Single nucleotide variations cause long branches;They don’t rejoin quickly.

Page 38: 2012 stamps-mbl-1

K-mer graphs - branching

For decisions about which paths etc, biology-based heuristics come into play as well.

Page 39: 2012 stamps-mbl-1

K-mer graph complexity - spur

(Short) dead-end in graph.

Can be caused by error at the end of some overlapping reads, or low coverage

J.R. Miller et al. / Genomics (2010)

Page 40: 2012 stamps-mbl-1

K-mer graph complexity - bubble

Multiple parallel paths that diverge and join.

Caused by sequencing error and true polymorphism / polyploidy in sample.

J.R. Miller et al. / Genomics (2010)

Page 41: 2012 stamps-mbl-1

K-mer graph complexity – “frayed rope”

Converging, then diverging paths.

Caused by repetitive sequences.

J.R. Miller et al. / Genomics (2010)

Page 42: 2012 stamps-mbl-1

Resolving graph complexity

• Primarily heuristic (approximate) approaches.

• Detecting complex graph structures can generally not be done efficiently.

• Much of the divergence in functionality of new assemblers comes from this.

• Three examples:

Page 43: 2012 stamps-mbl-1

Read threading

Single read spans k-mer graph => extract the single-read path.

J.R. Miller et al. / Genomics (2010)

Page 44: 2012 stamps-mbl-1

Mate threading

Resolve “frayed-rope” pattern caused by repeats, by separating paths based on mate-pair reads.

J.R. Miller et al. / Genomics (2010)

Page 45: 2012 stamps-mbl-1

Path following

Reject inconsistent paths based on mate-pair reads and insert size.

J.R. Miller et al. / Genomics (2010)

Page 46: 2012 stamps-mbl-1

More assembly issues

• Many parameters to optimize!

• Metagenomes have variation in copy number; naïve assemblers can treat this as repetitive and eliminate it.

• Assembly requires gobs of memory (4 lanes, 60m reads => ~ 150gb RAM)

• How do we evaluate assemblies?– What’s the best assembler?

Page 47: 2012 stamps-mbl-1

Metagenomics: Mixed community sampling

Coverage distribution

Page 48: 2012 stamps-mbl-1

Conclusions re mixed community sampling

In shotgun metagenomics, you are sampling randomly from the mixed population.

Therefore, the lowest abundance member of the population (that you want to observe) drives

the required depth of sequencing!

1 in a million => ~50 Tbp sequencing for 10x coverage.

Page 49: 2012 stamps-mbl-1

‘k’ parameter sets effective coverage.

coverage

Simulated data set.

Page 50: 2012 stamps-mbl-1

Conclusions re ‘k’ parameter• The previous slide shows you coverage histograms for per-

base (mapping) coverage, as well as k-mer distributions.

• People will tell you k is about specificity: a longer ‘k’ is more stringent and requires a more specific overlap between reads.

• However, the practical effect of increasing k is to lower your effective coverage.

• This is one (the?) reason why different ‘k’ parameters can give you different subsets of the metagenomic population.

Page 51: 2012 stamps-mbl-1

Assembly depends on high coverageHMP mock community assembly; Velvet-based protocol.

Page 52: 2012 stamps-mbl-1

Conclusions from previous slide

• To recover any contigs at all, you need coverage > 10 (green line).

• To recover long sequences, you want coverage > 20 (blue line).

Page 53: 2012 stamps-mbl-1

Assemblers fail to assemble complex regions into contigs.

Page 54: 2012 stamps-mbl-1

Conclusions from previous slide

• Contig assemblers don’t like “complex” regions in the graph (repeats, high polymorphism, etc.)

• They will simply end the contig there.

• This is why you need paired-end sequencing and scaffolding.

• Friends don’t let friends scaffold metagenomic data – See rumen paper, Hess et al., pmid 21273488, for

discussion.

Page 55: 2012 stamps-mbl-1

What do metagenomic assemblers do?

MetaVelvet and MetaIDBA (and khmer) “partition” the assembly graph into sections from different organisms, and then assemble

those individually.

This allows them to adjust coverage parameters “locally”.

Page 56: 2012 stamps-mbl-1
Page 57: 2012 stamps-mbl-1

MetaVelvet & partitioning

Page 58: 2012 stamps-mbl-1

Errors200x coverage – but most k-mers are from errors!

Page 59: 2012 stamps-mbl-1

Conclusions from previous slide• For a simulated data set with coverage of 200,

the vast majority (80%) of unique k-mers are low-abundance and caused by errors.

• Errors cause major problems for de Bruijn graph assemblers.

• For genomes, you can trim off low-abundance k-mers. For metagenomes, that removes real data. Dilemma.

Page 60: 2012 stamps-mbl-1

coverage

For genomes, you can trim low-abundance k-mers.

Not so for metagenomes…

Page 61: 2012 stamps-mbl-1

Conclusions from previous slide• For a simulated data set with coverage of 200,

the vast majority (80%) of unique k-mers are low-abundance and caused by errors.

• Errors cause major problems for de Bruijn graph assemblers.

• For genomes, you can trim off low-abundance k-mers. For metagenomes, that removes real data. Dilemma.

Page 62: 2012 stamps-mbl-1

Some concluding thoughts (day 1)

• Opinions:– For polymorphic/strain variants, contig assemby is

more likely to fail to produce a contig than it is to produce a chimera (contig assembly is specific).

– Scaffolding, at least with Velvet/MetaVelvet, seems to be prone to producing chimerae.

– My biggest concern with metgenome assembly is not specificity but rather sensitivity.

– We know so little about most environments that we have no good way of assessing what we’re missing.

Page 63: 2012 stamps-mbl-1

Some more concluding thoughts• Assembly is a gigantic black box into which you

feed your data, and out of which comes… something.

• Think hard about how to evaluate the results and be prepared to spend lots of time doing so.

• Your computation is part of your science! If you’re just running someone else’s program blindly, you’re doing it wrong.