Genome assembly: then and now — v1.0

221
Genome assembly: then and now Keith Bradnam Image from Wellcome Trust

description

A talk that I gave to a a general audience at UC Davis. Slides were also used for Prof. Ian Korf's presentation at the Genome 10K workshop (May 25th, 2013). This talk mostly concerns the results of the Assemblathon 2 contest, but also covers other issues relating to genome assembly. Note, this talk has been superseded by updated versions (also available on slideshare)!

Transcript of Genome assembly: then and now — v1.0

Page 1: Genome assembly: then and now — v1.0

Genome assembly: then and nowKeith Bradnam

Image from Wellcome Trust

Page 2: Genome assembly: then and now — v1.0

Image from flickr.com/photos/dougitdesign/5613967601/

Contents

Sequencing 101

Genome assembly: then

Genome assembly: now

Assemblathon 1

Assemblathon 2

Assemblathon 3

Page 3: Genome assembly: then and now — v1.0

More info

✤ http://assemblathon.org

✤ http://arxiv.org

✤ http://twitter.com/assemblathon

Assemblathon 2 paper has been reviewed, just dealing with reviewer's comments.

Page 4: Genome assembly: then and now — v1.0

Sequencing 101A, C, G, T...

Image from nlm.nih.gov

Fred Sanger

Page 5: Genome assembly: then and now — v1.0

Read

Most sequencing technologies start with a sequencing read. A read could be as short as 25 bp (Solexa sequencing from a few years ago), or >15,000 bp (PacBio with latest chemistry).

Page 6: Genome assembly: then and now — v1.0

Read pair

Most sequencing is done with pairs of connected reads, separated by a short interval whose length is known. Read pairs can also overlap with each other.

Page 7: Genome assembly: then and now — v1.0

Read pair

Mate pair

Mate pairs, also known as jumping pairs, have much larger inserts (thousands or tens of thousands of bp), but it is hard to make good mate pair libraries. Having very large inserts is very useful for the purposes of genome assembly.

Page 8: Genome assembly: then and now — v1.0

Sequence a whole lot of read pairs, and hopefully they will overlap with each other and allow you to start making contiguous sequences...

Page 9: Genome assembly: then and now — v1.0

Contigs

...which are better known as contigs.

Page 10: Genome assembly: then and now — v1.0

Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns).

Page 11: Genome assembly: then and now — v1.0

Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns).

Page 12: Genome assembly: then and now — v1.0

ScaffoldNNNNNNNNNNNNNNNNNNN

Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns).

Page 13: Genome assembly: then and now — v1.0

Assembly size

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

555

15

15

15

5

Assembly size is simply the sum of all scaffolds or contigs that are included in the final genome assembly. If you are calculating the assembly size from scaffolds, then some fraction of that final size will come from the Ns in scaffold sequences.

Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.

Page 14: Genome assembly: then and now — v1.0

Assembly size

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

555

200 Mbp

15

15

15

5

Assembly size is simply the sum of all scaffolds or contigs that are included in the final genome assembly. If you are calculating the assembly size from scaffolds, then some fraction of that final size will come from the Ns in scaffold sequences.

Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.

Page 15: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

555

200 Mbp

15

15

15

5

The most widely used measure to describe genome assemblies is the N50 lengths of scaffolds or contigs. This is essentially a weighted mean, designed to be more informative than a crude mean length (which is not very useful if you end up with thousands of very short contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...

Page 16: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

555

200 Mbp

15

15

15

5

The most widely used measure to describe genome assemblies is the N50 lengths of scaffolds or contigs. This is essentially a weighted mean, designed to be more informative than a crude mean length (which is not very useful if you end up with thousands of very short contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...

Page 17: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

555

200 Mbp

15

15

15

5

70

The most widely used measure to describe genome assemblies is the N50 lengths of scaffolds or contigs. This is essentially a weighted mean, designed to be more informative than a crude mean length (which is not very useful if you end up with thousands of very short contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...

Page 18: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

555

15

15

15

5

200 Mbp

95

If this length does not exceed 50% of the total assembly size (50% is why it is N50), proceed to the next longest scaffold, and add the length to a running total.

Page 19: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

555

15

15

15

5

200 Mbp

95

If this length does not exceed 50% of the total assembly size (50% is why it is N50), proceed to the next longest scaffold, and add the length to a running total.

Page 20: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

555

15

15

15

5

200 Mbp

115

Now we have exceeded 50% of the total assembly size.

Page 21: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

555

15

15

15

5

200 Mbp

115

Now we have exceeded 50% of the total assembly size.

Page 22: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

555

15

15

15

5

200 Mbp

The length of the contig or scaffold that takes you past 50% is what is reported as the N50 length. So here, we have an N50 length of 20 Mbp.

Page 23: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

55

15

15

15

55

N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our assembly?

Page 24: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

55

15

15

15

55

N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our assembly?

Page 25: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

55

15

15

15

N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our assembly?

Page 26: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

55

15

15

15

Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25 Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.

Page 27: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

55

15

15

15

190 Mbp

Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25 Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.

Page 28: Genome assembly: then and now — v1.0

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

55

15

15

15

190 Mbp

Now the total assembly size is 10 Mbp smaller, which is only 5%, but the N50 increases to 25 Mbp...a 25% increase in size. If these were two different assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think the first assembly was much better.

Page 29: Genome assembly: then and now — v1.0

N50 for two assemblies

Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which are albeit short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?

Page 30: Genome assembly: then and now — v1.0

N50 for two assemblies

208 Mbp 190 Mbp

Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which are albeit short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?

Page 31: Genome assembly: then and now — v1.0

N50 for two assemblies

208 Mbp 190 Mbp

N50 = 15 Mbp N50 = 25 Mbp

Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which are albeit short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?

Page 32: Genome assembly: then and now — v1.0

NG50 for two assemblies

208 Mbp 190 Mbp

We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to the Genome).

Page 33: Genome assembly: then and now — v1.0

NG50 for two assemblies

We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to the Genome).

Page 34: Genome assembly: then and now — v1.0

NG50 for two assemblies

Expected genome size = 250 Mbp

We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to the Genome).

Page 35: Genome assembly: then and now — v1.0

Expected genome size = 250 Mbp

NG50 for two assemblies

The NG50 of these two assemblies is now the same. We think that NG50 is a fairer way of comparing genome assemblies that might differ in their total size.

Page 36: Genome assembly: then and now — v1.0

NG50 = 15 Mbp NG50 = 15 Mbp

Expected genome size = 250 Mbp

NG50 for two assemblies

The NG50 of these two assemblies is now the same. We think that NG50 is a fairer way of comparing genome assemblies that might differ in their total size.

Page 37: Genome assembly: then and now — v1.0

How do I describe thee? Let me count the ways

Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.

Page 38: Genome assembly: then and now — v1.0

How do I describe thee? Let me count the ways

Metric Description

Assembly size With or without very short contigs?

N50 / NG50 For contigs and/or scaffolds

Coverage When compared to a reference sequence

Errors Base errors from alignment to reference sequence and/or input read data

Number of genes From comparison to reference transcriptome and/or set of known genes

Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.

Page 39: Genome assembly: then and now — v1.0

How do I describe thee? Let me count the ways

Metric Description

Assembly size With or without very short contigs?

N50 / NG50 For contigs and/or scaffolds

Coverage When compared to a reference sequence

Errors Base errors from alignment to reference sequence and/or input read data

Number of genes From comparison to reference transcriptome and/or set of known genes

And many, many more...Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.

Page 40: Genome assembly: then and now — v1.0

Genome assemblyBack in the day...

How were genomes assembled back in the late 1990s when genome sequencing projects were starting to make the news?

Page 41: Genome assembly: then and now — v1.0

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 42: Genome assembly: then and now — v1.0

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 43: Genome assembly: then and now — v1.0

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 44: Genome assembly: then and now — v1.0

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 45: Genome assembly: then and now — v1.0

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 46: Genome assembly: then and now — v1.0

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 47: Genome assembly: then and now — v1.0

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 48: Genome assembly: then and now — v1.0

So what was the result of spending millions of dollars to assemble genomes of well-characterized species,with accurate long reads, and detailed maps???

So hopefully this gave us a useful set of finished genomes, right?

Page 49: Genome assembly: then and now — v1.0

✤ 2000: published genome size = 125 Mbp

✤ 2007: genome size = 157 Mbp

✤ 2012: genome size = 135 Mbp

Arabidopsis thaliana

Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of the size went up. Now it has come back down again. But the genome remains unfinished.

Page 50: Genome assembly: then and now — v1.0

✤ 2000: published genome size = 125 Mbp

✤ 2007: genome size = 157 Mbp

✤ 2012: genome size = 135 Mbp

✤ Amount sequenced = 119 Mbp

Arabidopsis thaliana

Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of the size went up. Now it has come back down again. But the genome remains unfinished.

Page 51: Genome assembly: then and now — v1.0

✤ 2000: published genome size = 125 Mbp

✤ 2007: genome size = 157 Mbp

✤ 2012: genome size = 135 Mbp

✤ Amount sequenced = 119 Mbp

✤ Ns = 0.2% of genome

Arabidopsis thaliana

Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of the size went up. Now it has come back down again. But the genome remains unfinished.

Page 52: Genome assembly: then and now — v1.0

Drosophila melanogaster

✤ Genome published 1998

✤ Heterochromatin finished 2007

The fly genome was 'finished' in 1998. But this was only really the easy-to-sequence portion of the genome (the euchromatin). The trickier heterochromatin was sequenced as a separate project that didn't finish until almost a decade later. The fly genome remains unfinished.

Page 53: Genome assembly: then and now — v1.0

Drosophila melanogaster

✤ Genome published 1998

✤ Heterochromatin finished 2007

✤ Ns = 4% of genome

The fly genome was 'finished' in 1998. But this was only really the easy-to-sequence portion of the genome (the euchromatin). The trickier heterochromatin was sequenced as a separate project that didn't finish until almost a decade later. The fly genome remains unfinished.

Page 54: Genome assembly: then and now — v1.0

Caenorhabditis elegans

✤ Genome published 1998

✤ 2004: last N removed

The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last batch of changes all occurred just last November. So after almost 15 years of post-genome-publication, we can still find over 1,400 errors in one of the best characterized genome sequences that exists.

Page 55: Genome assembly: then and now — v1.0

Caenorhabditis elegans

✤ Genome published 1998

✤ 2004: last N removed

✤ 1998–2013: genome sequence changes

The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last batch of changes all occurred just last November. So after almost 15 years of post-genome-publication, we can still find over 1,400 errors in one of the best characterized genome sequences that exists.

Page 56: Genome assembly: then and now — v1.0

Caenorhabditis elegans

✤ Genome published 1998

✤ 2004: last N removed

✤ 1998–2013: genome sequence changes

✤ 558 insertions

✤ 230 deletions

✤ 614 substitutions

The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last batch of changes all occurred just last November. So after almost 15 years of post-genome-publication, we can still find over 1,400 errors in one of the best characterized genome sequences that exists.

Page 57: Genome assembly: then and now — v1.0

Caenorhabditis elegans

✤ Genome published 1998

✤ 2004: last N removed

✤ 1998–2013: genome sequence changes

✤ 558 insertions

✤ 230 deletions

✤ 614 substitutions

} Nov 2012

The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last batch of changes all occurred just last November. So after almost 15 years of post-genome-publication, we can still find over 1,400 errors in one of the best characterized genome sequences that exists.

Page 58: Genome assembly: then and now — v1.0

Saccharomyces cerevisiae

✤ Genome published 1997

✤ 12 Mbp genome

✤ 1,653 changes to genome since 1997

Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correct the sequence. The last set of changes were made in 2011. These changes affected coding sequences, not just intergenic and intronic DNA.

Page 59: Genome assembly: then and now — v1.0

Saccharomyces cerevisiae

✤ Genome published 1997

✤ 12 Mbp genome

✤ 1,653 changes to genome since 1997

✤ Last changes made in 2011

Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correct the sequence. The last set of changes were made in 2011. These changes affected coding sequences, not just intergenic and intronic DNA.

Page 60: Genome assembly: then and now — v1.0

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓

Genome assembly: then

And all of this was done in an era when we had all of these supporting materials.

Page 61: Genome assembly: then and now — v1.0

Genetic maps ✗

Physical maps ✗

Understanding of target genome ✗

Haploid / low heterozygosity genome ✗

Accurate & long reads ✗

Resources (time, money, people) ✗

Genome assembly: now

We don't have these now! Genome sequencing no longer requires an international consortium, rather it could be a project for a Grad student.

Page 62: Genome assembly: then and now — v1.0

Assembling & finishinga genome is not easy!

It was never easy, even when we access to lots of resources to help us put together genomes. And it is not easy now. Don't be fooled into thinking that because there are many published genome sequences, that these sequences represent the absolute ideal genome sequence.

Page 63: Genome assembly: then and now — v1.0

AssemblathonsA new idea is born

Image from flickr.com/photos/dullhunk/4422952630

Page 64: Genome assembly: then and now — v1.0

The Assemblathon was born out of the Genome 10K project.

Page 65: Genome assembly: then and now — v1.0

If you sequence 10,000 genomes......you need to assemble 10,000 genomes

The Assemblathon was born out of the Genome 10K project.

Page 66: Genome assembly: then and now — v1.0

How many assembly tools are out there?

There are many, many tools out there for assembling, or helping to assemble, a genome sequence. It seems reasonable to ask...which is the best?

Page 67: Genome assembly: then and now — v1.0

How many assembly tools are out there?

Ray

Celera

MIRA

ALLPATHS-LGSGA

CurtainMetassembler

Phusion

ABySS

Amos

Arapan

CLC

Cortex

DNAnexus

DNA Dragon EULER

EdenaForge

GeneiousIDBA

Newbler

PRICE

PADENA

PASHAPhrap

TIGR

Sequencher

SeqMan NGen

SHARCGS

SOPRA

SSAKE

SPAdesTaipan

VCAKE

Velvet

Arachne

PCAP

GAM

MonumentAtlas

ABBA

Anchor

ATAC

Contrail

DecGPU GenoMinerLasergene

PE-Assembler

Pipeline Pilot

QSRA

SeqPrep

SHORTY

fermiTelescoper

QuastSCARPA Hapsembler

HapCompass

HaploMerger

SWiPS

GigAssembler

MSR-CA

There are many, many tools out there for assembling, or helping to assemble, a genome sequence. It seems reasonable to ask...which is the best?

Page 68: Genome assembly: then and now — v1.0

How many assembly tools are out there?

Ray

Celera

MIRA

ALLPATHS-LGSGA

CurtainMetassembler

Phusion

ABySS

Amos

Arapan

CLC

Cortex

DNAnexus

DNA Dragon EULER

EdenaForge

GeneiousIDBA

Newbler

PRICE

PADENA

PASHAPhrap

TIGR

Sequencher

SeqMan NGen

SHARCGS

SOPRA

SSAKE

SPAdesTaipan

VCAKE

Velvet

Arachne

PCAP

GAM

MonumentAtlas

ABBA

Anchor

ATAC

Contrail

DecGPU GenoMinerLasergene

PE-Assembler

Pipeline Pilot

QSRA

SeqPrep

SHORTY

fermiTelescoper

QuastSCARPA Hapsembler

HapCompass

HaploMerger

SWiPS

GigAssembler

MSR-CA

Which is the best?

There are many, many tools out there for assembling, or helping to assemble, a genome sequence. It seems reasonable to ask...which is the best?

Page 69: Genome assembly: then and now — v1.0

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.

Page 70: Genome assembly: then and now — v1.0

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

✤ produced assemblies from different species

However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.

Page 71: Genome assembly: then and now — v1.0

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

✤ produced assemblies from different species

✤ assembled same species, but used sequence data from different NGS platforms

However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.

Page 72: Genome assembly: then and now — v1.0

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

✤ produced assemblies from different species

✤ assembled same species, but used sequence data from different NGS platforms

✤ used same NGS platform but different sequence libraries

However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.

Page 73: Genome assembly: then and now — v1.0

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

✤ produced assemblies from different species

✤ assembled same species, but used sequence data from different NGS platforms

✤ used same NGS platform but different sequence libraries

✤ Even using different options for the same assembler may produce very different assemblies!

However, it is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.

Page 74: Genome assembly: then and now — v1.0

A genome assembly competition

That's where the Assemblathon came in.

Page 75: Genome assembly: then and now — v1.0

An attempt to standardize some aspects of the genome assembly process

Genome assembly contests

Others have been trying to do the same thing. E.g. GAGE, and dnGASP.

Page 76: Genome assembly: then and now — v1.0

✤ 2010–2011

✤ Used synthetic data

✤ Small genome (~100 Mbp)

✤ We knew the answer!

Assemblathon 1

It is easier to judge a tool when you know what the final answer should look like. However, many people that work on developing assemblers would prefer to work with real data...

Page 77: Genome assembly: then and now — v1.0

Here we go again

...which is where Assemblathon 2 came in.

Page 78: Genome assembly: then and now — v1.0

Type of data Number of genomes

Size of genomes

Do we know the answer?

Assemblathon 1 Synthetic 1 Small ✓

Assemblathon 2 Real 3 Large ✗

Page 79: Genome assembly: then and now — v1.0

Type of data Number of genomes

Size of genomes

Do we know the answer?

Assemblathon 1 Synthetic 1 Small ✓

Assemblathon 2 Real 3 Large ✗

Page 80: Genome assembly: then and now — v1.0

Melopsittacus undulatus

Boa constrictor constrictorMaylandia zebraA budgie, a cichlid fish from Lake Mawali, and a reptile.

Page 81: Genome assembly: then and now — v1.0

Bird

SnakeFishLet's simplify the names for the rest of the talk.

Page 82: Genome assembly: then and now — v1.0

Why these three species?

There is no special reason why these species were used. People had a need to sequence the genomes, and some companies were willing to donate sequences.

Page 83: Genome assembly: then and now — v1.0

Why these three species?

Because they were there

There is no special reason why these species were used. People had a need to sequence the genomes, and some companies were willing to donate sequences.

Page 84: Genome assembly: then and now — v1.0

Species Estimated genome size Illumina Roche 454 PacBio

Bird 1.2 Gbp 285x(14 libraries)

16x(3 libraries)

10x(2 libraries)

Fish 1.0 Gbp 192x(8 libraries)

Snake 1.6 Gbp 125x(4 libraries)

Assemble this!

Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets.

Page 85: Genome assembly: then and now — v1.0

Species Estimated genome size Illumina Roche 454 PacBio

Bird 1.2 Gbp 285x(14 libraries)

16x(3 libraries)

10x(2 libraries)

Fish 1.0 Gbp 192x(8 libraries)

Snake 1.6 Gbp 125x(4 libraries)

Assemble this!

Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets.

Page 86: Genome assembly: then and now — v1.0

Species Estimated genome size Illumina Roche 454 PacBio

Bird 1.2 Gbp 285x(14 libraries)

16x(3 libraries)

10x(2 libraries)

Fish 1.0 Gbp 192x(8 libraries)

Snake 1.6 Gbp 125x(4 libraries)

Assemble this!

Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets.

Page 87: Genome assembly: then and now — v1.0

Species Estimated genome size Illumina Roche 454 PacBio

Bird 1.2 Gbp 285x(14 libraries)

16x(3 libraries)

10x(2 libraries)

Fish 1.0 Gbp 192x(8 libraries)

Snake 1.6 Gbp 125x(4 libraries)

Assemble this!

Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets.

Page 88: Genome assembly: then and now — v1.0

Who took part?

Lots of teams took part. Not just from the big sequencing/genome centers.

Page 89: Genome assembly: then and now — v1.0

Who took part?

Lots of teams took part. Not just from the big sequencing/genome centers.

Page 90: Genome assembly: then and now — v1.0

Who took part?

21 teams43 assemblies

52,013,623,777 bp of sequence

Lots of teams took part. Not just from the big sequencing/genome centers.

Page 91: Genome assembly: then and now — v1.0

Species Competitive entries

Evaluation entries

Bird 12 3

Fish 10 6

Snake 12 0

Entries

There were evaluation entries (not eligible to be declared the winner) allowed in addition to competition entries (only 1 per team).

Page 92: Genome assembly: then and now — v1.0

Species Competitive entries

Evaluation entries

Bird 12 3

Fish 10 6

Snake 12 0

Entries

There were evaluation entries (not eligible to be declared the winner) allowed in addition to competition entries (only 1 per team).

Page 93: Genome assembly: then and now — v1.0

Goals

Page 94: Genome assembly: then and now — v1.0

Goals

✤ Assess 'quality' of assemblies

Page 95: Genome assembly: then and now — v1.0

Goals

✤ Assess 'quality' of assemblies

✤ Define quality!

Page 96: Genome assembly: then and now — v1.0

Goals

✤ Assess 'quality' of assemblies

✤ Define quality!

✤ Produce ranking of assemblies for each species

Page 97: Genome assembly: then and now — v1.0

Goals

✤ Assess 'quality' of assemblies

✤ Define quality!

✤ Produce ranking of assemblies for each species

✤ Produce ranking of assemblers across species?

Page 98: Genome assembly: then and now — v1.0

Who did what?

Person/group Jobs

Me, Ian, and Joseph Fass Perform various analyses of all assemblies

David Schwarz et al. Produce & evaluate optical maps

Jay Shendure et al. Produce Fosmid sequences (bird & snake only)

Martin Hunt & Thomas Otto Performed REAPR analysis

Dent Earl & Benedict Paten Help with meta-analysis of final rankings

Page 99: Genome assembly: then and now — v1.0

flickr.com/photos/jamescridland/613445810

Hard to get agreement on how best to interpret the results. Some analyses and interpretations in the Assemblathon 2 paper end up being compromises.

Page 100: Genome assembly: then and now — v1.0

91 co-authors!

flickr.com/photos/jamescridland/613445810

Hard to get agreement on how best to interpret the results. Some analyses and interpretations in the Assemblathon 2 paper end up being compromises.

Page 101: Genome assembly: then and now — v1.0

Results!

Page 102: Genome assembly: then and now — v1.0

Lots of results!

A screen grab of my master spreadsheet that contains all of the numerical results.

Page 103: Genome assembly: then and now — v1.0
Page 104: Genome assembly: then and now — v1.0

102 different metrics!

Page 105: Genome assembly: then and now — v1.0

10 key metrics

We focused on 10 of 102 metrics that we thought were a) useful and b) captured different aspects of an assembly's quality.

Page 106: Genome assembly: then and now — v1.0

Key Metric Description

1 NG50 scaffold length

2 NG50 contig length

3 Amount of assembly in 'gene-sized' scaffolds

4 Number of 'core genes' present

5 Fosmid coverage

6 Fosmid validity

7 Short-range scaffold accuracy

8 Optical map: level 1

9 Optical map: levels 1–3

10 REAPR summary score

The 10 key metrics.

Page 107: Genome assembly: then and now — v1.0

1) Scaffold NG50 lengths

✤ Can calculate NG50 length for each assembly

✤ But also calculate NG60, NG70 etc.

✤ Plot all results as a graph

An N50 (or NG50) value on its own doesn't tell you that much. Ideally you should always be aware of the total assembly size and the distribution of lengths when comparing assemblies. You can do this by not only calculating NG50, but NG1..NG100. NG1 would be the length of scaffold that captures 1% of the estimated genome size (when summing scaffolds from longest to shortest).

Page 108: Genome assembly: then and now — v1.0

1) Scaffold NG50 lengths

Scaffold length is on a log axis and team identifiers are shown in the legend.

The black dashed line shows the NG50 value, but the point where each series starts on the left shows the lengths of the longest scaffolds. Also, if the NG100 value is greater than zero, then that assembly is bigger than the known/estimated genome size.

Page 109: Genome assembly: then and now — v1.0

2) Contig vs scaffold NG50

We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.

Page 110: Genome assembly: then and now — v1.0

2) Contig vs scaffold NG50

We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.

Page 111: Genome assembly: then and now — v1.0

2) Contig vs scaffold NG50

We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.

Page 112: Genome assembly: then and now — v1.0

3) Gene-sized scaffolds

It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.

Page 113: Genome assembly: then and now — v1.0

3) Gene-sized scaffolds

✤ Do assemblers get a little too excited by length?

It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.

Page 114: Genome assembly: then and now — v1.0

3) Gene-sized scaffolds

✤ Do assemblers get a little too excited by length?

✤ How long is 'long enough' for a scaffold?

It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.

Page 115: Genome assembly: then and now — v1.0

3) Gene-sized scaffolds

✤ Do assemblers get a little too excited by length?

✤ How long is 'long enough' for a scaffold?

✤ What if you just wanted to find genes?

It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.

Page 116: Genome assembly: then and now — v1.0

3) Gene-sized scaffolds

✤ Do assemblers get a little too excited by length?

✤ How long is 'long enough' for a scaffold?

✤ What if you just wanted to find genes?

✤ Average vertebrate gene = ~25 Kbp

It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.

Page 117: Genome assembly: then and now — v1.0

3) Gene-sized scaffolds

The blue line shows the percentage of the estimated genome size that is present in scaffolds of 25 Kbp or longer. Most assemblies, even if they have a much shorter *average* scaffold length, may contain many scaffolds that are long enough to contain a single gene.

Page 118: Genome assembly: then and now — v1.0

4) Core genes

A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.

Page 119: Genome assembly: then and now — v1.0

4) Core genes

✤ Used CEGMA tool

A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.

Page 120: Genome assembly: then and now — v1.0

4) Core genes

✤ Used CEGMA tool

✤ CEGMA = set of 458 'Core Eukaryotic Genes' (CEGs)

A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.

Page 121: Genome assembly: then and now — v1.0

4) Core genes

✤ Used CEGMA tool

✤ CEGMA = set of 458 'Core Eukaryotic Genes' (CEGs)

✤ How many full-length CEGs are in each assembly?

A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely, highly conserved) are present in each assembly. Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.

Page 122: Genome assembly: then and now — v1.0

4) Core genes

These results show the number of CEGMA genes that were present in any one assembly as a percentage of all possible CEGMA genes (i.e. those present across all assemblies for each species).

Page 123: Genome assembly: then and now — v1.0

4) Core genes

Core genes (out of 458)Core genes (out of 458)

Species Best individual assembly

Across all assemblies

Bird 420 442

Fish 436 455

Snake 438 454

In the three species, most of the core genes were present across all assemblies, but individual assemblies typically lacked several core genes.

Page 124: Genome assembly: then and now — v1.0

4) Core genes

Core genes (out of 458)Core genes (out of 458)

Species Best individual assembly

Across all assemblies

Bird 420 442

Fish 436 455

Snake 438 454

In the three species, most of the core genes were present across all assemblies, but individual assemblies typically lacked several core genes.

Page 125: Genome assembly: then and now — v1.0

ABYSS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDBCM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDCRACS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDCURT MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDGAM MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVEDMERAC MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDPHUS MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDRAY MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDSGA MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVEDSYMB MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVMLFYEVRKIKNVEDSOAP MNTVLTRANSLFAFSLSVMAALTFGCFITTAFKERTVPVSIAVSRVML-------KNVED ************************************************ *****

ABYSS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------BCM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------CRACS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------CURT FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------GAM FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNNLPHTHIMERAC FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------PHUS FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------RAY FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------SGA FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------SYMB FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------SOAP FTGPGERSDLGIITFNISANILYYKHSSLFPNIFDWNVKQLFLYLSAEYSTKNN------ ******************************************************

ABYSS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGBCM ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGCRACS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGCURT ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGGAM YGHALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLK------------------MERAC ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGPHUS ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGRAY ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGSGA ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGSYMB ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAGSOAP ---ALNQVVLWDKIILRGDDPNLLLKDMKSKYFFFDDGNGLKGNRNVTLTLSWNVVPNAG ***************************************

ABYSS ILPLVTGAGHISVPFPDTYKMTKSYBCM ILPLVTGAGHISVPFPDTYKMTKSYCRACS ILPLVTGAGHISVPFPDTYKMTKSYCURT ILPLVTGAGHISVPFPDTYKMTKSYGAM -------------------------

4) Core genes

Example of one core gene predicted in bird assemblies. CEGMA gene predictions are available as supplementary material with the paper.

Page 126: Genome assembly: then and now — v1.0

5) Fosmid coverage

Page 127: Genome assembly: then and now — v1.0

5) Fosmid coverage

✤ Had to first assemble Fosmids

Page 128: Genome assembly: then and now — v1.0

5) Fosmid coverage

✤ Had to first assemble Fosmids

✤ Looked at repeat content & coverage across Fosmids

Page 129: Genome assembly: then and now — v1.0

5) Fosmid coverage

✤ Had to first assemble Fosmids

✤ Looked at repeat content & coverage across Fosmids

✤ Aligned assembly scaffolds to Fosmids

Page 130: Genome assembly: then and now — v1.0

5) Fosmid coverage

✤ Had to first assemble Fosmids

✤ Looked at repeat content & coverage across Fosmids

✤ Aligned assembly scaffolds to Fosmids

✤ Only had Fosmids for bird and snake

Page 131: Genome assembly: then and now — v1.0

5) Fosmid coverage

Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.

Page 132: Genome assembly: then and now — v1.0

5) Fosmid coverage

Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.

Page 133: Genome assembly: then and now — v1.0

5) Fosmid coverage

Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.

Page 134: Genome assembly: then and now — v1.0

5) Fosmid coverage

Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.

Page 135: Genome assembly: then and now — v1.0

5) Fosmid coverage

Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.

Page 136: Genome assembly: then and now — v1.0

5) Fosmid coverage

Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.

Page 137: Genome assembly: then and now — v1.0

5) Fosmid coverage

Looked at coverage of Fosmids by aligning some of the input reads to the Fosmids. Occasionally we see small gaps in coverage. These represent Fosmid assembly errors or regions of the genome not captured by the input read data. We aligned scaffolds to the Fosmids and see that most assemblies contain most of the Fosmids, but repeats complicate the picture.

Page 138: Genome assembly: then and now — v1.0

5) Fosmid coverage

Most of the Fosmid sequences were used as 'Trusted' reference sequences by which to assess the assemblies.

Page 139: Genome assembly: then and now — v1.0

5) Fosmid coverage

✤ Only used regions of Fosmids that were validated by one or more assemblies

Most of the Fosmid sequences were used as 'Trusted' reference sequences by which to assess the assemblies.

Page 140: Genome assembly: then and now — v1.0

5) Fosmid coverage

✤ Only used regions of Fosmids that were validated by one or more assemblies

✤ Validated Fosmid Regions (VFRs)

✤ 99% of bird Fosmids

✤ 89% of snake Fosmids

Most of the Fosmid sequences were used as 'Trusted' reference sequences by which to assess the assemblies.

Page 141: Genome assembly: then and now — v1.0

5 & 6) Coverage & Validity

COMPASS tool by Joe Fass

The COMPASS tool compared the Validated Fosmid Regions (VFRs) to the scaffolds to calculate four measures, two of which ('coverage' and 'validity') were used as key metrics.

Page 142: Genome assembly: then and now — v1.0

5 & 6) Coverage & Validity

Some COMPASS results from the bird assemblies. Multiplicity is high when the assemblies were large (compared to the estimated genome size).

Page 143: Genome assembly: then and now — v1.0

Validated Fosmid Region

7) Short-range scaffold accuracy

We also used the VFRs in another way. We took pairs of 100 nt 'tag' sequences from either end of consecutive 1000 nt fragments across all VFR sequences.

Page 144: Genome assembly: then and now — v1.0

Validated Fosmid Region

7) Short-range scaffold accuracy

We also used the VFRs in another way. We took pairs of 100 nt 'tag' sequences from either end of consecutive 1000 nt fragments across all VFR sequences.

Page 145: Genome assembly: then and now — v1.0

Validated Fosmid Region

100 nt 100 nt

7) Short-range scaffold accuracy

We also used the VFRs in another way. We took pairs of 100 nt 'tag' sequences from either end of consecutive 1000 nt fragments across all VFR sequences.

Page 146: Genome assembly: then and now — v1.0

Validated Fosmid Region

7) Short-range scaffold accuracy

The start coordinates of each pair of tag sequences should map 900 nt apart in the assemblies and hopefully both tags map only to the same scaffold. We combined both of these into one 'summary score' metric.

Page 147: Genome assembly: then and now — v1.0

Validated Fosmid Region

Map pairs of 'tag' sequences to assembly scaffolds

7) Short-range scaffold accuracy

The start coordinates of each pair of tag sequences should map 900 nt apart in the assemblies and hopefully both tags map only to the same scaffold. We combined both of these into one 'summary score' metric.

Page 148: Genome assembly: then and now — v1.0

Validated Fosmid Region

Map pairs of 'tag' sequences to assembly scaffolds

7) Short-range scaffold accuracy

How many map as a pair to one scaffold?

The start coordinates of each pair of tag sequences should map 900 nt apart in the assemblies and hopefully both tags map only to the same scaffold. We combined both of these into one 'summary score' metric.

Page 149: Genome assembly: then and now — v1.0

Validated Fosmid Region

Map pairs of 'tag' sequences to assembly scaffolds

7) Short-range scaffold accuracy

How many map as a pair to one scaffold?

How many map at expected distance apart (900 ± 2 bp)?

The start coordinates of each pair of tag sequences should map 900 nt apart in the assemblies and hopefully both tags map only to the same scaffold. We combined both of these into one 'summary score' metric.

Page 150: Genome assembly: then and now — v1.0

7) Short-range scaffold accuracy

Expected distance apart (900 bp)Expected distance apart (900 bp)

Species Shortest Longest

Bird 702 bp 41,949 bp

Snake 673 bp 46,813 bp

Most pairs of tags mapped to the same scaffold, and at the expected distance apart, but there were a few notable exceptions.

Page 151: Genome assembly: then and now — v1.0

7) Short-range scaffold accuracy

The red line indicates the theoretical maximum summary score that could be achieved.

Page 152: Genome assembly: then and now — v1.0

8 & 9) Optical maps

For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.

Page 153: Genome assembly: then and now — v1.0

8 & 9) Optical maps

✤ Stretch out DNA

For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.

Page 154: Genome assembly: then and now — v1.0

8 & 9) Optical maps

✤ Stretch out DNA

✤ Cut with restriction enzymes

For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.

Page 155: Genome assembly: then and now — v1.0

8 & 9) Optical maps

✤ Stretch out DNA

✤ Cut with restriction enzymes

✤ Note lengths of fragments

For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.

Page 156: Genome assembly: then and now — v1.0

8 & 9) Optical maps

✤ Stretch out DNA

✤ Cut with restriction enzymes

✤ Note lengths of fragments

✤ Compare to in silico digest of scaffolds

For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.

Page 157: Genome assembly: then and now — v1.0

8 & 9) Optical maps

✤ Stretch out DNA

✤ Cut with restriction enzymes

✤ Note lengths of fragments

✤ Compare to in silico digest of scaffolds

✤ Not all scaffolds suitable for analysis

For optical map analysis, scaffolds had to be a certain minimum length *and* possess enough restriction enzyme sites.

Page 158: Genome assembly: then and now — v1.0

8 & 9) Optical maps

Image from University of Wisconsin-Madison

An example of an optical map. After cutting, each DNA fragment is measured to estimate its length. Optical map results were divided into three categories (levels 1–3).

Page 159: Genome assembly: then and now — v1.0

8 & 9) Optical maps

White bars: total length of scaffolds that were suitable for optical map analysis. Dark blue: global alignments of scaffolds to maps (these are the best quality). Light blue: global alignments with more permissive thresholds. Orange bars: local alignments. We used level 1 (dark blue) as one key metric and levels 1+2+3 as a second key metric. The MLK assembly is good, *relatively* speaking (high percentage of suitable scaffolds are in level 1 category), but we record scores on an absolute basis (MERAC highest for level 1, SOAP highest for levels 1+2+3).

Page 160: Genome assembly: then and now — v1.0

8 & 9) Optical maps

Fish optical map results were much worse than in bird, with very few assemblies having scaffolds with 'level 1' global alignments to the optical map. SGA had the most level 1 coverage, but a much lower amount of sequence that was alignable at any level (1, 2, or 3).

Page 161: Genome assembly: then and now — v1.0

8 & 9) Optical maps

Snake optical map results were intermediate compared to bird and fish.

Page 162: Genome assembly: then and now — v1.0

10) REAPR summary score

REAPR is a tool that aligns input reads to scaffolds and looks for base errors and regions which might represent misassemblies (where scaffolds should ideally be split in two). These two facets are combined into one summary score.

Page 163: Genome assembly: then and now — v1.0

10) REAPR summary score

REAPR is a tool that aligns input reads to scaffolds and looks for base errors and regions which might represent misassemblies (where scaffolds should ideally be split in two). These two facets are combined into one summary score.

Page 164: Genome assembly: then and now — v1.0

What does this all mean?

Page 165: Genome assembly: then and now — v1.0

102 metricsper assembly

10 key metrics

1 finalranking

Using the 10 key metrics, we combined the results to produce a single score for each assembly by which to rank them.

Page 166: Genome assembly: then and now — v1.0

Assembly Number of core genes Rank Z-score

CRACS 438 1 +0.68

SYMB 436 2 +0.59

PHUS 435 3 +0.54

BCM 434 4 +0.49

SGA 433 5 +0.44

MERAC 430 6 +0.30

ABYSS 429 7 +0.25

SOAP 428 8 +0.21

RAY 422 9 –0.08

GAM 415 10 –0.41

CURT 360 11 –3.02

Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics. The above results are from the CEGMA metric in bird.

Page 167: Genome assembly: then and now — v1.0

Assembly Number of core genes Rank Z-score

CRACS 438 1 +0.68

SYMB 436 2 +0.59

PHUS 435 3 +0.54

BCM 434 4 +0.49

SGA 433 5 +0.44

MERAC 430 6 +0.30

ABYSS 429 7 +0.25

SOAP 428 8 +0.21

RAY 422 9 –0.08

GAM 415 10 –0.41

CURT 360 11 –3.02

Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics. The above results are from the CEGMA metric in bird.

Page 168: Genome assembly: then and now — v1.0

Assembly Number of core genes Rank Z-score

CRACS 438 1 +0.68

SYMB 436 2 +0.59

PHUS 435 3 +0.54

BCM 434 4 +0.49

SGA 433 5 +0.44

MERAC 430 6 +0.30

ABYSS 429 7 +0.25

SOAP 428 8 +0.21

RAY 422 9 –0.08

GAM 415 10 –0.41

CURT 360 11 –3.02

Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics. The above results are from the CEGMA metric in bird.

Page 169: Genome assembly: then and now — v1.0

This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine, their competitive entry ranked number 2.

Page 170: Genome assembly: then and now — v1.0

This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine, their competitive entry ranked number 2.

Page 171: Genome assembly: then and now — v1.0

This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine, their competitive entry ranked number 2.

Page 172: Genome assembly: then and now — v1.0

This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine, their competitive entry ranked number 2.

Page 173: Genome assembly: then and now — v1.0

This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any combination of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine, their competitive entry ranked number 2.

Page 174: Genome assembly: then and now — v1.0

In fish, BCM ranked 1st though the error bars suggest there is much variability. The lack of Fosmid data means that there is only 7 key metrics rather than 10.

Page 175: Genome assembly: then and now — v1.0

Snake seemed to the only species that outwardly looked like one assembler outperformed all others (SGA, in this case). We will return to this issue. Note that there were no evaluation entries for snake.

Page 176: Genome assembly: then and now — v1.0

Another way of looking at all of this data is to plot the Z-scores for each metric as a heat map (red = higher Z-scores).

Page 177: Genome assembly: then and now — v1.0

A parallel coordinates plot is another way of trying to show all of the information at once.

Page 178: Genome assembly: then and now — v1.0

What does this all mean?

Page 179: Genome assembly: then and now — v1.0

No really, what does this all mean?

Still a bit hard to make sense of the overall rankings. What are the main findings from our paper?

Page 180: Genome assembly: then and now — v1.0

Some conclusions

✤ Very hard to find assemblers that performed well across all 10 key metrics

✤ Assemblers that perform well in one species, do not always perform as well in another

✤ Bird & snake assemblies appear better than fish

✤ No real 'winner' for bird and fish

Page 181: Genome assembly: then and now — v1.0

SGA — best assembler for snake?

Even if we had happened to use 9 key metrics rather than 10, and even if we threw out the metric where SGA performed the best, it would still probably rank 1st. So is that the end of the story?

Page 182: Genome assembly: then and now — v1.0

SGA — best assembler for snake?

Even if we had happened to use 9 key metrics rather than 10, and even if we threw out the metric where SGA performed the best, it would still probably rank 1st. So is that the end of the story?

Page 183: Genome assembly: then and now — v1.0

Description Rank of snake SGA assembly

NG50 scaffold length 2

NG50 contig length 5

Amount of assembly in 'gene-sized' scaffolds 7

Number of 'core genes' present 5

Fosmid coverage 2

Fosmid validity 2

Short-range scaffold accuracy 3

Optical map: level 1 2

Optical map: levels 1–3 1

REAPR summary score 2

SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a good assembler *on average*. But if one of these metrics was highly important to you, you may want to use an assembler that ranked higher in that metric.

Page 184: Genome assembly: then and now — v1.0

Description Rank of snake SGA assembly

NG50 scaffold length 2

NG50 contig length 5

Amount of assembly in 'gene-sized' scaffolds 7

Number of 'core genes' present 5

Fosmid coverage 2

Fosmid validity 2

Short-range scaffold accuracy 3

Optical map: level 1 2

Optical map: levels 1–3 1

REAPR summary score 2

SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a good assembler *on average*. But if one of these metrics was highly important to you, you may want to use an assembler that ranked higher in that metric.

Page 185: Genome assembly: then and now — v1.0

Best assembler across species?

Not all assemblers were used for all species, but many teams submitted entries for 2 or 3 of the species. In theory, if a team submitted an entry for all species, and if their assembler ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for fish). So what was the best assembler across species, as judged by total number of 1st places? It is BCM. But Ray comes 4th with three 1st places.

Page 186: Genome assembly: then and now — v1.0

Best assembler across species?

Assembler Number of 1st places (out of 27)

BCM 5

Meraculous 4

Symbiose 4

Ray 3

Excluding evaluation entries

Not all assemblers were used for all species, but many teams submitted entries for 2 or 3 of the species. In theory, if a team submitted an entry for all species, and if their assembler ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for fish). So what was the best assembler across species, as judged by total number of 1st places? It is BCM. But Ray comes 4th with three 1st places.

Page 187: Genome assembly: then and now — v1.0

Best assembler across species?

Assembler Number of 1st places (out of 27)

BCM 5

Meraculous 4

Symbiose 4

Ray 3

Excluding evaluation entries

Not all assemblers were used for all species, but many teams submitted entries for 2 or 3 of the species. In theory, if a team submitted an entry for all species, and if their assembler ranked 1st in all metrics, they could achieve 1st place twenty-seven times (10 + 10 + 7 for fish). So what was the best assembler across species, as judged by total number of 1st places? It is BCM. But Ray comes 4th with three 1st places.

Page 188: Genome assembly: then and now — v1.0

Ray performance

Species Final ranking

Bird 7

Fish 7

Snake 9

However, Ray ranks much lower when looking at its performance across all key metrics. So some assemblers do very well in specific measures, and not so well in others and other assemblers do moderately well across lots of metrics (e.g. SGA).

Page 189: Genome assembly: then and now — v1.0

We found it interesting that the best bird assembly was the evaluation entry by Baylor College of Medicine. What is different about this entry compared to their competitive entry?

Page 190: Genome assembly: then and now — v1.0

We found it interesting that the best bird assembly was the evaluation entry by Baylor College of Medicine. What is different about this entry compared to their competitive entry?

Page 191: Genome assembly: then and now — v1.0

Assembler Final rank

NGS data used in

assembly

CoverageZ-score

ValidityZ-score

NG50 Contig Z-score

BCM - evaluation 1 Illumina +

454 +2.0 +1.4 +1.5

BCM - competitive 2 Illumina +

454 + PacBio –0.3 –0.8 +2.7

BCM bird assemblies

The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequenced produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.

Page 192: Genome assembly: then and now — v1.0

Assembler Final rank

NGS data used in

assembly

CoverageZ-score

ValidityZ-score

NG50 Contig Z-score

BCM - evaluation 1 Illumina +

454 +2.0 +1.4 +1.5

BCM - competitive 2 Illumina +

454 + PacBio –0.3 –0.8 +2.7

BCM bird assemblies

The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequenced produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.

Page 193: Genome assembly: then and now — v1.0

Assembler Final rank

NGS data used in

assembly

CoverageZ-score

ValidityZ-score

NG50 Contig Z-score

BCM - evaluation 1 Illumina +

454 +2.0 +1.4 +1.5

BCM - competitive 2 Illumina +

454 + PacBio –0.3 –0.8 +2.7

BCM bird assemblies

The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequenced produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.

Page 194: Genome assembly: then and now — v1.0

Assembler Final rank

NGS data used in

assembly

CoverageZ-score

ValidityZ-score

NG50 Contig Z-score

BCM - evaluation 1 Illumina +

454 +2.0 +1.4 +1.5

BCM - competitive 2 Illumina +

454 + PacBio –0.3 –0.8 +2.7

BCM bird assemblies

The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequenced produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.

Page 195: Genome assembly: then and now — v1.0

Assembler Final rank

NGS data used in

assembly

CoverageZ-score

ValidityZ-score

NG50 Contig Z-score

BCM - evaluation 1 Illumina +

454 +2.0 +1.4 +1.5

BCM - competitive 2 Illumina +

454 + PacBio –0.3 –0.8 +2.7

BCM bird assemblies

The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequenced produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.

Page 196: Genome assembly: then and now — v1.0

BCM evaluation scaffoldNNNNNNNNNNNNNNNNNNN

BCM used PacBio data to help fill in the gaps in their scaffolds.

Page 197: Genome assembly: then and now — v1.0

BCM evaluation scaffoldNNNNNNNNNNNNNNNNNNN

BCM competition scaffoldNNNNNNNNNNNNNNNNNNN

BCM used PacBio data to help fill in the gaps in their scaffolds.

Page 198: Genome assembly: then and now — v1.0

BCM evaluation scaffoldNNNNNNNNNNNNNNNNNNN

BCM competition scaffoldNNNNNNNNNNNNNNNNNNN

PacBio sequence

BCM used PacBio data to help fill in the gaps in their scaffolds.

Page 199: Genome assembly: then and now — v1.0

BCM evaluation scaffoldNNNNNNNNNNNNNNNNNNN

BCM competition scaffoldCGTCGNNATCNNGGTTACG

Errors in the PacBio sequence were penalized by the choice of alignment program used to align Fosmids to scaffolds.

Page 200: Genome assembly: then and now — v1.0

BCM evaluation scaffoldNNNNNNNNNNNNNNNNNNN

BCM competition scaffoldCGTCGNNATCNNGGTTACG

Mismatches from PacBio sequence penalized alignment score more than matching unknown bases

Errors in the PacBio sequence were penalized by the choice of alignment program used to align Fosmids to scaffolds.

Page 201: Genome assembly: then and now — v1.0

The choice of one command-line option,used by one tool in the calculation of one key metric...

...probably made enough difference to dropthe PacBio-containing assembly to 2nd place.

This was actually down to the use of a single command-line option to the lastz alignment program. If we had not chosen this option, the PacBio-containing entry would have probably ranked 1st among all bird assemblies.

Page 202: Genome assembly: then and now — v1.0

Other conclusions

✤ Different metrics tell different stories

✤ Heterozygosity was a big issue for bird & fish assemblies

✤ Final rankings very sensitive to changes in metrics

✤ N50 is a semi-useful predictor of assembly quality

The last point may disappoint some. Despite looking at many different metrics, N50 scaffold length still does a reasonable job of predicting overall quality. However...

Page 203: Genome assembly: then and now — v1.0

...the outliers in this relationship should be noted. The highlighted bird assembly had the second highest scaffold N50 length, but ranked 6th among bird assemblies.

Page 204: Genome assembly: then and now — v1.0

...the outliers in this relationship should be noted. The highlighted bird assembly had the second highest scaffold N50 length, but ranked 6th among bird assemblies.

Page 205: Genome assembly: then and now — v1.0

Inter-specific differences matter

Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also very difference and this may play a role as well (some assemblers perform prefer certain short-insert sizes).

Page 206: Genome assembly: then and now — v1.0

Inter-specific differences matter

✤ The three species have genomes with different properties

✤ repeats

✤ heterozygosity

Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also very difference and this may play a role as well (some assemblers perform prefer certain short-insert sizes).

Page 207: Genome assembly: then and now — v1.0

Inter-specific differences matter

✤ The three species have genomes with different properties

✤ repeats

✤ heterozygosity

✤ The three genomes had very different NGS data sets

✤ Only bird had PacBio & 454 data

✤ Different insert sizes in short-insert libraries

Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also very difference and this may play a role as well (some assemblers perform prefer certain short-insert sizes).

Page 208: Genome assembly: then and now — v1.0

The Big Conclusion

Page 209: Genome assembly: then and now — v1.0

The Big Conclusion

"You can't always get what you want"Sir Michael Jagger, 1969

Page 210: Genome assembly: then and now — v1.0

What comes next?

Page 211: Genome assembly: then and now — v1.0

What comes next?

There may be an Assemblathon 3. This will be decided at the next Genome 10K workshop (in April, 2013).

Page 212: Genome assembly: then and now — v1.0

What comes next?

3?

There may be an Assemblathon 3. This will be decided at the next Genome 10K workshop (in April, 2013).

Page 213: Genome assembly: then and now — v1.0

A wish list for Assemblathon 3

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 214: Genome assembly: then and now — v1.0

A wish list for Assemblathon 3

✤ Only have 1 species

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 215: Genome assembly: then and now — v1.0

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 216: Genome assembly: then and now — v1.0

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 217: Genome assembly: then and now — v1.0

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

✤ Agree on metrics before evaluating assemblies!

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 218: Genome assembly: then and now — v1.0

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

✤ Agree on metrics before evaluating assemblies!

✤ Encourage experimental assemblies

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 219: Genome assembly: then and now — v1.0

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

✤ Agree on metrics before evaluating assemblies!

✤ Encourage experimental assemblies

✤ Use new FASTG genome assembly file format

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 220: Genome assembly: then and now — v1.0

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

✤ Agree on metrics before evaluating assemblies!

✤ Encourage experimental assemblies

✤ Use new FASTG genome assembly file format

✤ Get someone else to write the paper!

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 221: Genome assembly: then and now — v1.0

~ fin ~