09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome...

25
09.05.201 2 CS-681 PRESENTATION 1/25 ALLPATHS-LG ALLPATHS-LG a new standard for a new standard for assembling a assembling a billion-piece genome billion-piece genome puzzle puzzle

Transcript of 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome...

Page 1: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 1/25

ALLPATHS-LGALLPATHS-LG

a new standard for a new standard for assembling a assembling a

billion-piece genome puzzlebillion-piece genome puzzle

Page 2: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 2/25

CS 681

presented by

Ömer KöksalÖmer Köksal

High-quality draft assemblies of mammalian genomes from massively parallel sequence data

ALLPATHS-LGALLPATHS-LGby

Sante Gnerre et al. (20 Authors)Jan 25th, 2011

Page 3: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 3/25

Agenda

Introduction

Results

Model for Input Data

Sequencing Data

ALLPATHS-LG Assembly Method

Uncertainty in Assemblies

Human and Mouse Assemblies

Human Genome

Mouse Genome

Segmental Duplications

Understanding Gaps

Discussion

Page 4: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 4/25

Introduction

High-quality assembly of a genome sequence is critical

Particularly challenging for large, repeat rich genomes such as those of mammals

Using traditional capillary-based sequencing (>700 bases) such assemblies produced for multiple mammals at a cost of tens of millions dollars each.

New massively parallel technologies are expected to lower cost dramatically but they could not, because of

• short sequencing (~100 bases in length)

• less accuracy

• difficult to assemble

Page 5: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 5/25

Introduction (cont’d)

ALLPATHS-LG

de novo assembly of large (and small) genomes

it should be possible to generate high quality draft assemlies of Large Genomes

~1000 fold lower cost than a decade ago

Previous versions:

• ALLPATHS 1.0 (2008)

• ALLPATHS 2.0 (2009)

Page 6: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 6/25

RESULTS Model for Input Data

Sequencing Data

ALLPATHS-LG Assembly Method

Uncertainty in Assemblies

Human and Mouse Assemblies

Human Genome

Mouse Genome

Segmental Duplication

Understanding Gaps

Results

Page 7: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 7/25

De novo genome assembly depends on

• computational methods

• nature and quantity of sequence data used

Fairly standard model of Capillary-based sequence was modified

Sets a target of 100 fold sequence coverage to compensate shorter reads & nonuniform coverage

Despite using higher coverage, proposed model is dramatically cheaper since the per-base cost ~10000 fold lower than capillary sequencing

illumina sequencing was used (Table-1)

Results - Model for Input Data

Page 8: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 8/25

Table 1 – Provisional sequencing model for de novo assembly

Results - Model for Input Data (cont’d)

Page 9: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 9/25

Using the model above generated sequences are:

• Human Genome

• Mouse Genome

Human Genome:

• GM12878 (Coriell Institute) of 1000 Genomes Pilot Project

• NCBI Short Read: Human_NA_12878_Genome_on_illumina

Mouse Genome:

• C57BL/6J female DNA

• NCBI Short Read: Mouse_B6_Genome_on_illumina

Results – Sequencing Data

Page 10: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 10/25

previous versions were improved extensively

can assembly small genomes

freely available at:

http://www.broadinstitute.org/science/programs/genome-biology/crd

Results - ALLPATHS-LG Assembly Method

Page 11: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 11/25

Some key innovations in ALLPATH-LG- Handling repetitive sequences

-more resilient to repeats

- Error Correction

-for every 24-mer the algorithm examines the stack of all reads containing 24-mer

-incidence of incorrect error correction was reduced

- Use of jumping data

-it coult work even with such data: it trim bases beyond junction points and treats each read pair as belonging to one of two possible distributions

- Efficient memory usage

-can asseble human genomes on commercial servers (48 processors & 512 GB Ram) in a few week

-3 week for mouse & 3.5 weeks for human)

Results - ALLPATHS-LG Assembly Method (cont’d)

Page 12: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 12/25

Results – Uncertainty in Assemblies

The goal of assembly is to reconstruct the genome as accurately as possible

However in some locations the data may be compatible with more than one solutions

Rather than making an arbitrary choice (& introducing errors) algortihm retains alternatives

ALLPATHS-LG algorithm generates an assembly graph whose edges are sequences and braches represent alternate choices

ATC{A,T}GGTTTTTTT{T,TT}ACAC

Variant Call Format (.VCF file)

Page 13: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 13/25

NOTE:

Current version of ALLPATHS-LG only captures single-base and simple sequence indel uncertainties

Better way to capture alternatives are needed (many of which are still lost in the current version and giving rise to errors)

It would be desirable to assign probabilities to each alternative

Results – Uncertainty in Assemblies(cont’d)

Page 14: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 14/25

Results – Human & Mouse Assemblies

Resulting genome assemblies provide good coverage of the human and mouse genomes

ALLPATHS-LG assemblies were compared with previously published assemblies

- Capillary-based sequencing

- SOAP (massively sequencing parallel sequencing)

Page 15: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 15/25

Results – Human & Mouse Assemblies (cont’d)

Page 16: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 16/25

N50 contig length of 24 kb

Scaffold length of 11.5 Mb

Contiguity is > 4 fold longer than SOAP algorithm

Connectivity is > 25 fold longer than SOAP algorithm

Assembled sequence contains 91.1% of the reference genome (SOAP: 74.3%)

Assembled sequence contains 95.1% of the reference genome (SOAP: 81.2%)

Results are similar to capillary based assemblies

Results – Human Genome

Page 17: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 17/25

Local assembly error: 3.5 %

- Capillary: 4.1 %

- SOAP: 6.2 %

Long range accuracy: 99.1%

- Capillary: 99.7 %

- SOAP: 99.5 %

Results – Human Genome (cont’d)

Page 18: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 18/25

Results are broadly similar for the mouse genome

N50 contig length of 16 kb

Scaffold length of 7.2 Mb

Connectivity is > 20 fold larger than SOAP algorithm

Approach Capillary results (contig: 25 kb, scaffold: 16.9 Mb)

Assembled sequence contains 88.7% of the reference genome (Capillary: 94.2%)

Assembled sequence contains 96.7% of the reference genome (Capillary: 97.3%)

Results are considerably better than SOAP

Results – Mouse Genome

Page 19: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 19/25

Local assembly error: 3.0 %

- Capillary: 2.7 %

- SOAP: 14.2 %

Long range accuracy: 99.0 %

- Capillary: 99.1 %

- SOAP: 98.8 %

Results – Mouse Genome (cont’d)

Page 20: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 20/25

Segmental duplications shows a challange

ALLPATHS-LG assemlies (human and mouse) cover only ~40% segemental duplications

- Capillary: 60%

- SOAP: 12%

NOTE:

Clearly additional work is needed here

Results – Segmental Duplications

Page 21: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 21/25

Rougly three quarters of the gaps captured

Remaining gaps are not spanned

Majority of the sequence within the gaps consists of repetitive elements, 61.9% of gaps:

- For mouse: LINE elements are major contributors to GAPS

- For human: LINE & SINE elements

Results – Understanding Gaps

Page 22: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 22/25

High quality vertebrate genomes provided an essential foundation for comperative analysis of human genome

Costing tens of millions $ each to generate with capillary based sequencing

In this work, ALLPATHS-LG was presented lowering the cost ~1000 fold.

Discussion

Page 23: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 23/25

ALLPATHS-LG

- Good long range connectity,

- Good accuracy,

- Good coverage

wrt capillary based sequencing and

better than SOAP

ALLPATHS-LG

- Quality of the assembliesis considerably better:

scaffolds are > 25 times longer

Discussion (cont’d)

Page 24: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 24/25

ALLPATHS-LG is anticipated to yield even better results in the improved version.

ALLPATHS-LG introduced a preliminary syntax for expressing alternatives: TTTT{T, TT}

Computational hardware requirements:

- SOAP is faster (takes 3 days) but accuracy is low

- ALLPATHS-LG is slower but produces high quality assemblies

ALLPATHS-LG is anticipated to be speeded up with algorithmic improvements (considering in mind the trade-off between speed and the accuracy)

Discussion (cont’d)

Page 25: 09.05.2012CS-681 PRESENTATION 1/25 ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle.

09.05.2012 CS-681 PRESENTATION 25/25

Thank you.

Questions ?