Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1,...

42
Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1 , Christina Boucher *,1 , Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University of California, San Diego 2. Wayne State University, Michigan * Contributed equally to this work

Transcript of Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1,...

Improving the Accuracy of Genome Assemblies

July 17th 2012

Roy Ronen*,1, Christina Boucher*,1, Hamidreza Chitsaz2 and Pavel Pevzner1

1. University of California, San Diego2. Wayne State University, Michigan

* Contributed equally to this work

≈ $ billions≈ several years≈ hundreds of people

≈ $ thousands≈ several weeks≈ two people

2

High Throughput Sequencing Assemblies

3

4

Sample Preparation

Sequencing

Assembly

Analysis, Analysis, Analysis

Fragments

Reads

Contigs

Draft Genome from HTS

5

Sample Preparation

Sequencing

Analysis, Analysis, Analysis

Fragments

Reads

Contigs

Assembly

HTS assemblies (contigs) still contain an abundance of error:• 20-30 subst. errors per

100kbp with SOAPdenovo.• 5-20 subst. errors per 100kbp

with Velvet. • Small (<50 bp) INDEL errors.• Misassemblies, large INDELs,

etc.

6

Sample Preparation

Sequencing

Analysis, Analysis, Analysis

Fragments

Reads

Contigs

AssemblyErrors in the assembled

contigs will profoundly affect any downstream analysis.

7

Sample Preparation

Sequencing

Analysis, Analysis, Analysis

Fragments

Reads

Contigs

Assembly SEQuel

Refined Contigs

De Bruijn Graph for Fragment Assembly

De Bruijn Graph

GCC CCA CCA CAT CAT ATT ATT TTA

GCC CCT CCT CTT CTT TTT TTT TTA

CCT CTA CTA TAT TAT ATT

(Pevzner, Tang, Waterman 2001) 9

De Bruijn Graph

GCC CCA CCA CAT CAT ATT ATT TTA

GCC CCT CCT CTT CTT TTT TTT TTA

CCT CTA CTA TAT TAT ATT

(Pevzner, Tang, Waterman 2001) 10

De Bruijn Graph

GCC CAT CAT ATT ATT TTA

GCC CCT CCT CTT CTT TTT TTT TTA

CCT CTA CTA TAT TAT ATT

CCA

(Pevzner, Tang, Waterman 2001) 11

De Bruijn Graph

GCC CAT CAT ATT ATT TTA

GCC CTT CTT TTT TTT TTA

CTA CTA TAT TAT ATT

CCA

CCT

(Pevzner, Tang, Waterman 2001) 12

De Bruijn Graph

GCC

CAT CAT ATT ATT TTA

CTT CTT TTT TTT TTA

CTA CTA TAT TAT ATT

CCA

CCT

(Pevzner, Tang, Waterman 2001) 13

De Bruijn Graph

14

Challenges

GCC CCT CTA TAG AGG GGA GAC

CAC ACT CTT TTG TGG GGC GCA

..............GCCTAGGAC.............CACTTGGCA..............GCCTAGGACGCCTAGGACGCCTAGGAC

CACTTGGCACACTTGGCA

CACTTGGCA

16

17

Sequencing errors cause bulges in the de Bruijn graph

GCC CCT CTA TAG AGG GGA GAC

CAC ACT CTT TTG TGG GGC GCA

..............GCCTAGGAC.............CACTTGGCA..............GCCTAGGACGCCTAGGACGCCTTGGAC

CACTTGGCACACTTGGCA

CACTTGGCA

CCTT

TGGA

CTTGTTGA

18

Sequencing errors cause bulges in the de Bruijn graph

GCC CCT

CTA TAG AGG

GGA GAC

CAC ACT

CTT TTG TGG

GGC GCA

..............GCCTAGGAC.............CACTTGGCA..............GCCTAGGACGCCTAGGACGCCTTGGAC

CACTTGGCACACTTGGCA

CACTTGGCA

22 2

23

14 4 1

33

33

3

19

Sequencing errors cause bulges in the de Bruijn graph

GCC CCT GGA GAC

CAC ACT

CTT TTG TGG

GGC GCA

..............GCCTAGGAC.............CACTTGGCA..............GCCTAGGACGCCTAGGACGCCTTGGAC

CACTTGGCACACTTGGCA

CACTTGGCA

31

4 4 1

33

33

3

......CACTTGGCA............GCCTTGGAC......

The SEQuel Algorithm

21

Sample Preparation

Sequencing

Analysis, Analysis, Analysis

Fragments

Reads

Contigs

Assembly SEQuel

Refined Contigs

Permissively aligned read-pair: a read-pair for which at least one read aligned uniquely.

12 2519 32 40 348 21 29

53

2621 34 39 44 57 68 8175 89

The SEQuel Algorithm

22

Positional De Bruijn Graph

23

Positional De Bruijn Graph

GCC,111 CCA,112 CCA,112 CAT,113 CAT,113 ATT,114 ATT,114 TTA,115

CCT,112 CTT,113 CTT,113 TTT,114 TTT,114 TTA,115 GCC,975 CCT,976

CCT,976 CTA,977 CTA,977 TAT,978 TAT,978 ATT,979

Positional k-mer: a pair (k-mer, position), e.g. (GCCA, 111).

24

Positional De Bruijn Graph

GCC,111 CCA,112 CCA,112 CAT,113 ATT,114 ATT,114 TTA,115

CCT,112 CTT,113 CTT,113 TTT,114 TTT,114 TTA,115 GCC,975 CCT,976

CCT,976 CTA,977 CTA,977 TAT,978 TAT,978 ATT,979

CCA,112 ATT,114CAT,113

ATT,979

25

Positional De Bruijn Graph

4 4 4 4

26

partial contig #1: GCCATTA

partial contig #2: GCCTATT

The SEQuel Algorithm

27

GTATTCCGAGGACCACTGGATTATGAOriginal contig

2828

The SEQuel Algorithm

GTATTCCGAGGACCACTGGATTATGA

2929

GTATTCCGAGGACCAC---TGGATTATGA

CAAATGGATTACGAGCGGGCCGAGGA

The SEQuel Algorithm

3030

GTATTCCGAGGACCAC---TGGATTATGA

CAAATGGATTACGAGCGGGCCGAGGA

The SEQuel Algorithm

3131

GCGGGCCGAGGACCAC---TGGATTATGA

CAAATGGATTACGAGCGGGCCGAGGA

The SEQuel Algorithm

3232

GCGGGCCGAGGACCAC---TGGATTATGA

CAAATGGATTACGAGCGGGCCGAGGA

The SEQuel Algorithm

3333

GCGGGCCGAGGACCACAAATGGATTACGA

CAAATGGATTACGAGCGGGCCGAGGA

The SEQuel Algorithm

3434

GCGGGCCGAGGACCACAAATGGATTACGA

The SEQuel Algorithm

Repeat for all contigs.

35

Results

• Standard and Single-Cell E. coli.

• 100 bp paired-end, Illumina (GAII) reads.

• Mean coverage ≈ 600x.

• Assemblies compared to reference with & without

SEQuel.

Standard E. coli

36

Standard E. coli

37

Single Cell Sequencing

Standard Single Cell

(Chitsaz et al., 2011) 38

Single Cell E. coli

39

Single Cell E. coli

40

Summary

41

• Removed 35% to 96% of small-scale assembly errors.

• Introduced positional de Bruijn graph for contig refinement.

• Demonstrated utility in hard (single-cell) assembly.

• SEQuel can be used in combination with any assembler.

• Freely available at: http://bix.ucsd.edu/SEQuel

3P41RR024851-02S1

Acknowledgments

CCF-1115206