JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes...

Post on 25-Dec-2015

216 views 1 download

Tags:

Transcript of JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes...

JAMES LINDSAY* , HAMED SALOOTI , ALEX ZELIKOVSKI , ION MANDOIU*

ACM-BCB 2012

Scaffolding Large Genomes Using Integer Linear

Programming

University of Connecticut* Georgia State University

De-novo Assembly Paradigm

shotgun sequencing

short contigs

the scaffolds

short reads

the genome

denovoassembly

scaffolding

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Scaffold

gene XYZ

No scaffold

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Sanger Sequencing

gene XYZ3’

UTR5’

UTR

Biologist: There are holes in my genes!

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap Filling

Structural variation!

Massive Sequencing Projects

Effects of Read Length

I5k 5000 insect and

arthropod species

G10k 10,000 vertebrate

species

Dog Genome 7.5x Sanger N50: 180Kb

Chicken Genome 6x Illumina N50: 12Kb

Human Genome 100x Illumina N50: 24Kb

Fragmented Genomes

The Scaffolding Problem

GIVEN• CONTIGS, PAIRED READSFIND• ORIENTATION, ORDERING,

RELATIVE DISTANCEGOAL• RECREATE TRUE SCAFFOLDS

Paired Read Construction

Paired Read Styles

Mate Pair

Paired End

Paired Reads

2kb

2kb

same strand and orientation

R1 R2

100b 100b 10kb

different strand and orientation

R1R2

Linkage Information

Possible States (mate pair)Two contigs are adjacent if:

A read pair spans the contigs

State (A, B, C, D) Depends on orientation of

the read Order of contigs is arbitrary

Each read pair can be “consistent” with one of the four states

5’ 3’

contig i contig j

R1 R2A

B

C

D

Nodes Edges

Nodes are contigs Adjacent contigs have 4 edges (one for each state)

Weighted by overlap with repetitive region

Scaffolding Graph

contig i contig jState A

𝑊 𝑖𝑗𝐴= ∑

𝑟 𝑒𝑎𝑑𝑝𝑎𝑖𝑟𝑠

1−¿ 𝑏𝑝𝑖𝑛𝑟𝑒𝑝𝑒𝑎𝑡𝑟𝑒𝑔𝑖𝑜𝑛

¿𝑏𝑝𝑖𝑛𝑟𝑒𝑎𝑑

Integer Linear Program Formulation

Variables

, ,

𝑧=max ∑( 𝑖 , 𝑗 ) ∈𝐸

(𝑊 ¿¿ 𝑖𝑗𝐴 𝐴𝑖𝑗 )+(𝑊 ¿¿ 𝑖𝑗𝐵 𝐵𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐶𝐶𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐷 𝐷𝑖𝑗)¿¿¿¿

Contig pair state:

Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Objective Maximize weight of consistent pairs

Constraints

Variables

, , Contig pair state:

Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Pairwise Orientation

𝑆 𝑖𝑗≤𝑆 𝑗+𝑆𝑖

𝑆 𝑖𝑗≤2−𝑆𝑖−𝑆 𝑗

𝑆 𝑖𝑗≥𝑆 𝑗−𝑆 𝑖

𝑆 𝑖𝑗≥𝑆𝑖−𝑆 𝑗

Constraints

Variables

, , Contig pair state:

Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

State Variables

2 𝐴𝑖𝑗≤(1−𝑆¿¿ 𝑖)+(1−𝑆 𝑗)¿ 2𝐵𝑖𝑗≤(1−𝑆¿¿ 𝑖)+𝑆 𝑗¿

2𝐶𝑖𝑗≤𝑆 𝑖+(1−𝑆 𝑗) 2𝐷𝑖𝑗≤𝑆𝑖+𝑆 𝑗

Constraints

Variables

, , Contig pair state:

Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

𝐴𝑖𝑗+𝐷 𝑖𝑗≤1−𝑆𝑖 𝑗 𝐵𝑖𝑗+𝐶𝑖𝑗≤𝑆𝑖 𝑗

Mutual Exclusivity

Constraints

Forbid 2 Cycles

𝐵𝑖𝑗+𝐶𝑖 𝑗≤ 𝑆𝑖 𝑗 𝐴𝑖𝑗+𝐷 𝑖 𝑗≤1−𝑆 𝑖 𝑗

Forbid 3 Cycles

2222

2222

*larger cycles are broken at the end

Largest Connected Component

Graph Decomposition: Articulation Points

solve

Articulation point

MIP, Salmela 2011

Largest Biconnected Component

Non-Serial Dynamic Programming

A technique which exploits the sparsity of the scaffolding graph by computing the solution in stages, incorporating the results from previous stages

~inspired by (Neumaier, 06)

Non-Serial Dynamic Programming

2-cut+

+

+

-

-

+

-

-

𝑧 𝐴 𝑧𝐵

𝑧𝐶 𝑧𝐷

Non-Serial Dynamic Programming

+

+

+

-

-

+

-

-

𝑧 𝐴 𝑧𝐵

𝑧𝐶 𝑧𝐷

+

Objective Modification:

𝑧 𝐴

𝑧𝐵

𝑧𝐶

𝑧𝐷

SPQR-tree Based Implementation

• SPQR-tree efficiently finds 2 cuts (Tarjan, 73)

• DFS of SPQR-tree is used to schedule elimination order for NSDP

Post Processing ILP Solution

May have cyclesNot a total ordering

for each connected components

A

B

C

DF

E

ILP Solutionoutgoing incoming

A

B

C

D

E

F

A

B

C

D

E

F

Bipartite matching Objectives:

Max weight Max cardinality Max cardinality / Max weight

GAGE Framework

Genome Size (Mb) # reads

Staphlococcus Aureus 2.9 3,494,070

Rhodobacter sphaeorides

4.6 2,050,868

Human Chr14 107 22,669,408

Assembled using: ABySS, Allpaths-LG, Bambus2, CABOG, MSR-CA, SGA,

SOAPdenovo, VelvetScaffolded using:

SILP (our method), Opera, MIP, Bambus2

Testing Metrics

TPN50 Break scaffold at incorrect edges, then find N50 Size of contig where 50% of the contigs are this size

Binary Classification Given n contigs in a scaffold How many of n-1 adjacencies can you predict

PPV Sensitivity MCC

Results

staph rhodo chr140

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

Scaffolding TPN50

silpoperamipbambus2

Genome

TP

N50 (

bp)

Results

staph rhodo chr140.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

PPV

silpoperamipbambus2

Genome

PP

V

Results

staph rhodo chr140.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

Sensitivity

silpoperamipbambus2

Genome

Sensi

tivit

y

Results

staph rhodo chr140.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

Matthews Correlation Coefficient

silpoperamipbambus2

Genome

MC

C

Conclusions

Success ILP solves scaffolding problem! NSDP works

Improvements Include SOAPdenovo, Allpaths-LG scaffolds in comparison Look at parameter effects Practical considerations (read style, multi-libraries, merge

ctgs)Future Work

Where else can I apply NSDP? Scaffold before assembly … promising Structural Variation??

Questions?