JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes...

31
JAMES LINDSAY* , HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut* Georgia State University

Transcript of JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes...

Page 1: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

JAMES LINDSAY* , HAMED SALOOTI , ALEX ZELIKOVSKI , ION MANDOIU*

ACM-BCB 2012

Scaffolding Large Genomes Using Integer Linear

Programming

University of Connecticut* Georgia State University

Page 2: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

De-novo Assembly Paradigm

shotgun sequencing

short contigs

the scaffolds

short reads

the genome

denovoassembly

scaffolding

Page 3: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Scaffold

gene XYZ

No scaffold

Page 4: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Sanger Sequencing

gene XYZ3’

UTR5’

UTR

Biologist: There are holes in my genes!

Page 5: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap Filling

Structural variation!

Page 6: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Massive Sequencing Projects

Effects of Read Length

I5k 5000 insect and

arthropod species

G10k 10,000 vertebrate

species

Dog Genome 7.5x Sanger N50: 180Kb

Chicken Genome 6x Illumina N50: 12Kb

Human Genome 100x Illumina N50: 24Kb

Fragmented Genomes

Page 7: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

The Scaffolding Problem

GIVEN• CONTIGS, PAIRED READSFIND• ORIENTATION, ORDERING,

RELATIVE DISTANCEGOAL• RECREATE TRUE SCAFFOLDS

Page 8: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Paired Read Construction

Paired Read Styles

Mate Pair

Paired End

Paired Reads

2kb

2kb

same strand and orientation

R1 R2

100b 100b 10kb

different strand and orientation

R1R2

Page 9: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Linkage Information

Possible States (mate pair)Two contigs are adjacent if:

A read pair spans the contigs

State (A, B, C, D) Depends on orientation of

the read Order of contigs is arbitrary

Each read pair can be “consistent” with one of the four states

5’ 3’

contig i contig j

R1 R2A

B

C

D

Page 10: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Nodes Edges

Nodes are contigs Adjacent contigs have 4 edges (one for each state)

Weighted by overlap with repetitive region

Scaffolding Graph

contig i contig jState A

𝑊 𝑖𝑗𝐴= ∑

𝑟 𝑒𝑎𝑑𝑝𝑎𝑖𝑟𝑠

1−¿ 𝑏𝑝𝑖𝑛𝑟𝑒𝑝𝑒𝑎𝑡𝑟𝑒𝑔𝑖𝑜𝑛

¿𝑏𝑝𝑖𝑛𝑟𝑒𝑎𝑑

Page 11: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Integer Linear Program Formulation

Variables

, ,

𝑧=max ∑( 𝑖 , 𝑗 ) ∈𝐸

(𝑊 ¿¿ 𝑖𝑗𝐴 𝐴𝑖𝑗 )+(𝑊 ¿¿ 𝑖𝑗𝐵 𝐵𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐶𝐶𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐷 𝐷𝑖𝑗)¿¿¿¿

Contig pair state:

Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Objective Maximize weight of consistent pairs

Page 12: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Constraints

Variables

, , Contig pair state:

Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Pairwise Orientation

𝑆 𝑖𝑗≤𝑆 𝑗+𝑆𝑖

𝑆 𝑖𝑗≤2−𝑆𝑖−𝑆 𝑗

𝑆 𝑖𝑗≥𝑆 𝑗−𝑆 𝑖

𝑆 𝑖𝑗≥𝑆𝑖−𝑆 𝑗

Page 13: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Constraints

Variables

, , Contig pair state:

Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

State Variables

2 𝐴𝑖𝑗≤(1−𝑆¿¿ 𝑖)+(1−𝑆 𝑗)¿ 2𝐵𝑖𝑗≤(1−𝑆¿¿ 𝑖)+𝑆 𝑗¿

2𝐶𝑖𝑗≤𝑆 𝑖+(1−𝑆 𝑗) 2𝐷𝑖𝑗≤𝑆𝑖+𝑆 𝑗

Page 14: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Constraints

Variables

, , Contig pair state:

Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

𝐴𝑖𝑗+𝐷 𝑖𝑗≤1−𝑆𝑖 𝑗 𝐵𝑖𝑗+𝐶𝑖𝑗≤𝑆𝑖 𝑗

Mutual Exclusivity

Page 15: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Constraints

Forbid 2 Cycles

𝐵𝑖𝑗+𝐶𝑖 𝑗≤ 𝑆𝑖 𝑗 𝐴𝑖𝑗+𝐷 𝑖 𝑗≤1−𝑆 𝑖 𝑗

Forbid 3 Cycles

2222

2222

*larger cycles are broken at the end

Page 16: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Largest Connected Component

Page 17: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Graph Decomposition: Articulation Points

solve

Articulation point

MIP, Salmela 2011

Page 18: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Largest Biconnected Component

Page 19: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Non-Serial Dynamic Programming

A technique which exploits the sparsity of the scaffolding graph by computing the solution in stages, incorporating the results from previous stages

~inspired by (Neumaier, 06)

Page 20: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Non-Serial Dynamic Programming

2-cut+

+

+

-

-

+

-

-

𝑧 𝐴 𝑧𝐵

𝑧𝐶 𝑧𝐷

Page 21: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Non-Serial Dynamic Programming

+

+

+

-

-

+

-

-

𝑧 𝐴 𝑧𝐵

𝑧𝐶 𝑧𝐷

+

Objective Modification:

𝑧 𝐴

𝑧𝐵

𝑧𝐶

𝑧𝐷

Page 22: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

SPQR-tree Based Implementation

• SPQR-tree efficiently finds 2 cuts (Tarjan, 73)

• DFS of SPQR-tree is used to schedule elimination order for NSDP

Page 23: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Post Processing ILP Solution

May have cyclesNot a total ordering

for each connected components

A

B

C

DF

E

ILP Solutionoutgoing incoming

A

B

C

D

E

F

A

B

C

D

E

F

Bipartite matching Objectives:

Max weight Max cardinality Max cardinality / Max weight

Page 24: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

GAGE Framework

Genome Size (Mb) # reads

Staphlococcus Aureus 2.9 3,494,070

Rhodobacter sphaeorides

4.6 2,050,868

Human Chr14 107 22,669,408

Assembled using: ABySS, Allpaths-LG, Bambus2, CABOG, MSR-CA, SGA,

SOAPdenovo, VelvetScaffolded using:

SILP (our method), Opera, MIP, Bambus2

Page 25: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Testing Metrics

TPN50 Break scaffold at incorrect edges, then find N50 Size of contig where 50% of the contigs are this size

Binary Classification Given n contigs in a scaffold How many of n-1 adjacencies can you predict

PPV Sensitivity MCC

Page 26: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Results

staph rhodo chr140

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

Scaffolding TPN50

silpoperamipbambus2

Genome

TP

N50 (

bp)

Page 27: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Results

staph rhodo chr140.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

PPV

silpoperamipbambus2

Genome

PP

V

Page 28: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Results

staph rhodo chr140.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

Sensitivity

silpoperamipbambus2

Genome

Sensi

tivit

y

Page 29: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Results

staph rhodo chr140.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

Matthews Correlation Coefficient

silpoperamipbambus2

Genome

MC

C

Page 30: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Conclusions

Success ILP solves scaffolding problem! NSDP works

Improvements Include SOAPdenovo, Allpaths-LG scaffolds in comparison Look at parameter effects Practical considerations (read style, multi-libraries, merge

ctgs)Future Work

Where else can I apply NSDP? Scaffold before assembly … promising Structural Variation??

Page 31: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Questions?