JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer...

23
JAMES LINDSAY* , HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut* Georgia State University

Transcript of JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer...

Page 1: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

JAMES LINDSAY* , HAMED SALOOTI , ALEX ZELIKOVSKI , ION MANDOIU*

Scaffolding Large Genomes Using Integer Linear

Programming

University of Connecticut* Georgia State University

Page 2: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

De-novo Assembly Paradigm

Sequencing

The Contigs

The Scaffolds

The Reads

The Genome

Assembly

Scaffolding

Page 3: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Scaffold

gene XYZ

No scaffold

Page 4: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Sanger Sequencing

gene XYZ3’

UTR5’

UTR

Biologist: There are holes in my genes!

Page 5: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap Filling

Structural variation!

Page 6: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Read Pairs

Paired Read Construction

2kb

2kb

same strand and orientation

R1 R2

Informative Reads

Align each read against the contigs

Only accept uniquely mapped reads Use the non-unique

reads laterBoth reads in a pair

must map to different contigs

Page 7: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Linkage Information

Possible States

Two contigs are adjacent if: A read pair spans the contigs

State (A, B, C, D) Depends on orientation of

the read Order of contigs is arbitrary

Each read pair can be “consistent” with one of the four states

5’ 3’

contig i contig j

R1 R2A

B

C

D

Page 8: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

The Scaffolding Problem

Given• Contigs• Paired readsFind• Orientation• Ordering• Relative DistanceGoal• Recreate true scaffolds

Possible Objectives• Un-weighted• Max number of consistent

read pairs• Weighted• Each states is weighted:

• Overlap with repeat• Deviation of expected distance• …

𝑊 𝑖𝑗𝐴 ,𝑊 𝑖𝑗

𝐵 ,𝑊 𝑖𝑗𝐶 ,𝑊 𝑖𝑗

𝐷

Page 9: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Graph Representation

Using input we can define a scaffolding graph:

This is an undirected multi-graph

Assume it is connected

𝐺=(𝑉 ,𝐸)

𝑉 ,𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑐𝑜𝑛𝑡𝑖𝑔𝑠E, set of

Page 10: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Integer Linear Program Formulation

Variables

, ,

max ∑( 𝑖 , 𝑗 ) ∈𝐸

(𝑊 ¿¿ 𝑖𝑗𝐴 𝐴𝑖𝑗 )+(𝑊 ¿¿ 𝑖𝑗𝐵 𝐵𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐶𝐶𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐷 𝐷𝑖𝑗)¿¿¿¿

Contig Pair State:

Contig Orientation: 𝑆 𝑖∈ {0,1 }Pairwise Contig Consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Objective Maximize weight of consistent pairs

Page 11: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Constraints

Pairwise Orientation

𝑆 𝑖𝑗≤𝑆 𝑗+𝑆𝑖

𝑆 𝑖𝑗≤2−𝑆𝑖−𝑆 𝑗

𝑆 𝑖𝑗≥𝑆 𝑗−𝑆 𝑖

𝑆 𝑖𝑗≥𝑆𝑖−𝑆 𝑗

𝐴𝑖𝑗+𝐷 𝑖𝑗≤1−𝑆𝑖 𝑗 𝐵𝑖𝑗+𝐶𝑖𝑗≤𝑆𝑖 𝑗

Mutually Exclusivity

Forbid 2 and 3 Cycles Explicitly

Page 12: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Graph Decomposition: Articulation Points

solve

Articulation point

Page 13: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Graph Decomposition: 2-cuts

2-cut+

+

+

-

-

+

-

-

Page 14: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Non-Serial Dynamic Programming

• SPQR-tree to schedule decomposition

• Traverse tree using DFS

• NSDP utilizes solutions of previous stage in current stage

Page 15: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Largest Connected Component

Page 16: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Largest Biconnected Component

Page 17: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Largest Triconnected Component

Page 18: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Post Processing ILP Solution

May have cyclesNot a total ordering

for each connected components

A

B

C

DF

E

ILP Solutionoutgoing incoming

A

B

C

D

E

F

A

B

C

D

E

F

Bipartite matching Objectives:

Max weight Max cardinality Max cardinality / Max weight

Page 19: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Testing Framework

Venter Genome

Read Type Total ReadsTotal

BasesAvg

lengthCoverage

Sanger 31,861,976 2.79E+10 875 9.930637

SOLiD pairs 4.85E+08 2.42E+10 50 8.623028

# Reads# Bases in

reads # Contigs# Bases in

contigs N50112,00,000 1.1E+10 422,837 2.26E+09 7704

4x Assembly

Page 20: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Testing Metrics

Computer Scientists Finding Scaffold = Binary Classification Test

n contigs, try to predict n-1 adjacencies TP,FP,TN,FN, Sensitivity, PPV

Biologists (main focus) N50 (basically average scaffold size, ignore gaps) TP50

Break scaffold at incorrect edges, then find N50

Page 21: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Results

test case method

bundle size sensitivity ppv N50 TP50

10% opera 2 81.13% 99.26% 27,567 27,327

10% mip 2 59.01% 98.94% 19,988 19,755

10% ilp 1 79.86% 98.58% 26,814

26,459

25% opera 2 80.44% 98.27% 27,296

26,849

25% mip 2 58.95% 97.56% 19,842 19,518

25% ilp 1 79.30% 96.93% 26,684

26,079

100% opera 3 pending … … … 100% mip 3 failed n/a n/a n/a

100% ilp 1 68.25% 89.90% 20,538

19,006

Page 22: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Conclusions

Success ILP solves scaffolding problem! NSDP works.

Improvements Finalize large test cases (then publish?!) Practical considerations (read style, multi-libraries,

merge ctgs)Future Work

Where else can I apply NSDP? Scaffold before assembly?? Structural Variation??

Page 23: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Questions?