Scaffolding Large Genomes Using Integer Linear Programming

23
JAMES LINDSAY* , HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut* Georgia State University

description

Scaffolding Large Genomes Using Integer Linear Programming. James Lindsay* , Hamed Salooti , Alex Zelikovski , Ion Mandoiu *. University of Connecticut*. Georgia State University. De-novo Assembly Paradigm. The Reads. The Genome. S equencing. Assembly. The Scaffolds. S caffolding. - PowerPoint PPT Presentation

Transcript of Scaffolding Large Genomes Using Integer Linear Programming

Page 1: Scaffolding Large Genomes Using Integer Linear Programming

JAMES LINDSAY* , HAMED SALOOTI , ALEX ZELIKOVSKI , ION MANDOIU*

Scaffolding Large Genomes Using Integer Linear

Programming

University of Connecticut* Georgia State University

Page 2: Scaffolding Large Genomes Using Integer Linear Programming

De-novo Assembly Paradigm

Sequencing

The Contigs

The Scaffolds

The Reads

The Genome

Assembly

Scaffolding

Page 3: Scaffolding Large Genomes Using Integer Linear Programming

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation! gene XYZ 3’ UTR

5’ UTR

Scaffold

gene XYZ

No scaffold

Page 4: Scaffolding Large Genomes Using Integer Linear Programming

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ 3’

UTR5’

UTR

Sanger Sequencing

gene XYZ 3’ UTR

5’ UTR

Biologist: There are holes in my genes!

Page 5: Scaffolding Large Genomes Using Integer Linear Programming

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap Filling

Structural variation!

Page 6: Scaffolding Large Genomes Using Integer Linear Programming

Read Pairs

Paired Read Construction

2kb

2kb

same strand and orientation

R1 R2

Informative Reads

Align each read against the contigs

Only accept uniquely mapped reads Use the non-unique

reads laterBoth reads in a pair

must map to different contigs

Page 7: Scaffolding Large Genomes Using Integer Linear Programming

Linkage Information

Possible StatesTwo contigs are adjacent if:

A read pair spans the contigs

State (A, B, C, D) Depends on orientation of the

read Order of contigs is arbitrary

Each read pair can be “consistent” with one of the four states

5’ 3’

contig i contig j

R1 R2A

B

C

D

Page 8: Scaffolding Large Genomes Using Integer Linear Programming

The Scaffolding Problem

Given• Contigs• Paired readsFind• Orientation• Ordering• Relative DistanceGoal• Recreate true scaffolds

Possible Objectives• Un-weighted• Max number of consistent

read pairs• Weighted• Each states is weighted:

• Overlap with repeat• Deviation of expected distance• …

𝑊 𝑖𝑗𝐴 ,𝑊 𝑖𝑗

𝐵 ,𝑊 𝑖𝑗𝐶 ,𝑊 𝑖𝑗

𝐷

Page 9: Scaffolding Large Genomes Using Integer Linear Programming

Graph Representation

Using input we can define a scaffolding graph:

This is an undirected multi-graph

Assume it is connected

𝐺=(𝑉 ,𝐸)

𝑉 ,𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑐𝑜𝑛𝑡𝑖𝑔𝑠E, set of

Page 10: Scaffolding Large Genomes Using Integer Linear Programming

Integer Linear Program Formulation

Variables

, ,

max ∑( 𝑖 , 𝑗 ) ∈𝐸

(𝑊 ¿¿ 𝑖𝑗 𝐴 𝐴𝑖𝑗 )+(𝑊 ¿¿ 𝑖𝑗𝐵 𝐵𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐶𝐶𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐷 𝐷𝑖𝑗)¿ ¿¿¿

Contig Pair State:

Contig Orientation: 𝑆 𝑖∈ {0,1 }Pairwise Contig Consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Objective Maximize weight of consistent pairs

Page 11: Scaffolding Large Genomes Using Integer Linear Programming

Constraints

Pairwise Orientation

𝑆 𝑖𝑗≤𝑆 𝑗+𝑆𝑖𝑆 𝑖𝑗≤2−𝑆𝑖−𝑆 𝑗

𝑆 𝑖𝑗≥𝑆 𝑗−𝑆 𝑖𝑆 𝑖𝑗≥𝑆𝑖−𝑆 𝑗

𝐴𝑖𝑗+𝐷 𝑖𝑗≤1−𝑆𝑖 𝑗 𝐵𝑖𝑗+𝐶𝑖𝑗≤𝑆𝑖 𝑗

Mutually Exclusivity

Forbid 2 and 3 Cycles Explicitly

Page 12: Scaffolding Large Genomes Using Integer Linear Programming

Graph Decomposition: Articulation Points

solve

Articulation point

Page 13: Scaffolding Large Genomes Using Integer Linear Programming

Graph Decomposition: 2-cuts

2-cut+

+

+

-

-

+

-

-

Page 14: Scaffolding Large Genomes Using Integer Linear Programming

Non-Serial Dynamic Programming

• SPQR-tree to schedule decomposition

• Traverse tree using DFS

• NSDP utilizes solutions of previous stage in current stage

Page 15: Scaffolding Large Genomes Using Integer Linear Programming

Largest Connected Component

Page 16: Scaffolding Large Genomes Using Integer Linear Programming

Largest Biconnected Component

Page 17: Scaffolding Large Genomes Using Integer Linear Programming

Largest Triconnected Component

Page 18: Scaffolding Large Genomes Using Integer Linear Programming

Post Processing ILP Solution

May have cyclesNot a total ordering

for each connected components

A

B

C

DF

E

ILP Solutionoutgoing incoming

A

B

C

D

E

F

A

B

C

D

E

F

Bipartite matching Objectives:

Max weight Max cardinality Max cardinality / Max weight

Page 19: Scaffolding Large Genomes Using Integer Linear Programming

Testing Framework

Venter Genome

Read Type Total ReadsTotal

BasesAvg

lengthCoverage

Sanger 31,861,976 2.79E+10 875 9.930637

SOLiD pairs 4.85E+08 2.42E+10 50 8.623028

# Reads# Bases in

reads # Contigs# Bases in

contigs N50112,00,000 1.1E+10 422,837 2.26E+09 7704

4x Assembly

Page 20: Scaffolding Large Genomes Using Integer Linear Programming

Testing Metrics

Computer Scientists Finding Scaffold = Binary Classification Test

n contigs, try to predict n-1 adjacencies TP,FP,TN,FN, Sensitivity, PPV

Biologists (main focus) N50 (basically average scaffold size, ignore gaps) TP50

Break scaffold at incorrect edges, then find N50

Page 21: Scaffolding Large Genomes Using Integer Linear Programming

Results

test case method

bundle size sensitivity ppv N50 TP50

10% opera 2 81.13% 99.26% 27,567 27,327

10% mip 2 59.01% 98.94% 19,988 19,755

10% ilp 1 79.86% 98.58% 26,814

26,459

25% opera 2 80.44% 98.27% 27,296

26,849

25% mip 2 58.95% 97.56% 19,842 19,518

25% ilp 1 79.30% 96.93% 26,684

26,079

100% opera 3 pending … … … 100% mip 3 failed n/a n/a n/a

100% ilp 1 68.25% 89.90% 20,538

19,006

Page 22: Scaffolding Large Genomes Using Integer Linear Programming

Conclusions

Success ILP solves scaffolding problem! NSDP works.

Improvements Finalize large test cases (then publish?!) Practical considerations (read style, multi-libraries,

merge ctgs)Future Work

Where else can I apply NSDP? Scaffold before assembly?? Structural Variation??

Page 23: Scaffolding Large Genomes Using Integer Linear Programming

Questions?