JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer...

Post on 14-Dec-2015

221 views 2 download

Tags:

Transcript of JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer...

JAMES LINDSAY* , HAMED SALOOTI , ALEX ZELIKOVSKI , ION MANDOIU*

Scaffolding Large Genomes Using Integer Linear

Programming

University of Connecticut* Georgia State University

De-novo Assembly Paradigm

Sequencing

The Contigs

The Scaffolds

The Reads

The Genome

Assembly

Scaffolding

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Scaffold

gene XYZ

No scaffold

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Sanger Sequencing

gene XYZ3’

UTR5’

UTR

Biologist: There are holes in my genes!

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap Filling

Structural variation!

Read Pairs

Paired Read Construction

2kb

2kb

same strand and orientation

R1 R2

Informative Reads

Align each read against the contigs

Only accept uniquely mapped reads Use the non-unique

reads laterBoth reads in a pair

must map to different contigs

Linkage Information

Possible States

Two contigs are adjacent if: A read pair spans the contigs

State (A, B, C, D) Depends on orientation of

the read Order of contigs is arbitrary

Each read pair can be “consistent” with one of the four states

5’ 3’

contig i contig j

R1 R2A

B

C

D

The Scaffolding Problem

Given• Contigs• Paired readsFind• Orientation• Ordering• Relative DistanceGoal• Recreate true scaffolds

Possible Objectives• Un-weighted• Max number of consistent

read pairs• Weighted• Each states is weighted:

• Overlap with repeat• Deviation of expected distance• …

𝑊 𝑖𝑗𝐴 ,𝑊 𝑖𝑗

𝐵 ,𝑊 𝑖𝑗𝐶 ,𝑊 𝑖𝑗

𝐷

Graph Representation

Using input we can define a scaffolding graph:

This is an undirected multi-graph

Assume it is connected

𝐺=(𝑉 ,𝐸)

𝑉 ,𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑐𝑜𝑛𝑡𝑖𝑔𝑠E, set of

Integer Linear Program Formulation

Variables

, ,

max ∑( 𝑖 , 𝑗 ) ∈𝐸

(𝑊 ¿¿ 𝑖𝑗𝐴 𝐴𝑖𝑗 )+(𝑊 ¿¿ 𝑖𝑗𝐵 𝐵𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐶𝐶𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐷 𝐷𝑖𝑗)¿¿¿¿

Contig Pair State:

Contig Orientation: 𝑆 𝑖∈ {0,1 }Pairwise Contig Consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Objective Maximize weight of consistent pairs

Constraints

Pairwise Orientation

𝑆 𝑖𝑗≤𝑆 𝑗+𝑆𝑖

𝑆 𝑖𝑗≤2−𝑆𝑖−𝑆 𝑗

𝑆 𝑖𝑗≥𝑆 𝑗−𝑆 𝑖

𝑆 𝑖𝑗≥𝑆𝑖−𝑆 𝑗

𝐴𝑖𝑗+𝐷 𝑖𝑗≤1−𝑆𝑖 𝑗 𝐵𝑖𝑗+𝐶𝑖𝑗≤𝑆𝑖 𝑗

Mutually Exclusivity

Forbid 2 and 3 Cycles Explicitly

Graph Decomposition: Articulation Points

solve

Articulation point

Graph Decomposition: 2-cuts

2-cut+

+

+

-

-

+

-

-

Non-Serial Dynamic Programming

• SPQR-tree to schedule decomposition

• Traverse tree using DFS

• NSDP utilizes solutions of previous stage in current stage

Largest Connected Component

Largest Biconnected Component

Largest Triconnected Component

Post Processing ILP Solution

May have cyclesNot a total ordering

for each connected components

A

B

C

DF

E

ILP Solutionoutgoing incoming

A

B

C

D

E

F

A

B

C

D

E

F

Bipartite matching Objectives:

Max weight Max cardinality Max cardinality / Max weight

Testing Framework

Venter Genome

Read Type Total ReadsTotal

BasesAvg

lengthCoverage

Sanger 31,861,976 2.79E+10 875 9.930637

SOLiD pairs 4.85E+08 2.42E+10 50 8.623028

# Reads# Bases in

reads # Contigs# Bases in

contigs N50112,00,000 1.1E+10 422,837 2.26E+09 7704

4x Assembly

Testing Metrics

Computer Scientists Finding Scaffold = Binary Classification Test

n contigs, try to predict n-1 adjacencies TP,FP,TN,FN, Sensitivity, PPV

Biologists (main focus) N50 (basically average scaffold size, ignore gaps) TP50

Break scaffold at incorrect edges, then find N50

Results

test case method

bundle size sensitivity ppv N50 TP50

10% opera 2 81.13% 99.26% 27,567 27,327

10% mip 2 59.01% 98.94% 19,988 19,755

10% ilp 1 79.86% 98.58% 26,814

26,459

25% opera 2 80.44% 98.27% 27,296

26,849

25% mip 2 58.95% 97.56% 19,842 19,518

25% ilp 1 79.30% 96.93% 26,684

26,079

100% opera 3 pending … … … 100% mip 3 failed n/a n/a n/a

100% ilp 1 68.25% 89.90% 20,538

19,006

Conclusions

Success ILP solves scaffolding problem! NSDP works.

Improvements Finalize large test cases (then publish?!) Practical considerations (read style, multi-libraries,

merge ctgs)Future Work

Where else can I apply NSDP? Scaffold before assembly?? Structural Variation??

Questions?