APBC 20051 Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length...

APBC 2005 1

Improved Algorithms for Multiplex PCR Primer Set Selection with

Amplification Length Constraints

Kishori M. Konwar

Ion I. Mandoiu

Alexander C. Russell

Alexander A. Shvartsman

CS&E Dept., Univ. of Connecticut

APBC 2005 2

Combinatorial Optimization in Bioinformatics

• Fast growing number of applications– Sequence alignment– DNA sequencing– Haplotype inference– Pathogen identification– …– High-throughput assay design

• Microarray probe selection• Microarray quality control• Universal tag arrays• …• This talk: Multiplex PCR primer set selection

APBC 2005 3

Outline

• Background and problem formulation

• “Potential function” greedy algorithm

• Approximation guarantee

• Experimental results

• Conclusions

APBC 2005 4

The Polymerase Chain Reaction

Target Sequence Polymerase

Primer 1Primer 2

Primers

Repeat 20-30 cycles

APBC 2005 5

Primer Pair Selection Problem

• Given:

• Genomic sequence around amplification locus

• Primer length k

• Amplification upperbound L

• Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperature, secondary structure, mis-priming, etc.)

L

Forward primer

Reverse primer

amplification locus

3'

3'

5'

5'

APBC 2005 6

PCR for SNP Genotyping

• Thousands of SNPs to be genotyped using hybridization methods (e.g., SBE)

• Selective PCR amplification needed to improve accuracy of detection steps– whole-genome amplification not appropriate

• Simultaneous amplification OK Multiplex PCR

APBC 2005 7

Multiplex PCR• How it works

– Multiple DNA fragments amplified simultaneously

– Each amplified fragment still defined by two primers

– A primer may participate in amplification of multiple targets

• Primer set selection– Currently done by time-consuming trial and error

– An important objective is to minimize number of primers Reduced assay cost Higher effective concentration of primers higher

amplification efficiency Reduced unintended amplification

APBC 2005 8

Primer Set Selection Problem• Given:

• Genomic sequences around n amplification loci

• Primer length k

• Amplification upper bound L

• Find:

• Minimum size set S of primers of length k such that, for each amplification locus, there are two primers in S hybridizing with the forward and reverse genomic sequences within a distance of L of each other

APBC 2005 9

Previous Work on Primer Selection

• Well-studied problem: [Pearson et al. 96], [Linhart & Shamir’02], [Souvenir et al.’03], etc.

• Almost all problem formulations decouple selection of forward and reverse primers– To enforce bound of L on amplification length, select only

primers that hybridize within L/2 bases of desired target

– In worst case, this method can increase the number of primers by a factor of O(n) compared to the optimum

• [Pearson et al. 96] Greedy set cover algorithm gives O(ln n) approximation factor for the “decoupled” formulation

APBC 2005 10

Previous Work (2)

• [Fernandes&Skiena’02] study primer set selection with uniqueness constraints

• Minimum Multi-Colored Subgraph Problem:– Vertices correspond to candidate primers– Edge colored by color i between u and v iff

corresponding primers hybridize within a distance of L of each other around i-th amplification locus

– Goal is to find minimum size set of vertices inducing edges of all colors

APBC 2005 11

The Set Cover Problem Given:

- Universal set U with n elements- Family of sets (Sx, xX) covering all elements of

U Find:

- Minimum size subset X’ of X s.t. (Sx, xX’) covers all elements of U

APBC 2005 12

Selection w/ Length Constraints

• “Simultaneous set covering” problem:

- Ground set partitioned into n disjoint sets Si (one for each target), each with 2L elements

- Goal is to select minimum number of sets == primers covering at least 1/2 of the elements in each partition

L L

SNPi

APBC 2005 13

Greedy Setcover Algorithm

Classical result (Johnson’74, Lovasz’75, Chvatal’79): the greedy setcover algorithm has an approximation factor of H(n)=1+1/2+1/3+…+1/n < 1+ln(n)

- The approximation factor is tight- Cannot be approximated within a factor of (1-)ln(n) unless

NP=DTIME(nloglog(n))

Greedy Algorithm:- Repeatedly pick the set with most uncovered elements

APBC 2005 14

Potential Functions• Set cover

• = #uncovered elements

• Initially, = n

• For feasible solutions, = 0

• Primer selection with length constraints

• = minimum number of elements that must be covered = i max{0, L - #uncovered elements in Si}

• Initially, = nL

• For feasible solutions, = 0

APBC 2005 15

General setting

Potential function (X’) 0 ({}) = max

(X’) = 0 for all feasible solutions X’’ X’ (X’’) (X’) If (X’)>0, then there exists x s.t. (X’+x) < (X’) X’’ X’ ∆(x,X’) ∆(x,X’) for every x, where

∆(x,X’) := (X’) - (X’+x) Objective: find minimum size set X’ with (X’)=0

APBC 2005 16

Generic Greedy Algorithm

• Theorem: The generic greedy algorithm has an approximation factor of 1+ln ∆max

• Corollary: 1+ln(nL) approximation for PCR primer selection

X’ {} While (X’) > 0

Find x with maximum ∆(x,X’) X’ X’ + x

APBC 2005 17

Proof Sketch (1)

• x1, x2,…,xg be the elements selected by greedy, in the order in which they are chosen

• x*1, x*2,…,x*k be the elements of an optimum solution.

Charging scheme: xi charges to x*j a cost of

where ij = ∆(xi,{x1,…, xi-1}{x*1,…,x*j})

Fact 1: Each x*j gets charged a total cost of at most 1+ln ∆max

APBC 2005 18

Proof Sketch (2)Fact 2: Each xi charges at least 1 unit of cost

APBC 2005 19

Experimental Setting• Datasets extracted from NCBI databases, L=1000• Dell PowerEdge 2.8GHz Xeon• Compared algorithms

– G-FIX: greedy primer cover algorithm [Pearson et al.]

– MIPS-PT: iterative beam-search heuristic [Souvenir et al.]

• Restrict primers to L/2 bases around amplification locus

– G-VAR: naïve modification of G-FIX

• First selected primer can be up to L bases away

• Opposite sequence truncated after selecting first primer

– G-POT: potential function driven greedy algorithm

APBC 2005 20

Experimental Results, NCBI tests

#Targets

k

G-FIX(Pearson et al.)

G-VAR(G-FIX with dynamic

truncation)

MIPS-PT (Souvenir et al.)

G-POT(Potential- function

greedy)

#Primers CPU

sec

#Primers CPU

sec

#Primers CPU

sec

#Primers CPU

sec

20

8 7 0.04 7 0.08 8 10 6 0.10

10 9 0.03 10 0.08 13 15 9 0.08

12 14 0.04 13 0.08 18 26 13 0.11

50

8 13 0.13 15 0.30 21 48 10 0.32

10 23 0.22 24 0.36 30 150 18 0.33

12 31 0.14 32 0.30 41 246 29 0.28

100

8 17 0.49 20 0.89 32 226 14 0.58

10 37 0.37 37 0.72 50 844 31 0.75

12 53 0.59 48 0.84 75 2601 42 0.61

APBC 2005 21

#primers, as percentage of 2n (l=8)

n

APBC 2005 22


n

APBC 2005 23


n

APBC 2005 24

CPU Seconds (l=10)

n

APBC 2005 25

Conclusions

• Numerous combinatorial optimization problems arising in the area of high-throughput assay design

• Theoretical insights such as approximation results can lead to significant practical improvements

• Choosing the proper problem model is critical to solution efficiency

APBC 2005 26

Ongoing Work & Open Problems

• Degenerate primers• Accurate hybridization model (melting temperature,

secondary structure, cross hybridization,…)– In-silico MP-PCR simulator

• Partition into multiple multiplexed PCR reactions (Aumann et al. Wabi’03)

APBC 2005 27

Acknowledgments

• Financial support from UCONN’s Research Foundation

APBC 20051 Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length...

Documents

Transcript of APBC 20051 Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length...