APBC 20051 Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length...
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of APBC 20051 Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length...
APBC 2005 1
Improved Algorithms for Multiplex PCR Primer Set Selection with
Amplification Length Constraints
Kishori M. Konwar
Ion I. Mandoiu
Alexander C. Russell
Alexander A. Shvartsman
CS&E Dept., Univ. of Connecticut
APBC 2005 2
Combinatorial Optimization in Bioinformatics
• Fast growing number of applications– Sequence alignment– DNA sequencing– Haplotype inference– Pathogen identification– …– High-throughput assay design
• Microarray probe selection• Microarray quality control• Universal tag arrays• …• This talk: Multiplex PCR primer set selection
APBC 2005 3
Outline
• Background and problem formulation
• “Potential function” greedy algorithm
• Approximation guarantee
• Experimental results
• Conclusions
APBC 2005 4
The Polymerase Chain Reaction
Target Sequence Polymerase
Primer 1Primer 2
Primers
Repeat 20-30 cycles
APBC 2005 5
Primer Pair Selection Problem
• Given:
• Genomic sequence around amplification locus
• Primer length k
• Amplification upperbound L
• Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperature, secondary structure, mis-priming, etc.)
L
Forward primer
Reverse primer
amplification locus
3'
3'
5'
5'
APBC 2005 6
PCR for SNP Genotyping
• Thousands of SNPs to be genotyped using hybridization methods (e.g., SBE)
• Selective PCR amplification needed to improve accuracy of detection steps– whole-genome amplification not appropriate
• Simultaneous amplification OK Multiplex PCR
APBC 2005 7
Multiplex PCR• How it works
– Multiple DNA fragments amplified simultaneously
– Each amplified fragment still defined by two primers
– A primer may participate in amplification of multiple targets
• Primer set selection– Currently done by time-consuming trial and error
– An important objective is to minimize number of primers Reduced assay cost Higher effective concentration of primers higher
amplification efficiency Reduced unintended amplification
APBC 2005 8
Primer Set Selection Problem• Given:
• Genomic sequences around n amplification loci
• Primer length k
• Amplification upper bound L
• Find:
• Minimum size set S of primers of length k such that, for each amplification locus, there are two primers in S hybridizing with the forward and reverse genomic sequences within a distance of L of each other
APBC 2005 9
Previous Work on Primer Selection
• Well-studied problem: [Pearson et al. 96], [Linhart & Shamir’02], [Souvenir et al.’03], etc.
• Almost all problem formulations decouple selection of forward and reverse primers– To enforce bound of L on amplification length, select only
primers that hybridize within L/2 bases of desired target
– In worst case, this method can increase the number of primers by a factor of O(n) compared to the optimum
• [Pearson et al. 96] Greedy set cover algorithm gives O(ln n) approximation factor for the “decoupled” formulation
APBC 2005 10
Previous Work (2)
• [Fernandes&Skiena’02] study primer set selection with uniqueness constraints
• Minimum Multi-Colored Subgraph Problem:– Vertices correspond to candidate primers– Edge colored by color i between u and v iff
corresponding primers hybridize within a distance of L of each other around i-th amplification locus
– Goal is to find minimum size set of vertices inducing edges of all colors
APBC 2005 11
The Set Cover Problem Given:
- Universal set U with n elements- Family of sets (Sx, xX) covering all elements of
U Find:
- Minimum size subset X’ of X s.t. (Sx, xX’) covers all elements of U
APBC 2005 12
Selection w/ Length Constraints
• “Simultaneous set covering” problem:
- Ground set partitioned into n disjoint sets Si (one for each target), each with 2L elements
- Goal is to select minimum number of sets == primers covering at least 1/2 of the elements in each partition
L L
SNPi
APBC 2005 13
Greedy Setcover Algorithm
Classical result (Johnson’74, Lovasz’75, Chvatal’79): the greedy setcover algorithm has an approximation factor of H(n)=1+1/2+1/3+…+1/n < 1+ln(n)
- The approximation factor is tight- Cannot be approximated within a factor of (1-)ln(n) unless
NP=DTIME(nloglog(n))
Greedy Algorithm:- Repeatedly pick the set with most uncovered elements
APBC 2005 14
Potential Functions• Set cover
• = #uncovered elements
• Initially, = n
• For feasible solutions, = 0
• Primer selection with length constraints
• = minimum number of elements that must be covered = i max{0, L - #uncovered elements in Si}
• Initially, = nL
• For feasible solutions, = 0
APBC 2005 15
General setting
Potential function (X’) 0 ({}) = max
(X’) = 0 for all feasible solutions X’’ X’ (X’’) (X’) If (X’)>0, then there exists x s.t. (X’+x) < (X’) X’’ X’ ∆(x,X’) ∆(x,X’) for every x, where
∆(x,X’) := (X’) - (X’+x) Objective: find minimum size set X’ with (X’)=0
APBC 2005 16
Generic Greedy Algorithm
• Theorem: The generic greedy algorithm has an approximation factor of 1+ln ∆max
• Corollary: 1+ln(nL) approximation for PCR primer selection
X’ {} While (X’) > 0
Find x with maximum ∆(x,X’) X’ X’ + x
APBC 2005 17
Proof Sketch (1)
• x1, x2,…,xg be the elements selected by greedy, in the order in which they are chosen
• x*1, x*2,…,x*k be the elements of an optimum solution.
Charging scheme: xi charges to x*j a cost of
where ij = ∆(xi,{x1,…, xi-1}{x*1,…,x*j})
Fact 1: Each x*j gets charged a total cost of at most 1+ln ∆max
APBC 2005 18
Proof Sketch (2)Fact 2: Each xi charges at least 1 unit of cost
APBC 2005 19
Experimental Setting• Datasets extracted from NCBI databases, L=1000• Dell PowerEdge 2.8GHz Xeon• Compared algorithms
– G-FIX: greedy primer cover algorithm [Pearson et al.]
– MIPS-PT: iterative beam-search heuristic [Souvenir et al.]
• Restrict primers to L/2 bases around amplification locus
– G-VAR: naïve modification of G-FIX
• First selected primer can be up to L bases away
• Opposite sequence truncated after selecting first primer
– G-POT: potential function driven greedy algorithm
APBC 2005 20
Experimental Results, NCBI tests
#Targets
k
G-FIX(Pearson et al.)
G-VAR(G-FIX with dynamic
truncation)
MIPS-PT (Souvenir et al.)
G-POT(Potential- function
greedy)
#Primers CPU
sec
#Primers CPU
sec
#Primers CPU
sec
#Primers CPU
sec
20
8 7 0.04 7 0.08 8 10 6 0.10
10 9 0.03 10 0.08 13 15 9 0.08
12 14 0.04 13 0.08 18 26 13 0.11
50
8 13 0.13 15 0.30 21 48 10 0.32
10 23 0.22 24 0.36 30 150 18 0.33
12 31 0.14 32 0.30 41 246 29 0.28
100
8 17 0.49 20 0.89 32 226 14 0.58
10 37 0.37 37 0.72 50 844 31 0.75
12 53 0.59 48 0.84 75 2601 42 0.61
APBC 2005 21
#primers, as percentage of 2n (l=8)
n
APBC 2005 22
#primers, as percentage of 2n (l=10)
n
APBC 2005 23
#primers, as percentage of 2n (l=12)
n
APBC 2005 24
CPU Seconds (l=10)
n
APBC 2005 25
Conclusions
• Numerous combinatorial optimization problems arising in the area of high-throughput assay design
• Theoretical insights such as approximation results can lead to significant practical improvements
• Choosing the proper problem model is critical to solution efficiency
APBC 2005 26
Ongoing Work & Open Problems
• Degenerate primers• Accurate hybridization model (melting temperature,
secondary structure, cross hybridization,…)– In-silico MP-PCR simulator
• Partition into multiple multiplexed PCR reactions (Aumann et al. Wabi’03)
APBC 2005 27
Acknowledgments
• Financial support from UCONN’s Research Foundation