Multi-seed lossless filtration
description
Transcript of Multi-seed lossless filtration
![Page 1: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/1.jpg)
Multi-seed lossless filtrationMulti-seed lossless filtration
Gregory KucherovLaurent Noé
LORIA/INRIA, Nancy, France
Mikhail RoytbergInstitute of Mathematical Problems in Biology,
Puschino, Russia
CPM (Istanbul)July 5-7, 2004
![Page 2: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/2.jpg)
Text filtration: general principleText filtration: general principle
potential matches
![Page 3: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/3.jpg)
Text filtration: general principleText filtration: general principle
potential matches
![Page 4: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/4.jpg)
Text filtration: general principleText filtration: general principle
lossless and lossy filters
true match
![Page 5: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/5.jpg)
Filtration applied to sequence comparisonFiltration applied to sequence comparison
potential similarities
![Page 6: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/6.jpg)
Filtration applied to sequence alignmentFiltration applied to sequence alignment
potential similarities
![Page 7: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/7.jpg)
Filtration applied to sequence alignmentFiltration applied to sequence alignment
true similarities
![Page 8: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/8.jpg)
GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.
Similarities are defined through Hamming distance
GCTACGACTTCGAGCTGC
...CTCAGCTATGACCTCGAGCGGCCTATCTA...
![Page 9: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/9.jpg)
GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.
Similarities are defined through Hamming distance
![Page 10: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/10.jpg)
GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.
Similarities are defined through Hamming distance
(m,k)-problem, (m,k)-instances
m
k
![Page 11: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/11.jpg)
GaplessGapless similarit similarities. Hamming distance.ies. Hamming distance.
Similarities are defined through Hamming distance
(m,k)-problem, (m,k)-instances This work: lossless filtering
m
k
![Page 12: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/12.jpg)
Filtering by contiguous fragmentFiltering by contiguous fragment
PEX (Navarro&Raffinot 2002)– Searching for a contiguous pattern
PEX with errors– Searching for a contiguous pattern with l possible errors
• requires retrieval of all l-variants in the index. Efficient for– small alphabets (ADN,ARN)– relatively small l (<= 2)
m=18
k=3
11
km
####
conserved1
#########(1)
(m,k)
![Page 13: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/13.jpg)
Superposition of two filtersSuperposition of two filters
Pevzner&Waterman 1995
Idea: combine PEX with another filter based on a regularly-spaced seed
PEX :
spaced PEX (matches occurring at every k positions).
####
#---#---#---#
#---#---#---# #---#---#---# #---#---#---# #---#---#---#
k+1
![Page 14: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/14.jpg)
Spaced seedsSpaced seeds
Spaced seeds (spaced Q-grams)– proposed by Burkhardt & Kärkkäinen (CPM 2001) for solving (m,k)-
problems
Principle– Searching for spaced rather than contiguous patterns
– Selectivity• defined by the weight of the seed (number of #’s)
###-##
![Page 15: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/15.jpg)
ExExaamplemple: (18,3)-problem: (18,3)-problem
###-##
###-##
###-##
###-## ###-## ###-##
![Page 16: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/16.jpg)
Spaced seeds for sequence comparisonSpaced seeds for sequence comparison
Ma, Tromp, Li 2002 (PatternHunter)
Estimating seed sensitivity: Keich et al 2002, Buhler et al 2003, Brejova et al 2003, Choi&Zhang 2004, Choi et al 2004, Kucherov et al 2004, ...
Extended seed models: BLASTZ 2003, Brejova et al 2003, Chen&Sung 2003, Noé&Kucherov 2004, ...
![Page 17: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/17.jpg)
This work: lossless filtration using spaced seed families (extension of Burkhard&Karkkainen 2001)
single filter based on several distinct seeds each seed detects a part of (m,k)-instances but
together they must detect all (m,k)-instances
Families of spaced seedsFamilies of spaced seeds
Independent work (lossy seed families for sequence alignment):
Li, Ma, Kisman, Tromp 2004 (PatternHunter II) Xu, Brown, Li, Ma, this conference Sun, Buhler, RECOMB 2004 (Mandala)
![Page 18: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/18.jpg)
– every (18,3)-instance contains an occurrence of a seed of F
– all seeds of the family have the same weight 7
Example: (18.3)-problem (cont)Example: (18.3)-problem (cont)
Family F solvesthe (18,3)-problem
##-#-#######---#--##-#
F
![Page 19: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/19.jpg)
##-##-########-####--#####-##---#-#####----####-######---#-#-##-#####-#-#-#-----###
Example: (18.3)-problem (cont)Example: (18.3)-problem (cont)
##-#-#######---#--##-#
###-##---#-###
###---#--##-# ###---#--##-#
w=7
w=9
![Page 20: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/20.jpg)
####
###-##
##-##-########-####--#####-##---#-#####----####-######---#-#-##-#####-#-#-#-----###
Comparative selectivityComparative selectivity
##-#-#######---#--##-#
w=4 ~39. 10-4
w=5 ~9.8 10-4
w=7 ~1.2 10-4
w=9 ~0.23 10-4
Selectivity of families on Bernoulli similarities (p(match) = 1/4) estimated as the probability for one of the seeds to occur at a given position
![Page 21: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/21.jpg)
How far should we goHow far should we go
A trivial extreme solution ... – would be to pick all seeds of weight m - k. – prohibitive cost except for very small problems
We are interested in intermediate solutions:– relatively small number of seeds (< 10) to keep the hash table of a
reasonable size,– the seed weight sufficiently large to obtain a good selectivity
kmC ~
![Page 22: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/22.jpg)
ResultsResults
Computing properties of seed families Seed design
– Seed expansion/contraction– Periodic seeds– Seed optimality– Heuristic seed design
Experiments– Examples of designed seed families– Application to computing specific oligonucleotides
Conclusions
![Page 23: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/23.jpg)
MeMeaasursuringing the the efficefficiency of a familyiency of a family
Optimal threshold (Burkhard&Karkkainen): minimal number of seed occurrences over all (m,k)-instances
A seed family F is lossless iff the optimal threshold TF(m,k)1
TF(m,k) can be computed by a dynamic programming algorithm in time O(m·k·2(S+1)) and space O(k·2(S+1)), where S is the maximal length of a seed from F
optimizations are possible (see the paper) the resulting space complexity is the same as in the
Burkhard&Karkkainen algorithm
![Page 24: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/24.jpg)
MeMeaasursuringing the the efficefficiency of a family (cont)iency of a family (cont)
Using a similar DP technique we can compute, within the same time complexity bound:
the number UF(m,k) of undetected (m,k)-similarities for a (lossy) family F
the contribution of a seed of F, i.e. the number of (m,k)-similarities detected exclusively by this seed
[see the paper for details]
![Page 25: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/25.jpg)
Design Design of seedof seed famil familiesies
Pruning exhaustive search tree (Burkhard&Karkkainen)
– Construct all solutions of weight w from solutions of weight w – 1
– Example:if ##--#--# and ##-#---# are solutions of weight w-1,
consider their «union» ##-##--# of weight w.
– Prohibitive cost: • more than a week for computing all single-seed solutions of
the (50,5)-problem• the search space blows up for multi-seed families
![Page 26: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/26.jpg)
Seed expansion/contractionSeed expansion/contraction
Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem:
###-#--###-#--###-#
#-#-#---#-----#-#-#---#-----#-#-#---#
![Page 27: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/27.jpg)
Seed expansion/contractionSeed expansion/contraction
Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem:
###-#--###-#--###-#
#-#-#---#-----#-#-#---#-----#-#-#---#
the only solution of weight 12 of the (25,2)-problem
![Page 28: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/28.jpg)
Seed expansion/contractionSeed expansion/contraction
Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem:
###-#--###-#--###-#
#-#-#---#-----#-#-#---#-----#-#-#---#
– Let be the i-regular expansion of F obtained by inserting i-1 jokers between successive positions of each seed of F
– Example:If F = { ###-# , ##-## } then
= { #-#-#---# , #-#---#-# } = { #--#--#-----# , #--#-----#--# }
Fi
F2F3
the only solution of weight 12 of the (25,2)-problem
![Page 29: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/29.jpg)
Seed expansion/contractionSeed expansion/contraction (cont)(cont)
Lemma:
– If a family F solves an (m,k)–problem, then both F and solves the (i·m, (i+1)·k- 1)–problem
– If a family solves the (i·m,k)–problem, then its i-contraction F solves the (m, )-problem
Fi
Fi
ik
##-#-#######---#--##-#
##-#-#######---#--##-#
#-#---#---#-#-#-##-#-#-------#-----#-#-#
(18,3)
(36,7)
![Page 30: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/30.jpg)
Periodic seedsPeriodic seeds
Iterating short seeds with good properties
into longer seeds
###-#--###-#--###-#
###-#--
![Page 31: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/31.jpg)
Cyclic problemCyclic problem
Lemma: If a seed Q solves a cyclic (m,k)-problem, then the seed Qi=[Q,- (m-s(Q))]i solves the linear (m·(i+1)+s(Q)-1,k)-problem.
Cyclic (11,3)-problem
Linear (30,3)-problem
###-#--#---
###-#--#---###-#--#
![Page 32: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/32.jpg)
Extension to multi-seed caseExtension to multi-seed case
Cyclic (11,3)-problem
Linear (25,3)-problem
###-#--#---
###-#--#---###-#--##--#---###-#--#---###
![Page 33: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/33.jpg)
Extension to multi-seed caseExtension to multi-seed case
Cyclic (11,3)-problem
Linear (25,3)-problem
###-#--#---
###-#--#---###-#--# #--#---###-#--#---###
![Page 34: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/34.jpg)
AAsymptotsymptotic optimalityic optimality
Theorem:Fix a number of errors k. Let w(m) be the maximal weight
of a seed solving the linear (m,k)-problem. Then
the fraction of the number of jokers tends to 0 but the convergence speed depends on k
seed expansion cannot provide an asymptotically optimal solution
( )
![Page 35: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/35.jpg)
Non-asymptotic optimality Non-asymptotic optimality
Fix a number of errors k. For each seed (seed family) Q there exists mQ s.t.
mmQ, Q solves the (m,k)-problem
For a class of seeds , Q is an optimal seed in iff Q realizes the minimal mQ over all seeds of
Lemma: Let n be an integer and r=n/3. For every k2, seed #n-
r-#r is optimal among seeds of weight n with one joker.
![Page 36: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/36.jpg)
Heuristic seed design: genetic algorithmHeuristic seed design: genetic algorithm
a population of seed families is evolving by mutating and crossing over
seed families are screened against sets of difficult (m,k)-instances
for a family that detects all difficult instances, the number of undetected similarities is computed by a DP algorithm. A family is kept if it yields a smaller number than currently known families
compute the contribution of each seed of the family. Mutate the less “valuable” seeds.
difficult(m,k)-instances
seed families
select and reorderselect
![Page 37: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/37.jpg)
Example: (25,2)-problemExample: (25,2)-problem
![Page 38: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/38.jpg)
Example: (25,3)-problemExample: (25,3)-problem
![Page 39: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/39.jpg)
Application Application of lossless filtering: of lossless filtering: oligooligo design design
Specific oligonucleotides: small DNA molecules (10-50bp) that hybridize with a given target sequence and do not hybridize with the other background sequences (e.g. the rest of the genome)
Formalization: given a sequence, find all windows of length m which do not occur elsewhere within k substitution errors
![Page 40: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/40.jpg)
Seed design: (32,5)-problemSeed design: (32,5)-problem
![Page 41: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/41.jpg)
ExperimentExperiment
This filter has been applied to the rice EST database (100015 sequences of total size ~42 Mbp)
All 32-windows occurring elsewhere within 5 errors have been computed
The computation took slightly more than 1 hour on a P4 3GHz computer
87% of the database have been “filtered out”
![Page 42: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/42.jpg)
Further questionsFurther questions
Combinatorial structure of optimal seed families
Efficient design algorithm
![Page 43: Multi-seed lossless filtration](https://reader036.fdocuments.us/reader036/viewer/2022062809/5681593c550346895dc678fa/html5/thumbnails/43.jpg)
QuestionsQuestions
agctga
g?cc??
tatgag
caa?ga
cca??a
ctc?gc
ggcgca
tctagg
ag??ac
c???tc
ttcttc
g
???? ??