Efficient Algorithms for Some Variants of the Farthest String Problem Chih Huai Cheng, Ching Chiang...
-
Upload
rose-bradford -
Category
Documents
-
view
222 -
download
0
Transcript of Efficient Algorithms for Some Variants of the Farthest String Problem Chih Huai Cheng, Ching Chiang...
Efficient Algorithms for Some Variants of the Farthest String
Problem
Chih Huai Cheng, Ching Chiang Huang, Shu Yu Hu, Kun-Mao Chao
Abstract
Given k strings of the same length L and an integer d, find a string s such that the hamming distance between s and the k strings are greater than d.
NP complete if the distance d is not given.
Provide an efficient algorithm for a fixed k and L.
Let’s Begin
Input: Strings s1, s2, . . . , sk over alphabetΣof lengt
h L , and a nonnegative integer d.Question: Is there a string s of length L such that dH(s, si ) >= d for all i = 1, . . . , k?
FARTHEST STRING can be solved in O( kL(|Σ|(L-d))(L-d) ) time, yielding a bounded search tree algorithm for fixed parameters L and d.
Definitions
s : a string with length L
S : a set of length L strings
dH( s1, s2 ) : the hamming distance between the two strings s1, s2 .
Key Observation
Given a set of binary strings S = { s1, s2, . . . , sk }
and a positive integer d. If there are i, j ∈{1, . . . , k} with dH(si, sj) > x,
then there is no string s with mini=1,...,k {dH(s, si )} > L-x/2.
same
different
contribute L-x
contribute x/2
L-x +x/2 = L-x/2
Key Observation
Given a set of binary strings S = { s1, s2, . . . , sk } an
d a positive integer d. If there are i, j ∈{1, . . . , k} with dH( , sj )< 2d-L, then there is no string s with
mini=1,...,k {dH(s, si )} >= d.
This can be used to discard some of the strings.
iS
The Idea Of The Algorithm
Choose a “candidate string” first e.g. .
A string si , i = 2, . . . , k, that matches with the candidate string in more than L-d positions, we recursively try several ways to move the candid
ate string “away from” si . Stop either if the candidate is “too far away” from or if we find a solution. By a careful selection of subcases, we can limit the
size of this search tree to O( L-dL-d )
1S
1S
Algorithm By Pseudo-Code
;found"not "return then },,...,1{ somefor 2),( If (D1) kiLdssd iH ;found"not "return then ,0 If (D0) d
;return then ,...,1 allfor ),( If )D2( skidssd iH
;return then ,found"not " If
);1,'(:
;][:]['
;:'
do 'every For
;1|'| with 'any Choose
]};[][|{:
;),(such that },...,1{any Choose (D3)
retret
ret
i
i
iH
ss
dsFSds
psps
ss
Pp
dLPPP
pspspP
dssdki
found"not "Return (D4)
),FSd( Procedure Recursive ds In the beginning FSd( ,L-d)
This means that we moved more than L-d steps.
From previous observation
Find the answer.
Choose an unsatisfied string
The positions that “candidate string” and the unsatisfied string have the same alphabet.
Change a position once.
Recursive call
Set of positions Choose a subset of P that has L-d+1 positions
1S
Illustration By Graph
the number of the branch nodes
The height of the tree
In this example, If L = 7,d = 5
… … … … … … … … …
Pseudo-Code & Time Complexity
;found"not "return then },,...,1{ somefor 2),( If (D1) kiLdssd iH ;found"not "return then ,0 If (D0) d
;return then ,...,1 allfor ),( If )D2( skidssd iH
;return then ,found"not " If
);1,'(:
;][:]['
;:'
do 'every For
;1|'| with 'any Choose
]};[][|{:
;),(such that },...,1{any Choose (D3)
retret
ret
i
i
iH
ss
dsFSds
psps
ss
Pp
dLPPP
pspspP
dssdki
found"not "Return (D4)
),FSd( Procedure Recursive ds O(1)
O( L-dL-d ) recursive calls
O(KL)
total = O(KL(L-dL-d ))
Correctness
1S
Case 2: is not a solution but there exists a string s that satisfies the condition that mini=1,...,k {dH (s, si )} ≥ d.
Case 1: satisfies mini=1,...,k {dH (s, si )} ≥ d
We have to show that Algorithm FSD will find a string s with mini=1,...,k {dH (s, si )} ≥ d, if such an s exists.
There is a string si , i = 2, . . . , k, such that dH( , si ) < d
We will explain why the algorithm creates L-d+1 subcases and prove that it can achieve the correct
answer.
1S
1S
Correctness – Case2
1S
1S
1. dH (s, ) ≥ d , this means there are most L-d positions that have the same alphabet between s and .
2. dH (s, si ) ≥ d , this means there are most L-d positions that have the same alphabet between s and si.
3. We choose L-d+1 positions that use the same alphabet between and si.
4. Because s and si only have at most L-d positions that have the same alphabet, by the pigeon hole theorem we know that at least one position exists that s and si differ.
5. Choose that position, and the candidate string moves closer to the farthest string.
6. In at most L-d steps, the farthest string is achieved.
Take the first recursion as example:
1S
Correctness - Case2
S
iS
1S
In this example, L = 9, d = 4
same
different
Farthest String by Maximum Hamming Distance Sum
Input: Strings s1, s2,…, sk over alphabet Σof length L.
Question: Find a string s {∈ s1, s2, . . . , sk} that maxi
mizes ?
k
iiH ssd
1
),(
Naïve Approach
0 0
1
1
The number of 1’s in the first bit is the least of the candidates, so we choose 1 as the minority vote.
The number of 0’s in the second bit is the least, so the minority vote is 0.
Approach: Select the alphabet that occurs the fewest times in each column. This is the so called minority vote.
It still doesn’t work.
The concept of weighted sum
Therefore we hope to be able to decide which alphabet in one column would contribute the most in terms of the total hamming distance. Then calculating the sum of hamming distance for every string. To achieve this goal we use an array to record the number of times an alphabet occurs in every column.
Key Observation
L
pi
L
p ijjiH
ij
L
pjiH
ijjiH
ppsnumk
pspsd
pspsdssd
1
1
1
)],[(
])[],[(
])[],[(),(
We have to prove that the total sum of the hamming distance equals the total number of strings minus the times an alphabet appears. If it is proven, then the string with the maximum will be our answer.
Definition:num(α, i ) is the times the alphabet αappears at ith column.
Pseudo-code & Time Complexity
1 for p=0 to L2 for i=0 to k3 num[si[p]] += 1
4 farthest = 05 dis = 06 for i=0 to k7 temp_dis = 08 for p=0 to L9 temp_dis += k - num[si[p]]
10 if temp_dis > dis11 dis = temp_dis12 farthest = return sfarthest
Calculating the weighted sum of one string takes O(L) time, and the total time is therefore O(KL).
The time needed to calculate the number of times each alphabet occurs in each column and entering it into a 2-dimension array num[] takes O(KL).
Thank You