Efficient Algorithms for Some Variants of the Farthest String Problem Chih Huai Cheng, Ching Chiang...

Efficient Algorithms for Some Variants of the Farthest String

Problem

Chih Huai Cheng, Ching Chiang Huang, Shu Yu Hu, Kun-Mao Chao

Abstract

Given k strings of the same length L and an integer d, find a string s such that the hamming distance between s and the k strings are greater than d.

NP complete if the distance d is not given.

Provide an efficient algorithm for a fixed k and L.

Let’s Begin

Input: Strings s1, s2, . . . , sk over alphabetΣof lengt

h L , and a nonnegative integer d.Question: Is there a string s of length L such that dH(s, si ) >= d for all i = 1, . . . , k?

FARTHEST STRING can be solved in O( kL(|Σ|(L-d))(L-d) ) time, yielding a bounded search tree algorithm for fixed parameters L and d.

Definitions

s : a string with length L

S : a set of length L strings

dH( s1, s2 ) : the hamming distance between the two strings s1, s2 .

Key Observation

Given a set of binary strings S = { s1, s2, . . . , sk }

and a positive integer d. If there are i, j ∈{1, . . . , k} with dH(si, sj) > x,

then there is no string s with mini=1,...,k {dH(s, si )} > L-x/2.

same

different

contribute L-x

contribute x/2

L-x +x/2 = L-x/2

Key Observation

Given a set of binary strings S = { s1, s2, . . . , sk } an

d a positive integer d. If there are i, j ∈{1, . . . , k} with dH( , sj )< 2d-L, then there is no string s with

mini=1,...,k {dH(s, si )} >= d.

This can be used to discard some of the strings.

iS

The Idea Of The Algorithm

Choose a “candidate string” first e.g. .

A string si , i = 2, . . . , k, that matches with the candidate string in more than L-d positions, we recursively try several ways to move the candid

ate string “away from” si . Stop either if the candidate is “too far away” from or if we find a solution. By a careful selection of subcases, we can limit the

size of this search tree to O( L-dL-d )

1S

1S

Algorithm By Pseudo-Code

;found"not "return then },,...,1{ somefor 2),( If (D1) kiLdssd iH ;found"not "return then ,0 If (D0) d

;return then ,...,1 allfor ),( If )D2( skidssd iH

;return then ,found"not " If

);1,'(:

;][:]['

;:'

do 'every For

;1|'| with 'any Choose

]};[][|{:

;),(such that },...,1{any Choose (D3)

retret

ret

i

i

iH

ss

dsFSds

psps

ss

Pp

dLPPP

pspspP

dssdki

found"not "Return (D4)

),FSd( Procedure Recursive ds In the beginning FSd( ,L-d)

This means that we moved more than L-d steps.

From previous observation

Find the answer.

Choose an unsatisfied string

The positions that “candidate string” and the unsatisfied string have the same alphabet.

Change a position once.

Recursive call

Set of positions Choose a subset of P that has L-d+1 positions

1S

Illustration By Graph

the number of the branch nodes

The height of the tree

In this example, If L = 7,d = 5

… … … … … … … … …

Pseudo-Code & Time Complexity

;found"not "return then },,...,1{ somefor 2),( If (D1) kiLdssd iH ;found"not "return then ,0 If (D0) d

;return then ,...,1 allfor ),( If )D2( skidssd iH

;return then ,found"not " If

);1,'(:

;][:]['

;:'

do 'every For

;1|'| with 'any Choose

]};[][|{:

;),(such that },...,1{any Choose (D3)

retret

ret

i

i

iH

ss

dsFSds

psps

ss

Pp

dLPPP

pspspP

dssdki

found"not "Return (D4)

),FSd( Procedure Recursive ds O(1)

O( L-dL-d ) recursive calls

O(KL)

total = O(KL(L-dL-d ))

Correctness

1S

Case 2: is not a solution but there exists a string s that satisfies the condition that mini=1,...,k {dH (s, si )} ≥ d.

Case 1: satisfies mini=1,...,k {dH (s, si )} ≥ d

We have to show that Algorithm FSD will find a string s with mini=1,...,k {dH (s, si )} ≥ d, if such an s exists.

There is a string si , i = 2, . . . , k, such that dH( , si ) < d

We will explain why the algorithm creates L-d+1 subcases and prove that it can achieve the correct

answer.

1S

1S

Correctness – Case2

1S

1S

1. dH (s, ) ≥ d , this means there are most L-d positions that have the same alphabet between s and .

2. dH (s, si ) ≥ d , this means there are most L-d positions that have the same alphabet between s and si.

3. We choose L-d+1 positions that use the same alphabet between and si.

4. Because s and si only have at most L-d positions that have the same alphabet, by the pigeon hole theorem we know that at least one position exists that s and si differ.

5. Choose that position, and the candidate string moves closer to the farthest string.

6. In at most L-d steps, the farthest string is achieved.

Take the first recursion as example:

1S

Correctness - Case2

S

iS

1S

In this example, L = 9, d = 4

same

different

Farthest String by Maximum Hamming Distance Sum

Input: Strings s1, s2,…, sk over alphabet Σof length L.

Question: Find a string s {∈ s1, s2, . . . , sk} that maxi

mizes ?

k

iiH ssd

1

),(

Naïve Approach

0 0

1

1

The number of 1’s in the first bit is the least of the candidates, so we choose 1 as the minority vote.

The number of 0’s in the second bit is the least, so the minority vote is 0.

Approach: Select the alphabet that occurs the fewest times in each column. This is the so called minority vote.

It still doesn’t work.

The concept of weighted sum

Therefore we hope to be able to decide which alphabet in one column would contribute the most in terms of the total hamming distance. Then calculating the sum of hamming distance for every string. To achieve this goal we use an array to record the number of times an alphabet occurs in every column.

Key Observation

L

pi

L

p ijjiH

ij

L

pjiH

ijjiH

ppsnumk

pspsd

pspsdssd

1

1

1

)],[(

])[],[(

])[],[(),(

We have to prove that the total sum of the hamming distance equals the total number of strings minus the times an alphabet appears. If it is proven, then the string with the maximum will be our answer.

Definition:num(α, i ) is the times the alphabet αappears at ith column.

Pseudo-code & Time Complexity

1 for p=0 to L2 for i=0 to k3 num[si[p]] += 1

4 farthest = 05 dis = 06 for i=0 to k7 temp_dis = 08 for p=0 to L9 temp_dis += k - num[si[p]]

10 if temp_dis > dis11 dis = temp_dis12 farthest = return sfarthest

Calculating the weighted sum of one string takes O(L) time, and the total time is therefore O(KL).

The time needed to calculate the number of times each alphabet occurs in each column and entering it into a 2-dimension array num[] takes O(KL).

Thank You

Efficient Algorithms for Some Variants of the Farthest String Problem Chih Huai Cheng, Ching Chiang...

Documents

Transcript of Efficient Algorithms for Some Variants of the Farthest String Problem Chih Huai Cheng, Ching Chiang...