Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

28
Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM

Transcript of Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Page 1: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Hannu Peltola Jorma Tarhio

Aalto University

Finland

Variations of Forward-SBNDM

Page 2: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

Aims

Tuning algorithms for exact string matching.

Studying the effect of simultaneous 2-byte read.

Page 3: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

SBNDMSimple Backward Nondeterministic DAWG Matching

SBNDM [18] is a simplification of BNDM [17]. Both are bit-parallel algorithms.

Text T = t1...tn, pattern P = p1...pm.

At each alignment window of P in T, scan T from right to left until the suffix of the window is not a factor of P or an occurrence of P is found.

Page 4: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

Shift of SBNDM

No factor: m

P found: 1

Else: next alignment starts at the last factor

Page 5: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

SBNDM, example

P = banana, T = antanabadbanana...

alignment: antanabadbanana a na ana

Page 6: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

SBNDM, example

P = banana, T = antanabadbanana...

alignment: antanabadbanana a na ana

not a factor: tananext alignment: antanabadbanana

Page 7: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

SBNDM, example

P = banana, T = antanabadbanana...

alignment: antanabadbanana a na ana

not a factor: tananext alignment: antanabadbanana not a factor: dnext alignment: antanabadbanana

Page 8: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

SBNDMq

SBNDMq [6] is a tuned version of SBNDM.

Processing of an alignment starts with checking a q-gram.

Let q = 4. Consider an alignment at antana. Instead of testing four suffixes a, na, ana, tana,only tana is tested.

Testing is done in a fast loop.

Page 9: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

Forward-SBNDM

Forward-SBNDM (FSB for short) by Faro & Lecroq [7] is a lookahead version of SBNDM2.

Both FSB and SBNDM2 read a 2-gram x1x2 before a factor test.

x1x2 is matched with the end of P in SBNDM2.

Only x1 is matched with the end of P in FSB, and x2 is a lookahead character following the current alignment.

FSB is faster than SBNDM2 for large alphabets.

Page 10: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

Generalization of FSB: FSB(q,f)

FSB(q,f) (= Forward-SBNDM(q,f)) is SBNDMq with f lookahead characters, f = 0, 1, ..., q-1.

FSB(2,1) = FSB and FSB(q,0) = SBNDMq.

Motivation: SBNDMq works well on modern processors also for q>2.

Page 11: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

FSB(q,f)

Let UV be a q-gram, where |V| = f.

After reading UV there are 3 alternatives:i. If U is a suffix of P, reading continues leftwards.

ii. Else if UV is a factor of P, reading continues leftwards.

iii. Else the state vector is zero and P is shifted m-q+f+1 positions

(f positions more than in SBNDMq).

Page 12: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

Occurrence vectors in FSB(q,2)

Example: P = banana

bananaSBNDMq: B[n] = 00001010

FSB(q,2): B[n] = 00101011 B[a] = 01010111 B[x] = 00000011

extra bits

Page 13: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

State vectors in FSB(q,2) for q=4

4-gram nanx: x 00000011 n 00101011 a 01010111 n 00101011

00001000

4-gram State vector Conclusionnanx 00001000 na is a suffix of Pxana 00000000 not a factoranan 01000000 factor of P

nanx is not a factor

Page 14: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

Benefits / drawbacks of lookahead characters and extra bits

Benefits

• Longer shifts more speed

• Combined suffix/factor test

Drawback

• More q-grams accepted less speed

Page 15: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

Greedy skip loop for SBNDM2 (GSB2 = Greedy-SBNDM2)

Factor tests of two 2-grams are done in one round.

Let B2[x,y] denote the combined occurrence vector of characters x and y. B2[x,y] = B[x] & (B[y]<<1)

next:D B2[ti,ti+1]if D = 0 then if B2[ti+m-1,ti+m] = 0 then i i+2*m-2

goto next

Page 16: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

2-byte read

Read two characters (= 2 bytes = 16 bits) in one instruction (in a skip loop).

Suits well q-gram algorithms with even q.

For experiments we made two versions of the algorithms:• Standard (1-byte read)

• b-version using 2-byte read

Page 17: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

2-byte read (cont.)

Advantage: a part of computation can moved to preprocessing phase

• Example: B2[x,y] = B[x] & (B[y]<<1)

Speed-up factor even more than 2

Drawback: extra 0.1 ms for preprocessing.

Page 18: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

4-byte read?

Many border crosses happen => slow down

232 tables too big for practice

Page 19: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

Experimental results/KJV Bible

In the recent comparison S. Faro, T. Lecroq: The Exact String Matching Problem: a Comprehensive Experimental Evaluation

(2010), the algorithms EBOM and Hash3 were the fastest

in the bible text for m = 4,...,20.4 8 16

Hash3 14.6 5.42 2.79

EBOM 6.53 3.87 2.91

Page 20: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

KJV: EBOM & Hash3 (on ThinkPad X61s)

0

0,5

1

1,5

2

2,5

3

3,5

4

4 8 12 16 20

m

GB

/s

EBOM

Hash3

Page 21: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

KJV: EBOMb & Hash3b (with 2-byte read) added

0

0,5

1

1,5

2

2,5

3

3,5

4

4 8 12 16 20

m

GB

/s

EBOM

EBOMb

Hash3

Hash3b

Page 22: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

KJV: SBNDM2b = FSB(2,0)b added

0

0,5

1

1,5

2

2,5

3

3,5

4

4 8 12 16 20

m

GB

/s

EBOM

EBOMb

Hash3

Hash3b

FSB(2,0)b

Page 23: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

KJV: GSB2b added

0

0,5

1

1,5

2

2,5

3

3,5

4

4 8 12 16 20

m

GB

/s

EBOM

EBOMb

Hash3

Hash3b

FSB(2,0)b

GSB2b

Page 24: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

KJV: FSB(4,i)b added, i = 0,1,2

0

0,5

1

1,5

2

2,5

3

3,5

4

4 8 12 16 20

m

GB

/s

EBOM

EBOMb

Hash3

Hash3b

FSB(2,0)b

GSB2b

FSB(4,0)b

FSB(4,1)b

FSB(4,2)b

Page 25: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

KJV: Speed-up factors of 2-byte read

GSB2 1.32FSB(2,0) 1.34FSB(2,1) 1.24FSB(4,0) 1.72FSB(4,1) 2.15FSB(4,2) 2.03Hash3 1.05EBOM 1.17

Page 26: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

Other experiments

DNA and binary data was also tested.• Gain of lookahead characters or the greedy loop was smaller

than with the bible data.

Gain of 2-byte read was smaller with 64-bit code than with 32-bit code.

Page 27: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

Conclusions

Two new algorithms were presented: • FSB(q,f)

• GSB2

The new algorithms are faster than earlier algorithms on English data:• GSB2 for m = 4, …, 8

• FSB(q,f) for m = 8, …, 20

2-byte read makes most string algorithms faster.

Page 28: Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Aug. 29, 2011

Web site for practical speed comparison

cse.aalto.fi/stringmatching