CSC 212 – Data Structures Lecture 36: Pattern Matching.

Post on 18-Jan-2016

225 views 0 download

Transcript of CSC 212 – Data Structures Lecture 36: Pattern Matching.

CSC 212 –Data Structures

Lecture 36:

Pattern Matching

Suffixes and Prefixes

“I am the Lizard King!”Prefixes Suffixes

II I a

I am…I am the Lizard KinI am the Lizard King I am the Lizard King!

!g!ng!ing!…am the Lizard King!

am the Lizard King!

I am the Lizard King!

KMP Algorithm

Asymptotically optimal algorithmMeans cannot do better in big-Oh terms

Compares from left-to-rightSo like BruteForce, not Boyer-MooreBut shifts pattern intelligently

Relies on a Key Insight™Preprocess pattern to avoid redundant

comparisonsAlways go forward; Never, ever look back

The KMP Algorithm

x

j

. . a b a a b . . . . .

a b a a b a

a b a a b a

Do notrepeat thesecomparisons

Need to resume

comparinghere

Shifting P hereensures these

two entries match

KMP Failure Function

Assume P[j] ≠ T[k]. Need rank in P to next compared to T[k]

E.g., How should we shift P after a miss? Uses failure function, F(j-1),

One value defined for each rank in PSpecifies rank j in P must restart comparisons

Computing Failure Function

For rank j, find longest proper prefix and suffix of P[0...j] For speed, store failure function in arrayUnlike Boyer-Moore, works w/infinite alphabets

Takes at most O(2m) = O(m) time

Similar algorithm computes failure function & KMP

Computing Failure FunctionAlgorithm KMPFailureFunction(String P)

F[0] 0i 1j 0while i < P.length()

if P[i] = P[j] // So, P[0…j] = P[i - j…i] F[i] j + 1 // Record the length of this prefix/suffix i i + 1 // Advance a character and see if still matches j j + 1else if j > 0 // No match, need to restart our computation j F[j - 1] // Skip over longest prefix that is also a suffixelse F[i] 0 // No prefix of P[0…i] is a suffix of P[0…i] i i + 1 // Move to the next character

return F

KMP Failure Functionj 0 1 2 3 4

P[j] a b a a b a

F(j) 0 0 1 1 2

The KMP AlgorithmAlgorithm KMPMatch(String T, String P)

F KMPFailureFunction(P)i 0j 0while i < T.length()

if P[j] = T[i] // So, P[0…j] = T[i - j…i] if j = P.length() - 1 return i - j i i + 1 // Advance and see if still a match j j + 1else if j > 0 // No match, but a prefix of P[0…j-1] matches j F[j - 1] // So skip past longest prefix that is a suffixelse i i + 1 // Nothing to reuse, move to the next character

return F

Example

1

a b a c a a b a c a b a c a b a a b b

7

8

19181715

a b a c a b

1614

13

2 3 4 5 6

9

a b a c a b

a b a c a b

a b a c a b

a b a c a b

10 11 12

c

j 0 1 2 3 4

P[j] a b a c a b

F(j) 0 0 1 0 1

The KMP Algorithm

In each pass of KMPMatch, either:P[j]=T[i] i increases by one, orP[j]≠T[i] & j > 0 P shifted right by at least 1P[j]≠T[i] & j = 0 i increases by 1

So at most 2n iterations of loop KMPMatch takes O(2n) = O(n) time KMPFailureFunction needs O(m) time Thus, algorithm runs in O(m n) time

Your Turn

Get back into groups and do activity

Before Next Lecture…

Finish up assignments Start thinking about questions for Final