Survey of String Matching Algorithm

Post on 14-Apr-2017

238 views 1 download

Transcript of Survey of String Matching Algorithm

Survey: String Matching AlgorithmsVishal Kumar Jaiswal1st yr. M.Tech. CSTIIEST Shibpur

Outline

❖Problem Statement

❖Exact String Matching ➢ Naive String Matching

➢ Rabin Karp Algorithm

➢ Finite State Automaton

➢ Knuth Morris Pratt Algorithm

❖Approximate String Matching

String Matching

The objective of string searching is to find the location of a specific text pattern within a larger body of text (e.g., a sentence, a paragraph, a book, etc.).

Formally, let the text is an array T[1...n] of length n and that the pattern is an array P[1..m] of length m<=n. The elements of P and T are characters drawn from a finite alphabet Σ.

Assumptions and Terminology

The text T is static, given before queries are made, available for preprocessing and storing in a data structure.

The concatenation of two strings x and y, denoted xy, has length |x|+|y| and consists of the characters from x followed by the characters from y.

A string w is a prefix of a string x, denoted w󠁰 x⊏ , if x=wy for some string yϵΣ*.

A string w is a suffix of a string x, denoted w󠁰 x⊐ , if x=wy for some string yϵΣ*.

Problem Statement

The pattern P occurs with shift s in text T (or, equivalently, that pattern P occurs beginning at position s+1 in text T) if 0<=s<=n-m and T[s+1....s+m]=P[1...m] (that is, if T[s+j]=P[j], for 1<=j<=m).

Fig. 1. Demonstration of string search problem

Valid / Invalid shifts

If P occurs with shift s in T, then we call s a valid shift; otherwise we call s an invalid shift.The string matching problem is the problem of finding all valid shifts with which a given pattern P occurs in a given text T.

If a variable width encoding is in use then it is slow (time proportional to N) to find the Nth character.

Algorithms

Finite set of pattern search algorithm:

Aho-Corasick string matching algorithm

Commentz-Walter algorithm

Rabin-Karp string search algorithm

Infinite set of pattern search algorithm:

Represent pattern using regular expression or regular grammar.

Exact String Matching

Naive string matching Check at all positions in the text between 0 and n-m, whether an occurrence of the

pattern starts there or not.

After each attempt, shift the pattern by exactly one position to the right.

The time complexity of this searching phase is O(mn) (when searching for am-1b in an for instance).

Requires no preprocessing phase.

Only constant extra space needed.

The expected number of text character comparisons is 2n.

Naive string matching Algorithm

Example

String matching with finite automata

Automaton examines each text character exactly once, taking constant amount of time per text character. So, matching time after preprocessing is (n).𝛉

To search a word x with an automaton, first build the minimal Deterministic Finite Automaton (DFA) A(x) recognizing the language Σ*x.

If Σ is large, then the time to build the automaton can be large.

String matching with finite automata

The DFA A(x)=(Q,qo,T,E) recognizing the language Σ*x is defined as follows:

Q is the set of all prefixes of x: Q={ ,x[0],x[0..1],....,x[0...m-2],x};ℇ

Q0= ;ℇ

T={x};

For q in Q (q is prefix of x) and a in Σ, (q,a,qa) is in E if and only if qa is also a prefix of x, otherwise (q,a,p) is in E such that p is the longest suffix of qa which is a prefix of x.

Finite automata: Example

Rabin Karp String Matching

Performs well in practice and also generalizes to other problems e.g. two dimensional pattern matching.

Uses hashing to find any one of a set of pattern strings in a text.

For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst case time is O(nm).

A practical application of this algorithm is detecting plagiarism.

Hash Function

Hash󠁰Function󠁰A hash function is a function which converts every string into a numeric value, called its hash value. For example, hash(“hello”)=5.󠁰Hash function should be computationally efficient, highly discriminating.

hash(y[j+1...j+m]) must be easily computable from hash(y[j…j+m-1]) and y[j+m];

hash(y[j+1...j+m])=rehash(y[j],y[j+m],hash(y[j…j+m-1]))

Hash Function

For a word w of length m let hash(w) be defined as follows:

hash(w[0….m-1])=(w[0]*2m-1+w[1]*2m-2+.....+w[m-1]*20)mod q where q is large number

If two strings are equal, their hash values are also equal.

Compute hash value of the substring we’re searching for, and then look for a substring with the same hash value.

Rabin Karp Example

The Knuth Morris Pratt Algorithm

Consider an attempt at a left position j, that is when the window is positioned on the text factor y[j..j+m-1]. Assume that the first mismatch occurs between x[i] and y[i+j] with 0<i<m.Then, x[0...i-1]=y[j...i+j-1]=u and a=x[i]!=y[i+j]=b.

When shifting, a prefix v of the pattern can match some suffix of the portion u of the text. If we want to avoid another immediate mismatch, the character following the prefix v in the pattern must be different from a. The longest such prefix v󠁰is called the tagged border of u.

Prefix Table

q 1 2 3 4 5 6 7 8

P[q] G C A G A G A G

𝜋[q] 0 0 0 1 0 1 0 1

Knuth Morris Pratt Example

The Knuth Morris Pratt Algorithm

Fig. 2. Comparison of Single pattern search algorithm

Algorithm Preprocessing Time Matching Time

Naïve search Algorithm 0 (no processing time) Θ((n-m)m)

Rabin Karp string search Algorithm

Θ(m) Average Θ(n+m)Worst Θ((n-m)m)

Finite State Automaton Θ(m|Σ|) Θ(n)

Knuth Morris Pratt Algorithm Θ(m) Θ(n)

Boyer Moore string search Algorithm

Θ(m+|Σ|) Best Ω(n/m) Worst O(n)

Bitap Algorithm Θ(m+|Σ|) O(mn)

Note: Boyer Moore String Search Algorithm is the standard benchmark for the practical string search literature

Approximate String Matching

Approximate / Fuzzy String Searching Finds strings that match a pattern approximately (rather than exactly).

The closeness of a match is measured in terms of primitive operations necessary to convert the string into an exact match. This number is called the edit󠁰distance between the string and the pattern. The usual primitive operations are:

Insertion: cot->coat

Deletion: coat->cot

Substitution: coat->cost

Transposition: cost->cots

Problem Definition and Solution Strategy

Given a pattern string P=p1p2.....pm and a text string T=t1t2…..tm, find a substring Tj’,j=tj’.....tj󠁰in T, which, of all substrings of T, has the smallest edit distance to the pattern P.

Brute󠁰force󠁰approach:󠁰

Firstly, Compute the edit distance to P for all substrings of T,

Then choose the substring with the minimum edit distance. However, this algorithm would have the running time O(n3m).

Dynamic programming based approach

For each position j in text T and each position i in the pattern P, go through all substrings of T ending at position j, and determine which one of them has the minimal edit distance to the i first characters of the pattern P. Write this minimal distance as E(i,j).

After computing E(i,j), for all i and j, we can easily find a solution to the original problem: It is the substring for which E(m,j) is minimal (m being the length of the pattern P).

Online Matching

Online󠁰Searching:

The pattern can be processed before searching but the text can not. The most improved version of online searching algorithm is bitap󠁰algorithm, which is used by Unix search utility agrep.

Applications

• Spam Filtering,

• De-duplication,

• Identity Resolution,

• Microsoft’s Spell Checker and Autocorrect feature,

• Google search engine’s “Showing results for”,

• Searching lyrics of a song.

Note: Approximate string search can not be used for most binary data, such as images and music. These applications require different algorithms ,namely ,Acoustic󠁰fingerprinting.

ApplicationsError󠁰in󠁰Fuzzy󠁰String󠁰Matching!

Bibliography

1. Christian Charras - Thierry Lecroq, “Exact String Matching Algorithms:- Animation in Java” , http://www-igm.univ-mlv.fr/~lecroq/string/index.html

2. Thomas H. Cormen, ”Introduction to Algorithms”, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 32: String Matching, pp. 906–932.

Thank YouFind this presentation at:

www.slideshare.net/VishalKumarJaiswal2