Survey: String Matching AlgorithmsVishal Kumar Jaiswal1st yr. M.Tech. CSTIIEST Shibpur
Outline
❖Problem Statement
❖Exact String Matching ➢ Naive String Matching
➢ Rabin Karp Algorithm
➢ Finite State Automaton
➢ Knuth Morris Pratt Algorithm
❖Approximate String Matching
String Matching
The objective of string searching is to find the location of a specific text pattern within a larger body of text (e.g., a sentence, a paragraph, a book, etc.).
Formally, let the text is an array T[1...n] of length n and that the pattern is an array P[1..m] of length m<=n. The elements of P and T are characters drawn from a finite alphabet Σ.
Assumptions and Terminology
The text T is static, given before queries are made, available for preprocessing and storing in a data structure.
The concatenation of two strings x and y, denoted xy, has length |x|+|y| and consists of the characters from x followed by the characters from y.
A string w is a prefix of a string x, denoted w x⊏ , if x=wy for some string yϵΣ*.
A string w is a suffix of a string x, denoted w x⊐ , if x=wy for some string yϵΣ*.
Problem Statement
The pattern P occurs with shift s in text T (or, equivalently, that pattern P occurs beginning at position s+1 in text T) if 0<=s<=n-m and T[s+1....s+m]=P[1...m] (that is, if T[s+j]=P[j], for 1<=j<=m).
Fig. 1. Demonstration of string search problem
Valid / Invalid shifts
If P occurs with shift s in T, then we call s a valid shift; otherwise we call s an invalid shift.The string matching problem is the problem of finding all valid shifts with which a given pattern P occurs in a given text T.
If a variable width encoding is in use then it is slow (time proportional to N) to find the Nth character.
Algorithms
Finite set of pattern search algorithm:
Aho-Corasick string matching algorithm
Commentz-Walter algorithm
Rabin-Karp string search algorithm
Infinite set of pattern search algorithm:
Represent pattern using regular expression or regular grammar.
Exact String Matching
Naive string matching Check at all positions in the text between 0 and n-m, whether an occurrence of the
pattern starts there or not.
After each attempt, shift the pattern by exactly one position to the right.
The time complexity of this searching phase is O(mn) (when searching for am-1b in an for instance).
Requires no preprocessing phase.
Only constant extra space needed.
The expected number of text character comparisons is 2n.
Naive string matching Algorithm
Example
String matching with finite automata
Automaton examines each text character exactly once, taking constant amount of time per text character. So, matching time after preprocessing is (n).𝛉
To search a word x with an automaton, first build the minimal Deterministic Finite Automaton (DFA) A(x) recognizing the language Σ*x.
If Σ is large, then the time to build the automaton can be large.
String matching with finite automata
The DFA A(x)=(Q,qo,T,E) recognizing the language Σ*x is defined as follows:
Q is the set of all prefixes of x: Q={ ,x[0],x[0..1],....,x[0...m-2],x};ℇ
Q0= ;ℇ
T={x};
For q in Q (q is prefix of x) and a in Σ, (q,a,qa) is in E if and only if qa is also a prefix of x, otherwise (q,a,p) is in E such that p is the longest suffix of qa which is a prefix of x.
Finite automata: Example
Rabin Karp String Matching
Performs well in practice and also generalizes to other problems e.g. two dimensional pattern matching.
Uses hashing to find any one of a set of pattern strings in a text.
For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst case time is O(nm).
A practical application of this algorithm is detecting plagiarism.
Hash Function
HashFunctionA hash function is a function which converts every string into a numeric value, called its hash value. For example, hash(“hello”)=5.Hash function should be computationally efficient, highly discriminating.
hash(y[j+1...j+m]) must be easily computable from hash(y[j…j+m-1]) and y[j+m];
hash(y[j+1...j+m])=rehash(y[j],y[j+m],hash(y[j…j+m-1]))
Hash Function
For a word w of length m let hash(w) be defined as follows:
hash(w[0….m-1])=(w[0]*2m-1+w[1]*2m-2+.....+w[m-1]*20)mod q where q is large number
If two strings are equal, their hash values are also equal.
Compute hash value of the substring we’re searching for, and then look for a substring with the same hash value.
Rabin Karp Example
The Knuth Morris Pratt Algorithm
Consider an attempt at a left position j, that is when the window is positioned on the text factor y[j..j+m-1]. Assume that the first mismatch occurs between x[i] and y[i+j] with 0<i<m.Then, x[0...i-1]=y[j...i+j-1]=u and a=x[i]!=y[i+j]=b.
When shifting, a prefix v of the pattern can match some suffix of the portion u of the text. If we want to avoid another immediate mismatch, the character following the prefix v in the pattern must be different from a. The longest such prefix vis called the tagged border of u.
Prefix Table
q 1 2 3 4 5 6 7 8
P[q] G C A G A G A G
𝜋[q] 0 0 0 1 0 1 0 1
Knuth Morris Pratt Example
The Knuth Morris Pratt Algorithm
Fig. 2. Comparison of Single pattern search algorithm
Algorithm Preprocessing Time Matching Time
Naïve search Algorithm 0 (no processing time) Θ((n-m)m)
Rabin Karp string search Algorithm
Θ(m) Average Θ(n+m)Worst Θ((n-m)m)
Finite State Automaton Θ(m|Σ|) Θ(n)
Knuth Morris Pratt Algorithm Θ(m) Θ(n)
Boyer Moore string search Algorithm
Θ(m+|Σ|) Best Ω(n/m) Worst O(n)
Bitap Algorithm Θ(m+|Σ|) O(mn)
Note: Boyer Moore String Search Algorithm is the standard benchmark for the practical string search literature
Approximate String Matching
Approximate / Fuzzy String Searching Finds strings that match a pattern approximately (rather than exactly).
The closeness of a match is measured in terms of primitive operations necessary to convert the string into an exact match. This number is called the editdistance between the string and the pattern. The usual primitive operations are:
Insertion: cot->coat
Deletion: coat->cot
Substitution: coat->cost
Transposition: cost->cots
Problem Definition and Solution Strategy
Given a pattern string P=p1p2.....pm and a text string T=t1t2…..tm, find a substring Tj’,j=tj’.....tjin T, which, of all substrings of T, has the smallest edit distance to the pattern P.
Bruteforceapproach:
Firstly, Compute the edit distance to P for all substrings of T,
Then choose the substring with the minimum edit distance. However, this algorithm would have the running time O(n3m).
Dynamic programming based approach
For each position j in text T and each position i in the pattern P, go through all substrings of T ending at position j, and determine which one of them has the minimal edit distance to the i first characters of the pattern P. Write this minimal distance as E(i,j).
After computing E(i,j), for all i and j, we can easily find a solution to the original problem: It is the substring for which E(m,j) is minimal (m being the length of the pattern P).
Online Matching
OnlineSearching:
The pattern can be processed before searching but the text can not. The most improved version of online searching algorithm is bitapalgorithm, which is used by Unix search utility agrep.
Applications
• Spam Filtering,
• De-duplication,
• Identity Resolution,
• Microsoft’s Spell Checker and Autocorrect feature,
• Google search engine’s “Showing results for”,
• Searching lyrics of a song.
Note: Approximate string search can not be used for most binary data, such as images and music. These applications require different algorithms ,namely ,Acousticfingerprinting.
ApplicationsErrorinFuzzyStringMatching!
Bibliography
1. Christian Charras - Thierry Lecroq, “Exact String Matching Algorithms:- Animation in Java” , http://www-igm.univ-mlv.fr/~lecroq/string/index.html
2. Thomas H. Cormen, ”Introduction to Algorithms”, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 32: String Matching, pp. 906–932.
Thank YouFind this presentation at:
www.slideshare.net/VishalKumarJaiswal2
Top Related