Lec 6-String Processing

25
String Processing Engr. Tazeen Muzammil

description

a lecture from course data structure

Transcript of Lec 6-String Processing

Page 1: Lec 6-String Processing

String Processing

Engr. Tazeen Muzammil

Page 2: Lec 6-String Processing

Basic Terminologies

• Each programming language contains a character set that is used to communicate with the computer. The usually indicates the following:

• Alphabet: A,B,C,D…..,Z• Digits: 0,1,2,3,4,5,6,7,8,9• Characters: +, -, /, *, ^, &, %, = etc.

• A finite sequence of 0 or more characters is called a string.

• The number of characters in a string is called its length.• The string with zero characters is called the empty

string or null string.

Page 3: Lec 6-String Processing

Storing Strings

Strings are sorted in there types of structures

1. Fixed-length structure2. Variable-length structure3. Linked Structure

Page 4: Lec 6-String Processing

Fixed-Length Storage• Record-Oriented

– In fixed-length storage each line of print is viewed as a record, where all records have the same length, i.e. each record accommodate the same number of characters. Assume our record has length 80 unless otherwise stated.

• Suppose the input consists of a program. Using a record-oriented, fixed length storage medium, the input data will appear in memory as shown in figure, where we assume that 200 is the address of the first character of the program.

Page 5: Lec 6-String Processing

Program

C PROGRAM PRINTING TWO INTEGERS IN INCREASING ORDER

READ *,J,KIF(J,LE,K)

PRINT *,J,KELSE

PRINT *,K,JEFNDIFSTOPEND

Page 6: Lec 6-String Processing

Record stored sequentially in computer

C P R O G R A M P R I N T I N G

T W O

200 210 220

R E A D * , J , K

208 290 300

C PROGRAM PRINTING TWO INTEGERS IN INCREASING ORDER

READ *,J,KIF(J,LE,K)

PRINT *,J,KELSE

PRINT *,K,JEFNDIFSTOPEND

Page 7: Lec 6-String Processing

Record stored sequentially in computer

I F ( J , L E , K ) T H E N

360 370 380E N D

840 850 860

C PROGRAM PRINTING TWO INTEGERS IN INCREASING ORDER

READ *,J,KIF(J,LE,K)

PRINT *,J,KELSE

PRINT *,K,JEFNDIFSTOPEND

Page 8: Lec 6-String Processing

Advantages• Advantages

– The ease of accessing data from any given record– The ease of updating data in any given record (as long as

the length of the new data does not exceed the record length)

• Disadvantages– Time is wasted reading an entire record if most of the

storage consists of blank spaces.– Certain records may require more space that available.– When the correction consists of more or fewer characters

than the original text, changing a misspelled word requires the entire record to be changed.

Page 9: Lec 6-String Processing

Variable-Length Storage with Fixed Maximum

• Although string may be stored in fixed-length memory location as above, there are advantages in knowing the actual length of each string; one does not have to read the entire record when the string occupies only the beginning part of the memory location.

• The storage of variable-length strings in memory cells with fixed lengths can be done in two general ways:1. One can use a marker that is two $$ signs, to signal the

end of the string.2. One can list the length of the string as an additional item

in the pointer array .

C PROGRAM PRINTING TWO INTEGERS IN INCREASING ORDER

READ *,J,KIF(J,LE,K)

PRINT *,J,KELSE

PRINT *,K,JEFNDIFSTOPEND

Page 10: Lec 6-String Processing

Linked Storage

• Computer must be able to correct and modify the printed matter, which usually means deleting, changing, and inserting words, phrases, sentences and even paragraphs in the text. The fixed-length memory cells do not easily lend themselves to these operations. For this reason strings are stored by means of linked lists.

Page 11: Lec 6-String Processing

Linked List

• A linked list, or one-way list is a linear collection of data elements called nodes, where linear order is given by means of pointer.

Page 12: Lec 6-String Processing

Linked Lists

• A linked list is a series of connected nodes• Each node contains at least

– A piece of data (any type)– Pointer to the next node in the list

• Head: pointer to the first node• The last node points to NULL

A

Head

B C

A

data pointer

node

Page 13: Lec 6-String Processing

Linked Storage• String may b used in a linked list as follows.

Each memory cell is assigned one character or a fixed number of characters, and a link contained in the cell gives the address of the cell containing the next character or goup of characters in the string. For example:

To be or not to be, that is the question.

Page 14: Lec 6-String Processing

Linked Storage

B R OT OE

T O B

One character per node

Four character per node

Page 15: Lec 6-String Processing

String Operations• Substring ( substr(pos,len))

– Accessing a substring form a given string requires two piece of information.1. The position of the first character of the substring, and2. The length of the substring .

• Indexing (find())– Indexing refers to finding the location of the substring.

find(string)find(string, positionFirstChar)find(string, positionFirstChar, len)rfind()-(Find last occurrence of string or substring)

• Concatenation– String concatenation is the operation of joining two character strings end to end. For example, the

strings "snow" and "ball" may be concatenated to give "snowball".

• Length( length(), size())– The number of characters in the strng is called the length or size of string.

Page 16: Lec 6-String Processing

Example

• Substrings = s2.substr(1,4);s = s2.substr(1,50);

• Lengthi = s.length();i = s.size();

• Concatenations2 = s2 + "x";s2 += "x";

• Findi = s.find("ab",4);

string s = "abc def abc";string s2 = "abcde uvwxyz";char c;char ch[] = "aba daba do";char *cp = ch;

Page 17: Lec 6-String Processing

Word Processing• The operations usually associated with word

processing are:– Replacement

• Replacing one string in the text by anotherreplace(pos1, len1, string)replace(pos1, len1, string, pos2, len2)

– Insertion• Inserting a string in the middle of the text

insert()– Deletion

• Deleting a string from the text.erase(positionFirstChar)erase(positionFirstChar,len)

Page 18: Lec 6-String Processing

Example

• Replace

s.replace(4,3,"x");

• Erases.erase(4,5);s.erase(4);

string s = "abc def abc";string s2 = "abcde uvwxyz";char c;char ch[] = "aba daba do";char *cp = ch;

Page 19: Lec 6-String Processing

Question

A. A text T and a pattern P are in memory. Write an algorithm which deletes every occurrence of P in T

B. A text T and a pattern P and Q are in memory. Write an algorithm which replaces every occurrence of P in T by Q.

A. [Find the index of P] Set K=Find(T,P)Repeat while k=!0

a) [Delete P from T]Set T=Delete(T, Find(T,P),Length(P))

b) [Update index] Set K= Find(T,P)[End of loop]Writ TExit

B. [Find the index of P] Set K=Find(T,P)Repeat while k=!0

a) [Replace P from Q]Set T=Replace(T,P,Q)

b) [Update index] Set K= Find(T,P)[End of loop]Writ TExit

Page 20: Lec 6-String Processing

Pattern matching Algorithm

• Given strings T (text) and P(pattern), the pattern matching problem consists of finding a substring of T equal to P

• T: “the rain in spain stays mainly on the plain”• P: “n th”

• We assume that the length of pattern does not exceed the length of text.

• Applications:– Text editors– Web search engines (e.g. Google)

Page 21: Lec 6-String Processing

The Brute Force Algorithm

• Check each position in the text T to see if the pattern P starts in that position

a n d r e wT:

r e wP:

a n d r e wT:

r e wP:

P moves 1 char at a time through T

Page 22: Lec 6-String Processing

The Brute Force Algorithm• The first pattern matching algorithm is the one in which we compare a

given pattern P with each of the substring of T, moving from left to right, until we get a match.

• Let Wk denote the substring of T having the same length as P and beginning with the Kth character of .

Wk = Substring(T,K,LENGTH(P))

• First we compare P, character by character, with first substring W1

• If all the characters are the same, then P= W1 and so P appears in T and Index(T,P)=1.

• If some characters of p is not the same as corresponding character W1 . Then P is not equal to W1 and we can move on to the next substring W2

• The process stops when we find the match of P with some substring Wk and so P appears in T and Index(T,P)=K, or

• We exhaust all the Wk with no match that means P does not appear in T.

• The maximum value of substring K is equal to Length(T)-Length(P) +1.

Page 23: Lec 6-String Processing

The Brute Force Algorithm• P and T are strings with length R and S, respectively, and are stored as

array with one character per element. The algorithm finds the Index of P in T

1. [Initialize] Set K= 1 and MAX=S-R+12. Repeat Step 3 to 5 while K<=MAX3. Repeat for L=1 to R [Test each character of P]

If P[L]!= T[K+L-1], then: Go to step 5.[End of inner loop]

4. [Success] Set INDEX=K, and Exit5. Set K=K+1

[End of Step 2 outer loop]6. [Failure] Set INDEX=07. Exit.

Page 24: Lec 6-String Processing

Analysis• Brute force pattern matching runs in time O(mn) in the worst case.

• But most searches of ordinary text take O(m+n), which is very quick.

• Example of a worst case:– T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"– P: "aaah"

• Example of a more average case:– T: "a string searching example is standard"– P: "store"

Page 25: Lec 6-String Processing

The Boyer-Moore Algorithm• The Boyer-Moore pattern matching algorithm is based

on two techniques.

• 1. The looking-glass technique– find P in T by moving backwards through P, starting at its

end• 2. The character-jump technique

– when a mismatch occurs at T[i] == x– the character in pattern P[j] is not the

same as T[i]

• There are 3 possible cases, tried in order.