CSCE350 Algorithms and Data Structure

31
CSCE350 Algorithms and Data Structure Lecture 16 Jianjun Hu Department of Computer Science and Engineering University of South Carolina 2009.11.

description

CSCE350 Algorithms and Data Structure. Lecture 16 Jianjun Hu Department of Computer Science and Engineering University of South Carolina 2009.11. Chapter 7: Space-Time Tradeoffs. For many problems some extra space really pays off: extra space in tables (breathing room?) hashing - PowerPoint PPT Presentation

Transcript of CSCE350 Algorithms and Data Structure

Page 1: CSCE350 Algorithms and Data Structure

CSCE350 Algorithms and Data StructureLecture 16Jianjun Hu

Department of Computer Science and Engineering

University of South Carolina2009.11.

Page 2: CSCE350 Algorithms and Data Structure

Chapter 7: Space-Time Tradeoffs• For many problems some extra space really pays off: • extra space in tables (breathing room?)– hashing– non comparison-based sorting

• input enhancement– indexing schemes (eg, B-trees)– auxiliary tables (shift tables for pattern matching)

• tables of information that do all the work– dynamic programming

Page 3: CSCE350 Algorithms and Data Structure

String Matching• pattern: a string of m characters to search for• text: a (long) string of n characters to search in

• Brute force algorithm:1. Align pattern at beginning of text2. moving from left to right, compare each character of pattern to

the corresponding character in text until• all characters are found to match (successful search); or• a mismatch is detected

3. while pattern is not found and the text is not yet exhausted, realign pattern one position to the right and repeat step 2.

What is the complexity of the brute-force string matching?

Page 4: CSCE350 Algorithms and Data Structure

String Searching - History• 1970: Cook shows (using finite-state machines) that problem can be

solved in time proportional to n+m• 1976 Knuth and Pratt find algorithm based on Cook’s idea; Morris

independently discovers same algorithm in attempt to avoid “backing up” over text

• At about the same time Boyer and Moore find an algorithm that examines only a fraction of the text in most cases (by comparing characters in pattern and text from right to left, instead of left to right)

• 1980 Another algorithm proposed by Rabin and Karp virtually always runs in time proportional to n+m and has the advantage of extending easily to two-dimensional pattern matching and being almost as simple as the brute-force method.

Page 5: CSCE350 Algorithms and Data Structure

Horspool’s Algorithm

• A simplified version of Boyer-Moore algorithm that retains key insights:

– compare pattern characters to text from right to left

– given a pattern, create a shift table that determines how much to shift the pattern when a mismatch occurs (input enhancement)

Page 6: CSCE350 Algorithms and Data Structure

Consider the Problem• Search pattern BARBER in some text

• Compare the pattern in the current text position from the right to the left

• If the whole match is found, done.• Otherwise, decide the shift distance of the pattern

(move it to the right)• There are four cases!

s0 … c … sn-1

B A R B E R

Page 7: CSCE350 Algorithms and Data Structure

Shift Distance -- Case 1:

• There is no ‘c’ in the pattern. Shift by the m – the length of the pattern

s0 … S … sn-1

B A R B E R

B A R B E R

s0 … c … sn-1

B A R B E R

Page 8: CSCE350 Algorithms and Data Structure

Shift Distance -- Case 2:

• There are occurrence of ‘c’ in the pattern, but it is not the last one. Shift should align the rightmost occurrence of the ‘c’ in the pattern

s0 … B … sn-1

B A R B E R

B A R B E R

s0 … c … sn-1

B A R B E R

Page 9: CSCE350 Algorithms and Data Structure

Shift Distance -- Case 3:

• ‘c’ matches the last character in the pattern, but no ‘c’ among the other m-1 characters. Follow Case 1 and shift by m

s0 … M E R … sn-1

L E A D E R

L E A D E R

s0 … c … sn-1

L E A D E R

Page 10: CSCE350 Algorithms and Data Structure

Shift Distance -- Case 4:

• ‘c’ matches the last character in the pattern, and there are other ‘c’s among the other m-1 characters. Follow Case 2.

s0 … O R … sn-1

O R D E R

O R D E R

s0 … c … sn-1

O R D E R

Page 11: CSCE350 Algorithms and Data Structure

We can precompute the shift distance for every possible character ‘c’ (given a pattern)

• Shift Table for the pattern “BARBER”

otherwise character, last its to pattern the of characters 1- first the among rightmost the from distance the

pattern the of characters 1- first the among not is if, length spattern' the

mc

mcm

ct )(

c A B C D E F … R … Z _

t(c) 4 2 6 6 1 6 6 3 6 6 6

Page 12: CSCE350 Algorithms and Data Structure

Example

• See Section 7.2 for the pseudocode of the shift-table construction algorithm and Horspool’s algorithm

• Example: find the pattern BARBER from the following textJ I M _ S A W _ M E _ I N _ A _ B A R B E R S H O

B A R B E R

B A R B E R

B A R B E R

B A R B E R

B A R B E R

B A R B E R

Page 13: CSCE350 Algorithms and Data Structure

Algorithm Efficiency

• The worst-case complexity is Θ(nm)• In average, it is Θ(n)• It is usually much faster than the brute-force

algorithm

• A simple exercise: Create the shift table of 26 letters and space for the pattern BAOBAB

Page 14: CSCE350 Algorithms and Data Structure

Boyer-Moore algorithm

• Based on same two ideas:– compare pattern characters to text from right to left

– given a pattern, create a shift table that determines how much to shift the pattern when a mismatch occurs (input enhancement)

• Uses additional shift table with same idea applied to the number of matched characters

• The worst-case efficiency of Boyer-Moore algorithm is linear. See Section 7.2 of the textbook for detail

Page 15: CSCE350 Algorithms and Data Structure

Boyer-Moore algorithm

s0 … S E R … sn-1

B A R B E R

B A R B E R

Bad-symbol shift

Page 16: CSCE350 Algorithms and Data Structure

Space and Time Tradeoffs: Hashing

• A very efficient method for implementing a dictionary, i.e., a set with the operations:

• insert • find • delete

• Applications:• databases• symbol tables

Addressing by index numberAddressing by content

Page 17: CSCE350 Algorithms and Data Structure

Example Application: How to store student records into a data structure ?• Store student record• xxx-xx-6453 Jeffrey …..Los angels….• xxx-xx-2038• xxx-xx-0913• xxx-xx-4382• xxx-xx-9084• xxx-xx-2498

0 1 2 3 4 5 6 7 8 9

6453

09134382 9084

2038

2498

Page 18: CSCE350 Algorithms and Data Structure

Hash tables and hash functions• Hash table: an array with indices that correspond to

buckets• Hash function: determines the bucket for each record

• Example: student records, key=SSN. Hash function:h(k) = k mod m(k is a key and m is the number of buckets)

– if m = 1000, where is record with SSN= 315-17-4251 stored?

• Hash function must:– be easy to compute– distribute keys evenly throughout the table

Page 19: CSCE350 Algorithms and Data Structure

Collisions

• If h(k1) = h(k2) then there is a collision.• Good hash functions result in fewer collisions.• Collisions can never be completely eliminated.• Two types handle collisions differently:

– Open hashing - bucket points to linked list of all keys hashing to it.– Closed hashing –

• one key per bucket • in case of collision, find another bucket for one of the keys (need

Collision resolution strategy)– linear probing: use next bucket – double hashing: use second hash function to compute increment

Page 20: CSCE350 Algorithms and Data Structure

Example of Open Hashing

• Store student record into 10 bucket using hashing function

• h(SSN)=SSN mod 10• xxx-xx-6453• xxx-xx-2038• xxx-xx-0913• xxx-xx-4382• xxx-xx-9084• xxx-xx-2498

0 1 2 3 4 5 6 7 8 9

6453

09134382 9084

2038

2498

Page 21: CSCE350 Algorithms and Data Structure

Open hashing

• If hash function distributes keys uniformly, average length of linked list will be α =n/m (load factor)

• Average number of probes = 1+α/2

• Worst-case is still linear!

• Open hashing still works if n>m.

Page 22: CSCE350 Algorithms and Data Structure

Example: Closed Hashing (Linear Probing)

• xxx-xx-6453• xxx-xx-2038• xxx-xx-0913• xxx-xx-4382• xxx-xx-9084• xxx-xx-2498

4382 6453 0913 9084 2038 24980 1 2 3 4 5 6 7 8 9

Page 23: CSCE350 Algorithms and Data Structure

Hash function for strings

• h(SSN)=SSN mod 10 SSN must be a integer• What is the key is a string?• “ABADGUYISMOREWELCOME”………..Bill….

Page 24: CSCE350 Algorithms and Data Structure

Hash function for strings

• h(SSN)=SSN mod 10 SSN must be a integer• What is the key is a string?• static unsigned long sdbm(str) unsigned char

*str; { – unsigned long hash = 0; int c; – while (c = *str++) • hash = c + (hash << 6) + (hash << 16) - hash;

– return hash; – }

Page 25: CSCE350 Algorithms and Data Structure

Closed Hashing (Linear Probing)

• Avoids pointers.• Does not work if n>m. • Deletions are not straightforward.• Number of probes to insert/find/delete a key

depends on load factor α = n/m (hash table density)

• successful search: (½) (1+ 1/(1- α))• unsuccessful search: (½) (1+ 1/(1- α)²)

• As the table gets filled (α approaches 1), number of probes increases dramatically:

Page 26: CSCE350 Algorithms and Data Structure

B-Trees

• Organize data for fast queries• Index for fast search• For datasets of structured records, B-tree

indexing is used

Page 27: CSCE350 Algorithms and Data Structure

B-Trees 27

Motivation (cont.)• Assume that we use an AVL tree to store about 20 million

records• We end up with a very deep binary tree with lots of different

disk accesses; log2 20,000,000 is about 24, so this takes about 0.2 seconds

• We know we can’t improve on the log n lower bound on search for a binary tree

• But, the solution is to use more branches and thus reduce the height of the tree!– As branching increases, depth decreases

Page 28: CSCE350 Algorithms and Data Structure

B-Trees 28

Constructing a B-tree

17

3 8 28 48

1 2 6 7 12 14 16 52 53 55 6825 26 29 45

Page 29: CSCE350 Algorithms and Data Structure

B-Trees 29

Definition of a B-tree

• A B-tree of order m is an m-way tree (i.e., a tree where each node may have up to m children) in which:1.the number of keys in each non-leaf node is one less than the

number of its children and these keys partition the keys in the children in the fashion of a search tree

2.all leaves are on the same level3.all non-leaf nodes except the root have at least m / 2 children4.the root is either a leaf node, or it has from two to m children5.a leaf node contains no more than m – 1 keys

• The number m should always be odd

Page 30: CSCE350 Algorithms and Data Structure

B-Trees 30

Comparing Trees• Binary trees

– Can become unbalanced and lose their good time complexity (big O)– AVL trees are strict binary trees that overcome the balance problem– Heaps remain balanced but only prioritise (not order) the keys

• Multi-way trees– B-Trees can be m-way, they can have any (odd) number of children– One B-Tree, the 2-3 (or 3-way) B-Tree, approximates a permanently

balanced binary tree, exchanging the AVL tree’s balancing operations for insertion and (more complex) deletion operations

Page 31: CSCE350 Algorithms and Data Structure

Announcement

• Midterm Exam 2 Statistics:• Points Possible: 100 • Class Average: 85 (expected) • High Score: 100