Search Algorithm Lecture-14

Search AlgorithmLecture-14

1

Chapter 15-2

Algorithms for SELECT and JOIN Operations

Implementing the SELECT Operation :Search Methods for Simple Selection:• S1. Linear search (brute force): Retrieve every record in the

file, and test whether its attribute values satisfy the selection condition.

• S2. Binary search: If the selection condition involves an equality comparison on a key attribute on which the file is ordered, binary search (which is more efficient than linear search) can be used.

• S3. Using a primary index or hash key to retrieve a single record: If the selection condition involves an equality comparison on a key attribute with a primary index (or a hash key), use the primary index (or the hash key) to retrieve the record.

Chapter 15-3


Implementing the SELECT Operation (cont.):Search Methods for Simple Selection:• S4. Using a primary index to retrieve multiple records: If the comparison

condition is >, ≥, <, or ≤ on a key field with a primary index, use the index to find the record satisfying the corresponding equality condition, then retrieve all subsequent records in the (ordered) file.

• S5. Using a clustering index to retrieve multiple records: If the selection condition involves an equality comparison on a non-key attribute with a clustering index, use the clustering index to retrieve all the records satisfying the selection condition.

• S6. Using a secondary (B+-tree) index: On an equality comparison, this search method can be used to retrieve a single record if the indexing field has unique values (is a key) or to retrieve multiple records if the indexing field is not a key. In addition, it can be used to retrieve records on conditions involving >,>=, <, or <=. (FOR RANGE QUERIES)

Chapter 15-4


Implementing the SELECT Operation (cont.):Search Methods for Complex Selection:• S7. Conjunctive selection: If an attribute involved in any single

simple condition in the conjunctive condition has an access path that permits the use of one of the methods S2 to S6, use that condition to retrieve the records and then check whether each retrieved record satisfies the remaining simple conditions in the conjunctive condition.

• S8. Conjunctive selection using a composite index: If two or more attributes are involved in equality conditions in the conjunctive condition and a composite index (or hash structure) exists on the combined field, we can use the index directly.

Chapter 15-5


Implementing the SELECT Operation (cont.):Search Methods for Complex Selection:• S9. Conjunctive selection by intersection of record

pointers: This method is possible if secondary indexes are available on all (or some of) the fields involved in equality comparison conditions in the conjunctive condition and if the indexes include record pointers (rather than block pointers). Each index can be used to retrieve the record pointers that satisfy the individual condition. The intersection of these sets of record pointers gives the record pointers that satisfy the conjunctive condition, which are then used to retrieve those records directly. If only some of the conditions have secondary indexes, each retrieved record is further tested to determine whether it satisfies the remaining conditions.

Chapter 15-6

Algorithms for SELECT and JOIN Operations (7)

Implementing the SELECT Operation (cont.):• Whenever a single condition specifies the selection, we can only

check whether an access path exists on the attribute involved in that condition. If an access path exists, the method corresponding to that access path is used; otherwise, the “brute force” linear search approach of method S1 is used.

• For conjunctive selection conditions, whenever more than one of the attributes involved in the conditions have an access path, query optimization should be done to choose the access path that retrieves the fewest records in the most efficient way .

Chapter 15-7


Implementing the JOIN Operation:• Join (EQUIJOIN, NATURAL JOIN)– two–way join: a join on two filese.g. R A=B S

– multi-way joins: joins involving more than two files. e.g. R A=B S C=D T

• Examples(OP6): EMPLOYEE DNO=DNUMBER DEPARTMENT

(OP7): DEPARTMENT MGRSSN=SSN EMPLOYEE

Chapter 15-8


Implementing the JOIN Operation (cont.):Methods for implementing joins:• J1. Nested-loop join (brute force): For each record t in R

(outer loop), retrieve every record s from S (inner loop) and test whether the two records satisfy the join condition t[A] = s[B].

• J2. Single-loop join (Using an access structure to retrieve the matching records): If an index (or hash key) exists for one of the two join attributes — say, B of S — retrieve each record t in R, one at a time, and then use the access structure to retrieve directly all matching records s from S that satisfy s[B] = t[A].

Chapter 15-9


Implementing the JOIN Operation (cont.):Methods for implementing joins:• J3. Sort-merge join: If the records of R and S are

physically sorted (ordered) by value of the join attributes A and B, respectively, we can implement the join in the most efficient way possible. Both files are scanned in order of the join attributes, matching the records that have the same values for A and B. In this method, the records of each file are scanned only once each for matching with the other file—unless both A and B are non-key attributes, in which case the method needs to be modified slightly.

Chapter 15-10


Implementing the JOIN Operation (cont.):Methods for implementing joins:• J4. Hash-join: The records of files R and S are both

hashed to the same hash file, using the same hashing function on the join attributes A of R and B of S as hash keys. A single pass through the file with fewer records (say, R) hashes its records to the hash file buckets. A single pass through the other file (S) then hashes each of its records to the appropriate bucket, where the record is combined with all matching records from R.

BTREE fundamentalsLecture -15

11

B-TREE

• A B-tree is a specialized multiway tree designed especially for use on disk

• In a B-tree each node may contain a large number of keys• Only a small number of nodes must be read from disk to

retrieve an item• The goal is to get fast access to the data

12

• A multiway tree of order m is an ordered tree where each node has at most m children. For each node, if k is the actual number of children in the node, then k - 1 is the number of keys in the node. If the keys and subtrees are arranged in the fashion of a search tree, then this is called a multiway search tree of order m. For example, the following is a multiway search tree of order 4. Note that the first row in each node shows the keys, while the second row shows the pointers to the child nodes.

• Of course, in any useful application there would be a record of data associated with each key, so that the first row in each node might be an array of records where each record contains a key and its associated data. Another approach would be to have the first row of each node contain an array of records where each record contains a key and a record number for the associated data record, which is found in another file. This last method is often used when the data records are large.

13

B-TREE

Hashing FunctionsLecture -16

14

Hash tables• hash table: an array of some fixed size,

that positions elements according to an algorithm called a hash function

15elements (e.g., strings)

0

…

length –1

hash func.h(element)

hash table

Hashing, hash functions• The idea: somehow we map every element into some index in the

array ("hash" it);this is its one and only place that it should go– Lookup becomes constant-time: simply look at that one slot again

later to see if the element is there– add, remove, contains all become O(1) !

• For now, let's look at integers (int)– a "hash function" h for int is trivial:

store int i at index i (a direct mapping)• if i >= array.length, store i at index(i % array.length)

– h(i) = i % array.length

16

Hash function in action

• Add these elements to thehash table:– 89– 18– 49– 58– 9

17

Hash function example• elements = Integers• h(i) = i % 10• add 41, 34, 7, and 18• constant-time lookup:– just look at i % 10 again later

• We lose all ordering information:– getMin, getMax, removeMin, removeMax– the various ordered traversals– printing items in sorted order 18

0

1 41

2

3

4 34

5

6

7 7

8 18

9

Hash collisions• collision: the event that two hash table

elements map into the same slot in the array

• example: add 41, 34, 7, 18, then 21– 21 hashes into the same slot as 41!– 21 should not replace 41 in the hash table;

they should both be there

collision resolution: means for fixing collisions in a hash table 19

0

1 21

2

3

4 34

5

6

7 7

8 18

9

Linear probing• linear probing: resolving collisions in slot i by putting the colliding

element into the next available slot (i+1, i+2, ...)– add 41, 34, 7, 18, then 21, then 57

• 21 collides (41 is already there), so we search ahead until we find empty slot 2• 57 collides (7 is already there), so we search ahead twice until we find empty

slot 9

– lookup algorithm becomes slightly modified; we have to loop now until we find the element or an empty slot• what happens when the table gets mostly full?

20

0

1 41

2 21

3

4 34

5

6

7 7

8 18

9 57

Clustering problem• clustering: nodes being placed close

together by probing, which degrades hash table's performance– add 89, 18, 49, 58, 9

– now searching for the value 28 will have to check half the hash table! no longer constant time...

21

0 49

1 58

2 9

3

4

5

6

7

8 18

9 89

Quadratic probing• quadratic probing: resolving collisions on

slot i by putting the colliding element into slot i+1, i+4, i+9, i+16, ...– add 89, 18, 49, 58, 9• 49 collides (89 is already there), so we search ahead

by +1 to empty slot 0• 58 collides (18 is already there), so we search ahead

by +1 to occupied slot 9, then +4 to empty slot 2• 9 collides (89 is already there), so we search ahead

by +1 to occupied slot 0, then +4 to empty slot 3– clustering is reduced– what is the lookup algorithm?

22

0 49

1

2 58

3 9

4

5

6

7

8 18

9 89

Quadratic probing in action

23

Chaining

• chaining: All keys that map to the same hash value are kept in a linked list

24

0

1

2

3

4

5

6

7

8

9

10

107

22 12 42

Load factor• load factor: ratio of elements to capacity

– load factor = size / capacity

= 6 / 10

= 0.6

25

0

1 41

2 21

3

4 34

5

6

7 7

8 18

9 57

Rehashing, hash table size• rehash: increasing the size of a hash table's array, and

re-storing all of the items into the array using the hash function– can we just copy the old contents to the larger array?

– When should we rehash? Some options:• when load reaches a certain level (e.g., = 0.5)• when an insertion fails

• What is the cost (Big-Oh) of rehashing?• what is a good hash table array size?– how much bigger should a hash table get when it grows?

26

Hash table removal• lazy removal: instead of actually removing elements, replace them

with a special REMOVED value– avoids expensive re-shuffling of elements on remove

– example: remove 18 -->

– lookup algorithm becomes slightly modified• what should we do when we hit a slot containing the REMOVED value?

– keep going

– add algorithm becomes slightly modified• what should we do when we hit a slot containing the REMOVED value?

– use that slot, replace REMOVED with the new value– add(17) --> slot 8

27

0

1 41

2 21

3

4 34

5

6

7 7

8 REMOVED

9 57

Hashing practice problem• Draw a diagram of the state of a hash table of size 10,

initially empty, after adding the following elements:

7, 84, 31, 57, 44, 19, 27, 14, and 64

Assume that the hash table uses linear probing.

Assume that rehashing occurs at the start of an add where the load factor is 0.75.

• Repeat the above problem using quadratic probing.

28

Writing a hash function

• If we write a hash table that can store objects, we need a hash function for the objects, so that we know what index to store them

We want a hash function to:1.be simple/fast to compute2.map equal elements to the same index3.map different elements to different indexes4.have keys distributed evenly among indexes

29

Hash function for strings• elements = Strings• let's view a string by its letters:– String s : s0, s1, s2, …, sn-1

• how do we map a string into an integer index?(how do we "hash" it?)

• one possible hash function:– treat first character as an int, and hash on that

• h(s) = s0 % array.length• is this a good hash function? When will strings collide?

30

Better string hash functions• view a string by its letters:

– String s : s0, s1, s2, …, sn-1

• another possible hash function:– treat each character as an int, sum them, and hash on that

• h(s) = % array.length

• what's wrong with this hash function? When will strings collide?

• a third option:– perform a weighted sum of the letters, and hash on that

– h(s) = % array.length

31

1

0

37k

i

iis

1

0

n

iis

Analysis of hash table search

• load: the load of a hash table is the ratio: no. of elements

array size

• analysis of search, with linear probing:

– unsuccessful:

– successful: 32

M

N

2)1(

11

2

1

1

11

2

1

Analysis of hash table search

• analysis of search, with chaining:– unsuccessful:

(the average length of a list at hash(i))

– successful: 1 + (/2)(one node, plus half the avg. length of a list (not including the item))

33

Compound collections• Collections can be nested to represent

more complex data

• example: A person can have one or many phone numbers– want to be able to quickly find all of a

person's phone numbers, given their name

• implement this example as a HashMap of Lists– keys are Strings (names)– values are Lists (e.g ArrayList) of Strings,

where each String is one phone number

String List<String>name --> list of phone numbers

"Phil" --> ["234-8793", "439-8575", ...]34

35

The Tuning CycleLecture -17

Performance tuning

• Performance tuning is the main activity associated with performance management.

• Reduced to its most basic level, tuning consists of finding and eliminating bottlenecks — a condition that occurs, and is revealed, when a piece of hardware or software in a server approaches the limits of its capacity.

36

The Tuning Cycle

• The tuning cycle, which is an iterative series of controlled performance experiments..

• Repeat the four phases of the tuning cycle shown below until you achieve the performance goals that you established prior to starting the tuning process.

• Tuning Cycle:

37

• Collecting The Collecting phase is the starting point of any tuning exercise. Gathering data with the collection of performance counters that are chosen for a specific part of the system.

• AnalyzingAfter you have collected the performance data that you require for tuning the selected part of the system, you need to analyze the data to determine bottlenecks.

38

Tuning Cycle

Tuning Cycle

• Configuring After you have collected your data and completed the analysis of the results, you can determine which part of the system is the best candidate for a configuration change and implement this change.

• TestingAfter implementing a configuration change, you must complete the appropriate level of testing to determine the impact of the change on the system that you are tuning.

39

• When testing, be sure to: • Check the correctness and performance of the application

that you are using for testing by looking for memory leaks and inordinate delays in response to client requests.

• Ensure that all tests are working correctly.• Make sure you can repeat all tests by using the same

transaction mix and the same clients generating the same load.

• Document changes and results.

40

Tuning Cycle

Database Performance Tuning• DBMS Installation– Setting installation parameters

• Memory Usage – Set cache levels– Choose background processes

• Input / Output (I/O) Contention– Distribution of heavily accessed files

• CPU Usage– Monitor CPU load

• Application tuning– Modification of SQL code in applications

41

Search Algorithm Lecture-14

Documents

Transcript of Search Algorithm Lecture-14