Mysql Performance Optimization Indexing Algorithms and Data Structures

download Mysql Performance Optimization Indexing Algorithms and Data Structures

If you can't read please download the document

Transcript of Mysql Performance Optimization Indexing Algorithms and Data Structures

MySQL Performance OptimizationPart IIIndexing Data Structures and Algorithms

Abhijit Mondal

Software Engineer at HolidayIQ

Contents

Hash Indexes

B-Trees and B+ Trees Indexes

Indexing Strategies for High Performance

Full Text Searching

Hash Indexes

A hash index is built on a hash table and is useful only for exact lookups that use every column in the index. For each row, the storage engine computes a hash code of the indexed columns, which is a small value that will probably differ from the hash codes computed for other rows with different key values. It stores the hash codes in the index and stores a pointer to each row in a hash table.

CREATE TABLE user_info (user_id int not null primary key auto_increment, username varchar(50), password char(32), KEY USING HASH(username, password)) ENGINE=MEMORY;

Suppose the has function is f() i.e. f : (username, password) -> Integer, then our data will have has values as such for eg. f('john','abc123') = 2789. The index's data structure will have a pointer from slot 2789 to the row which has username 'john' and password 'abc123'.

If the function f() is very selective i.e. For each combination of username and password it gives a different integer as output, then lookups will be O(1) in constant time (very very fast). For queries such as SELECT * from user_info where username='john' and password='abc123', it will not scan the table but compute f('john','abc123')=2789 and directly pick up the row from slot 2789.

Hash Indexes

ORDER BY queries on Memory engine will not take advantage of hash indexes as rows are not stored in sorted order.

Queries such as SELECT * from user_info where username='john'; will not use hash index because to compute the function f() it needs both username and password.

Range queries doesn't use hash indexes because to compute f() it needs exact values for the parameters.

If the function f() is not selective, i.e. For more than one combination of username, password pair it returns the same integer output e.g. f('john','abc123')=2789 and f('mary','25qwer')=2789 and so on for 5 other pairs then the slot 2789 points to a linked list of row pointers where each row pointer in the linked list has username, password pair that gives the same output when f() id applied on it. This case is termed chaining.

In case of hash collisions the worst case perormance for a query like SELECT * from user_info where username='mary' and password='25quer'; can amount to equivalent of a full table scan if all username, password pairs in the table have the same hash value.

Hash Indexes

Analysis of hashing with chaining :1. How long does it take to return the output of the query SELECT * from user_info where username='johnny' and password='derp123' ?2. Assuming simple uniform hashing, if there are 'm' slots in the index and a total of 'n' rows then the expected number of rows each slot points to is a=n/m (the average length of linked list for each slot is n/m ).3. For query such as SELECT * from user_info where username='johnny' and password='derp123' the average number of lookups is (1+a).Proof : Suppose the username-password combination we are searching is non-existent then Mysql would compute f('johnny','derp123') = x, then it will search in the linked list of pointers in slot 'x'. Since it is not there it has to search till the end of linked list i.e. Average length of linked list = a = (1+a).If the particular username-password combination is present then the number of lookups is equal to 1+ #(row pointers before ('johnny','derp123') in the linked list).For large values of n (number of rows in the table) we can assume that the expected number of row pointers before ('johnny','derp123') in its linked list is a/2. Thus average number of lookups = 1+a/2 = (1+a).

Hash Indexes

Hash Indexes for InnoDB engine : The InnoDB storage engine has a special feature called adaptive hash indexes. When InnoDB notices that some index values are being accessed very frequently, it builds a hash index for them in memory on top of B-Tree indexes.

A 'Good' Hash function f() : Each row is equally likely to hash to any of the 'm' slots independently of where any other row has hashed to.i.e. f('john','abc123') should be independent of f('johnny','derp123').

In InnoDB there is no inbuilt hash function that we can take advantage of for explicit indexing. So we can maintain one column in the table for our hash values. ALTER TABLE user_info add column hash char(32) key. Then index 'hash'.

Collision analysis using 16 byte (32 hexadecimal digits) MD5() hash function :1. MD5() hash lookups are time consuming as the algorithm takes time to compute the value and then since the value is 32 digit hexadecimal string comparison also takes time.2. SELECT * from user_info where username='johnny' and password='derp123' and hash='690cdca9655043e9d087a1d50cd74e02'; we need the check on username and password field also so that single row is returned in case of collisions.

Hash Indexes

Method 2 : Using CRC32() as another builtin hash function is a better choice than MD5() since it results in a 10 digit integer value which can speed up comparisons effectively.SELECT * from user_info where username='johnny' and password='derp123' and hash=3682452828;

Method 3 : Using column prefixes as hash index. We can use fixed length prefixes from our username and password values. For e.g. For username 'johnny' and password 'derp123' we can choose our hash to be (4+3) character long 'johnder'.1. SELECT * from user_info where username='johnny' and password='derp123' and hash='johnder';2. Less comparison overhead compared to indexing the whole username and password values.3. Less selectivity. Defining selectivity s1= (# of distinct username-password pairs)/(# of rows in user_info) and s2=(# of distinct hash values)/(# of rows in user_info). Choose a length L for our hash values for which s2 s1, then number of collisions will be minimized.

Hash Indexes

Method 4 : Using universal class of hash functions. Convert our username and password strings to integer by summing up their ASCII character values and assuming the following for them :1. The ASCII character values for username and passwords lie between 0 and 255.2. Maximum length of username is 10 and password is 10. Thus the maximum integer value for username is 255*10 and password is 255*10 adding them gives the maximum integer value for our key = 5100.3. Assuming there are 1000 distinct username passwords in our database, choose a prime p > 5100, p=5101, choose 2 integers 1