Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso...

31
Data Management Systems Storage Management Memory hierarchy Segments and file storage Database buffer cache Storage techniques in context Basic principles The Buffer Cache Management, replacement Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage - Buffer cache 1

Transcript of Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso...

Page 1: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Data Management Systems

• Storage Management

• Memory hierarchy

• Segments and file storage

• Database buffer cache

• Storage techniques in context

• Basic principles

• The Buffer Cache

• Management, replacement

• Relation to overall system

Gustavo Alonso

Institute of Computing Platforms

Department of Computer Science

ETH Zürich Storage - Buffer cache 1

Page 2: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Buffer Cache: basic principles

• Data must be in memory to be processed but what if all the data does not fit in main memory?

• Databases cache blocks in memory, writing them back to storage when dirty (modified) or in need of more space

• Similar to OS virtual memory and paging mechanisms but:• The database knows the access patterns

• The database can optimize the process much more

• The buffer cache is a key component of any database with many implications for the rest of the system

Storage - Buffer cache 2

Page 3: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Storage Management

Physical storage

Blocks, files, segments

Pages in memory

Physical records

Logical records (tuples)

Logical data (tables, schemas)

Relations, views

Queries, Transactions (SQL)

Record Interface

Record Access

Page access

File Access

Application

Logical view (logical data)

Access Paths

Physical data in memory

Page structure

Storage allocation

3Storage - Buffer cache

Page 4: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Disclaimers

• The Buffer manager, buffer cache, buffer pool, etc. is a complex system with significant performance implications:• Many tuning parameters

• Many aspects affect performance and behavior

• Many options to optimize its use and tailor it to particular data

• We will cover the basic ideas and discuss the performance implications, we will not be able to cover all possible optimizations or system specifics.

Storage - Buffer cache 4

Page 5: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Storage - Buffer cache 5

Latches

Buffer header

Hashbuckets Linked list of buffer headers

Memory cache

Blocks in cache

Page 6: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Buffer manager: latches

• Databases distinguish between a lock and a latch:• Lock: mechanism to avoid conflicting updates to the data by transactions

• Latch: mechanism to avoid conflicting updates in system data structures

• The buffer cache latches do the following:• Avoid conflicting access to the hash buckets with the block headers

• Cover several hash buckets (tunable parameter)

• Why not a latch per bucket or per block header?• Way too many!!!

• Very common trade-off in databases: how much space to devote to the engine data structures?

Storage - Buffer cache 6

LatchesHash

buckets

Page 7: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Performance issues of latches in buffer cache

• When looking for a block, a query or a transaction scans the buffer cache looking to see if the block is in memory. This requires to acquire a latch per block accessed.

• A latch can be owned by a single process and latches cover several link lists of block headers!

• Contention on these latches may cause performance problems:• Hot blocks

• SQL statements that access too many blocks

• Similar SQL statements executed concurrently

Storage - Buffer cache 7

LatchesHash

buckets

Page 8: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

How to address latch performance issues

• Reducing the amount of data in a block so that there is less contention on it (in Oracle, use PCTFREE, PCTUSED)

• Configure the database engine with more latches and less buckets per latch (DBAdmin)

• Use multiple buffer pools (DBAdmin but also at table creation)

• Tune queries to minimize the number of blocks they access (avoid table scans)

• Avoid many concurrent queries that access the same data

• Avoid concurrent transactions and queries against the same data (see later for how updates are managed to see the problem)

Storage - Buffer cache 8

Page 9: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Buffer manager: Hash buckets

• The correct linked list where a block header resides is found by hashing on some form of block identifier (e.g., file ID and block number)

• After hashing, the linked list is traversed looking for an entry for the corresponding block:• Expensive => lists should be kept short by having as many hash buckets as

possible (tunable parameter by DBAdmin) => trade-off

Storage - Buffer cache 9

Hashbuckets

Page 10: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Buffer manager: block headers, linked lists

• The blocks that are in memory are located through a block header stored in the corresponding linked list. The header contains quite a bit of information:• Block number• Block type (typically refers to the segment where the block is but now we do

not see the segment, only the block)• Format• LSN = log Sequence number (Change Number, Commit number, etc.)

timestamp of the last transaction to modify the block• Checksum for integrity• Latches/status flags• Buffer replacement information (see later)

Storage - Buffer cache 10

Buffer header

Hashbuckets Linked list of buffer headers

Page 11: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Status of a block

• Relevant for the management of the buffer are the following states• Pinned: if a block is pinned, it cannot be evicted

• Usage count: (in some systems), how many queries are using or have used the block, also counts of accesses

• Clean/dirty: block has not been / has been modified

• This information is used when implementing cache replacement policies

Storage - Buffer cache 11

Page 12: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

What is in the linked list

• Depending on how the database engine works, the nature of the blocks in the linked list might be different. Besides normal blocks, one can have, for instance (Oracle):• Version blocks: every update to a block results in a copy of the block being

inserted in the list with the timestamp of the corresponding transaction• Undo blocks/redo blocks (for recovery)• Dirty blocks• Pinned blocks• …

• In the case of Oracle, the version blocks play a big role in transaction management and implementing snapshot isolation

Storage - Buffer cache 12

Buffer header

Hashbuckets Linked list of buffer headers

Page 13: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Performance implications of version blocks

• It is a form of shadow paging: keep the old block in the linked list, add a new entry for the modified block. The same discussion as for shadow paging applies. However:• It allows queries to read data as of the time they started without having to

worry about writes => huge advantage for concurrency control (see later)

• One can find older versions, enabling reading “in the past”

• Facilitates recovery (as in shadow paging)

• If many concurrent transactions update the same data, the linked list will grow too long, creating a performance problem (see earlier discussion on latches)

Storage - Buffer cache 13

Page 14: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Buffer replacement

• Any form of caching requires a cache replacement policy:• What to cache

• What to keep in the cache

• What to evict from the cache and when

• How to avoid thrashing the cache with unnecessary traffic

• Similar to OS but, as usual, the database has much more information on how and when the data will be used.

• Real systems have many parameters and many options to determine how to manage the buffer cache (and even how to avoid it)

Storage - Buffer cache 14

Page 15: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

LRU: Least Recently Used

Storage - Buffer cache

15

Buffer pool

T R

S

P

MRU

LRU

Idea is to keep track of when a page was used using a list. When a block is used, it goes on top (Most Recently Used), to decide which blocks to evict, pick those at the bottom (Least Recently Used).

LRU List1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Page 16: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

LRU: Least Recently Used

Storage - Buffer cache

16

Buffer pool

T R

S

P

MRU

LRU

LRU List

SELECT * FROM T 7

6

5

4

3

Page 17: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

LRU: Least Recently Used

Storage - Buffer cache

17

Buffer pool

T R

S

P

MRU

LRU

LRU List

SELECT * FROM TSELECT * FROM S

7

10

9

8

11

Page 18: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

LRU: Least Recently Used

Storage - Buffer cache

18

Buffer pool

T R

S

P

MRU

LRU

LRU List

SELECT * FROM TSELECT * FROM SSELECT * FROM R

At this point, the cache is full and we cannotbring more blocks from R without removingsomething: we will remove the block at theend of the list

12

15

14

13

16

1

2

3

4

Page 19: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

LRU: Least Recently Used

Storage - Buffer cache

19

Buffer pool

T R

S

P

MRU

LRU

LRU List

SELECT * FROM TSELECT * FROM SSELECT * FROM R

15

14

13

16

2

3

4

5

1

Page 20: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

The trouble with LRU

• LRU is a common strategy in OS but does not really work in databases (although it was used in some systems years ago).

• Table scan flooding = a large table loaded to be scanned once will pollute the cache

• Index range scan = a range scan using an index will pollute the cache with random pages

• Note how we can use the knowledge of what queries do to see the problems. These two types of queries pollute the cache but do not benefit from it as they do not reuse the data

Storage - Buffer cache 20

20

Buffer pool

T R

S

P

MRU

LRU

LRU List

15

14

13

16

2

3

4

5

1

Page 21: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Modified LRU

• A way to avoid polluting the cache when using data that is rarely accessed is to put those blocks at the bottom of the list rather than at the top. That way they are thrown away quickly.

• Another modification is to simply not cache large tables

Storage - Buffer cache 21

21

Buffer pool

T R

S

P

MRU

LRU

LRU List

7

10

9

8

11

SELECT * FROM TSELECT * FROM SSELECT * FROM R

13

14

11

12

Page 22: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Reading assignment

Read the paper:

An Evaluation of Buffer Management Strategies for Relational Database Systems, Hong-Tai Chou, David J. Dewitt, VLDB 1985

Keep in mind that it was written for very different system sizes (e.g., a query may have its pages evicted before it finishes) but many of its ideas are still valid and provide an excellent overview of database engine design

Storage - Buffer cache 22

Page 23: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Database optimizations

• While not really used, LRU serves to illustrate many of the problems a database buffer cache has and how to solve them:

• Keep Buffer Pool (Oracle): tell the database which blocks are important and should not be evicted from memory (will go to a separate buffer)

• Recycle Buffer Pool (Oracle): tell the database which blocks should not be kept after they are used (will go to a separate buffer)

• Keep statistics of usage of tables and let the system decide automatically what should be cached and what not

Storage - Buffer cache 23

Page 24: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Interactions with other optimizations

• Cache pollution is an important aspect because it interacts with other optimizations implemented by databases, e.g., pre-fetching or read-ahead:

• In read-ahead (SQL Server) the database uses the plan of a query to find out what blocks are needed. Instead of bring the blocks one by one, they are read in chunks of up to 64 contiguous blocks even before they are requested by the query

• Sequential read ahead: for tables that are not ordered, sort them by location and fetch then sequentially. Indexes are read sequentially by key.

• Random pre-fetching: (for non-clustered indexes) fetch the needed blocks at the same time as one processes the block pages

• Read ahead is not for free, it might fetch data that is not needed (it is fetched in the hope it will be reused).

Storage - Buffer cache 24

Page 25: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Further optimizations

• Pages can be clean (have not been modified) or dirty (have been modified). If there is a choice, evicting a clean page is faster than evicting a dirty page as the dirty page needs to be written to storage

• Ring buffers (Postgres): for scans, allocate the pages in a ring so that blocks are allocated only within the ring. When the buffer is full, evict the pages form the beginning of the ring as those have already been scanned

• Block sizes are not homogeneous, requiring a buffer cache for each block size.

Storage - Buffer cache 25

Page 26: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Touch Count (Hot/Cold list)

• Algorithm used in Oracle

• A more sophisticated LRU

• Insert new blocks in the middle of the list (instead of at the top)

• Keep a count of accesses (increase when page is touched). Frequent accessed pages float to the top (hot), rarely accessed blocks sink to the bottom (cold)

• To avoid counting problems (a page is accessed many times but only for a short period of time), counter is incremented only after a (tunable) number of seconds

• Periodically, decrease counters

Storage - Buffer cache 26

HOT

COLD

Age List

INSERT IN THE MIDDLE

EVICT FROM BOTTOM

HOT PAGES REMAIN

Push up as counter increases

Push down as counter decreases

Page 27: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Second Chance

• Whatever the policy, something like the LRU list can become a bottleneck (accessing, sorting, maintaining, updating, etc.) if it is large.

• An alternative design is to use the “second chance” algorithm and implement it using a “clock sweep” approach• No list is maintained• Counters are kept in the blocks• Buffer is treated as a circular buffer with an eviction process going around the

blocks in the buffer• When page is accessed, set counter to 1• When eviction processes passes by, if counter = 1, set to 0 and move on. If

counter = 0, evict page.

Storage - Buffer cache 27

Page 28: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Clock Sweep

• Same as second chance but it takes into account that some pages are access frequently at regular intervals so it uses a counter rather than just a 1/0 flag. This is the approach used in Postgres

• Algorithm is the same:• Upon touching a block, the counter is increased (up to a tunable maximum)

• With every pass of the eviction process, the counter is decreased

• If counter = 0, block can be evicted

• That way, blocks that are accessed regularly have a higher chance of staying in memory since their counter will tend to be high

Storage - Buffer cache 28

Page 29: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

2Q: using two lists

• Another way to achieve something similar is to use two lists• A FIFO list for blocks that do not need to be kept

• A LRU list for blocks that are accessed several times

• A block in the FIFO that is accessed again is oved to the LRU list

• A block at the bottom of the LRU list is ether moved to the FIFO list (or evicted)

• Evict from FIFO list

Storage - Buffer cache 29

Page 30: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

Summary

• Buffer cache management is essential to obtain performance

• Fundamental difference over OS approaches: databases know what the operations do and know it in advance (every query has a plan)• Leads to a variety of optimizations

• Many different approaches

• Overhead of the data structures needed to keep track of things should not be underestimated

• Many tuning parameters in all database engines to adjust the behavior

Storage - Buffer cache 30

Page 31: Data Management Systems · •Management, replacement •Relation to overall system Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage

What is out there

• Some of these approaches change over time!

• Oracle: LRU, modified LRU, and HoT/Cold

• SQL Server: LRU-K/2 (the blocks are sorted according to their frequency of access rather than just an access counter, which allows to account for interarrival times for accesses)

• Postgres: Clock Sweep and circular buffer from scans

• MySQL: Hot/Cold

• SAP Hana NSE: 2Q with hot buffers list and LRU

Storage - Buffer cache 31