Kernel Implementation: Page table structurescs9242/02/lectures/08-vm.pdf · System impact Some...

Kernel Implementation:

Page table structures

Cr istan Szmajda

[email protected]

Virtual memory

Almost all modern operating systems support vir tual memor y. VM lets you:

• run applications larger than physical memory

• make best use of physical memory

• protect applications from each other (and themselves!)

But, paging virtual memory has some unavoidable overheads:

• translation lookaside buffer (TLB)

• TLB misses

• page table

• page faults

In the 1980’s, these overheads were typically around 5% to 8% (Clark & Emer 1985).

Suddenly in the mid-1990’s, studies start to repor t TLB refill overheads of 30% to 50%,

and even 80%.

What went wrong?

↑ increasing MHz

↑ increasing MBytes

↑ increasing instructions per cycle (superscalar, VLIW, etc.)

↑ more address bits (64-bit addresses)

↑ higher miss penalty (pipeline & exception costs)

– i80386: 9 to 13 cycles

– VAX-11/780: 17 to 23 cycles

– Pentium 4: 31 to 48 cycles (assuming L2 cache hits)

– Pow erPC 604: 120 cycles (assuming level 2 TLB hit)

TLBs aren’t getting bigger and faster as fast as CPU and RAM.

Why not just make bigger and faster TLBs?

• large CAMs are slow and hot

• often flushed (context switch, address space teardown, protection change, etc.)

• MHz, MBytes, and caches sell computers, not TLBs

Why not just increase the page size?

• fragmentation

• I/O latency

• iner tia

In many processors, TLB thrashing is a bottleneck.

System impact

Some system features exacerbate TLB load.

• ser vice decomposition

– user-level daemons

– micro-ker nels

• sparse address space layout

• shared memory

– increases fragmentation

– reduces locality

– requires more TLB entries to cover the available RAM

Micro-ker nel based systems and SASOSes particular ly suffer.

Application impact

Some features of recent languages

and applications reduce locality:

• bloat

• dynamic librar ies

• indirection

• garbage collection

Chen, Borg, and Jouppi (1992) trace

a process undergoing garbage collection.

Hardware solutions

• shared TLB tags (StrongARM, PA-RISC, IA-64)

• vir tual caches

• super-pages

machine ITLB DTLB page siz es

StrongARM 32 32 4k, 64k, 1M

Pentium III 32 64 4k, 4M

4k, 8k, 16k, 64k, 256k, 1M,

4M, 16M, 64M, 256M, 4GItanium 64 96

Alpha 21264 128 128 8k, 64k, 512k, 4M

UltraSPARC 64 8k, 64k, 512k, 4M

MIPS R4000 96 4k, 16k, 64k, 256k, 1M, 4M, 16M

Po werPC 601 256 4k

Software solutions

• optimize page table lookup

• cross-link page tables for shared memory

Page table structures

TLB miss overhead is directly limited by page table perfor mance.

Classical page table structures were designed for

• 32-bit address spaces, and

• the Unix two-segment model.

How well do they perfor m in:

• large (64-bit) address spaces?

• sparse address-space distributions?

• micro-ker nel system structures?

Requirements

• time

• space

• establishment, tear-down cost

• cache friendliness

• suppor t for shar ing/aliasing

• suppor t for mixed page sizes

• suppor t for operations on large regions

Multi-level page table

Virtual Address

P1 P2 Off

P2

Level 1

Level 2

Physical Mem

Off

root

P1

• used on many 32-bit processors (Pentium, StrongARM, etc.)

• require 4–5 levels for 64-bit address space (AMD Hammer)

• used in Linux (on all platfor ms)

Multi-level page table

Advantages:

• can support super pages efficiently

• can support page table sharing

Task APage

TSS

PDE

Page Directories

PDE

PTEPTEPTE

PTEPTE

Page Tables Page Frames

Task APage

Task APage

SharedPage

SharedPage

Task BPage

Task BPage

Shared PT

PTEPTE

PDEPDE

PDBR

PDBR

Task A TSS

Task B TSS

• significant alignment restrictions on this sharing and superpage support

Virtual linear page table

Physical memory

Virtual Address

Virtual Array Page Root in

Physical Mem

P Off

P

Off

Equivalent to MPT, but:

• better best-case lookup time

• steals TLB entries from application

• requires nested exception handling

Inverted page table

T P L

Off

Frame Table

F

Virtual Address

T Off

Hash Anchor Table

Inverted page table

Advantages:

• scales with physical, not virtual, memory size

• no problem with virtual sparsity

• one IPT per system, not per address space

• PTEs bigger as need to store tag

• system needs a frame table anyway

Disadvantages:

• newer machines have sparse physical address space

• difficult to support super-pages

• shar ing is hard

Hashed page table

Off

Virtual Address

T Off

Hash Table

T P L

Very similar to IPT.

Clustering

Cluster ing is a page table optimization which can in principle be applied to any page

table structure.

P1 P2 P3P

B

0

Virtual Address

Hash Table

T

BT Off

Off

L

• store multiple pages per PTE

• load multiple pages into TLB per miss

• improves perfor mance in presence of spatial locality

• used in MIPS R4000 hardware TLB entry

Level 2 TLB

• a direct-mapped cache of TLB entries in main memory

• fast lookup; can achieve >95% hit rate

• also called software TLB or TLB cache

Software TLB

Virtual Address

T Off

T P

Off

hitmiss

Page Table

• simple enough for hardware implementation

• difficult to support super-pages

Guarded page table

In large address spaces, MPT often creates page table levels with only one valid entry.

• idea: bypass these tables

• some address bits are not used to index any table: check these bits during

lookup

• skipped bits called a guard

• technique also called path compression

guard

guard guard guard guard

Invented by Liedtke (1994), inventor of L4.

TLB performance studies

How well do these page tables perfor m?

page bench- TLB miss

table mark penaltyreference CPU OS

TS1 6.6%

Clar k TS2 6.4%

& Emer EDU 6.0%

(1985) SCI 5.5%

COM 6.8%

VAX-11/780 VMS 2 VLA2

Ultr ix 3.1 VLA2 2.03%Nagle

OSF/1 1.0 VLA2½ 5.81%et al.

(1993) Mach 3.0 VLA2½ 8.21%

MIPS R2000 mixed



matr ix 45%

nasker 18%

HuckOLTP1 12%

& Haysgcc 6%

IPT

matr ix 39%(1993)

nasker 15%

OLTP1 9%

gcc 4%

HPT

PA-RISC HP-UX

coral 50%

Tallur i nasa7 40%

& Hill compress 26%

(1994) mp3d 11%

gcc 3%

UltraSPARC Solaris 2.1 HPT



Romercoral 41.4%

et al.compress 35.2%

(1995)spice 9.4%

gcc 5.2%

Alpha 21064 OSF/1 2.1 VLA2½

SubramanianVerilog 49%

et al. (1998)apsi 31%

compress 15%

PA-RISC HP-UX HPT

c4 32.4%

gcc 17.9%

compress 14.0%

Elphinstone wave5 9.3%

GPT16

c4 14.4%(1999)MIPS R4700

128k gcc 5.9%

STLB compress 4.9%

wave5 3.1%

L4/MIPS

GPT performance

Elphinstone (1999) studied GPT and var ious other page tables, using L4/MIPS as a

testbed.

source name size (M) type remarks

go 0.8 I game of go

swim 14.2 F PDE solver

SPEC gcc 9.3 I GNU C compiler

CPU95 compress 34.9 I file (un)compression

apsi 2.2 F PDE solver

wave5 40.4 F PDE solver

c4 5.1 I game of connect four

nsieve 4.9 I pr ime number generator

heapsor t 4.0 I sor ting large arrays

mm 7.7 F matr ix multiply

tfftdp 4.0 F fast four ier transfor m

Albur to

GPT refill time

0

50

100

150

200

250

300

G2 G4 G8 G16 G32 G64 G128 G256

cycl

es

entries per GPT node

GPT versus other page tables

0

20

40

60

80

100

120

140

160

MPT G16 H8k H128k C128k G16S8k/ G16

S128k/

cycl

es

GPT depth

Compare average GPT depth with (fixed) MPT depth.

0

2

4

6

8

10

12

14

G2 G4 G8 G16 G32 G64 G128 G256

dept

h

entries per GPT node

GPT space

Compare GPT storage requirements with expected MPT storage requirements.

0

10

20

30

40

50

60

70

80

G2 G4 G8 G16 G32 G64 G128 G256

byte

s pe

r PT

E

page table

GPT versus other page tables

0

50

100

150

200

250

300

350

MPT G16 H8k H128k C128k

byte

s pe

r P

TE

Address space establishment/teardown cost

0

1000

2000

3000

4000

5000

MPT C128k H128k H8k G16 G16S8k/ G16

S128k/

µsec

0

50

100

150

200

H8k G16 G16S8k/ G16

S128k/

Other benchmarks

• sparse benchmark: unifor mly distr ibuted pages in 1 T address space

• file benchmark: unifor mly distr ibuted multi-page objects

0

200

400

600

800

1000

G2 G4 G8 G16 G32 G64 G128 G256

byte

s pe

r P

TE

worstuniform

fileconv

0

20

40

60

80

100

G2 G4 G8 G16

GPT conclusions

• low establishment/teardown cost

• small GPT node size saves space, especially for sparse distributions

• tree depth can become a problem, especially for dense distributions

L4/MIPS solution: use GPT with a level 2 TLB.

Implementation in L4

L4 provides three operations: map, grant, and unmap.

0σ

A

B C

D E

L4 must remember the history of map operations in the mapping database, to allow

future undo with unmap.

Memor y management and I/O is the responsibility of user-level pagers.

L4 implementation

Most L4 implementations (including L4/MIPS) have a similar implementation of recur-

sive address spaces.

• guarded page table

• frame table

• mapping database

guardedpagetable

frametable

mappingdatabase

Direct pointers between GPT and mapping database (green arrow) were considered

by Elphinstone, but rejected to allow PT implementation freedom.

Level and Path Compressed Trie

• invented by Andersson and Nilsson (1991)

• implemented by Szmajda in Calypso VM system

• a simplified and flattened version of GPT

• allows node size and skip size to be an arbitrar y power of two

• all guard comparison deferred until the leaves

Calypso implementation

• store two shift amounts and a pointer in internal nodes

• extract bits with two shifts

m f prot size skip

ptr

• store virtual address (and other goodies) in an enlarged PTE

m f gen task hard size skip

ptr

m 0 phys c w v g

virt

Each PTE may represent any (hard) page size.

Page tables may be shared (with an addition to the L4 API).

Calypso vs. GPT

Elphinstone (1999) derived the following GPT algorithm.

repeat {

u = v >> (vlen − s)

g = (p + 32u)→guard

g len = (p + 32u)→guardlen

if g == (v >> (vlen − s − g len)) and (2glen − 1) {

v ’len = vlen − s − g len

v ’ = v and (2glen)

s ’ = (p + 32u)→size ’

p’ = (p + 32u)→table ’

} else

page fault

} until p is a leaf

Calypso vs. GPT

After common subexpression elimination, the GPT loop has 17 arithmetic and load op-

erations.

Calypso is much simpler.

repeat {

p = &p→ptr [v << p→skip >> p→size ]

} until p is a leaf

if p→vir t ≠ v

page fault

All guard checking is deferred until the end.

The inner loop on the MIPS R4000 requires only 7 instructions.

Calypso policies

How large to make Calypso nodes?

Andersson and Nilsson (1998) used thresholds like 50% full → double, etc. Calypso’s

implementation is different.

• each page table is greedy, and takes all the memory it can

• unused page tables are liable to be chopped in half at any time, and the returned

to the memory manager

• power of two regions are managed by a buddy system allocator

To prevent excess greed, ker nel memor y is managed by user-level pagers, instead of

in a single fixed pool.

Status: memor y management API undergoing standardization.

Page table fragmentation

But, representing many page sizes can blow out depth.

16k 16k

4k 4k 4k 4k 4k 4k

Solution: key expansion

expanded keys

4k 4k 4k 4k 4k 4k 16k 16k

Complicates the mapping database (later).

Calypso mapping database

Topologically sort each mapping graph into a singly-linked list.

σ0

A2

B1

D2

B4

D4

E2 C2

E3 E4

σ0

1

A22

B13

D24

B43

D44

E23

C23

E34

E44

Integrate the mapping database list into the PTEs.

mapping database

page table

Link

New VM operation: link, which establishes a shared domain between pager and

pagee.

Semantically, link is like map, but instead of just copying a snapshot of the pager’s

mappings to the destination address space, the pager and pagee always share the

mappings, even if the pager’s address space is updated by future maps or unmaps.

L4 primitive Unix analogy

unmap rm

map cp

grant mv

link ln -s

More on link

Restr ictions:

• vir tual address of the fpage in pager and pagee must be equal

• fpage size may be restr icted

Advantages:

• natural generalization of map and grant

• reduces ker nel crossings

• reduces page fault IPC

• restr icted by L4’s usual IPC confinement model (e.g. clans and chiefs)

Calypso performance

Results measured by running with VM on and off, and comparing run-times.

• counts all direct and indirect costs of VM

• nor malized to percentage overhead

Calypso also includes other optimizations beyond the scope of this lecture.

HPT CPT GPT GPT+TLB2 CALY4

wave5 15.4% 14.9% 16.2% 5.1% 6.2%

swim 4.7% 2.4% 1.1% 0.5% 2.6%

gcc 24.3% 26.8% 31.4% 9.1% 9.5%

compress 16.2% 17.2% 24.5% 7.9% 7.6%

Single-tasking perfor mance

Calypso performance

Enabling page size mixtures drastically improves perfor mance; but space/time tradeoff

is harder to measure.

CALY4 CALY64 CALY1024 CALY16384

wave5 6.2% 2.4% <0.1% <0.1%

swim 2.6% 1.1% 0.0% 0.0%

gcc 9.5% 0.8% 0.0% 0.0%

compress 7.6% 2.6% <0.1% <0.1%

Mixed page sizes (assuming infinite physical memory)

Multi-tasking perfor mance was measured with and without LINK, and using the G (glob-

al) bit to simulate shared TLB tags.

GPT GPT+TLB2 CALY4M CALY4L CALY4G

wave5 20.2% 9.1% 8.2% 8.0% 7.6%

swim 2.6% 2.1% 2.9% 2.8% 2.8%

gcc 36.9% 13.4% 11.8% 11.5% 10.8%

compress 27.9% 10.1% 9.1% 8.6% 8.3%

Multi-tasking perfor mance (assuming infinite physical memory)

Conclusions

• Moder n hardware and recent software can lead to high VM overhead.

– 64-bit addresses

– sparse address space usage

– micro-ker nel ser vice decomposition

– bloated applications

• Conventional page tables don’t perfor m well in these conditions.

• Level 2 TLB is the best solution to a slow page table

• Calypso perfor ms as well as level 2 TLB for dense address spaces

• Perfor mance in sparse situations yet to be evaluated

• Optimization of the critical path pays off

– but only after evaluation and measurement.

References and further information

http://www.cse.unsw.edu.au/˜cls/

Kernel Implementation: Page table structurescs9242/02/lectures/08-vm.pdf · System impact Some...

Documents

Transcript of Kernel Implementation: Page table structurescs9242/02/lectures/08-vm.pdf · System impact Some...