Kernel Implementation: Page table structurescs9242/02/lectures/08-vm.pdf · System impact Some...
Transcript of Kernel Implementation: Page table structurescs9242/02/lectures/08-vm.pdf · System impact Some...
Virtual memory
Almost all modern operating systems support vir tual memor y. VM lets you:
• run applications larger than physical memory
• make best use of physical memory
• protect applications from each other (and themselves!)
But, paging virtual memory has some unavoidable overheads:
• translation lookaside buffer (TLB)
• TLB misses
• page table
• page faults
In the 1980’s, these overheads were typically around 5% to 8% (Clark & Emer 1985).
Suddenly in the mid-1990’s, studies start to repor t TLB refill overheads of 30% to 50%,
and even 80%.
What went wrong?
↑ increasing MHz
↑ increasing MBytes
↑ increasing instructions per cycle (superscalar, VLIW, etc.)
↑ more address bits (64-bit addresses)
↑ higher miss penalty (pipeline & exception costs)
– i80386: 9 to 13 cycles
– VAX-11/780: 17 to 23 cycles
– Pentium 4: 31 to 48 cycles (assuming L2 cache hits)
– Pow erPC 604: 120 cycles (assuming level 2 TLB hit)
TLBs aren’t getting bigger and faster as fast as CPU and RAM.
Why not just make bigger and faster TLBs?
• large CAMs are slow and hot
• often flushed (context switch, address space teardown, protection change, etc.)
• MHz, MBytes, and caches sell computers, not TLBs
Why not just increase the page size?
• fragmentation
• I/O latency
• iner tia
In many processors, TLB thrashing is a bottleneck.
System impact
Some system features exacerbate TLB load.
• ser vice decomposition
– user-level daemons
– micro-ker nels
• sparse address space layout
• shared memory
– increases fragmentation
– reduces locality
– requires more TLB entries to cover the available RAM
Micro-ker nel based systems and SASOSes particular ly suffer.
Application impact
Some features of recent languages
and applications reduce locality:
• bloat
• dynamic librar ies
• indirection
• garbage collection
Chen, Borg, and Jouppi (1992) trace
a process undergoing garbage collection.
Hardware solutions
• shared TLB tags (StrongARM, PA-RISC, IA-64)
• vir tual caches
• super-pages
machine ITLB DTLB page siz es
StrongARM 32 32 4k, 64k, 1M
Pentium III 32 64 4k, 4M
4k, 8k, 16k, 64k, 256k, 1M,
4M, 16M, 64M, 256M, 4GItanium 64 96
Alpha 21264 128 128 8k, 64k, 512k, 4M
UltraSPARC 64 8k, 64k, 512k, 4M
MIPS R4000 96 4k, 16k, 64k, 256k, 1M, 4M, 16M
Po werPC 601 256 4k
Software solutions
• optimize page table lookup
• cross-link page tables for shared memory
Page table structures
TLB miss overhead is directly limited by page table perfor mance.
Classical page table structures were designed for
• 32-bit address spaces, and
• the Unix two-segment model.
How well do they perfor m in:
• large (64-bit) address spaces?
• sparse address-space distributions?
• micro-ker nel system structures?
Requirements
• time
• space
• establishment, tear-down cost
• cache friendliness
• suppor t for shar ing/aliasing
• suppor t for mixed page sizes
• suppor t for operations on large regions
Multi-level page table
Virtual Address
P1 P2 Off
P2
Level 1
Level 2
Physical Mem
Off
root
P1
• used on many 32-bit processors (Pentium, StrongARM, etc.)
• require 4–5 levels for 64-bit address space (AMD Hammer)
• used in Linux (on all platfor ms)
Multi-level page table
Advantages:
• can support super pages efficiently
• can support page table sharing
Task APage
TSS
PDE
Page Directories
PDE
PTEPTEPTE
PTEPTE
Page Tables Page Frames
Task APage
Task APage
SharedPage
SharedPage
Task BPage
Task BPage
Shared PT
PTEPTE
PDEPDE
PDBR
PDBR
Task A TSS
Task B TSS
• significant alignment restrictions on this sharing and superpage support
Virtual linear page table
Physical memory
Virtual Address
Virtual Array Page Root in
Physical Mem
P Off
P
Off
Equivalent to MPT, but:
• better best-case lookup time
• steals TLB entries from application
• requires nested exception handling
Inverted page table
T P L
Off
Frame Table
F
Virtual Address
T Off
Hash Anchor Table
Inverted page table
Advantages:
• scales with physical, not virtual, memory size
• no problem with virtual sparsity
• one IPT per system, not per address space
• PTEs bigger as need to store tag
• system needs a frame table anyway
Disadvantages:
• newer machines have sparse physical address space
• difficult to support super-pages
• shar ing is hard
Hashed page table
Off
Virtual Address
T Off
Hash Table
T P L
Very similar to IPT.
Clustering
Cluster ing is a page table optimization which can in principle be applied to any page
table structure.
P1 P2 P3P
B
0
Virtual Address
Hash Table
T
BT Off
Off
L
• store multiple pages per PTE
• load multiple pages into TLB per miss
• improves perfor mance in presence of spatial locality
• used in MIPS R4000 hardware TLB entry
Level 2 TLB
• a direct-mapped cache of TLB entries in main memory
• fast lookup; can achieve >95% hit rate
• also called software TLB or TLB cache
Software TLB
Virtual Address
T Off
T P
Off
hitmiss
Page Table
• simple enough for hardware implementation
• difficult to support super-pages
Guarded page table
In large address spaces, MPT often creates page table levels with only one valid entry.
• idea: bypass these tables
• some address bits are not used to index any table: check these bits during
lookup
• skipped bits called a guard
• technique also called path compression
guard
guard guard guard guard
Invented by Liedtke (1994), inventor of L4.
TLB performance studies
How well do these page tables perfor m?
page bench- TLB miss
table mark penaltyreference CPU OS
TS1 6.6%
Clar k TS2 6.4%
& Emer EDU 6.0%
(1985) SCI 5.5%
COM 6.8%
VAX-11/780 VMS 2 VLA2
Ultr ix 3.1 VLA2 2.03%Nagle
OSF/1 1.0 VLA2½ 5.81%et al.
(1993) Mach 3.0 VLA2½ 8.21%
MIPS R2000 mixed
page bench- TLB miss
table mark penaltyreference CPU OS
matr ix 45%
nasker 18%
HuckOLTP1 12%
& Haysgcc 6%
IPT
matr ix 39%(1993)
nasker 15%
OLTP1 9%
gcc 4%
HPT
PA-RISC HP-UX
coral 50%
Tallur i nasa7 40%
& Hill compress 26%
(1994) mp3d 11%
gcc 3%
UltraSPARC Solaris 2.1 HPT
page bench- TLB miss
table mark penaltyreference CPU OS
Romercoral 41.4%
et al.compress 35.2%
(1995)spice 9.4%
gcc 5.2%
Alpha 21064 OSF/1 2.1 VLA2½
SubramanianVerilog 49%
et al. (1998)apsi 31%
compress 15%
PA-RISC HP-UX HPT
c4 32.4%
gcc 17.9%
compress 14.0%
Elphinstone wave5 9.3%
GPT16
c4 14.4%(1999)MIPS R4700
128k gcc 5.9%
STLB compress 4.9%
wave5 3.1%
L4/MIPS
GPT performance
Elphinstone (1999) studied GPT and var ious other page tables, using L4/MIPS as a
testbed.
source name size (M) type remarks
go 0.8 I game of go
swim 14.2 F PDE solver
SPEC gcc 9.3 I GNU C compiler
CPU95 compress 34.9 I file (un)compression
apsi 2.2 F PDE solver
wave5 40.4 F PDE solver
c4 5.1 I game of connect four
nsieve 4.9 I pr ime number generator
heapsor t 4.0 I sor ting large arrays
mm 7.7 F matr ix multiply
tfftdp 4.0 F fast four ier transfor m
Albur to
GPT refill time
0
50
100
150
200
250
300
G2 G4 G8 G16 G32 G64 G128 G256
cycl
es
entries per GPT node
GPT versus other page tables
0
20
40
60
80
100
120
140
160
MPT G16 H8k H128k C128k G16S8k/ G16
S128k/
cycl
es
GPT depth
Compare average GPT depth with (fixed) MPT depth.
0
2
4
6
8
10
12
14
G2 G4 G8 G16 G32 G64 G128 G256
dept
h
entries per GPT node
GPT space
Compare GPT storage requirements with expected MPT storage requirements.
0
10
20
30
40
50
60
70
80
G2 G4 G8 G16 G32 G64 G128 G256
byte
s pe
r PT
E
page table
GPT versus other page tables
0
50
100
150
200
250
300
350
MPT G16 H8k H128k C128k
byte
s pe
r P
TE
Address space establishment/teardown cost
0
1000
2000
3000
4000
5000
MPT C128k H128k H8k G16 G16S8k/ G16
S128k/
µsec
0
50
100
150
200
H8k G16 G16S8k/ G16
S128k/
Other benchmarks
• sparse benchmark: unifor mly distr ibuted pages in 1 T address space
• file benchmark: unifor mly distr ibuted multi-page objects
0
200
400
600
800
1000
G2 G4 G8 G16 G32 G64 G128 G256
byte
s pe
r P
TE
worstuniform
fileconv
0
20
40
60
80
100
G2 G4 G8 G16
GPT conclusions
• low establishment/teardown cost
• small GPT node size saves space, especially for sparse distributions
• tree depth can become a problem, especially for dense distributions
L4/MIPS solution: use GPT with a level 2 TLB.
Implementation in L4
L4 provides three operations: map, grant, and unmap.
0σ
A
B C
D E
L4 must remember the history of map operations in the mapping database, to allow
future undo with unmap.
Memor y management and I/O is the responsibility of user-level pagers.
L4 implementation
Most L4 implementations (including L4/MIPS) have a similar implementation of recur-
sive address spaces.
• guarded page table
• frame table
• mapping database
guardedpagetable
frametable
mappingdatabase
Direct pointers between GPT and mapping database (green arrow) were considered
by Elphinstone, but rejected to allow PT implementation freedom.
Level and Path Compressed Trie
• invented by Andersson and Nilsson (1991)
• implemented by Szmajda in Calypso VM system
• a simplified and flattened version of GPT
• allows node size and skip size to be an arbitrar y power of two
• all guard comparison deferred until the leaves
Calypso implementation
• store two shift amounts and a pointer in internal nodes
• extract bits with two shifts
m f prot size skip
ptr
• store virtual address (and other goodies) in an enlarged PTE
m f gen task hard size skip
ptr
m 0 phys c w v g
virt
Each PTE may represent any (hard) page size.
Page tables may be shared (with an addition to the L4 API).
Calypso vs. GPT
Elphinstone (1999) derived the following GPT algorithm.
repeat {
u = v >> (vlen − s)
g = (p + 32u)→guard
g len = (p + 32u)→guardlen
if g == (v >> (vlen − s − g len)) and (2glen − 1) {
v ’len = vlen − s − g len
v ’ = v and (2glen)
s ’ = (p + 32u)→size ’
p’ = (p + 32u)→table ’
} else
page fault
} until p is a leaf
Calypso vs. GPT
After common subexpression elimination, the GPT loop has 17 arithmetic and load op-
erations.
Calypso is much simpler.
repeat {
p = &p→ptr [v << p→skip >> p→size ]
} until p is a leaf
if p→vir t ≠ v
page fault
All guard checking is deferred until the end.
The inner loop on the MIPS R4000 requires only 7 instructions.
Calypso policies
How large to make Calypso nodes?
Andersson and Nilsson (1998) used thresholds like 50% full → double, etc. Calypso’s
implementation is different.
• each page table is greedy, and takes all the memory it can
• unused page tables are liable to be chopped in half at any time, and the returned
to the memory manager
• power of two regions are managed by a buddy system allocator
To prevent excess greed, ker nel memor y is managed by user-level pagers, instead of
in a single fixed pool.
Status: memor y management API undergoing standardization.
Page table fragmentation
But, representing many page sizes can blow out depth.
16k 16k
4k 4k 4k 4k 4k 4k
Solution: key expansion
expanded keys
4k 4k 4k 4k 4k 4k 16k 16k
Complicates the mapping database (later).
Calypso mapping database
Topologically sort each mapping graph into a singly-linked list.
σ0
A2
B1
D2
B4
D4
E2 C2
E3 E4
σ0
1
A22
B13
D24
B43
D44
E23
C23
E34
E44
Integrate the mapping database list into the PTEs.
mapping database
page table
Link
New VM operation: link, which establishes a shared domain between pager and
pagee.
Semantically, link is like map, but instead of just copying a snapshot of the pager’s
mappings to the destination address space, the pager and pagee always share the
mappings, even if the pager’s address space is updated by future maps or unmaps.
L4 primitive Unix analogy
unmap rm
map cp
grant mv
link ln -s
More on link
Restr ictions:
• vir tual address of the fpage in pager and pagee must be equal
• fpage size may be restr icted
Advantages:
• natural generalization of map and grant
• reduces ker nel crossings
• reduces page fault IPC
• restr icted by L4’s usual IPC confinement model (e.g. clans and chiefs)
Calypso performance
Results measured by running with VM on and off, and comparing run-times.
• counts all direct and indirect costs of VM
• nor malized to percentage overhead
Calypso also includes other optimizations beyond the scope of this lecture.
HPT CPT GPT GPT+TLB2 CALY4
wave5 15.4% 14.9% 16.2% 5.1% 6.2%
swim 4.7% 2.4% 1.1% 0.5% 2.6%
gcc 24.3% 26.8% 31.4% 9.1% 9.5%
compress 16.2% 17.2% 24.5% 7.9% 7.6%
Single-tasking perfor mance
Calypso performance
Enabling page size mixtures drastically improves perfor mance; but space/time tradeoff
is harder to measure.
CALY4 CALY64 CALY1024 CALY16384
wave5 6.2% 2.4% <0.1% <0.1%
swim 2.6% 1.1% 0.0% 0.0%
gcc 9.5% 0.8% 0.0% 0.0%
compress 7.6% 2.6% <0.1% <0.1%
Mixed page sizes (assuming infinite physical memory)
Multi-tasking perfor mance was measured with and without LINK, and using the G (glob-
al) bit to simulate shared TLB tags.
GPT GPT+TLB2 CALY4M CALY4L CALY4G
wave5 20.2% 9.1% 8.2% 8.0% 7.6%
swim 2.6% 2.1% 2.9% 2.8% 2.8%
gcc 36.9% 13.4% 11.8% 11.5% 10.8%
compress 27.9% 10.1% 9.1% 8.6% 8.3%
Multi-tasking perfor mance (assuming infinite physical memory)
Conclusions
• Moder n hardware and recent software can lead to high VM overhead.
– 64-bit addresses
– sparse address space usage
– micro-ker nel ser vice decomposition
– bloated applications
• Conventional page tables don’t perfor m well in these conditions.
• Level 2 TLB is the best solution to a slow page table
• Calypso perfor ms as well as level 2 TLB for dense address spaces
• Perfor mance in sparse situations yet to be evaluated
• Optimization of the critical path pays off
– but only after evaluation and measurement.
References and further information
http://www.cse.unsw.edu.au/˜cls/