Operating Systems & Memory Systems: Address Translation
Transcript of Operating Systems & Memory Systems: Address Translation
Page 1
Operating Systems & Memory Systems: Address Translation
Computer Science 220ECE 252
Professor Alvin R. LebeckFall 2006
CPS 220 2© Alvin R. Lebeck 2001
Outline
• Finish Main Memory• Address Translation
– basics– 64-bit Address Space
• Managing memory• OS PerformanceThroughout• Review Computer Architecture• Interaction with Architectural Decisions
Page 2
CPS 220 3© Alvin R. Lebeck 2001
Fast Memory Systems: DRAM specific
• Multiple RAS accesses: several names (page mode)– 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns
• New DRAMs to address gap; what will they cost, will they survive?
– Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock
– RAMBUS: reinvent DRAM interface (Intel will use it)» Each Chip a module vs. slice of memory» Short bus between CPU and chips» Does own refresh» Variable amount of data returned» 1 byte / 2 ns (500 MB/s per chip)
– Cached DRAM (CDRAM): Keep entire row in SRAM
CPS 220 4© Alvin R. Lebeck 2001
Main Memory Summary
• Big DRAM + Small SRAM = Cost Effective– Cray C-90 uses all SRAM (how many sold?)
• Wider Memory• Interleaved Memory: for sequential or independent
accesses• Avoiding bank conflicts: SW & HW• DRAM specific optimizations: page mode & Specialty
DRAM, CDRAM– Niche memory or main memory?
» e.g., Video RAM for frame buffers, DRAM + fast serial output
• IRAM: Do you know what it is?
Page 3
CPS 220 5© Alvin R. Lebeck 2001
Review: Reducing Miss Penalty Summary
• Five techniques– Read priority over write on miss– Subblock placement– Early Restart and Critical Word First on miss– Non-blocking Caches (Hit Under Miss)– Second Level Cache
• Can be applied recursively to Multilevel Caches– Danger is that time to DRAM will grow with multiple levels in
between
CPS 220 6© Alvin R. Lebeck 2001
Review: Improving Cache Performance
1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache
Page 4
CPS 220 7© Alvin R. Lebeck 2001
Review: Cache Optimization Summary
Technique MR MP HT ComplexityLarger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2Small & Simple Caches – + 0Avoiding Address Translation + 2Pipelining Writes + 1
CPS 220 8© Alvin R. Lebeck 2001
I/O Bus
Core Chip Set
Processor
Cache
MainMemory
DiskController
Disk Disk
GraphicsController
NetworkInterface
Graphics Network
interrupts
System Organization
Page 5
CPS 220 9© Alvin R. Lebeck 2001
Computer Architecture
• Interface Between Hardware and Software
Hardware
SoftwareOperatingSystem
Compiler
Applications
CPU Memory I/O
Multiprocessor Networks
This is IT
CPS 220 10© Alvin R. Lebeck 2001
Memory Hierarchy 101
P
$
Memory
Very fast <1ns clockMultiple Instructionsper cycle SRAM, Fast, Small
Expensive
DRAM, Slow, Big,Cheap(called physical or main)
=> Cost Effective Memory System (Price/Performance)
Magnetic, Really Slow,Really Big, Really Cheap
Page 6
CPS 220 11© Alvin R. Lebeck 2001
Virtual Memory: Motivation
• Process = Address Space + thread(s) of control
• Address space = PA– programmer controls
movement from disk– protection?– relocation?
• Linear Address space– larger than physical
address space» 32, 64 bits v.s. 28-bit
physical (256MB)
• Automatic management
Virtual
Physical
CPS 220 12© Alvin R. Lebeck 2001
Virtual Memory
• Process = virtual address space + thread(s) of control• Translation
– VA -> PA– What physical address does virtual address A map to– Is VA in physical memory?
• Protection (access control)– Do you have permission to access it?
Page 7
CPS 220 13© Alvin R. Lebeck 2001
Virtual Memory: Questions
• How is data found if it is in physical memory?
• Where can data be placed in physical memory? Fully Associative, Set Associative, Direct Mapped
• What data should be replaced on a miss? (Take Compsci 210 …)
CPS 220 14© Alvin R. Lebeck 2001
Segmented Virtual Memory
• Virtual address (232, 264) to Physical Address mapping (230)
• Variable size, base + offset, contiguous in both VA and PA
Virtual
Physical0x1000
0x6000
0x9000
0x00000x1000
0x2000
0x11000
Page 8
CPS 220 15© Alvin R. Lebeck 2001
Intel Pentium Segmentation
Seg Selector Offset
Logical Address
SegmentDescriptor
Global DescriptorTable (GDT)
Segment BaseAddress
Physical Address Space
CPS 220 16© Alvin R. Lebeck 2001
Pentium Segmention (Continued)
• Segment Descriptors– Local and Global– base, limit, access rights– Can define many
• Segment Registers– contain segment descriptors (faster than load from mem)– Only 6
• Must load segment register with a valid entry before segment can be accessed
– generally managed by compiler, linker, not programmer
Page 9
CPS 220 17© Alvin R. Lebeck 2001
Paged Virtual Memory
• Virtual address (232, 264) to Physical Address mapping (228)
– virtual page to physical page frame• Fixed Size units for access control & translation
Virtual
Physical0x1000
0x6000
0x9000
0x00000x1000
0x2000
0x11000
Virtual page number Offset
CPS 220 18© Alvin R. Lebeck 2001
Page Table
• Kernel data structure (per process)• Page Table Entry (PTE)
– VA -> PA translations (if none page fault)– access rights (Read, Write, Execute, User/Kernel, cached/uncached)– reference, dirty bits
• Many designs– Linear, Forward mapped, Inverted, Hashed, Clustered
• Design Issues– support for aliasing (multiple VA to single PA)– large virtual address space– time to obtain translation
Page 10
CPS 220 19© Alvin R. Lebeck 2001
Alpha VM Mapping (Forward Mapped)
• “64-bit” address divided into 3 segments
– seg0 (bit 63=0) user code/heap– seg1 (bit 63 = 1, 62 = 1) user stack– kseg (bit 63 = 1, 62 = 0)
kernel segment for OS• Three level page table, each one
page– Alpha 21064 only 43 unique bits of VA– (future min page size up to 64KB => 55
bits of VA)• PTE bits; valid, kernel & user read
& write enable (No reference, use, or dirty bit)
– What do you do for replacement?
2110
POL3L2L1
base+
10 10 13
+
+
phys pageframe number
seg 0/1
CPS 220 20© Alvin R. Lebeck 2001
Inverted Page Table (HP, IBM)
• One PTE per page frame
– only one VA per physical frame
• Must search for virtual address
• More difficult to support aliasing
• Force all sharing to use the same VA
Virtual page number Offset
VA PA,ST
Hash Anchor Table (HAT)
Inverted Page Table (IPT)
Hash
Page 11
CPS 220 21© Alvin R. Lebeck 2001
Intel Pentium Segmentation + Paging
Seg Selector Offset
Logical Address
SegmentDescriptor
Global DescriptorTable (GDT)
Segment BaseAddress
Linear Address Space
PageDir
Physical Address Space
Dir OffsetTable
PageTable
CPS 220 22© Alvin R. Lebeck 2001
The Memory Management Unit (MMU)
• Input– virtual address
• Output– physical address– access violation (exception, interrupts the processor)
• Access Violations– not present– user v.s. kernel– write– read– execute
Page 12
CPS 220 23© Alvin R. Lebeck 2001
Translation Lookaside Buffers (TLB)
• Need to perform address translation on every memory reference
– 30% of instructions are memory references– 4-way superscalar processor– at least one memory reference per cycle
• Make Common Case Fast, others correct• Throw HW at the problem• Cache PTEs
CPS 220 24© Alvin R. Lebeck 2001
Fast Translation: Translation Buffer
• Cache of translated addresses• Alpha 21164 TLB: 48 entry fully associative
Page Number
Pageoffset
. . . . . .
v r w tag phys frame
. . .
48:1 mux
1 2
. . .
483
4
Page 13
CPS 220 25© Alvin R. Lebeck 2001
TLB Design
• Must be fast, not increase critical path• Must achieve high hit ratio• Generally small highly associative• Mapping change
– page removed from physical memory– processor must invalidate the TLB entry
• PTE is per process entity– Multiple processes with same virtual addresses– Context Switches?
• Flush TLB• Add ASID (PID)
– part of processor state, must be set on context switch
CPS 220 26© Alvin R. Lebeck 2001
Hardware Managed TLBs
• Hardware Handles TLB miss
• Dictates page table organization
• Compilicated state machine to “walk page table”
– Multiple levels for forward mapped
– Linked list for inverted
• Exception only if access violation
Control
Memory
TLB
CPU
Page 14
CPS 220 27© Alvin R. Lebeck 2001
Software Managed TLBs
• Software Handles TLB miss
• Flexible page table organization
• Simple Hardware to detect Hit or Miss
• Exception if TLB miss or access violation
• Should you check for access violation on TLB miss?
Control
Memory
TLB
CPU
CPS 220 28© Alvin R. Lebeck 2001
Kernel
Mapping the Kernel
• Digital Unix Kseg– kseg (bit 63 = 1, 62 = 0)
• Kernel has direct access to physical memory
• One VA->PA mapping for entire Kernel
• Lock (pin) TLB entry– or special HW detection
UserStack
Kernel
User Code/Data
PhysicalMemory
0
264-1
Page 15
CPS 220 29© Alvin R. Lebeck 2001
Considerations for Address Translation
Large virtual address space• Can map more things
– files– frame buffers– network interfaces– memory from another workstation
• Sparse use of address space• Page Table Design
– space– less locality => TLB misses
OS structure• microkernel => more TLB misses
CPS 220 30© Alvin R. Lebeck 2001
Address Translation for Large Address Spaces
• Forward Mapped Page Table– grows with virtual address space
» worst case 100% overhead not likely– TLB miss time: memory reference for each level
• Inverted Page Table– grows with physical address space
» independent of virtual address space usage– TLB miss time: memory reference to HAT, IPT, list search
Page 16
CPS 220 31© Alvin R. Lebeck 2001
Hashed Page Table (HP)
• Combine Hash Table and IPT [Huck96]
– can have more entries than physical page frames
• Must search for virtual address
• Easier to support aliasing than IPT
• Space– grows with physical space
• TLB miss– one less memory ref than
IPT
Virtual page number Offset
VA PA,ST
Hashed Page Table (HPT)Hash
CPS 220 32© Alvin R. Lebeck 2001
Clustered Page Table (SUN)
• Combine benefits of HPT and Linear [Talluri95]
• Store one base VPN (TAG) and several PPN values
– virtual page block number (VPBN)
– block offset
VPBN Offset
VPBNnext
PA0 attrib
Hash
Boff
VPBNnext
PA0 attrib
......
PA1 attribPA2 attribPA3 attrib
VPBNnext
PA0 attribVPBNnext
PA0 attrib
Page 17
CPS 220 33© Alvin R. Lebeck 2001
Reducing TLB Miss Handling Time
• Problem– must walk Page Table on TLB miss– usually incur cache misses– big problem for IPC in microkernels
• Solution– build a small second-level cache in SW– on TLB miss, first check SW cache
» use simple shift and mask index to hash table
CPS 220 34© Alvin R. Lebeck 2001
Cache Indexing
• Tag on each block– No need to check index or block offset
• Increasing associativity shrinks index, expands tag
Fully Associative: No indexDirect-Mapped: Large index
Block offset
Block Address
TAG Index
Page 18
CPS 220 35© Alvin R. Lebeck 2001
Address Translation and Caches
• Where is the TLB wrt the cache?• What are the consequences?
• Most of today’s systems have more than 1 cache– Digital 21164 has 3 levels – 2 levels on chip (8KB-data,8KB-inst,96KB-unified)– one level off chip (2-4MB)
• Does the OS need to worry about this?
Definition: page coloring = careful selection of va->pa mapping
CPS 220 36© Alvin R. Lebeck 2001
TLBs and Caches
CPU
TLB
$
MEM
VA
PA
PA
ConventionalOrganization
CPU
$
TLB
MEM
VA
VA
PA
Virtually Addressed CacheTranslate only on miss
Alias (Synonym) Problem
CPU
$ TLB
MEM
VA
PATags
PA
Overlap $ accesswith VA translation:requires $ index to
remain invariantacross translation
VATags
L2 $
Page 19
CPS 220 37© Alvin R. Lebeck 2001
Virtual Caches
• Send virtual address to cache. Called VirtuallyAddressed Cache or just Virtual Cache vs. Physical Cache or Real Cache
• Avoid address translation before accessing cache– faster hit time to cache
• Context Switches?– Just like the TLB (flush or pid)– Cost is time to flush + “compulsory” misses from empty cache– Add process identifier tag that identifies process as well as address
within process: can’t get a hit if wrong process
• I/O must interact with cache
CPS 220 38© Alvin R. Lebeck 2001
I/O Bus
Memory Bus
Processor
Cache
MainMemory
DiskController
Disk Disk
GraphicsController
NetworkInterface
Graphics Network
interrupts
I/O and Virtual Caches
I/O Bridge
VirtualCache
PhysicalAddresses
I/O is accomplishedwith physical addressesDMA• flush pages from cache• need pa->va reverse
translation• coherent DMA
Page 20
CPS 220 39© Alvin R. Lebeck 2001
Aliases and Virtual Caches
• aliases (sometimes called synonyms); Two different virtual addresses map to same physical address
• But, but... the virtual address is used to index the cache
• Could have data in two different locations in the cache
Kernel
UserStack
Kernel
User Code/Data
PhysicalMemory
0
264-1
CPS 220 40© Alvin R. Lebeck 2001
• If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag
• Limits cache to page size: what if want bigger caches and use same trick?
– Higher associativity– Page coloring
Index with Physical Portion of Address
Page Address Page Offset
Address Tag Index Block Offset
Page 21
CPS 220 41© Alvin R. Lebeck 2001
Page Coloring for Aliases
• HW that guarantees that every cache frame holds unique physical address
• OS guarantee: lower n bits of virtual & physical page numbers must have same value; if direct-mapped, then aliases map to same cache frame
– one form of page coloring
Page Address
Page Offset
Address TagIndex
Block Offset
CPS 220 42© Alvin R. Lebeck 2001
Page Coloring to reduce misses
• Notion of bin– region of cache that may
contain cache blocks from a page
• Random vs careful mapping
• Selection of physical page frame dictates cache index
• Overall goal is to minimize cache misses
Cache Page frames
Page 22
CPS 220 43© Alvin R. Lebeck 2001
Careful Page Mapping
[Kessler92, Bershad94]• Select a page frame such that cache conflict misses
are reduced– only choose from available pages (no VM replacement induced)
• static– “smart” selection of page frame at page fault time
• dynamic– move pages around
CPS 220 44© Alvin R. Lebeck 2001
A Case for Large Pages
• Page table size is inversely proportional to the page size
– memory saved
• Fast cache hit time easy when cache <= page size (VA caches);
– bigger page makes it feasible as cache size grows
• Transferring larger pages to or from secondary storage, possibly over a network, is more efficient
• Number of TLB entries are restricted by clock cycle time,
– larger page size maps more memory– reduces TLB misses
Page 23
CPS 220 45© Alvin R. Lebeck 2001
A Case for Small Pages
• Fragmentation– large pages can waste storage– data must be contiguous within page
• Quicker process start for small processes(??)
CPS 220 46© Alvin R. Lebeck 2001
Superpages
• Hybrid solution: multiple page sizes– 8KB, 16KB, 32KB, 64KB pages– 4KB, 64KB, 256KB, 1MB, 4MB, 16MB pages
• Need to identify candidate superpages– Kernel– Frame buffers– Database buffer pools
• Application/compiler hints• Detecting superpages
– static, at page fault time– dynamically create superpages
• Page Table & TLB modifications
Page 24
CPS 220 47© Alvin R. Lebeck 2001
More details on page coloring to reduce misses
CPS 220 48© Alvin R. Lebeck 2001
Page Coloring
• Make physical index match virtual index• Behaves like virtual index cache
– no conflicts for sequential pages
• Possibly many conflicts between processes– address spaces all have same structure (stack, code, heap)– modify to xor PID with address (MIPS used variant of this)
• Simple implementation• Pick abitrary page if necessary
Page 25
CPS 220 49© Alvin R. Lebeck 2001
Bin Hopping
• Allocate sequentially mapped pages (time) to sequential bins (space)
• Can exploit temporal locality– pages mapped close in time will be accessed close in time
• Search from last allocated bin until bin with available page frame
• Separate search list per process• Simple implementation
CPS 220 50© Alvin R. Lebeck 2001
Best Bin
• Keep track of two counters per bin– used: # of pages allocated to this bin for this address space– free: # of available pages in the system for this bin
• Bin selection is based on low values of used and high values of free
• Low used value– reduce conflicts within the address space
• High free value– reduce conflicts between address spaces
Page 26
CPS 220 51© Alvin R. Lebeck 2001
Hierarchical
• Best bin could be linear in # of bins• Build a tree
– internal nodes contain sum of child <used,free> values
• Independent of cache size– simply stop at a particular level in the tree
CPS 220 52© Alvin R. Lebeck 2001
Benefit of Static Page Coloring
• Reduces cache misses by 10% to 20%• Multiprogramming
– want to distribute mapping to avoid inter-address space conflicts
Page 27
CPS 220 53© Alvin R. Lebeck 2001
Dynamic Page Coloring
• Cache Miss Lookaside (CML) buffer [Bershad94]– proposed hardware device
• Monitor # of misses per page• If # of misses >> # of cache blocks in page
– must be conflict misses– interrupt processor – move a page (recolor)
• Cost of moving page << benefit