Operating Systems & Memory Systems: Address Translation

Operating Systems & Memory Systems: Address Translation

Computer Science 220ECE 252

Professor Alvin R. LebeckFall 2006

CPS 220 2© Alvin R. Lebeck 2001

Outline

• Finish Main Memory• Address Translation

– basics– 64-bit Address Space

• Managing memory• OS PerformanceThroughout• Review Computer Architecture• Interaction with Architectural Decisions


Fast Memory Systems: DRAM specific

• Multiple RAS accesses: several names (page mode)– 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns

• New DRAMs to address gap; what will they cost, will they survive?

– Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock

– RAMBUS: reinvent DRAM interface (Intel will use it)» Each Chip a module vs. slice of memory» Short bus between CPU and chips» Does own refresh» Variable amount of data returned» 1 byte / 2 ns (500 MB/s per chip)

– Cached DRAM (CDRAM): Keep entire row in SRAM


Main Memory Summary

• Big DRAM + Small SRAM = Cost Effective– Cray C-90 uses all SRAM (how many sold?)

• Wider Memory• Interleaved Memory: for sequential or independent

accesses• Avoiding bank conflicts: SW & HW• DRAM specific optimizations: page mode & Specialty

DRAM, CDRAM– Niche memory or main memory?

» e.g., Video RAM for frame buffers, DRAM + fast serial output

• IRAM: Do you know what it is?


Review: Reducing Miss Penalty Summary

• Five techniques– Read priority over write on miss– Subblock placement– Early Restart and Critical Word First on miss– Non-blocking Caches (Hit Under Miss)– Second Level Cache

• Can be applied recursively to Multilevel Caches– Danger is that time to DRAM will grow with multiple levels in

between


Review: Improving Cache Performance

1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache


Review: Cache Optimization Summary

Technique MR MP HT ComplexityLarger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2Small & Simple Caches – + 0Avoiding Address Translation + 2Pipelining Writes + 1


I/O Bus

Core Chip Set

Processor

Cache

MainMemory

DiskController

Disk Disk

GraphicsController

NetworkInterface

Graphics Network

interrupts

System Organization


Computer Architecture

• Interface Between Hardware and Software

Hardware

SoftwareOperatingSystem

Compiler

Applications

CPU Memory I/O

Multiprocessor Networks

This is IT


Memory Hierarchy 101

P

$

Memory

Very fast <1ns clockMultiple Instructionsper cycle SRAM, Fast, Small

Expensive

DRAM, Slow, Big,Cheap(called physical or main)

=> Cost Effective Memory System (Price/Performance)

Magnetic, Really Slow,Really Big, Really Cheap


Virtual Memory: Motivation

• Process = Address Space + thread(s) of control

• Address space = PA– programmer controls

movement from disk– protection?– relocation?

• Linear Address space– larger than physical

address space» 32, 64 bits v.s. 28-bit

physical (256MB)

• Automatic management

Virtual

Physical


Virtual Memory

• Process = virtual address space + thread(s) of control• Translation

– VA -> PA– What physical address does virtual address A map to– Is VA in physical memory?

• Protection (access control)– Do you have permission to access it?


Virtual Memory: Questions

• How is data found if it is in physical memory?

• Where can data be placed in physical memory? Fully Associative, Set Associative, Direct Mapped

• What data should be replaced on a miss? (Take Compsci 210 …)


Segmented Virtual Memory

• Virtual address (232, 264) to Physical Address mapping (230)

• Variable size, base + offset, contiguous in both VA and PA

Virtual

Physical0x1000

0x6000

0x9000

0x00000x1000

0x2000

0x11000


Intel Pentium Segmentation

Seg Selector Offset

Logical Address

SegmentDescriptor

Global DescriptorTable (GDT)

Segment BaseAddress

Physical Address Space


Pentium Segmention (Continued)

• Segment Descriptors– Local and Global– base, limit, access rights– Can define many

• Segment Registers– contain segment descriptors (faster than load from mem)– Only 6

• Must load segment register with a valid entry before segment can be accessed

– generally managed by compiler, linker, not programmer


Paged Virtual Memory

• Virtual address (232, 264) to Physical Address mapping (228)

– virtual page to physical page frame• Fixed Size units for access control & translation

Virtual

Physical0x1000

0x6000

0x9000

0x00000x1000

0x2000

0x11000

Virtual page number Offset


Page Table

• Kernel data structure (per process)• Page Table Entry (PTE)

– VA -> PA translations (if none page fault)– access rights (Read, Write, Execute, User/Kernel, cached/uncached)– reference, dirty bits

• Many designs– Linear, Forward mapped, Inverted, Hashed, Clustered

• Design Issues– support for aliasing (multiple VA to single PA)– large virtual address space– time to obtain translation


Alpha VM Mapping (Forward Mapped)

• “64-bit” address divided into 3 segments

– seg0 (bit 63=0) user code/heap– seg1 (bit 63 = 1, 62 = 1) user stack– kseg (bit 63 = 1, 62 = 0)

kernel segment for OS• Three level page table, each one

page– Alpha 21064 only 43 unique bits of VA– (future min page size up to 64KB => 55

bits of VA)• PTE bits; valid, kernel & user read

& write enable (No reference, use, or dirty bit)

– What do you do for replacement?

2110

POL3L2L1

base+

10 10 13

+

+

phys pageframe number

seg 0/1


Inverted Page Table (HP, IBM)

• One PTE per page frame

– only one VA per physical frame

• Must search for virtual address

• More difficult to support aliasing

• Force all sharing to use the same VA


VA PA,ST

Hash Anchor Table (HAT)

Inverted Page Table (IPT)

Hash


Intel Pentium Segmentation + Paging

Seg Selector Offset

Logical Address

SegmentDescriptor

Global DescriptorTable (GDT)

Segment BaseAddress

Linear Address Space

PageDir

Physical Address Space

Dir OffsetTable

PageTable


The Memory Management Unit (MMU)

• Input– virtual address

• Output– physical address– access violation (exception, interrupts the processor)

• Access Violations– not present– user v.s. kernel– write– read– execute


Translation Lookaside Buffers (TLB)

• Need to perform address translation on every memory reference

– 30% of instructions are memory references– 4-way superscalar processor– at least one memory reference per cycle

• Make Common Case Fast, others correct• Throw HW at the problem• Cache PTEs


Fast Translation: Translation Buffer

• Cache of translated addresses• Alpha 21164 TLB: 48 entry fully associative

Page Number

Pageoffset

. . . . . .

v r w tag phys frame

. . .

48:1 mux

1 2

. . .

483

4


TLB Design

• Must be fast, not increase critical path• Must achieve high hit ratio• Generally small highly associative• Mapping change

– page removed from physical memory– processor must invalidate the TLB entry

• PTE is per process entity– Multiple processes with same virtual addresses– Context Switches?

• Flush TLB• Add ASID (PID)

– part of processor state, must be set on context switch


Hardware Managed TLBs

• Hardware Handles TLB miss

• Dictates page table organization

• Compilicated state machine to “walk page table”

– Multiple levels for forward mapped

– Linked list for inverted

• Exception only if access violation

Control

Memory

TLB

CPU


Software Managed TLBs

• Software Handles TLB miss

• Flexible page table organization

• Simple Hardware to detect Hit or Miss

• Exception if TLB miss or access violation

• Should you check for access violation on TLB miss?

Control

Memory

TLB

CPU


Kernel

Mapping the Kernel

• Digital Unix Kseg– kseg (bit 63 = 1, 62 = 0)

• Kernel has direct access to physical memory

• One VA->PA mapping for entire Kernel

• Lock (pin) TLB entry– or special HW detection

UserStack

Kernel

User Code/Data

PhysicalMemory

0

264-1


Considerations for Address Translation

Large virtual address space• Can map more things

– files– frame buffers– network interfaces– memory from another workstation

• Sparse use of address space• Page Table Design

– space– less locality => TLB misses

OS structure• microkernel => more TLB misses


Address Translation for Large Address Spaces

• Forward Mapped Page Table– grows with virtual address space

» worst case 100% overhead not likely– TLB miss time: memory reference for each level

• Inverted Page Table– grows with physical address space

» independent of virtual address space usage– TLB miss time: memory reference to HAT, IPT, list search


Hashed Page Table (HP)

• Combine Hash Table and IPT [Huck96]

– can have more entries than physical page frames

• Must search for virtual address

• Easier to support aliasing than IPT

• Space– grows with physical space

• TLB miss– one less memory ref than

IPT


VA PA,ST

Hashed Page Table (HPT)Hash


Clustered Page Table (SUN)

• Combine benefits of HPT and Linear [Talluri95]

• Store one base VPN (TAG) and several PPN values

– virtual page block number (VPBN)

– block offset

VPBN Offset

VPBNnext

PA0 attrib

Hash

Boff

VPBNnext

PA0 attrib

......

PA1 attribPA2 attribPA3 attrib

VPBNnext

PA0 attribVPBNnext

PA0 attrib


Reducing TLB Miss Handling Time

• Problem– must walk Page Table on TLB miss– usually incur cache misses– big problem for IPC in microkernels

• Solution– build a small second-level cache in SW– on TLB miss, first check SW cache

» use simple shift and mask index to hash table


Cache Indexing

• Tag on each block– No need to check index or block offset

• Increasing associativity shrinks index, expands tag

Fully Associative: No indexDirect-Mapped: Large index

Block offset

Block Address

TAG Index


Address Translation and Caches

• Where is the TLB wrt the cache?• What are the consequences?

• Most of today’s systems have more than 1 cache– Digital 21164 has 3 levels – 2 levels on chip (8KB-data,8KB-inst,96KB-unified)– one level off chip (2-4MB)

• Does the OS need to worry about this?

Definition: page coloring = careful selection of va->pa mapping


TLBs and Caches

CPU

TLB

$

MEM

VA

PA

PA

ConventionalOrganization

CPU

$

TLB

MEM

VA

VA

PA

Virtually Addressed CacheTranslate only on miss

Alias (Synonym) Problem

CPU

$ TLB

MEM

VA

PATags

PA

Overlap $ accesswith VA translation:requires $ index to

remain invariantacross translation

VATags

L2 $


Virtual Caches

• Send virtual address to cache. Called VirtuallyAddressed Cache or just Virtual Cache vs. Physical Cache or Real Cache

• Avoid address translation before accessing cache– faster hit time to cache

• Context Switches?– Just like the TLB (flush or pid)– Cost is time to flush + “compulsory” misses from empty cache– Add process identifier tag that identifies process as well as address

within process: can’t get a hit if wrong process

• I/O must interact with cache


I/O Bus

Memory Bus

Processor

Cache

MainMemory

DiskController

Disk Disk

GraphicsController

NetworkInterface

Graphics Network

interrupts

I/O and Virtual Caches

I/O Bridge

VirtualCache

PhysicalAddresses

I/O is accomplishedwith physical addressesDMA• flush pages from cache• need pa->va reverse

translation• coherent DMA


Aliases and Virtual Caches

• aliases (sometimes called synonyms); Two different virtual addresses map to same physical address

• But, but... the virtual address is used to index the cache

• Could have data in two different locations in the cache

Kernel

UserStack

Kernel

User Code/Data

PhysicalMemory

0

264-1


• If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag

• Limits cache to page size: what if want bigger caches and use same trick?

– Higher associativity– Page coloring

Index with Physical Portion of Address

Page Address Page Offset

Address Tag Index Block Offset


Page Coloring for Aliases

• HW that guarantees that every cache frame holds unique physical address

• OS guarantee: lower n bits of virtual & physical page numbers must have same value; if direct-mapped, then aliases map to same cache frame

– one form of page coloring

Page Address

Page Offset

Address TagIndex

Block Offset


Page Coloring to reduce misses

• Notion of bin– region of cache that may

contain cache blocks from a page

• Random vs careful mapping

• Selection of physical page frame dictates cache index

• Overall goal is to minimize cache misses

Cache Page frames


Careful Page Mapping

[Kessler92, Bershad94]• Select a page frame such that cache conflict misses

are reduced– only choose from available pages (no VM replacement induced)

• static– “smart” selection of page frame at page fault time

• dynamic– move pages around


A Case for Large Pages

• Page table size is inversely proportional to the page size

– memory saved

• Fast cache hit time easy when cache <= page size (VA caches);

– bigger page makes it feasible as cache size grows

• Transferring larger pages to or from secondary storage, possibly over a network, is more efficient

• Number of TLB entries are restricted by clock cycle time,

– larger page size maps more memory– reduces TLB misses


A Case for Small Pages

• Fragmentation– large pages can waste storage– data must be contiguous within page

• Quicker process start for small processes(??)


Superpages

• Hybrid solution: multiple page sizes– 8KB, 16KB, 32KB, 64KB pages– 4KB, 64KB, 256KB, 1MB, 4MB, 16MB pages

• Need to identify candidate superpages– Kernel– Frame buffers– Database buffer pools

• Application/compiler hints• Detecting superpages

– static, at page fault time– dynamically create superpages

• Page Table & TLB modifications


More details on page coloring to reduce misses


Page Coloring

• Make physical index match virtual index• Behaves like virtual index cache

– no conflicts for sequential pages

• Possibly many conflicts between processes– address spaces all have same structure (stack, code, heap)– modify to xor PID with address (MIPS used variant of this)

• Simple implementation• Pick abitrary page if necessary


Bin Hopping

• Allocate sequentially mapped pages (time) to sequential bins (space)

• Can exploit temporal locality– pages mapped close in time will be accessed close in time

• Search from last allocated bin until bin with available page frame

• Separate search list per process• Simple implementation


Best Bin

• Keep track of two counters per bin– used: # of pages allocated to this bin for this address space– free: # of available pages in the system for this bin

• Bin selection is based on low values of used and high values of free

• Low used value– reduce conflicts within the address space

• High free value– reduce conflicts between address spaces


Hierarchical

• Best bin could be linear in # of bins• Build a tree

– internal nodes contain sum of child <used,free> values

• Independent of cache size– simply stop at a particular level in the tree


Benefit of Static Page Coloring

• Reduces cache misses by 10% to 20%• Multiprogramming

– want to distribute mapping to avoid inter-address space conflicts


Dynamic Page Coloring

• Cache Miss Lookaside (CML) buffer [Bershad94]– proposed hardware device

• Monitor # of misses per page• If # of misses >> # of cache blocks in page

– must be conflict misses– interrupt processor – move a page (recolor)

• Cost of moving page << benefit

Operating Systems & Memory Systems: Address Translation

Documents

Transcript of Operating Systems & Memory Systems: Address Translation