Operating Systems & Memory Systems: Address Translation

27
Page 1 Operating Systems & Memory Systems: Address Translation Computer Science 220 ECE 252 Professor Alvin R. Lebeck Fall 2006 CPS 220 2 © Alvin R. Lebeck 2001 Outline Finish Main Memory Address Translation basics 64-bit Address Space Managing memory OS Performance Throughout Review Computer Architecture Interaction with Architectural Decisions

Transcript of Operating Systems & Memory Systems: Address Translation

Page 1: Operating Systems & Memory Systems: Address Translation

Page 1

Operating Systems & Memory Systems: Address Translation

Computer Science 220ECE 252

Professor Alvin R. LebeckFall 2006

CPS 220 2© Alvin R. Lebeck 2001

Outline

• Finish Main Memory• Address Translation

– basics– 64-bit Address Space

• Managing memory• OS PerformanceThroughout• Review Computer Architecture• Interaction with Architectural Decisions

Page 2: Operating Systems & Memory Systems: Address Translation

Page 2

CPS 220 3© Alvin R. Lebeck 2001

Fast Memory Systems: DRAM specific

• Multiple RAS accesses: several names (page mode)– 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns

• New DRAMs to address gap; what will they cost, will they survive?

– Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock

– RAMBUS: reinvent DRAM interface (Intel will use it)» Each Chip a module vs. slice of memory» Short bus between CPU and chips» Does own refresh» Variable amount of data returned» 1 byte / 2 ns (500 MB/s per chip)

– Cached DRAM (CDRAM): Keep entire row in SRAM

CPS 220 4© Alvin R. Lebeck 2001

Main Memory Summary

• Big DRAM + Small SRAM = Cost Effective– Cray C-90 uses all SRAM (how many sold?)

• Wider Memory• Interleaved Memory: for sequential or independent

accesses• Avoiding bank conflicts: SW & HW• DRAM specific optimizations: page mode & Specialty

DRAM, CDRAM– Niche memory or main memory?

» e.g., Video RAM for frame buffers, DRAM + fast serial output

• IRAM: Do you know what it is?

Page 3: Operating Systems & Memory Systems: Address Translation

Page 3

CPS 220 5© Alvin R. Lebeck 2001

Review: Reducing Miss Penalty Summary

• Five techniques– Read priority over write on miss– Subblock placement– Early Restart and Critical Word First on miss– Non-blocking Caches (Hit Under Miss)– Second Level Cache

• Can be applied recursively to Multilevel Caches– Danger is that time to DRAM will grow with multiple levels in

between

CPS 220 6© Alvin R. Lebeck 2001

Review: Improving Cache Performance

1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache

Page 4: Operating Systems & Memory Systems: Address Translation

Page 4

CPS 220 7© Alvin R. Lebeck 2001

Review: Cache Optimization Summary

Technique MR MP HT ComplexityLarger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2Small & Simple Caches – + 0Avoiding Address Translation + 2Pipelining Writes + 1

CPS 220 8© Alvin R. Lebeck 2001

I/O Bus

Core Chip Set

Processor

Cache

MainMemory

DiskController

Disk Disk

GraphicsController

NetworkInterface

Graphics Network

interrupts

System Organization

Page 5: Operating Systems & Memory Systems: Address Translation

Page 5

CPS 220 9© Alvin R. Lebeck 2001

Computer Architecture

• Interface Between Hardware and Software

Hardware

SoftwareOperatingSystem

Compiler

Applications

CPU Memory I/O

Multiprocessor Networks

This is IT

CPS 220 10© Alvin R. Lebeck 2001

Memory Hierarchy 101

P

$

Memory

Very fast <1ns clockMultiple Instructionsper cycle SRAM, Fast, Small

Expensive

DRAM, Slow, Big,Cheap(called physical or main)

=> Cost Effective Memory System (Price/Performance)

Magnetic, Really Slow,Really Big, Really Cheap

Page 6: Operating Systems & Memory Systems: Address Translation

Page 6

CPS 220 11© Alvin R. Lebeck 2001

Virtual Memory: Motivation

• Process = Address Space + thread(s) of control

• Address space = PA– programmer controls

movement from disk– protection?– relocation?

• Linear Address space– larger than physical

address space» 32, 64 bits v.s. 28-bit

physical (256MB)

• Automatic management

Virtual

Physical

CPS 220 12© Alvin R. Lebeck 2001

Virtual Memory

• Process = virtual address space + thread(s) of control• Translation

– VA -> PA– What physical address does virtual address A map to– Is VA in physical memory?

• Protection (access control)– Do you have permission to access it?

Page 7: Operating Systems & Memory Systems: Address Translation

Page 7

CPS 220 13© Alvin R. Lebeck 2001

Virtual Memory: Questions

• How is data found if it is in physical memory?

• Where can data be placed in physical memory? Fully Associative, Set Associative, Direct Mapped

• What data should be replaced on a miss? (Take Compsci 210 …)

CPS 220 14© Alvin R. Lebeck 2001

Segmented Virtual Memory

• Virtual address (232, 264) to Physical Address mapping (230)

• Variable size, base + offset, contiguous in both VA and PA

Virtual

Physical0x1000

0x6000

0x9000

0x00000x1000

0x2000

0x11000

Page 8: Operating Systems & Memory Systems: Address Translation

Page 8

CPS 220 15© Alvin R. Lebeck 2001

Intel Pentium Segmentation

Seg Selector Offset

Logical Address

SegmentDescriptor

Global DescriptorTable (GDT)

Segment BaseAddress

Physical Address Space

CPS 220 16© Alvin R. Lebeck 2001

Pentium Segmention (Continued)

• Segment Descriptors– Local and Global– base, limit, access rights– Can define many

• Segment Registers– contain segment descriptors (faster than load from mem)– Only 6

• Must load segment register with a valid entry before segment can be accessed

– generally managed by compiler, linker, not programmer

Page 9: Operating Systems & Memory Systems: Address Translation

Page 9

CPS 220 17© Alvin R. Lebeck 2001

Paged Virtual Memory

• Virtual address (232, 264) to Physical Address mapping (228)

– virtual page to physical page frame• Fixed Size units for access control & translation

Virtual

Physical0x1000

0x6000

0x9000

0x00000x1000

0x2000

0x11000

Virtual page number Offset

CPS 220 18© Alvin R. Lebeck 2001

Page Table

• Kernel data structure (per process)• Page Table Entry (PTE)

– VA -> PA translations (if none page fault)– access rights (Read, Write, Execute, User/Kernel, cached/uncached)– reference, dirty bits

• Many designs– Linear, Forward mapped, Inverted, Hashed, Clustered

• Design Issues– support for aliasing (multiple VA to single PA)– large virtual address space– time to obtain translation

Page 10: Operating Systems & Memory Systems: Address Translation

Page 10

CPS 220 19© Alvin R. Lebeck 2001

Alpha VM Mapping (Forward Mapped)

• “64-bit” address divided into 3 segments

– seg0 (bit 63=0) user code/heap– seg1 (bit 63 = 1, 62 = 1) user stack– kseg (bit 63 = 1, 62 = 0)

kernel segment for OS• Three level page table, each one

page– Alpha 21064 only 43 unique bits of VA– (future min page size up to 64KB => 55

bits of VA)• PTE bits; valid, kernel & user read

& write enable (No reference, use, or dirty bit)

– What do you do for replacement?

2110

POL3L2L1

base+

10 10 13

+

+

phys pageframe number

seg 0/1

CPS 220 20© Alvin R. Lebeck 2001

Inverted Page Table (HP, IBM)

• One PTE per page frame

– only one VA per physical frame

• Must search for virtual address

• More difficult to support aliasing

• Force all sharing to use the same VA

Virtual page number Offset

VA PA,ST

Hash Anchor Table (HAT)

Inverted Page Table (IPT)

Hash

Page 11: Operating Systems & Memory Systems: Address Translation

Page 11

CPS 220 21© Alvin R. Lebeck 2001

Intel Pentium Segmentation + Paging

Seg Selector Offset

Logical Address

SegmentDescriptor

Global DescriptorTable (GDT)

Segment BaseAddress

Linear Address Space

PageDir

Physical Address Space

Dir OffsetTable

PageTable

CPS 220 22© Alvin R. Lebeck 2001

The Memory Management Unit (MMU)

• Input– virtual address

• Output– physical address– access violation (exception, interrupts the processor)

• Access Violations– not present– user v.s. kernel– write– read– execute

Page 12: Operating Systems & Memory Systems: Address Translation

Page 12

CPS 220 23© Alvin R. Lebeck 2001

Translation Lookaside Buffers (TLB)

• Need to perform address translation on every memory reference

– 30% of instructions are memory references– 4-way superscalar processor– at least one memory reference per cycle

• Make Common Case Fast, others correct• Throw HW at the problem• Cache PTEs

CPS 220 24© Alvin R. Lebeck 2001

Fast Translation: Translation Buffer

• Cache of translated addresses• Alpha 21164 TLB: 48 entry fully associative

Page Number

Pageoffset

. . . . . .

v r w tag phys frame

. . .

48:1 mux

1 2

. . .

483

4

Page 13: Operating Systems & Memory Systems: Address Translation

Page 13

CPS 220 25© Alvin R. Lebeck 2001

TLB Design

• Must be fast, not increase critical path• Must achieve high hit ratio• Generally small highly associative• Mapping change

– page removed from physical memory– processor must invalidate the TLB entry

• PTE is per process entity– Multiple processes with same virtual addresses– Context Switches?

• Flush TLB• Add ASID (PID)

– part of processor state, must be set on context switch

CPS 220 26© Alvin R. Lebeck 2001

Hardware Managed TLBs

• Hardware Handles TLB miss

• Dictates page table organization

• Compilicated state machine to “walk page table”

– Multiple levels for forward mapped

– Linked list for inverted

• Exception only if access violation

Control

Memory

TLB

CPU

Page 14: Operating Systems & Memory Systems: Address Translation

Page 14

CPS 220 27© Alvin R. Lebeck 2001

Software Managed TLBs

• Software Handles TLB miss

• Flexible page table organization

• Simple Hardware to detect Hit or Miss

• Exception if TLB miss or access violation

• Should you check for access violation on TLB miss?

Control

Memory

TLB

CPU

CPS 220 28© Alvin R. Lebeck 2001

Kernel

Mapping the Kernel

• Digital Unix Kseg– kseg (bit 63 = 1, 62 = 0)

• Kernel has direct access to physical memory

• One VA->PA mapping for entire Kernel

• Lock (pin) TLB entry– or special HW detection

UserStack

Kernel

User Code/Data

PhysicalMemory

0

264-1

Page 15: Operating Systems & Memory Systems: Address Translation

Page 15

CPS 220 29© Alvin R. Lebeck 2001

Considerations for Address Translation

Large virtual address space• Can map more things

– files– frame buffers– network interfaces– memory from another workstation

• Sparse use of address space• Page Table Design

– space– less locality => TLB misses

OS structure• microkernel => more TLB misses

CPS 220 30© Alvin R. Lebeck 2001

Address Translation for Large Address Spaces

• Forward Mapped Page Table– grows with virtual address space

» worst case 100% overhead not likely– TLB miss time: memory reference for each level

• Inverted Page Table– grows with physical address space

» independent of virtual address space usage– TLB miss time: memory reference to HAT, IPT, list search

Page 16: Operating Systems & Memory Systems: Address Translation

Page 16

CPS 220 31© Alvin R. Lebeck 2001

Hashed Page Table (HP)

• Combine Hash Table and IPT [Huck96]

– can have more entries than physical page frames

• Must search for virtual address

• Easier to support aliasing than IPT

• Space– grows with physical space

• TLB miss– one less memory ref than

IPT

Virtual page number Offset

VA PA,ST

Hashed Page Table (HPT)Hash

CPS 220 32© Alvin R. Lebeck 2001

Clustered Page Table (SUN)

• Combine benefits of HPT and Linear [Talluri95]

• Store one base VPN (TAG) and several PPN values

– virtual page block number (VPBN)

– block offset

VPBN Offset

VPBNnext

PA0 attrib

Hash

Boff

VPBNnext

PA0 attrib

......

PA1 attribPA2 attribPA3 attrib

VPBNnext

PA0 attribVPBNnext

PA0 attrib

Page 17: Operating Systems & Memory Systems: Address Translation

Page 17

CPS 220 33© Alvin R. Lebeck 2001

Reducing TLB Miss Handling Time

• Problem– must walk Page Table on TLB miss– usually incur cache misses– big problem for IPC in microkernels

• Solution– build a small second-level cache in SW– on TLB miss, first check SW cache

» use simple shift and mask index to hash table

CPS 220 34© Alvin R. Lebeck 2001

Cache Indexing

• Tag on each block– No need to check index or block offset

• Increasing associativity shrinks index, expands tag

Fully Associative: No indexDirect-Mapped: Large index

Block offset

Block Address

TAG Index

Page 18: Operating Systems & Memory Systems: Address Translation

Page 18

CPS 220 35© Alvin R. Lebeck 2001

Address Translation and Caches

• Where is the TLB wrt the cache?• What are the consequences?

• Most of today’s systems have more than 1 cache– Digital 21164 has 3 levels – 2 levels on chip (8KB-data,8KB-inst,96KB-unified)– one level off chip (2-4MB)

• Does the OS need to worry about this?

Definition: page coloring = careful selection of va->pa mapping

CPS 220 36© Alvin R. Lebeck 2001

TLBs and Caches

CPU

TLB

$

MEM

VA

PA

PA

ConventionalOrganization

CPU

$

TLB

MEM

VA

VA

PA

Virtually Addressed CacheTranslate only on miss

Alias (Synonym) Problem

CPU

$ TLB

MEM

VA

PATags

PA

Overlap $ accesswith VA translation:requires $ index to

remain invariantacross translation

VATags

L2 $

Page 19: Operating Systems & Memory Systems: Address Translation

Page 19

CPS 220 37© Alvin R. Lebeck 2001

Virtual Caches

• Send virtual address to cache. Called VirtuallyAddressed Cache or just Virtual Cache vs. Physical Cache or Real Cache

• Avoid address translation before accessing cache– faster hit time to cache

• Context Switches?– Just like the TLB (flush or pid)– Cost is time to flush + “compulsory” misses from empty cache– Add process identifier tag that identifies process as well as address

within process: can’t get a hit if wrong process

• I/O must interact with cache

CPS 220 38© Alvin R. Lebeck 2001

I/O Bus

Memory Bus

Processor

Cache

MainMemory

DiskController

Disk Disk

GraphicsController

NetworkInterface

Graphics Network

interrupts

I/O and Virtual Caches

I/O Bridge

VirtualCache

PhysicalAddresses

I/O is accomplishedwith physical addressesDMA• flush pages from cache• need pa->va reverse

translation• coherent DMA

Page 20: Operating Systems & Memory Systems: Address Translation

Page 20

CPS 220 39© Alvin R. Lebeck 2001

Aliases and Virtual Caches

• aliases (sometimes called synonyms); Two different virtual addresses map to same physical address

• But, but... the virtual address is used to index the cache

• Could have data in two different locations in the cache

Kernel

UserStack

Kernel

User Code/Data

PhysicalMemory

0

264-1

CPS 220 40© Alvin R. Lebeck 2001

• If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag

• Limits cache to page size: what if want bigger caches and use same trick?

– Higher associativity– Page coloring

Index with Physical Portion of Address

Page Address Page Offset

Address Tag Index Block Offset

Page 21: Operating Systems & Memory Systems: Address Translation

Page 21

CPS 220 41© Alvin R. Lebeck 2001

Page Coloring for Aliases

• HW that guarantees that every cache frame holds unique physical address

• OS guarantee: lower n bits of virtual & physical page numbers must have same value; if direct-mapped, then aliases map to same cache frame

– one form of page coloring

Page Address

Page Offset

Address TagIndex

Block Offset

CPS 220 42© Alvin R. Lebeck 2001

Page Coloring to reduce misses

• Notion of bin– region of cache that may

contain cache blocks from a page

• Random vs careful mapping

• Selection of physical page frame dictates cache index

• Overall goal is to minimize cache misses

Cache Page frames

Page 22: Operating Systems & Memory Systems: Address Translation

Page 22

CPS 220 43© Alvin R. Lebeck 2001

Careful Page Mapping

[Kessler92, Bershad94]• Select a page frame such that cache conflict misses

are reduced– only choose from available pages (no VM replacement induced)

• static– “smart” selection of page frame at page fault time

• dynamic– move pages around

CPS 220 44© Alvin R. Lebeck 2001

A Case for Large Pages

• Page table size is inversely proportional to the page size

– memory saved

• Fast cache hit time easy when cache <= page size (VA caches);

– bigger page makes it feasible as cache size grows

• Transferring larger pages to or from secondary storage, possibly over a network, is more efficient

• Number of TLB entries are restricted by clock cycle time,

– larger page size maps more memory– reduces TLB misses

Page 23: Operating Systems & Memory Systems: Address Translation

Page 23

CPS 220 45© Alvin R. Lebeck 2001

A Case for Small Pages

• Fragmentation– large pages can waste storage– data must be contiguous within page

• Quicker process start for small processes(??)

CPS 220 46© Alvin R. Lebeck 2001

Superpages

• Hybrid solution: multiple page sizes– 8KB, 16KB, 32KB, 64KB pages– 4KB, 64KB, 256KB, 1MB, 4MB, 16MB pages

• Need to identify candidate superpages– Kernel– Frame buffers– Database buffer pools

• Application/compiler hints• Detecting superpages

– static, at page fault time– dynamically create superpages

• Page Table & TLB modifications

Page 24: Operating Systems & Memory Systems: Address Translation

Page 24

CPS 220 47© Alvin R. Lebeck 2001

More details on page coloring to reduce misses

CPS 220 48© Alvin R. Lebeck 2001

Page Coloring

• Make physical index match virtual index• Behaves like virtual index cache

– no conflicts for sequential pages

• Possibly many conflicts between processes– address spaces all have same structure (stack, code, heap)– modify to xor PID with address (MIPS used variant of this)

• Simple implementation• Pick abitrary page if necessary

Page 25: Operating Systems & Memory Systems: Address Translation

Page 25

CPS 220 49© Alvin R. Lebeck 2001

Bin Hopping

• Allocate sequentially mapped pages (time) to sequential bins (space)

• Can exploit temporal locality– pages mapped close in time will be accessed close in time

• Search from last allocated bin until bin with available page frame

• Separate search list per process• Simple implementation

CPS 220 50© Alvin R. Lebeck 2001

Best Bin

• Keep track of two counters per bin– used: # of pages allocated to this bin for this address space– free: # of available pages in the system for this bin

• Bin selection is based on low values of used and high values of free

• Low used value– reduce conflicts within the address space

• High free value– reduce conflicts between address spaces

Page 26: Operating Systems & Memory Systems: Address Translation

Page 26

CPS 220 51© Alvin R. Lebeck 2001

Hierarchical

• Best bin could be linear in # of bins• Build a tree

– internal nodes contain sum of child <used,free> values

• Independent of cache size– simply stop at a particular level in the tree

CPS 220 52© Alvin R. Lebeck 2001

Benefit of Static Page Coloring

• Reduces cache misses by 10% to 20%• Multiprogramming

– want to distribute mapping to avoid inter-address space conflicts

Page 27: Operating Systems & Memory Systems: Address Translation

Page 27

CPS 220 53© Alvin R. Lebeck 2001

Dynamic Page Coloring

• Cache Miss Lookaside (CML) buffer [Bershad94]– proposed hardware device

• Monitor # of misses per page• If # of misses >> # of cache blocks in page

– must be conflict misses– interrupt processor – move a page (recolor)

• Cost of moving page << benefit