Nesting Paging in VM Replay for MPs
description
Transcript of Nesting Paging in VM Replay for MPs
![Page 1: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/1.jpg)
CS530 Operating System
Nesting Paging in VMReplay for MPs
Jaehyuk HuhComputer Science, KAIST
![Page 2: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/2.jpg)
Address Translation in VM• Need to translate guest VA (gVA) to machine address
– gVA (guest VA) gPA (guest PA) sPA (system PA)
• Paravirtualization– Guest page table (managed by guest OS) directly maps gVA to sPA– Hypervisor validates guest page table
• Full virtualization– SW technique: shadow paging– HW-assisted technique: nested paging
![Page 3: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/3.jpg)
X86 4KB page tables in long mode
![Page 4: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/4.jpg)
Shadow Page Table • Shadow page table (sPT)
– translate from gVA to sPA– maintained by VMM (hypervisor)
• VMM intercepts the updates of page table base address – CR3 updates in x86– Set CR3 with sPT base address instead of gPT base address
• must be consistent with guest page table (gPT) gPT up-dates must be reflected in sPT
• Any page fault must be intercepted by VMM– VMM must tell guest-induced page-faults from VMM-induced ones– Vectors guest-induced page-faults to guest OS– High overheads for page fault handling
![Page 5: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/5.jpg)
How to make gPT and sPT consistent?• Write-protecting gPT
– Any modification of gPT (add or remove a translation) causes a fault– VMM updates sPT accordingly
• Exploiting page-fault behavior and TLB consistency rules– Adding a page translation
• Guest OS can add a new translation to gPT without interception by VMM• Later accesses by guest VM causes a page fault on the new translation• VMM updates sPT on the page fault: must inspect gPT to find out the new page
– Deleting a page translation• Guest OS executes INVLPG to invalidate TLB entry• VMM intercept the execution and remove the entry from sPT
![Page 6: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/6.jpg)
Overheads of Shadow Paging• Any page fault requires the expensive VMM intervention
– Guest-induced page fault– Hypervisor-induced page faults
• Accessed and dirty bit updates– HW page walker sets bits in sPT (not gPT)– Guest OS need the information to make paging decision– Dirty bit example: set pages pointed by sPT read-only
• Problems in MPs– What if a VM uses multiple processors?– Replicating sPT for each processor? memory overheads– Sharing sPT ? synchronizing sPT for any change
![Page 7: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/7.jpg)
Shadow Paging Overheads
![Page 8: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/8.jpg)
Nesting Page Table• A source of address translation overheads in traditional
x86 VMM– a fixed hardware page walker to handle a TLB miss – Can walk from only one page table (pointed by CR3)
• Nested paging– Separate HW states affecting paging (two copies of CR3 etc … ) for guest
OS and VMM– HW page walker can walk both gPT and sPT – TLB can holds a translation from gVA to sPT directly
• Benefits: No more traps on Guest Page Table accesses• Drawback: Extra page table steps add latency to TLB miss• May add extra caching for page translation
– Nested TLB– 2D page walk cache
![Page 9: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/9.jpg)
Nested Paging
![Page 10: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/10.jpg)
Nested Paging
![Page 11: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/11.jpg)
Address Space IDs• Old x86 did not support address space IDs (ASID) in TLBs
– must flush TLBs for VM switch– Assign ASID for each VM– Still need to flush TLBs for context switch within a VM
![Page 12: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/12.jpg)
Replay Papers• VM-based replay
– Execution Replay for Multiprocessor Virtual Machines– Dunlap et al
• HW-based replay– Rerun: Exploiting Episodes for Lightweight Memory Race Recording– Hower and Hill
• ODR: Output-Deterministic Replay for Multicore Debug-ging– Altekar and Stoica
• Slides adapted from the presentation slides by the paper authors
![Page 13: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/13.jpg)
Big ideas• Detection and replay of memory races is possible on com-
modity hardware• Overhead high for some workloads• …but surprisingly low for other workloads
![Page 14: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/14.jpg)
Execution Replay
CPU
Memory
Disk
Network
Keyboard, mouse
Interrupts
![Page 15: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/15.jpg)
Deterministic Replay• Deterministic Replay
– Faithfully replay an execution such that all instructions appear to complete in the same order and produce the same result
• Valuable– Debugging [LeBlanc, et al. - COMP ’87]
• e.g., time travel debugging, rare bug replication– Fault tolerance [Bressoud, et al. - SIGOPS ‘95]
• e.g., hot backup virtual machines– Security [Dunlap et al. – OSDI ‘02]
• e.g., attack analysis– Tracing [Xu et al. – WDDD ‘07]
• e.g., unobtrusive replay tracing
15
![Page 16: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/16.jpg)
Single-processor Replay• Basic principles well understood
– Log all non-deterministic inputs– Timing of asynchronous events
• Minimal overhead (Dunlap02)– 13% worst case– Log for months or years
• Available commercially– VMWare: Record/Replay
![Page 17: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/17.jpg)
The Multiprocessor Challenge• Interleaved reads and writes
– Fine-grained non-determinism– Much more difficult
• Existing solutions– Hardware modification– Software instrumentation
• SMP-ReVirt– Hardware MMU to detect sharing
![Page 18: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/18.jpg)
Multiprocessor Replay
P2
Memory
P1
P1 P2
n=3n=5
if (n<4)
![Page 19: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/19.jpg)
Ordering Memory Accesses• Preserving order will reproduce execution
– a→b: “a happens-before b”– Ordering is transitive: a→b, b→c means a→c
• Two instructions must be ordered if:– they both access the same memory, and– one of them is a write
![Page 20: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/20.jpg)
Constraints: Enforcing order
• To guarantee a→d:– a→d– b→d– a→c– b→c
• Suppose we need b→c– b→c is necessary– a→d is redundant
P1a
b
c
d
P2
overconstrained
![Page 21: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/21.jpg)
CREW Protocol• Each shared object in one of two states:
– Concurrent-Read: all processors can read, none can write– Exclusive-Write: one processor (the owner) can read and write; others have
no access• Enforced with hardware MMU
– Read/write– Read-only– None
• Change CREW states on demand– Fault, fixup, re-execute
• CREW event– Increasing or reducing permission due to CREW state changes
![Page 22: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/22.jpg)
CREW Property• If two instructions on different processors:
– access the same page,– and one of them is a write,– there will be a CREW event on each processor between them.
![Page 23: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/23.jpg)
Generating Constraints
• State: Concurrent Read– All processors read-only
• d*: CREW fault• New state: P2 Exclusive• r: privilege reduction
– Read to None• i: privilege increase
– Read to Read/write• Log timing of r and i• Constraint:
– r → i
P1a
d
P2
ri
d*
![Page 24: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/24.jpg)
Predicting results• Key changes in sharing attributes
– 4096-byte sharing granularity– “Miss” is very expensive
• SPLASH2– Good: high spatial locality / low false sharing– Bad: random access patterns / high false sharing
• The Linux kernel– Tuned to 16-byte cacheline– Involving the kernel may be expensive
![Page 25: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/25.jpg)
Single-processor Xen guests
1.001.04
1.01 1.001.03
1.13
1.001.05
0
0.2
0.4
0.6
0.8
1
1.2
FMM LU ocean radix water-spatial
kernel-build
radiosity dbench
Nor
mal
ized
runt
ime
Unmodified 1-cpu guest
Logging 1-cpuguest
`
![Page 26: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/26.jpg)
2-processor Xen guests
1.51
1.001.08
1.601.48
2.10
1.901.76
1.96
1.741.83
1.99
0
0.5
1
1.5
2
2.5
FMM LU ocean radix water-spatial kernel-build
Nor
mal
ized
runt
ime
Unmodified 2-cpuguest
Logging 2-cpu guest
Logging 1-cpu guest
![Page 27: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/27.jpg)
2-processor, con’t
8.70
7.21
1.85 1.88
0123456789
10
radiosity dbench
Nor
mal
ized
runt
ime
Unmodified 2-cpu guestLogging 2-cpu guestLogging 1-cpu guest
![Page 28: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/28.jpg)
4-processor Xen guests
7.36
1.12 1.28
4.20
1.72
9.03
0
2
4
6
8
10
FMM LU ocean radix water-spatial kernel-build
Nor
mal
ized
runt
ime
Unmodified domain, 4 cpusCREW logging, 4 cpusCREW logging, 2 cpus*CREW logging, 1 cpu
![Page 29: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/29.jpg)
HW Memory Race Recording• SW only approach
– Too slow to be turned on always– SW alter execution path
• Want– Small log – record longer for same state– Small hardware – reduce cost, especially when not used– Unobtrusive – should not alter execution
• Rerun: Exploiting Episodes for Lightweight Memory Race Recording
29
![Page 30: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/30.jpg)
Episodic Recording• Most code executes without races
– Use race-free regions as unit of ordering• Episodes: independent execution regions
– Defined per thread– Identified passively does not affect execution– Encompass every instruction
30
T0 T1
LD A ST B ST C LD F
ST E LD B ST X LD R ST T LD X
T2
ST V ST Z LD W LD J
ST C LD Q LD J
ST Q ST E ST C LD Z
LD V
ST X
![Page 31: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/31.jpg)
23
Capturing Causality• Via scalar Lamport Clocks [Lamport ‘78]
– Assigns timestamps to events– Timestamp order implies causality
• Replay in timestamp order– Episodes with same timestamp can be replayed in parallel
31
43 2260
61 44
62
2344
45
T0 T1 T2
![Page 32: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/32.jpg)
Episode Benefits
• Multiple races can be captured by a single episode– Reduces amount of information to be logged
• Episodes are created passively– No speculation, no rollback
• Episodes can end early– Eases implementation
• Episode information is thread-local– Promotes scalability, avoids synchronization overheads
32
![Page 33: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/33.jpg)
Hardware• Rerun requirements:
– Detect races track r/w sets– Mark episode boundaries– Maintain logical time
33
Coherence Con-troller
L1 I
L2 0 L2 1 L2 14
L2 15
Core 15
Interconnect
DR
AM
DR
AM
…
Core 14
Core 1
Core 0 …
Base System
Write Filter (WF)Read Filter (RF)
Timestamp (TS)References (REFS)
Memory Timestamp(MTS)
32 bytes
128 bytes2 bytes4 bytes
4 bytes
Total State: 166 bytes/core
![Page 34: Nesting Paging in VM Replay for MPs](https://reader036.fdocuments.us/reader036/viewer/2022062305/56815d33550346895dcb2f36/html5/thumbnails/34.jpg)
HW Replay Summary• Require some modification to existing HW
– will CPU manufacturers add the support any time soon? not likely
• Other low overhead approaches with SW-based replay– ODR: Output-Deterministic Replay for Multicore Debugging, Altekar and
Stoica, SOSP 09