Xen Summit 2009 Shanghai Ras
-
Upload
xen-project -
Category
Technology
-
view
1.694 -
download
1
description
Transcript of Xen Summit 2009 Shanghai Ras
![Page 1: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/1.jpg)
Towards Mission Critical Xen
-- RAS Enabling Update
Jiang, Yunhong <[email protected]>
Ke, Liping <[email protected]>
Liu, Jinsong <[email protected]>
![Page 2: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/2.jpg)
2
Legal Information
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Intel may make changes to specifications and product descriptions at any time, without notice.
All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
Intel is a trademark of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2009, Intel Corporation. All rights are protected.
![Page 3: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/3.jpg)
3
Agenda
• Overview
• CPU/Memory error handling
• Host CPU hot-plug support
• Guest vCPU Hot-plug
![Page 4: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/4.jpg)
4
Continuous Improvement on RAS Support
February 2009
(Xen Summit North America 2009)
Now
CPU/Memory error handling
•Infrastructure proposed
•Implementation WIP
Checked-in
I/O error handling • Infrastructure proposed
• PV guest supported
WIP for HVM guest support
Host CPU hot-add Not Started Yet checked-in
Host memory hot-add Not started yet WIP
Guest vCPU hot-plug Not started yet Ready for send out
Guest vMem hot-plug Not started yet May not support
![Page 5: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/5.jpg)
5
Agenda
• Overview
• CPU/Memory error handling
• Host CPU hot-plug support
• Virtual CPU Hot-plug
![Page 6: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/6.jpg)
6
Why error handling enhancement – Motivation
6
X
X
Pre-Virtualization – 1 App/svr, error kills 1 application
Post-Virtualization – ~5-10 App/svr HW/HV error kills all 5-10 applications
Error handling enhancements HW/HV error kills 1 virtual application that error localized
X
Any difference in system hardware or software design or configuration may affect actual performance.
![Page 7: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/7.jpg)
7
Retrospect: CPU/Memory Error Handling Flow• Error happens to CPU or memory and is detected by
hardware– E.g. ECC error in L3 Cache , ECC error in memory cell
• A MCE (Machine Check Exception) is raised to Xen hypervisor
• Xen hypervisor MCE handler will parse error information– Information from: MCE error code, MCE MSRs etc
• Xen hypervisor take action to recover the error, e.g.– Offline a page for memory error
– Killing impacted guest
• System continues if error is recovered successfully– Log the error information in dom0
• Otherwise reset the system
![Page 8: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/8.jpg)
8
MCA Handler
MCA MSR VirtMCA telemetry
Retrospect: CPU/Memory Error Handling Architecture
Dom0 DomUUser space tools (FMA/ Mcelog)
vIRQ handler
vMCE handler
Page offlineCPU offline
Reset System
vMCE handler
X
Xen
Action
HardwareError
Infrastructure is implemented Already by Ke, Liping/Frank van der Linden/ Christoph Egger
MCE
![Page 9: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/9.jpg)
9
Next Step for CPU/Memory Error Handling
• Add xen awareness to Linux MCA tools
• Support for more MCA error code– Now Supports two software recoverable Errors
• UCR Errors detected by memory controller scrubbing• UCR Errors detected during L3 cache explicit write-backs
• Enhancement to memory error handling (see next pages)
![Page 10: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/10.jpg)
10
Memory Error Handling– Current Solution• Reading access to broken memory affects data integrity
– Whole system may even crash
• Recover action from xen hypervisor
Type Probability Action
Free
memory
Depends on workload
Offline the page
Guest memory Large •Off line the page when it is freed
• A virtual MCE is sent to guest
Critical Memory
•Xen’s private data/heap
•Shadow/HAP/IOMMU/P2M page tables
•The granted memory used by backend service
Small Reset the system
![Page 11: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/11.jpg)
11
Memory Error Handling Enhancement (Cont’d)• Issue: Xen may access broken guest memory
– Xen scans guest’s page tables when killing shadow mode guest• Xen hypervisor crashes if one page table page is broken
– Xen access guest’s memory for instruction emulation
– KExec access the broken pages
– ……
– Proposal: Avoid high-possibility access
• Issues: guest’s access to broken memory is not prevented– Malicious guest can trigger system crash
– Proposal: Detecting the access in Xen hypervisor in advance
![Page 12: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/12.jpg)
12
Agenda
• Overview
• CPU/Memory error handling
• Host CPU hot-plug support
• Guest vCPU Hot-plug
![Page 13: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/13.jpg)
13
Host CPU Hot-add Support
• Host CPU hot-add works in Xen environment through 2 steps
• Step 1: CPU is marked present– A CPU is hot-added to physical platform
– Platform raise a interrupt (SCI) to dom0
– Dom0’s ACPI driver parses the ACPI table to get CPU information
– Dom0’s ACPI driver notify Xen hypervisor of a new CPU added
– The new CPU is marked present in xen hypervisor, but will not be scheduled
• Step 2: management tools notify Xen hypervisor to bring the CPU online
![Page 14: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/14.jpg)
14
CPU Add
Host CPU Hot-add
Dom0 ManagementTools
ACPI Driver
CPU OnlineXen
HWACPI
Xen PCPUDriver
Patch is on the way to upstream
ACPI Notification
sysfs
Hypercall Hypercall
CPU
![Page 15: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/15.jpg)
15
Agenda
• Overview
• CPU/Memory error handling
• Host CPU hot-plug support
• Guest vCPU Hot-plug
![Page 16: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/16.jpg)
16
HVM Guest vCPU Hotplug
• Hot-add/remove HVM guest’s vCPU dynamically – PV guest has vCPU hot-plug for a long time
• Code is almost ready
![Page 17: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/17.jpg)
17
XenvCPU
Management
HVM Guest vCPU Hotplug
Dom0 DomU
Control Panel
(xm vcpu-add)
QEMU
VirtualACPI HW
Virtual ACPI Table
Notification
Hot plugDriver
Virtual BIOS
![Page 18: Xen Summit 2009 Shanghai Ras](https://reader034.fdocuments.us/reader034/viewer/2022051211/553922c84a7959016b8b499c/html5/thumbnails/18.jpg)
18
Next Step for RAS Effort• CPU/Memory Enhancement
• PCI AER support for device assigned to HVM guest – Based on PCI-Express support in Qemu
• Host memory hot-add support