XPDS13: Performance Optimization on Xen-based Android Device - Jack Ren, Intel and Xiantao Zhang,...
-
Upload
the-linux-foundation -
Category
Technology
-
view
1.595 -
download
2
description
Transcript of XPDS13: Performance Optimization on Xen-based Android Device - Jack Ren, Intel and Xiantao Zhang,...
Performance Optimization on Xen-
based Android device
Jack Ren/Xiantao Zhang/Dongxiao Xu
Key contributor: Eddie Dong
Intel Corporation
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
Intel may make changes to specifications and product descriptions at any time, without notice.
All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2012 Intel Corporation.
Agenda
• Overview
• Design Details
• Gaps, Analysis & Optimizations
• Summary
3
Overview• Back to Xen Summit 2011 in Seoul…
“Mobile virtualization will be more important…Xen has unique advantages there”
- <<Mobile Virtualization using the Xen Technoligies>>, Jun Nakajima, Intel.
And Jun proposed xen-based Android system:
Overview continue• New use case: Android in Dom0, hypervisor as TEE
Dom0
Android userland (ring 3)
Android framework
Android Kernel
(ring 1)
Surface Manager
OpenGLES
Dalvik
…
Xen(ring 0)Virtual CPU
GFX
Video
PM
…
Virtual MMU
Virtual IRQ
…
Gallery VideoPlayer Browser …
But we don’t want to sacrifice performance and power too much
TEE:
Trusted Execution Engine
Design Details
− For example, Quadrant I/O: 21% downgrade
Virtualization performance
I/O pass-through to Android close to native performance
CPU vCPUs pinned to physical CPUs
Eliminate the vCPU scheduling penalty
MMU Para-virtualized Good run time performance
IRQ Xen owns, dispatch toAndroid via event channel
Main overhead: ring switch
FPU Para-virtualized No vCPU scheduling, very good performance
CpuIdle Pass-through to Android Completely consistent with Android PM
CpuFreq Pass-through to Android Same as above
Standby (S3) Pass-through to Android Same as above
• Android runs almost natively
Standby (S3) is a little bit tricky…
Design Details continue
Re-design S3
• Dom0 owns the full suspend/resume logic.
• Xen assists Dom0 to issue the real monitor/mwait.
• 2X faster than native for S3 resume.
CPU0
CPU1
CPU2
CPU3
HYPERVISOR_
vcpu_op(VCPUOP_down)
do_mwait_suspend()
mwait
HYPERVISOR_
do_mwait_suspend()do_mwait_suspend()
sleep
mwait
mwait
wake up CPU0
CPU1
CPU2
CPU3
Time line
HYPERVISOR_
vcpu_op(VCPUOP_up)
Preliminary Power (normalized)• > 90% of benchmarks reach 95% of native power
80%
85%
90%
95%
100%
105%
Power KPIs
But we still identified several gaps…
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any
change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully
evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance
Preliminary Performance (normalized)
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%EEM
BC
Core
Mark
Dhry
sto
ne -
BEN
C
Caffein
eM
ark
iSPEC00 -
speed
Mic
ro B
enchm
ark
…
Mic
ro B
enchm
ark
…
AnTutu
2.9
.4 C
PU
Int
Sunspid
er
EEM
C B
row
ingBench
Bro
wserm
ark
Octa
ne
Fis
hIE
Tank -
200M
BaseM
ark
ES2v1 T
aiji
BaseM
ark
ES2v1…
Sm
ark
Bench2012
Qudra
nt2
D
Qudra
nt3
D
Qudra
nt
IO
GLBenchm
ark
2.5
.1…
GLBenchm
ark
2.5
.1…
Cold
Boot
tim
e t
o…
H.2
64/M
PEG
-4 A
VC…
H.2
64 v
ideo r
ecord
3G
HSD
PA d
ow
nlo
ad
WLAN
dow
nlo
ad
CF-B
ench (
malloc)
USB M
TP e
rad larg
e…
USB M
TP w
rite
…
Performance KPIs
But we still identified several gaps and need some tools to help us…
•> 90% of benchmarks reach 97% of native performance
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any
change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully
evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance
Tools EnablingEnabled a lot of tools for performance tuning
• vTune
− Based on PMU, mainly used to tune Dom0
• Xentrace
− Based on original Xentrace, but revised to count key events and hypercalls
• Perf
− Based on PMU, mainly used to tune Dom0
• Xenoprofier
− Based on PMU, mainly used to tune Xen
Those tools prove very helpful in the late tuning Performance and power
Case #1: Quadrant I/O (perf)
Gap: 21%
• Analysis:
Storage data are cached in page cache which is allocated from
high_memory. Each page cache access needs to kmap/kunmap which
leads to a lot of PVMMU hypercalls
• Optimizations:
− Shrink Xen memory foot print from 168M to 72M
− Force page cache allocated from low memory
• Gap reduced to 8.5%
Can we continue to optimize and close that gap of 8.5%?
Case #1: Quadrant I/O (perf) continue
Profiled by Vtune
Among 8.5%:Xen overhead =
134/3138 ~= 4.27%
Xen traces
Among 4.27%:PVMMU overhead ~= 70.88%
Hard to further close the gap of 8.5% due to PVMMU overhead
type name count cost cost%
hcall mwait_idle_op 3759 37142118744
hcall multicall 12147 145492506 32.12%
hcall mmu_update 27126 113270256 25.00%
hcall mmuext_op 7781 50615724 11.17%
hcall vcpu_op 6577 39658986 8.75%
hcall event_channel_op 3405 26617650 5.88%
hcall xen_version 4937 12374700 2.73%
event PAGE_FAULT 9764 11719224 2.59%
event IRQ 1119 10178934 2.25%
hcall event_channel_op 1259 9081834 2.00%
hcall physdev_op 1692 8251512 1.82%
hcall event_channel_op 840 7024398 1.55%
hcall event_channel_op 761 6150300 1.36%
event TIMER_IRQ 472 5745738 1.27%
hcall event_channel_op 545 4361118 0.96%
event TRAP 1038 1040916 0.23%
event PRIVOP 1032 872700 0.19%
hcall fpu_taskswitch 1038 439638 0.10%
hcall undfined 21 102672 0.02%
hcall apic_op 3 5484 0.00%
total cost: 453004290
Case #2: Home Screen Scroll (power)Gap: 1.2% gap
Profiled by Vtune
Xen overhead = 30/3176 ~= 1%
type Name count cost cost%
event IRQ, 1843 18323532 7.040037304
event TRAP, 88 131352 0.050466416
event PAGE_FAULT, 943 3237852 1.244006825
event PRIVOP, 1385 533748 0.205069952
event TIMER_IRQ, 144 2062704 0.792506221
hcall mmu_update, 990 8866296 3.40649688
hcall fpu_taskswitch, 95 66816 0.025671204
hcall multicall, 8736 109199952 41.9554339
hcall xen_version, 3914 10860348 4.172626492
hcall vcpu_op, 9694 55009236 21.13495769
hcall mmuext_op, 3858 34409052 13.22021375
hcall event_channel_op, 1188 10105920 3.882769643
hcall physdev_op, 1078 7469256 2.869743719
hcall mwait_idle_op, 3938 23493503868
total cost 260276064
cost of PAGE_FAULT, mmu_update, multicall, mmuext_op 155713152 59.82615136
Xen tracesAmong 1%:
PVMMU overhead = 59.83%
PVMMU overhead again…
Other Gaps
Other cases have the similar Xen overheads:
• PVMMU
• TLS/stack switching
Some cases could be optimized by reducing the hypercall
numbers by optimizing guest
• For example, Quadrant I/O
While, some cases could be hard to optimize due to PV overhead
• For example, CF-Bench malloc
Could be fixed by HVM Dom0
Summary
• Dom0 Android achieved near-native power and performance
• Still found some power and performance gaps caused by PVOPS
− PVMMU
− TLS/Stack switch
• Those gaps could be fixed by HVM Dom0