Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun...
-
Upload
beverly-turner -
Category
Documents
-
view
220 -
download
0
Transcript of Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun...
Workload Offloading for Native Codes
from ARM to x86
Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun KimCoreLab POSTECH
ARM is widely used in various smart devices
2Source: http://www.rudebaguette.com/assets/smart_devices.jpg
ARM is much slower than x86
3
2mm3mm
cholesky
corre
lation
cova
riance
doitgen
dynpro
g
fdtd-2d
gemm
reg_detect
symm
syr2k
syrk
GEOMEAN
0
5
10
15
20
25
30
35
Execution Time Normalized to x86 Execution time
ARM x86
Client: ARMv7 Processor 1.70GHz 4-core (Lubuntu 14.04)Server: Intel(R) Xeon(R) E5-2407 2.20GHz 8-core (Ubuntu 14.04)
4
Offloading has been proposed• Existing offloading techniques rely on virtual machines
ARM
OS
Application
Migration
Profiler
Runtime
App.VMM
anag
er
x86
VMM
Virtual HW
OS
Application
Migration
Profiler
App.VMM
anag
er
Virtual HW
OS
Source: Byung-Gon Chun et al. CloneCloud: elastic execution between mobile device and cloud. EuroSys '11
VMs are SLOW!!!
5
C C++ using STL containers
Java JIT JavaScript Interpreted JavaScript
0
1
2
3
4
5
6
7
8
9
Execution time of Image edge detection program
Runti
me
Nor
mal
ized
to C
50X
Source: Mojtaba Mehrara et al. Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism. HPCA '11
Huge Performance Overhead of Virtual Machine
6
Offloading for Native Code is necessary
ARM
OS A
Application Binary
x86
OS B
Application Binary
• Different ISAs• Different Memory Layouts• Different ABIs (Application Binary Interface)• Sizes, layout, and alignment of data types• Calling convention• System Libraries
Overall System
7
Source Code
Target Info.
ARM Binary(Whole Prgm)
x86 Binary(Offloaded Fcn)
Hot Function DetectorFunction Filter
Target Selector
Partitioner Unified Virtual Address Mngr.
ABI Convertor Communication Optimizer
Native Offloader
8
Source Code
Target Info.
ARM Binary(Whole Prgm)
x86 Binary(Offloaded Fcn)
Hot Function DetectorFunction Filter
Target Selector
Partitioner Unified Virtual Address Mngr.
ABI Convertor Communication Optimizer
Native Offloader
164.gzip in SPEC 2000
main() { init(); compress(); uncompress(); verification();}
9
Source Code
Target Info.
ARM Binary(Whole Prgm)
x86 Binary(Offloaded Fcn)
Hot Function DetectorFunction Filter
Target Selector
Partitioner Unified Virtual Address Mngr.
ABI Convertor Communication Optimizer
Native Offloader
init() { .. file_read_to_memory(); ..}
• Constraint cases• File I/O• System call• Machine specific code
main() { init(); compress(); uncompress(); verification();}
10
Source Code
Target Info.
ARM Binary(Whole Prgm)
x86 Binary(Offloaded Fcn)
Hot Function DetectorFunction Filter
Target Selector
Partitioner Unified Virtual Address Mngr.
ABI Convertor Communication Optimizer
Native Offloader
main() { init(); compress(); uncompress(); verification();}
Function Coverage
compress 37%
uncompress 42%
verification 1.5%
Total 100%
11
Source Code
Target Info.
ARM Binary(Whole Prgm)
x86 Binary(Offloaded Fcn)
Hot Function DetectorFunction Filter
Target Selector
Partitioner Unified Virtual Address Mngr.
ABI Convertor Communication Optimizer
Native Offloader
main() { init(); compress(); uncompress(); verification();}
12
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
main() { while(id = recv()){ switch(id) { } send(ret); }}
Client: ARM Server: x86
main() { init(); compress(); uncompress(); verification();}
13
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
main() { init(); send(compress_id); ret = recv(); uncompress(); verification();}
main() { while(id = recv()){ switch(id) { case: compress_id ret = compress(); } send(ret); }}
Client: ARM Server: x86
14
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
main() { init(); send(compress_id); ret = recv(); send(uncompress_id); ret = recv(); verification();}
main() { while(id = recv()){ switch(id) { case: compress_id ret = compress(); case: uncompress_id ret = uncom-press(); } send(ret); }}
Client: ARM Server: x86
15
Stack
global variables
text
Heap
textglobal variables
Client’s memory layout Server’s memory layout
Overwritten
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
sp
sp
brk
brk
sp
brkOverwritten
16
Stack
global variables
text
Heap
textglobal variables
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
sp
brk
sp
brk
brk
spsp
brk
Client’s memory layout Server’s memory layout
sp
brk
17
struct Foo { char a; long long b; int c;};
a
b
c
a b
c
Structure layout
x86
a
b
c
struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
x86
ARMConversion
18
struct Foo{ char a; long long b; int c;};
struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
internal Foo fn1(Foo a);
Function offloaded() { … Foo a = *pa; Foo ret = fn1(a); …}
19
struct Foo{ char a; long long b; int c;};
struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
internal Foo fn1(Foo a);internal Foo_cvrt fn1_cvrt(Foo_cvrt a);
Function offloaded() { … Foo_cvrt a = *pa; Foo_cvrt ret = fn1_cvrt(a); …}
20
struct Foo{ char a; long long b; int c;};
struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
external Foo fn2(Foo a);
Function offloaded() { … Foo a = *pa;
Foo tret = fn2(a);
…}
21
struct Foo{ char a; long long b; int c;};
struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
external Foo fn2(Foo a);
Function offloaded() { … Foo_cvrt a = *pa; Foo ta = convert_to_x86(a); Foo tret = fn2(ta); Foo_cvrt ret = convert_to_arm(tret); …}
22
Migration
• Speculative page migration (Before offloading)
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
Page# Value Dirty
1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0
……
Page# Value Dirty
1234
……
Used In Offloaded()Page #1, #2, #3 …
Profiling result
Client memory Server memory
1 0x5052 02 0xFF00 03 0x2A48 0
23
• Lazy Loading (During offloading)
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
Page# Value Dirty
1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0
……
Page# Value Dirty
1 0x5052 02 0xFF00 03 0x2A48 04 (Page Fault)
……
Client memory Server memory
Request
4 0xF35A 0
24
• Lazy Loading (During offloading)
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
Page# Value Dirty
1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0
……
Page# Value Dirty
1 0x5052 02 0xFF00 03 0x2A48 04 (Page Fault)
……
Client memory Server memory
Migration
4 0xF35A 0
25
• Write-back (After offloading)
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
Page# Value Dirty
1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0
……
Page# Value Dirty
1 0x5052 02 0x00AC 13 0x2000 14 0xF35A 0
……
Client memory Server memory
26
• Write-back (After offloading)
Partitioner Unified Virtual Address Mngr. ABI Convertor Communication
Optimizer
Native Offloader
Page# Value Dirty
1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0
……
Page# Value Dirty
1 0x5052 02 0x00AC 03 0x2000 04 0xF35A 0
……
Client memory Server memory
Write-back
2 0x00AC 03 0x2000 0
27
gemm
2mm3mm
cholesky
corre
lation
cova
riance
doitgen
dynpro
g
fdtd-2d
reg_detect
symm
syr2k
syrk
GeoMean
0
1
2
3
4
5
6
7
8
9
10
Spea
dup
Evaluation
Client: ARMv7 Processor 1.70GHz 4-core (Lubuntu 14.04)Server: Intel(R) Xeon(R) E5-2407 2.20GHz 8-core (Ubuntu 14.04)
28
Conclusion• We developed a compiler framework provides
workload offloading for native codes from ARM to x86.
• We solve the different ISAs, memory layout, ABI problems which occurs in offloading for native code.