Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.
-
date post
22-Dec-2015 -
Category
Documents
-
view
220 -
download
1
Transcript of Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.
![Page 1: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/1.jpg)
Be-Nice Schedulingfor embedded SMT processors
Apr 6th, 2008Boston
Handong Ye
![Page 2: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/2.jpg)
Be-Nice Scheduling
• ITS (Inter-Thread Stall) Introduction
• Be-Nice Scheduling
• Some experimental results
![Page 3: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/3.jpg)
Be-Nice Scheduling
• ITS Introduction– ITS in Out-Of-Order processor– ITS in In-Order processor
• Be-Nice Scheduling
• Some experimental results
![Page 4: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/4.jpg)
• ITS Introduction– ITS in Out-Of-Order machine
• A thread holds (or fulfills) shared resources too long, e.g., instruction queue/reservation station/..., and blocks others
• Flush, …
– ITS in In-Order machine• A thread holds Functional Units, blocking others• 2 examples• What can compiler do ?
Be-Nice Scheduling
![Page 5: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/5.jpg)
• ITS Introduction– ITS In In-Order machine
• Examples, assume:– SMT, 2 threads– Embedded– 2 LS units, and 2 ALU– Separate dispatch buffer
Be-Nice Scheduling
![Page 6: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/6.jpg)
• ITS Introduction– ITS In In-Order machine
• Example – 1 (Same FU ITS)– A missed load can block other threads which are using
the same LS unit
Be-Nice Scheduling
![Page 7: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/7.jpg)
add
ldld
add
EXE
MEM
WB
Dispatch
Buffer
LS1 LS2 ALU1 ALU2
ld
add
MISS
Example - 1 : same-FU block
Thread-A Thread-B
![Page 8: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/8.jpg)
• ITS Introduction– ITS In In-Order machine
• Example – 2 (Cross FU ITS)– A missed load can block other threads which are using
non-LS Functional Units, e.g., ALU
Be-Nice Scheduling
![Page 9: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/9.jpg)
add
ldld
add
EXE
MEM
WB
Dispatch
Buffer
LS1 LS2 ALU1 ALU2
add
add
MISS
Example – 2 : cross-FU block
add
Thread-A Thread-B
![Page 10: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/10.jpg)
• ITS Introduction– ITS In In-Order machine
Be-Nice Scheduling
Assume:1. Thread-A cache miss,
around 1%~2%2. Thread-B always hit Results:1. Half of idle cycles are due to ITS2. Almost 1/3 cycles are idle
The effect of ITS, from thread-A to thread-B
![Page 11: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/11.jpg)
• ITS Introduction– ITS In In-Order machine
• What can compiler do ?– Focused on in-order embedded processor– Need a few simple HW supports– Using Open64, in Instruction Scheduling
Be-Nice Scheduling
![Page 12: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/12.jpg)
Be-Nice Scheduling
• ITS (Inter-Thread Stall) Introduction
• Be-Nice Scheduling
• Some experimental results
![Page 13: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/13.jpg)
• Be-Nice Scheduling• Intuitive thinking
– Prefetch : Unacceptable for embedded systemPrefetch : Unacceptable for embedded system– Reduce Cross-FU ITS: Reduce the number of FUs hold
by the thread-A – Reduce Same-FU ITS: Avoid issuing instructions from
other threads into those blocked FUs
Be-Nice Scheduling
![Page 14: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/14.jpg)
add
ldld
ld
EXE
MEM
WB
Dispatch
Buffer
LS1 LS2 ALU1 ALU2
add
add
add
Thread-A Thread-Badd
add
ld
ld
add sched
Original Thread-A
![Page 15: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/15.jpg)
• Be-Nice Scheduling– Objective
• Schedule n (>=2) loads back-to-back
• Issue the n loads to same FU
– Compiler + HW solution• HW side
– Add an extra load, ld.n (n=1,2), saying sending load only to the nth LS unit
– Different threads has its prefer LS unit
• Compiler side– Profile to figure out the loads which are highly possible to miss , saying
‘load_a’– Schedule another load, saying ‘load_b’, behind ‘load_a’, and glue them
as a pseudo OP– Change ‘load_a’ and ‘load_b’ to the thread’s prefer LS unit, e.g., both
are changed to ‘ld.1’
Be-Nice Scheduling
![Page 16: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/16.jpg)
• Be-Nice Scheduling– A Compiler + HW solution
Be-Nice Scheduling
BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3
BB1: $r1 = ld $r2 $r3 = ld $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3
BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3
Identifiedto miss
BB1: $r1 = ld.1 $r2 $r3 = ld.1 $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3
![Page 17: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/17.jpg)
WHIRL
CG-expand
CGIR
Control flow opt.
If-conversionLoop optimizations
Softwarepipelining
Loopunrolling
Scheduling pre-pass ( GCM here)
Local register alloc
Scheduling post-pass
Prolog and Epilog
Extendedblockoptimizer
Code emission
.s
Global register alloc
Be-Nice Scheduling
![Page 18: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/18.jpg)
• Be-Nice Scheduling ( In Open64 GCM and LIS )– The key points during code motion
• Use GCM to find candidates of <ld.1, ld.1> pair
• Moving the pair as a ‘pseudo’ single instruction
Be-Nice Scheduling
![Page 19: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/19.jpg)
Be-Nice Scheduling
• Some experimental results– Be-Nice Schedule on Thread-A– Performance difference on Thread-B
![Page 20: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/20.jpg)
Be-Nice Scheduling
• Some experimental results
The Number of ITS Cycles in thread-B: w/ Be-Nice vs. w/o Be-Nice
![Page 21: Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.](https://reader035.fdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60b20/html5/thumbnails/21.jpg)
Be-Nice Scheduling
• Some experimental results
IPC Improvement of thread-B with Be-Nice Instruction Scheduling