Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
-
Upload
hsien-hsin-sean-lee-phd -
Category
Devices & Hardware
-
view
385 -
download
1
Transcript of Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
![Page 1: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/1.jpg)
ECE 4100/6100Advanced Computer Architecture
Lecture 15 Static Scheduling Machines
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
![Page 2: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/2.jpg)
2
Static Scheduling• Compiler performs instruction scheduling• VLIW Very Long Instruction Word• An alternative to dynamic scheduling processors• Pack multiple operations into one instruction• Move scheduling to Compiler (Software Approach)• Can simplify the complexity of a hardware-based instruction
scheduler• Cydrome, Multiflow, EPIC
![Page 3: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/3.jpg)
3
Very Long Instruction Word (VLIW)
• Rely on Compilers• Simple Hardware• Dependency is explicitly represented in the instructions• Instruction window, supposedly, is much larger than a
hardware scheduling window– How about loop boundary?– How about function boundary?– Interprocedural optimization is generally difficult
• Might lead to compatibility or performance issues if instruction latency changed
• EPIC/Itanium closely follows VLIW philosophy, many embedded and DSP processors embrace VLIW
![Page 4: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/4.jpg)
4
Intel Itanium ISA• Itanium Instruction “Bundle” (VLIW)
– 128 bits each– Contains three Itanium instructions (aka syllables)– Template bits in each bundle specify dependencies both within a
bundle as well as between sequential bundles– A collection of independent bundles forms a “group” (use stops)
• Each Itanium Instruction– Fixed-length 41 bits long– Left-most 4 bits (40-37) are the major opcode (e.g. FP ld/st, INT
ld/st, ALU)– Contains max three 7-bit register specifiers– Contains a 6-bit field for specifying one of the 64 one-bit qualifying
predicate registers
Instruction Slot 1 Instruction Slot 2 Instruction Slot 3 Templt0454586127
![Page 5: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/5.jpg)
5
Encoding Instruction Bundle
• Use “;;” as “stop bitstop bit” in assembly code to separate dependent instructions• Instructions between “;;” belong to the same “instruction group”
– RAW and WAW are not allowed in the same instruction group– WAR is allowed except for an special case: when writing p63 by modulo-scheduled
branch (e.g. br.ctop) after reading p63 (e.g. qualifying predicate) by B-type instruction
• Each instruction slot can represent one (out of 5) functional unit type based on encoding (e.g. slot 0 can be M-unit or B-unit)
• 12 basic templates provided, each with 2 versions depending on stop bit– MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, MFB– MII_, MI_I_, MLX_, MMI_, M_MI_, MFI_, MMF_, MIB_, MBB_, BBB_, MMB_, MFB_
{ .mii ld4 r28=[r8]add r9 = 2,r1;;add r30= 1,r9
}MI_I format ⇒ Template encoded “02”
![Page 6: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/6.jpg)
6
Itanium Instruction Example
{ .mii add r1 = r2, r3 sub r4 = r4, r5;; shr r1, r4, r1;;}{ .mmi ld8 r2, [r1];; st8 [r1] = r23 tbit p1,p2 = r4, 5} { .mbb ld8 r45 = [r55](p3)br.call b1=func1(p4)br.cond Label1}{ .mfi st4 [r45] = r6 fmac f1=f2,f3 add r3=r3, 8;;}
![Page 7: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/7.jpg)
7
Itanium Register Files
Stacked (Rotating)
Static
0
3132
127
General Purpose Registers
Stacked (Rotating)
Static
0
3132
127
FP Registers
063 081
Stacked (Rotating)
Static
01516
630
Predicate Registers
![Page 8: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/8.jpg)
8
Register Stack Engine
• Avoid spills/fills during function call/return• Callee uses instruction alloc r1=ar.pfs, i, l, o, r alloc r1=ar.pfs, i, l, o, r upon entering a function
(inputs)
Static
0
3132
127
localsoutputs
illegalsize of frame (sof)
sofsol
Current Frame Marker (CFM) 38 bits
size of locals (sol = i+l)
sorrrb.grrrb.frrrb.pr
size of rotating (sor)
![Page 9: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/9.jpg)
9
Function Call Examplemain(){
a=foo(i*i, b[i]);
}
int foo(int ii, int bb){
}
r32
r43r44r45
i*i b[i]
r127
main: alloc r32=ar.pfs,0,12,2,0
foo: alloc r26=ar.pfs,2,5,0,0
GPR
Caller (main)
r32
r43r32r33
i*i b[i]
r127
GPR
r38
Callee (foo)
![Page 10: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/10.jpg)
10
RSE: A Function Call
32
46
loc
out52
sofsol
CFM 2114
PFS.pfm xx
3238
out
sofsol
70
2114
call
pfm: Previous frame marker
![Page 11: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/11.jpg)
11
RSE: Alloc
32
46
loc
out52
sofsol
CFM 2114
PFS.pfm xx
3238
out
sofsol
70
2114
call alloc r32=ar.pfs,7,9,3,0
sofsol
1916
2114
32
48
loc
out50
inputs
alloc copies PFM to GR (r32)
![Page 12: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/12.jpg)
12
RSE: Return
32
46
loc
out52
sofsol
CFM 2114
PFS.pfm xx
3238
out
sofsol
70
2114
call alloc
sofsol
1916
2114
32
48
loc
out50
32
46
loc
out52
sofsol
2114
2114
return
![Page 13: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/13.jpg)
13
Itanium Pipelines
• Performance improvement due to pipeline shortening — 4% to 6% • Large integer register file cause extra stage WLD (Word Line Decode) in Itanium,
circuit improved for Itanium 2 • Inter-group latency is enforced by a scoreboard
– Latency due to scheduling that failed to space instructions out– Due to cache misses
Front-endFront-end
Ckt improvedCkt improved
Dependency Scoreboard Stall checked here prior to EXE
![Page 14: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/14.jpg)
14
Itanium 2 Eight-stage Pipeline
EXPEXP RENRENROTROTIPGIPG REGREG EXEEXE DETDET WBWB
FP1FP1 FP2FP2 FP3FP3 FP4FP4 WBWB
L2NL2N L2IL2I L2AL2A L2ML2M L2DL2D L2CL2C L2WL2W
CoreCore
FPFP
L2L2
IPGIPG IP Generate, L1I cache (6 inst) and TLB access
EXEEXE ALU Execute, L1D Cache and TLB Access + L2 Cache Tag Access
ROTROT Instruction Rotate and Buffer (6 inst) DETDET Exception Detect, Branch Correction
EXPEXP Expand, Port assignment and routing WBWB Writeback, INT register update
RENREN INT and FP register rename FP1-WBFP1-WB FP FMAC pipeline (2) + register write
REGREG INT and FP register file read L2N-L2IL2N-L2I L2 Queue Nominate/Issue (4)(speculatively issued with L1 requestspeculatively issued with L1 request)
L2A-L2WL2A-L2W L2 Access, Rotate, Correct, Write (4)
![Page 15: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/15.jpg)
15
Itanium 2 MicroarchitectureL1 I-Cache &
Fetch/Prefetch engine I-TLB
8 bundles8 bundlesInstructionInstructionQueueQueue
Branch Prediction
FF FFII IIMM MMMM MMBBBB BB
Register stack engine / remapping Register stack engine / remapping
Branch & Predicate
128 INTRegisters
128 FPRegisters
BranchUnits
BranchUnits
BranchUnits
INT & MMUnits
INT & MMUnits
INT & MMUnits
INT & MMUnits
INT & MMUnits
INT & MMUnits
Quad-port(INT) L1
PIPT DataCache (WT)
D-TLB
ALA
T
FloatingFloatingPointPointUnitsUnits
FloatingFloatingPointPointUnitsUnits
Scor
eboa
rd, P
redi
cate
NaT
, Exc
eptio
ns
IA-32Decode
& Control
11 issue 11 issue portsports
PIPT
Uni
fied
L2 C
ache
Qua
d-Po
rt (E
CC
)PI
PT U
nifie
d L2
Cac
he Q
uad-
Port
(EC
C)
On-
chip
PIP
T U
nifie
d L
3 C
ache
Sin
gle-
port
ed
On-
chip
PIP
T U
nifie
d L
3 C
ache
Sin
gle-
port
ed
(EC
C)
(EC
C)
Bus Controller (ECC)Bus Controller (ECC)
![Page 16: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/16.jpg)
16
![Page 17: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/17.jpg)
17
ld.sld.sinstr 1instr 1instr 2instr 2brbr
chk.schk.suse use
ItaniumItanium
instr 1instr 1instr 2instr 2. . .. . .brbr
LoadLoaduseuse
Conventional ArchitecturesConventional Architectures
Elevate loads above a branchElevate loads above a branch
• To improve memory latency by control speculation at compile time• Defer exceptions by setting NaT (GR’s 65th bit) that indicates:
– Whether or not an exception has occurred – Branch to fixup code required
• NaT set during ld.s, checked by chk.s
BarrierBarrier
Control Speculation (Speculative Load)
![Page 18: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/18.jpg)
18
Control Speculation (Hoist Uses)
• The uses of speculative data can be executed speculatively– Distinguishes speculation from simple prefetch
• NaT bit propagates down to the dependent instruction chain
ld.sld.sinstr 1instr 1instr 2instr 2brbr
chk.schk.suse use
IA-64IA-64
![Page 19: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/19.jpg)
19
Control Speculation (Recovery)
• All computation instructions propagate NaTsNaTs to the consumers to reduce number of checks
• Cmp propagates “false” if NaT is set when writing predicates (“0” for both target predicates)
chk.s chk.s r5r5, recv, recvsub r7 = sub r7 = r5r5,r2,r2
ld8.s r3 = (r9)ld8.s r3 = (r9)ld8.sld8.s r4 = (r10) r4 = (r10)addaddr6 = r3, r4r6 = r3, r4ld8.s ld8.s r5r5 = (r6) = (r6)p1,p2 = cmp(...)p1,p2 = cmp(...)
Allows single chk on Allows single chk on resultresult
ld8ld8ld8ld8addaddld8ld8br homebr home
Recovery codeRecovery code
![Page 20: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/20.jpg)
20
Data Speculation (Advanced Loads)
• Compiler can hoist a load prior to a preceding, possibly-conflicting store• ALAT (Advanced Load Address Table) is used for checking every store
address in-between • Can be done by superscalar machine using Store coloringStore coloring
instr 1instr 1instr 2instr 2. . .. . .st8st8
ld8ld8useuse
BarrierBarrier
Conventional ArchitecturesConventional Architectures
ld8.ald8.ainstr 1instr 1instr 2instr 2st8st8
ld.cld.cuse use
ItaniumItanium
![Page 21: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/21.jpg)
21
Data Speculation (load.a + chk.a)• Compiler hoist a load and its subsequent consumersits subsequent consumers prior to
a preceding, possibly-conflicting store• Need to patch a recovery code for mis-speculation
ld8.a r3=ld8.a r3=instr 1instr 1instr 2instr 2st8st8
ld.cld.cadd =r3, add =r3,
ld8.a r3=ld8.a r3=instr 1instr 1add =r3,add =r3,instr 2instr 2st8st8
chk.achk.aL1:L1:
ld8 r3=ld8 r3=add =r3,add =r3,br L1br L1
Recovery codeRecovery code
![Page 22: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/22.jpg)
22
Parallel Compare Types
• Three new types of compares:– and: both target predicates set FALSE if compare is false– or: both target predicates set TRUE if compare is true– DeMorgan: if true, sets one TRUE, sets other FALSE
• Do not get confused with the “parallel compare” pcmp1/pcmp2/pcmp4
Reduces Critical PathReduces Critical PathReduces Critical PathReduces Critical Path
BB
AA
CC
DD
BBAA CC
DD
![Page 23: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/23.jpg)
23
Eight Queen Example
Source: Crawford & HuckSource: Crawford & Huck
if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]ld R2=[R1]ld R2=[R1]ld.s R4=[R3]ld.s R4=[R3]ld.s R6=[R5]ld.s R6=[R5]p1,p2=cmp.unc(R2==true)p1,p2=cmp.unc(R2==true)
(p1)(p1) chk.s R4chk.s R4(p1)(p1) p3,p4=cmp.unc(R4==true)p3,p4=cmp.unc(R4==true)
(p3)(p3) chk.s R6chk.s R6(p3)(p3) p5,p6=cmp.unc(R5==true)p5,p6=cmp.unc(R5==true)(p5) br then(p5) br thenelseelse
1
2
4
5
6
7
ThenElse
P1
P2
P5
P3 P4
P6
8 queens control flow8 queens control flowUnconditional ComparesUnconditional Compares
![Page 24: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/24.jpg)
24
Eight Queen Example
Source: Crawford & HuckSource: Crawford & Huck
if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
ThenElse
P1
P2
P5
P3 P4
P6
Parallel ComparesParallel Compares
R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) br then(p1) br thenelseelse
1
2
4
5
Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5
ThenElse
P1= true P1=False
![Page 25: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/25.jpg)
25
More Example of Parallel Compare
1
0 cmp.eq p1,p2 = r0,r0;;
cmp.eq.and.orcm p1,p2 = c1,r0 cmp.eq.and.orcm p1,p2 = c2,r0 cmp.eq.and.orcm p1,p2 = c3,r0 cmp.eq.and.orcm p1,p2 = c4,r0
(p1) add r1=r2,r3(p2) sub r4=r5-r6
c1
c2
c3
else
c4
then
Itanium CodeItanium Code
2
if (c1 && c2 && c3 && c4)if (c1 && c2 && c3 && c4) r1 = r2 + r3;r1 = r2 + r3;else else r4 = r5 – r6 r4 = r5 – r6
Parallel cmp.crel.and or cmp.crel.or write the same values to both predicatesParallel cmp.crel.and or cmp.crel.or write the same values to both predicates
Use Use cmp.crel.and.orcm cmp.crel.and.orcm or or cmp.crel.or.andcmcmp.crel.or.andcm for writing for writing
complementary predicatescomplementary predicates
Also called Also called DeMorganDeMorgan type type (for complementary output)(for complementary output)
![Page 26: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/26.jpg)
26
Multiway Branches
3 branch cycles3 branch cycles3 branch cycles3 branch cycles 1 branch cycle1 branch cycle1 branch cycle1 branch cycle
w/o Speculationw/o Speculation Hoisting LoadsHoisting Loads
ld8 r6 = (ra)ld8 r6 = (ra)(p1) br exit1(p1) br exit1
ld8 r7 = (rb)ld8 r7 = (rb)(p3) br exit2(p3) br exit2
ld8 r8 = (rc)ld8 r8 = (rc)(p5) br exit3(p5) br exit3
(p1) br exit1(p1) br exit1
chk r7, rec1chk r7, rec1(p3) br exit2(p3) br exit2
chk r8, rec2chk r8, rec2(p5) br exit3(p5) br exit3
ld8 r6 = (ra)ld8 r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)
ld8 r6 = (ra)ld8 r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)
(p2) chk r7, rec1(p2) chk r7, rec1(p4) chk r8, rec2(p4) chk r8, rec2 (p1) br exit1(p1) br exit1(p3) br exit2(p3) br exit2(p5) br exit3(p5) br exit3
P1P1
P6P6P5P5
P2P2
P4P4P3P3
• Multiway branches: more than 1 branch in a single cycleMultiway branches: more than 1 branch in a single cycle– Itanium allows multiple Itanium allows multiple ““consecutiveconsecutive”” B instructions in the same inst group B instructions in the same inst group– Allows n-way branching (Itanium and Itanium 2 have 3 branch units)Allows n-way branching (Itanium and Itanium 2 have 3 branch units) per cycle per cycle– Ordering matters if branch predicates are not mutually exclusiveOrdering matters if branch predicates are not mutually exclusive
• E.g. E.g. BBB template enables 3 branches in one bundleBBB template enables 3 branches in one bundle
Multi-way BranchesMulti-way Branches
![Page 27: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/27.jpg)
27
Branch and Prefetch Hints
• Compiler provides hints for branch predictor by– Completer in branch instructions, e.g. br.call.sptksptk
• 4 completer types for static and dynamic predictions: sptk, spnt, sptk, spnt, dptk, dpntdptk, dpnt
– Explicit brpbrp instructions• Compiler provide hints for instructioninstruction sequentialsequential prefetchingprefetching
– Use completer in branch instructions, e.g. br.call.sptk.manymany• 2 completer types: many, few many, few• ManyMany and fewfew are implementation-specific
• Compiler directs predictor allocation– For managing branch predictor resources– Use completer in branch instructions, e.g. br.call.sptk.many.nonenone
• 2 completer types: none, clr none, clr• nonenone: don’t deallocate; clrclr: deallocate branch info
![Page 28: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/28.jpg)
28
Modulo Scheduling Support
• Will be discussed next• Itanium features support modulo scheduling
(or software pipelining)– Full Predication– Special branch handling features
•br.ctop (for for-loop with known loop count)•br.wtop (for while-loop)
– Register rotation: removes loop copy overhead•No modulo variable expansion, tighter code
– Predicate rotation/generation•Removes prologue & epilogue
![Page 29: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/29.jpg)
29
List Scheduling
++
xx
A1A1
A2A2
A3A3
M1M1
M2M2
M3M3
C1C1
C3C3
C2C2
++
++
xx
xx
ld
st
X1X1
X2X2
P = Mem[A++] + C1;Q = P * C2;Y = P * C3 + (P + Q) * (P * C3);Mem[B++] = Y;
Latency: Latency: Mem — 1 cycleAdder — 2 cyclesMultiplier — 2 cycles
Schedule = {X1, A1, M1, A2, M2, M3, A3, X2}Schedule = {X1, A1, M1, A2, M2, M3, A3, X2}
• Build dependency graph• Assign a priority of “0” to all operations
having no successors• Assign each remaining operation the sum of
priority and latency of their successor. If more than one successor, assign the maximum.
• Schedule instructions based on priority
00
11
33
55 55
99
1111
77
![Page 30: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/30.jpg)
30
List Scheduling
++
xx
A1A1
A2A2
A3A3
M1M1
M2M2
M3M3
C1C1
C3C3
C2C2
++
++
xx
xx
ld
st
X1X1
X2X2 00
11
33
55 55
99
1111
• LS (a heuristic) provides near-optimal schedule• But no guarantee for optimality, especially, in terms of
throughputthroughput
Reservation TableReservation Table
Time MEM ADDER MULT0 X11 A123 M14 M25 A267 M389 A31011 X2
77
![Page 31: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/31.jpg)
31
Scheduling• If I want to use the same schedule, what is the minimum
initiation interval? • In the example, do I need to wait for 12 cycles?• If not, how do I avoid collision?
Time MEM ADDER MULT0 X11 A123 M14 M25 A267 M389 A31011 X2
![Page 32: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/32.jpg)
32
Modulo Scheduling [RauGlaeser’81]
• A.k.a. “Polycyclic scheduling” or “Software pipelining”• Exploit ILP among loop iterations to maximize
– Machine utilization– Throughput
• Use a common schedule for the majority of iterations• Overlap execution of consecutive iterations• Constant initiation rate Init iat ion IntervalInit iat ion Interval (I II I )• Minimum II (MIIMII) generates an optimal schedule with
maximum throughput• Originally developed for polycyclic architecture (or
horizontal architecture, or aka VLIW later) at TRW/ESL
![Page 33: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/33.jpg)
33
Modulo Scheduling: Resource Constraint
• The optimal schedule is constrained by the number of available resources
• Determine ResII (Resource minimal initiation interval)– Successive iterations will be scheduled ResII cycles
apart• N(i) is the number of usage of resource i in a loop• C(i) is the number of resources i
) .... ,C(3)
N(3) ,
C(2)
N(2) ,
C(1)
N(1) max( ResII
=
![Page 34: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/34.jpg)
34
Resource II
++
xx
A1A1
A2A2
A3A3
M1M1
M2M2
M3M3
C1C1
C3C3
C2C2
++
++
xx
xx
ld
st
X1X1
X2X2
• Assume 3 FUs– 1 adder with 2-cycle latency– 1 mult with 2-cycle latency– 1 mem unit with 1-cycle
latency
• Determine MII = MII = Resource I IResource I I
3 ) 1
3 ,
1
3,
1
2 max( MII ResII ===
![Page 35: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/35.jpg)
35
Modulo Reservation Table (MRT)
MRT
New Schedule for 1 iteration
Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 X1 0 01 A1 1 12 2 23 M1 0 34 M2 1 45 A2 2 56 0 67 M3 1 78 2 89 A3 0 910 1 1011 X2 2 11
0 121 132 14
Modulo MEM ADDER MULT012
![Page 36: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/36.jpg)
36
Modulo Reservation Table (MRT)
MRT
New Schedule for 1 iteration
Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 X1 0 0 X11 A1 1 1 A12 2 23 M1 0 3 M14 M2 1 4 M25 A2 2 5 A26 0 67 M3 1 78 2 8 M39 A3 0 910 1 1011 X2 2 11
0 12 A31 132 14 X2
Modulo MEM ADDER MULT0 X1 A3 M11 A1 M22 X2 A2 M3
![Page 37: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/37.jpg)
37
Modulo Scheduled Loop
Kernel, steady state (MRT schedule)
Prolog
Modulo Time MEM ADDER MULT0 0 X1 (1)1 1 A1 (1)2 20 3 X1 (2) M1 (1)1 4 A1 (2) M2 (1)2 5 A2 (1)0 6 X1 (3) M1 (2)1 7 A1 (3) M2 (2)2 8 A2 (2) M3 (1)0 9 X1 (4) M1 (3)1 10 A1 (4) M2 (3)2 11 A2 (3) M3 (2)0 12 X1 (5) A3 (1) M1 (4)1 13 A1 (5) M2 (4)2 14 X2 (1) A2 (4) M3 (3)0 15 X1 (6) A3 (2) M1 (5)1 16 A1 (6) M2 (5)2 17 X2 (2) A2 (5) M3 (4)0 18 X1 (7) A3 (3) M1 (6)1 19 A1 (7) M2 (6)2 20 X2 (3) A2 (6) M3 (5)0 21 X1 (8) A3 (4) M1 (7)1 22 A1 (8) M2 (7)2 23 X2 (4) A2 (7) M3 (6)0 24 X1 (9) A3 (5) M1 (8)1 25 A1 (9) M2 (8)2 26 X2 (5) A2 (8) M3 (7)0 27 X1 (10) A3 (6) M1 (9)1 28 A1 (10) M2 (9)2 29 X2 (6) A2 (9) M3 (8)0 30 X1 (11) A3 (7) M1 (10)1 31 A1 (11) M2 (10)2 32 X2 (7) A2 (10) M3 (9)
![Page 38: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/38.jpg)
38
Modulo Scheduled Loop
Lastkernel
Epilog
Modulo Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 0 X1 (1) 0 T+0 X1 (N-2) A3 (N-6) M1 (N-3)1 1 A1 (1) 1 T+1 A1 (N-2) M2 (N-3)2 2 2 T+2 X2 (N-6) A2 (N-3) M3 (N-4)0 3 X1 (2) M1 (1) 0 T+3 X1 (N-1) A3 (N-5) M1 (N-2)1 4 A1 (2) M2 (1) 1 T+4 A1 (N-1) M2 (N-2)2 5 A2 (1) 2 T+5 X2 (N-5) A2 (N-2) M3 (N-3)0 6 X1 (3) M1 (2) 0 T+6 X1 (N) A3 (N-4) M1 (N-1)1 7 A1 (3) M2 (2) 1 T+7 A1 (N) M2 (N-1)2 8 A2 (2) M3 (1) 2 T+8 X2 (N-4) A2 (N-1) M3 (N-2)0 9 X1 (4) M1 (3) 0 T+9 A3 (N-3) M1 (N)1 10 A1 (4) M2 (3) 1 T+10 M2 (N)2 11 A2 (3) M3 (2) 2 T+11 X2 (N-3) A2 (N) M3 (N-1)0 12 X1 (5) A3 (1) M1 (4) 0 T+12 A3 (N-2)1 13 A1 (5) M2 (4) 1 T+132 14 X2 (1) A2 (4) M3 (3) 2 T+14 X2 (N-2) M3 (N)0 15 X1 (6) A3 (2) M1 (5) 0 T+15 A3 (N-1)1 16 A1 (6) M2 (5) 1 T+162 17 X2 (2) A2 (5) M3 (4) 2 T+17 X2 (N-1)0 18 X1 (7) A3 (3) M1 (6) 0 T+18 A3 (N)1 19 A1 (7) M2 (6) 1 T+192 20 X2 (3) A2 (6) M3 (5) 2 T+20 X2 (N)0 21 X1 (8) A3 (4) M1 (7)1 22 A1 (8) M2 (7)2 23 X2 (4) A2 (7) M3 (6)0 24 X1 (9) A3 (5) M1 (8)1 25 A1 (9) M2 (8)2 26 X2 (5) A2 (8) M3 (7)0 27 X1 (10) A3 (6) M1 (9)1 28 A1 (10) M2 (9)2 29 X2 (6) A2 (9) M3 (8)0 30 X1 (11) A3 (7) M1 (10)1 31 A1 (11) M2 (10)2 32 X2 (7) A2 (10) M3 (9)
![Page 39: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/39.jpg)
39
Another Modulo Schedule Example
xx
A1A1
A3A3
M2M2M1M1
AA BB
EE
ZZ
++ A2A2
CC DD
00
1111
33 33
Modulo Reservation TableModulo Reservation Table
Given 2 adders (1-cycle) & 1 multiplier (2-cycle)Given 2 adders (1-cycle) & 1 multiplier (2-cycle)
prologprolog
epilogepilog
5x kernel5x kernel
Multiplier is fully utilizedMultiplier is fully utilized
MII = max(3/2, 2/1) = 2 MII = max(3/2, 2/1) = 2
++
++
xx
Modulo ADDER1 ADDER2 MULT0 A1 (3) A2 (3) M2 (2)1 A3 (1) M1 (3)
![Page 40: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/40.jpg)
40
How to Perform Register Allocation?• We are overlapping multiple iterations into one
schedule.– Example: iteration 1 to 5 are alive at the same time
• Registers from multiple iterations are alive during a period of time
MRT
Modulo MEM ADDER MULT0 X1 (5) A3 (1) M1 (4)1 A1 (5) M2 (4)2 X2 (1) A2 (4) M3 (3)
![Page 41: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/41.jpg)
41
Modulo Variable Expansion
• Analyze the “life time” of an architecture register• Unroll the loop to enable modulo schedule• R5 needs to stay alive for 8 cycles = 8/3 = 3 MII (i.e. unroll 3 times)
r1(1) r2
(4)
r3 (2) r4
(3)
r5 (8)
r6 (4)
r7 (2)
The cycle numbers assumes WAR allowed in the same cycle
Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r11, (A)++ mul r3, r2, $c21 4 add r12, r11, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 X1 (3) mul r13, r12, $c21 7 A1 (3) mul r15, r12, $c32 8 add r14, r12, r13 mul r6, r4, r50 9 X1 (4) M1 (3)1 10 A1 (4) M2 (3)2 11 A2 (3) mul r16, r14, r150 12 X1 (5) add r7, r5, r6 M1 (4)1 13 A1 (5) M2 (4)2 14 st r7, (B)++ A2 (4) M3 (3)
![Page 42: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/42.jpg)
42
Post MVE code
Kernel (unrolled 3 times)
Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r11, (A)++ mul r3, r2, $c21 4 add r12, r11, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 ld r21, (A)++ mul r13, r12, $c21 7 add r22, r21, $c1 mul r15, r12, $c32 8 add r14, r12, r13 mul r6, r4, r50 9 ld r1, (A)++ mul r23, r22, $c21 10 add r2, r1, $c1 mul r25, r22, $c32 11 add r24, r22, r23 mul r16, r14, r150 12 ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c21 13 add r12, r11, $c1 mul r5, r2, $c32 14 st r7, (B)++ add r4, r2, r3 mul r26, r24, r250 15 ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c21 16 add r22, r21, $c1 mul r15, r12, $c32 17 st r17, (B)++ add r14, r12, r13 mul r6, r4, r50 18 ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c21 19 add r2, r1, $c1 mul r25, r22, $c32 20 st r27, (B)++ add r24, r22, r23 mul r16, r14, r150 21 ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c21 22 add r12, r11, $c1 mul r5, r2, $c32 23 st r7, (B)++ add r4, r2, r3 mul r26, r24, r250 24 ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c21 25 add r22, r21, $c1 mul r15, r12, $c32 26 st r17, (B)++ add r14, r12, r13 mul r6, r4, r50 27 ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c21 28 add r2, r1, $c1 mul r25, r22, $c32 29 st r27, (B)++ add r24, r22, r23 mul r16, r14, r15
![Page 43: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/43.jpg)
43
Register Allocation for MVE
• To save # of registers, might not need to expand all registers• Calculate the lifetime of each register to determine if a new register is
needed across iterations (the formula assumes WAR in the same instruction bundle is allowed)
• # of copies = (MII % lifetime/MII == 0) ? lifetime/MII : MII• 14 5/14
– R1 is alive for 1 cycle = 1/3 = 1 MII (need 1 copy)– R2 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since 3%2=1)– R3 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy)– R4 is alive for 3 cycles = 3/3 = 1 MII (need 1 copy)– R5 is alive for 8 cycles = 8/3 = 3 MII (need 3 copies)– R6 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since 3%2=1)– R7 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy)
• 13 registers used, instead of 21 with the same unrolling degree
![Page 44: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/44.jpg)
44
MVE (reallocate registers)
Kernel (unrolled 3 times)
The cycle numbers assumes WAR allowed in the same cycle
Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r1, (A)++ mul r3, r2, $c21 4 add r12, r1, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 ld r1, (A)++ mul r3, r12, $c21 7 add r22, r1, $c1 mul r15, r12, $c32 8 add r4, r12, r3 mul r6, r4, r50 9 ld r1, (A)++ mul r3, r22, $c21 10 add r2, r1, $c1 mul r25, r22, $c32 11 add r4, r22, r3 mul r16, r4, r150 12 ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c21 13 add r12, r1, $c1 mul r5, r2, $c32 14 st r7, (B)++ add r4, r2, r3 mul r26, r4, r250 15 ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c21 16 add r22, r1, $c1 mul r15, r12, $c32 17 st r7, (B)++ add r4, r12, r3 mul r6, r4, r50 18 ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c21 19 add r2, r1, $c1 mul r25, r22, $c32 20 st r7, (B)++ add r4, r22, r3 mul r16, r4, r150 21 ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c21 22 add r12, r1, $c1 mul r5, r2, $c32 23 st r7, (B)++ add r4, r2, r3 mul r26, r4, r250 24 ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c21 25 add 22, r1, $c1 mul r15, r12, $c32 26 st r7, (B)++ add r4, r12, r3 mul r6, r4, r50 27 ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c21 28 add r2, r1, $c1 mul r25, r22, $c32 29 st r7, (B)++ add r4, r22, r3 mul r16, r4, r15
![Page 45: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/45.jpg)
45
Final Modulo Schedule
Prolog Code (12 instruction bundles)
Epilog Code (12 instruction bundles)
**Branch instruction not shown
9 instruction bundles
ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c2add r12, r11, $c1 mul r5, r2, $c3
st r7, (B)++ add r4, r2, r3 mul r26, r24, r25ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c2
add r22, r21, $c1 mul r15, r12, $c3st r17, (B)++ add r14, r12, r13 mul r6, r4, r5ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c2
add r2, r1, $c1 mul r25, r22, $c3st r27, (B)++ add r24, r22, r23 mul r16, r14, r15
![Page 46: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/46.jpg)
46
Final Modulo Schedule (Reallocate Registers)
Prolog Code (12 instruction bundles)
Epilog Code (12 instruction bundles)
**Branch instruction not shown
9 instruction bundles
ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c2add r12, r1, $c1 mul r5, r2, $c3
st r7, (B)++ add r4, r2, r3 mul r26, r4, r25ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c2
add r22, r1, $c1 mul r15, r12, $c3st r7, (B)++ add r4, r12, r3 mul r6, r4, r5ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c2
add r2, r1, $c1 mul r25, r22, $c3st r7, (B)++ add r4, r22, r3 mul r16, r4, r15
![Page 47: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/47.jpg)
47
Issues with Modulo Variable Expansion
• Many architecture registers are needed• Code size gets bigger when more unrolling
needed
• Alternative solution: Rotating register file– A hardware technique– Solving problem without code duplication – Similar to register windowregister window plus renamingrenaming: keep
old iteration values on the stack (Itanium calls the hardware Register Stack EngineRegister Stack Engine or RSERSE)
![Page 48: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/48.jpg)
48
Intention of Using Rotation Registers• Use exactly the same schedule (below) for all
including– Kernel codes– Prolog codes– Epilog codes
• The “registers” need to be re-allocated• Registers “rotate” per iteration!!!
**Branch instruction not shown
ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c2add r2, r1, $c1 mul r5, r2, $c3
st r7, (B)++ add r4, r2, r3 mul r6, r4, r5
![Page 49: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/49.jpg)
49
Idea of Rotation Register (Original Schedule)
i te Time Mem Adder Multipl ier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 mul r43, r42, $c2
4 mul r45, r42, $c3
5 add r44, r42, r43
2 6
7
8 mul r46, r44, r45
3 9
10
11
4 12 add r47, r45, r46
13
14 st r47, (B)++
In Intel Itanium, integer registers 32 – 127 are rotating registers
![Page 50: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/50.jpg)
50
Original Code Schedule
i te Time Mem Adder Multipl ier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 mul r43, r42, $c2
4 mul r45, r42, $c3
5 add r44, r42, r43
2 6
7
8 mul r46, r44, r45
3 9
10
11
4 12 add r47, r45, r46
13
14 st r47, (B)++
In Intel Itanium, integer registers 32 – 127 are rotating registers
![Page 51: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/51.jpg)
51
Assume HW Rotation Registers
i te Time Mem Adder Multipl ier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 mul r44, r43, $c2
4 mul r45, r43, $c3
5 add r52, r43, r44
2 6
7
8 mul r48, r53, r46
3 9
10
11
4 12 add r51, r48, r50
13
14 st r51, (B)++
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
![Page 52: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/52.jpg)
52
Rotation Registers in Itanium Processors
Stacked (Rotating)
Static
0
3132
127
General Purpose Registers
Stacked (Rotating)
Static
0
3132
127
FP Registers
063 081
Stacked (Rotating)
Static
01516
630
Predicate Registers
![Page 53: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/53.jpg)
53
Register Rotation (Prolog i0)
i te Time Mem Adder Multipl ier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
![Page 54: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/54.jpg)
54
Register Rotation (Prolog i1)
i te Time Mem Adder Multipl ier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 ld r41, (A)++ mul r44, r43, $c2
4 add r42, r41, $c1 mul r45, r43, $c3
5 add r52, r43, r44
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
![Page 55: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/55.jpg)
55
Register Rotation (Prolog i2)
i te Time Mem Adder Multipl ier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 ld r41, (A)++ mul r44, r43, $c2
4 add r42, r41, $c1 mul r45, r43, $c3
5 add r52, r43, r44
2 6 ld r41, (A)++ mul r44, r43, $c2
7 add r42, r41, $c1 mul r45, r43, $c3
8 add r52, r43, r44 mul r48, r53, r46
3 9
10
11
4 12
13
14
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
![Page 56: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/56.jpg)
56
Register Rotation (Prolog i3)
i te Time Mem Adder Multipl ier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 ld r41, (A)++ mul r44, r43, $c2
4 add r42, r41, $c1 mul r45, r43, $c3
5 add r52, r43, r44
2 6 ld r41, (A)++ mul r44, r43, $c2
7 add r42, r41, $c1 mul r45, r43, $c3
8 add r52, r43, r44 mul r48, r53, r46
3 9 ld r41, (A)++ mul r44, r43, $c2
10 add r42, r41, $c1 mul r45, r43, $c3
11 add r52, r43, r44 mul r48, r53, r46
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
![Page 57: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/57.jpg)
57
Register Rotation (Kernel Steady State i4)
i te Time Mem Adder Multipl ier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 ld r41, (A)++ mul r44, r43, $c2
4 add r42, r41, $c1 mul r45, r43, $c3
5 add r52, r43, r44
2 6 ld r41, (A)++ mul r44, r43, $c2
7 add r42, r41, $c1 mul r45, r43, $c3
8 add r52, r43, r44 mul r48, r53, r46
3 9 ld r41, (A)++ mul r44, r43, $c2
10 add r42, r41, $c1 mul r45, r43, $c3
11 add r52, r43, r44 mul r48, r53, r46
4 12 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
13 add r42, r41, $c1 mul r45, r43, $c3
14 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
Registers wrapped around if exceeding specified bound
![Page 58: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/58.jpg)
58
• Execute many iterations in the kernel …Register Rotation (Kernel)
![Page 59: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/59.jpg)
59
Register Rotation (Kernel to Epilog, i<-4>)
i te Time Mem Adder Multipl ier
-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
N-13 add r42, r41, $c1 mul r45, r43, $c3
N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-3 N-11
N-10
N-9
-2 N-8
N-7
N-6
-1 N-5
N-4
N-3
0 N-2
N-1
N
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
![Page 60: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/60.jpg)
60
Register Rotation (Kernel to Epilog, i<-3>)
i te Time Mem Adder Multipl ier
-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
N-13 add r42, r41, $c1 mul r45, r43, $c3
N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-3 N-11 add r51, r48, r50 mul r44, r43, $c2
N-10 mul r45, r43, $c3
N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-2 N-8
N-7
N-6
-1 N-5
N-4
N-3
0 N-2
N-1
N
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
![Page 61: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/61.jpg)
61
Register Rotation (Kernel to Epilog, i<-2>)
i te Time Mem Adder Multipl ier
-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
N-13 add r42, r41, $c1 mul r45, r43, $c3
N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-3 N-11 add r51, r48, r50 mul r44, r43, $c2
N-10 mul r45, r43, $c3
N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-2 N-8 add r51, r48, r50
N-7
N-6 st r51, (B)++ mul r48, r53, r46
-1 N-5
N-4
N-3
0 N-2
N-1
N
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
![Page 62: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/62.jpg)
62
Register Rotation (Kernel to Epilog, i<-1>)
i te Time Mem Adder Multipl ier
-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
N-13 add r42, r41, $c1 mul r45, r43, $c3
N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-3 N-11 add r51, r48, r50 mul r44, r43, $c2
N-10 mul r45, r43, $c3
N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-2 N-8 add r51, r48, r50
N-7
N-6 st r51, (B)++ mul r48, r53, r46
-1 N-5 add r51, r48, r50
N-4
N-3 st r51, (B)++
0 N-2
N-1
N
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
![Page 63: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/63.jpg)
63
Register Rotation (Kernel to Epilog, final ite)
i te Time Mem Adder Multipl ier
-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
N-13 add r42, r41, $c1 mul r45, r43, $c3
N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-3 N-11 add r51, r48, r50 mul r44, r43, $c2
N-10 mul r45, r43, $c3
N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-2 N-8 add r51, r48, r50
N-7
N-6 st r51, (B)++ mul r48, r53, r46
-1 N-5 add r51, r48, r50
N-4
N-3 st r51, (B)++
0 N-2 add r51, r48, r50
N-1
N st r51, (B)++
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
![Page 64: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/64.jpg)
64
Modulo Schedule with Rotating Register Support
• No loop unrolling required (required careful register allocation)
• Tighter code, saving space• However, there are still prolog and epilog codes• Can we use the same schedule for prolog/epilog?
– Use stage predicates to execute instructions conditionally– Require new ISA support (Itanium)
ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2add r42, r41, $c1 mul r45, r43, $c3
st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
![Page 65: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/65.jpg)
65
Predicated Instruction Execution (Prolog i0)i te Time Mem Adder Multipl ier
0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
1 (p16) add r42, r41, $c1 mul r45, r43, $c3
2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
1 3
4
5
2 6
7
8
3 9
10
11
4 12
13
14
Don’t execute shaded instructions
cc0: only issue ld
cc1: only issue add
cc2: no issue
![Page 66: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/66.jpg)
66
Predicated Prolog (Prolog i1)i te Time Mem Adder Multipl ier
0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
1 (p16) add r42, r41, $c1 mul r45, r43, $c3
2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46
2 6
7
8
3 9
10
11
4 12
13
14
cc3: ld(i1) & mul(i0)
cc4: add(i0) & mul(i0)
cc5: add(i0)
Note that stage predicates also “rotate” per iteration
![Page 67: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/67.jpg)
67
Predicated Prolog (Prolog i2)i te Time Mem Adder Multipl ier
0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
1 (p16) add r42, r41, $c1 mul r45, r43, $c3
2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46
2 6 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
3 9
10
11
4 12
13
14
cc6: ld(i2) & mul(i1)
cc7: add(i2) & mul(i1)
cc8: add(i1) & mul(i0)
![Page 68: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/68.jpg)
68
Predicated Prolog (Prolog i3)i te Time Mem Adder Multipl ier
0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
1 (p16) add r42, r41, $c1 mul r45, r43, $c3
2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46
2 6 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
3 9 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
11 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
4 12
13
14
cc9: ld(i3) & mul(i2)
cc10: add(i3) & mul(i2)
cc11: add(i2) & mul(i1)
![Page 69: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/69.jpg)
69
Predicated Kernel (i4)i te Time Mem Adder Multipl ier
0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
1 (p16) add r42, r41, $c1 mul r45, r43, $c3
2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46
2 6 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
3 9 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
11 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
4 12 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
14 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
cc12: ld(i4) & add(i0) & mul(i3)cc13: st(i0) & add(i4) & mul(3)cc11: add(i3) & mul(i2)
(p20) is used in iteration 4, not (p19) because of predicate rotation
![Page 70: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/70.jpg)
70
• Execute many iterations in the kernel …Register Rotation (Kernel)
![Page 71: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/71.jpg)
71
Predicated Epilog (i<-4>)i te Time Mem Adder Multipl ier
-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-3 N-11
N-10
N-9
-2 N-8
N-7
N-6
-1 N-5
N-4
N-3
0 N-2
N-1
N
![Page 72: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/72.jpg)
72
Predicated Epilog (i<-3>)i te Time Mem Adder Multipl ier
-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-2 N-8
N-7
N-6
-1 N-5
N-4
N-3
0 N-2
N-1
N
![Page 73: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/73.jpg)
73
Predicated Epilog (i<-2>)i te Time Mem Adder Multipl ier
-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-2 N-8 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-6 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-1 N-5
N-4
N-3
0 N-2
N-1
N
![Page 74: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/74.jpg)
74
Predicated Epilog (i<-1>)i te Time Mem Adder Multipl ier
-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-2 N-8 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-6 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-1 N-5 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-3 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
0 N-2
N-1
N
![Page 75: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/75.jpg)
75
Predicated Epilog (final iteration)i te Time Mem Adder Multipl ier
-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-2 N-8 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-6 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-1 N-5 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-3 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
0 N-2 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-1 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
![Page 76: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/76.jpg)
76
Final Modulo Schedule (Itanium-like)
• Before entering the loop, set p16p16 =1 (p16 is the first rotating predicate register)
• When the modulo-scheduled loop branch (e.g. br.ctop) encountered – p63p63 is set to 1 by hardware in the prolog code (see next slide)– All registers (rotating registers and predicate rotating registers) rotate as each
stage (iteration) advances• Only 3 Itanium Instruction Bundles (= 3 VLIWs) needed
– No prolog, epilog codes– No modulo variable expansions that stress registers and blow up code size
(p16) r41 = (A)++ (p20) r51 = r48 + r50
(p20) (B)++ = r51(p16) r42 = r41 + $c1
(p17) r44 = r43 * $c2
(p17) r52 = r43 + r44
mov ar.lc = 196 // loop countmov ar.ec = 5 // epilog stages+1mov pr.rot = 0x10000 // special inst set pr[16]=1 and p[63:17]=0
L1top:
br.ctop L1top
(p17) r45 = r43 * $c3(p18) r48 = r53 * r46
![Page 77: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/77.jpg)
Counted Modulo-scheduled Loop
p20p20 00
p19p19 00
p18p18 00
p17p17 00
p16p16 11
p63p63 11
p62p62 00
Stage 0 (Stage 0 (PrologProlog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
After the first iterationLC = 195, EC = 5
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
![Page 78: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/78.jpg)
Counted Modulo-scheduled Loop
p20p20 00
p19p19 00
p18p18 00
p17p17 00
p16p16 11
p63p63 11
p62p62 00
Stage 1 (Stage 1 (PrologProlog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 2nd iteration
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
![Page 79: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/79.jpg)
Counted Modulo-scheduled Loop
p20p20 00
p19p19 00
p18p18 00
p17p17 00
p16p16 11
p63p63 11
p62p62 11
p61p61 00
Stage 2 (Stage 2 (PrologProlog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 3rd iteration
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
![Page 80: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/80.jpg)
Counted Modulo-scheduled Loop
p20p20 00
p19p19 00
p18p18 00
p17p17 00
p16p16 11
p63p63 11
p62p62 11
p61p61 11
p60p60 00
Stage 3 (Stage 3 (PrologProlog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 4th iteration
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
![Page 81: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/81.jpg)
Counted Modulo-scheduled Loop
p20p20 00
p19p19 00
p18p18 00
p17p17 00
p16p16 11
p63p63 11
p62p62 11
p61p61 11
p60p60 11
p59p59 00
Stage 4 (Stage 4 (KernelKernel))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 5th iteration
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
![Page 82: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/82.jpg)
In the Kernel
• After Another 191 Iterations …..
![Page 83: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/83.jpg)
Counted Modulo-scheduled Loop
p20p20 11
p19p19 11
p18p18 11
p17p17 11
p16p16 11
p63p63 11
p62p62 11
p61p61 11
p60p60 11
p59p59 11
p58p58 11
p57p57 11
p56p56 11
p55p55 11
Stage 195 (Stage 195 (KernelKernel))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 196th iterationLC=0, EC=5
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
![Page 84: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/84.jpg)
Counted Modulo-scheduled Loop
p20p20 11
p19p19 11
p18p18 11
p17p17 11
p16p16 11
p63p63 11
p62p62 11
p61p61 11
p60p60 00
p59p59 11
p58p58 11
p57p57 11
p56p56 11
p55p55 11
Stage 195 (Stage 195 (KernelKernel))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
after the 196th iterationEC=4
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
![Page 85: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/85.jpg)
Counted Modulo-scheduled Loop
p20p20 11
p19p19 11
p18p18 11
p17p17 11
p16p16 11
p63p63 11
p62p62 11
p61p61 11
p60p60 00
p59p59 11
p58p58 11
p57p57 11
p56p56 11
p55p55 11
Stage 196 (Stage 196 (EpilogEpilog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 197th iterationEC=4
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
![Page 86: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/86.jpg)
Counted Modulo-scheduled Loop
p20p20 11
p19p19 11
p18p18 11
p17p17 11
p16p16 11
p63p63 11
p62p62 11
p61p61 11
p60p60 00
p59p59 00
p58p58 11
p57p57 11
p56p56 11
p55p55 11
Stage 197 (Stage 197 (EpilogEpilog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 198th iterationEC=3
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
![Page 87: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/87.jpg)
Counted Modulo-scheduled Loop
p20p20 11
p19p19 11
p18p18 11
p17p17 11
p16p16 11
p63p63 11
p62p62 11
p61p61 11
p60p60 00
p59p59 00
p58p58 00
p57p57 11
p56p56 11
p55p55 11
Stage 198 (Stage 198 (EpilogEpilog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 199th iterationEC=2
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
![Page 88: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/88.jpg)
Counted Modulo-scheduled Loop
p20p20 11
p19p19 11
p18p18 11
p17p17 11
p16p16 11
p63p63 11
p62p62 11
p61p61 11
p60p60 00
p59p59 00
p58p58 00
p57p57 00
p56p56 11
p55p55 11
Stage 199 (Stage 199 (EpilogEpilog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 200th iteration (Last iteration)EC=1
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
![Page 89: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/89.jpg)
Counted Modulo-scheduled Loop
p20p20 11
p19p19 11
p18p18 11
p17p17 11
p16p16 11
p63p63 11
p62p62 11
p61p61 11
p60p60 00
p59p59 00
p58p58 00
p57p57 00
p56p56 00
p55p55 11
Stage 199 (Stage 199 (EpilogEpilog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
After the 200th iteration (Last iteration)EC=0
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20• “br.ctop” instruction exits
the loop
![Page 90: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/90.jpg)
90
Modulo Scheduling ExampleLoop{
P=A+B
Q=C+D;
X=PxE
Y=PxQ
Z=X+Y
}
Step 1: Data flow graph
xx M2M2
A1A1
AA BB
++ A2A2
CC DD
++
A3A3
ZZ
++
EE
M1M1 xx
Loop{
P=A+B
Q=C+D;
X=PxE
Y=PxQ
Z=X+Y
}
Loop{
P=A+B
Q=C+D;
X=PxE
Y=PxQ
Z=X+Y
}
Loop{
P=A+B
Q=C+D;
X=PxE
Y=PxQ
Z=X+Y
}
Loop{
P=A+B
Q=C+D;
X=PxE
Y=PxQ
Z=X+Y
}
![Page 91: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/91.jpg)
91
Modulo SchedulingStep 2: Generate a list schedule
xx M2M2
A1A1
AA BB
++ A2A2
CC DD
++
A3A3
ZZ
++
EE
M1M1 xx
00
11 11
3333
Execution units:2 Adders – 1cycle latency1 Multiplier – 2 cycle latency
![Page 92: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/92.jpg)
92
Modulo SchedulingStep 2: Generate a list schedule
xx M2M2
A1A1
AA BB
++ A2A2
CC DD
++
A3A3
ZZ
++
EE
M1M1 xx
00
11 11
3333
ReservationReservation TableTable
Time Adder1 Adder2 Mult0 A1
1234
A2
M1
M2
A3
![Page 93: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/93.jpg)
93
Modulo SchedulingGenerating Modulo Schedule:
1. Determine the MII:
=
Ctyavailabilisource
NdemandsourceMII
:_Re
:_Remax
MII = max[(3/2) ,(2/1)] = 2
![Page 94: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/94.jpg)
94
Modulo SchedulingMapping from list schedule to modulo schedule
Time Modulo 2 Adder1 Adder2 Mult
0 0 A1 A2
1 1 M1
2 0 M2
3 1
4 0 A3
5 1
6 0
List scheduleList schedule
Time Adder1 Adder2 Mult0 A1
1234
A2
M1
M2
A3
Modulo scheduleModulo schedulefor 1 iterationfor 1 iteration
A3
![Page 95: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/95.jpg)
95
Modulo SchedulingTime Modulo 2 Adder1 Adder2 Mult
0 0 1:A1 1:A2
1 1 1:M1
2 0 2:A1 2:A2 1:M2
3 1 2:M1
4 0 2:M2
5 1 1:A3
6 0
7 1 2:A3
8 0
inserting iteration 2
![Page 96: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/96.jpg)
96
Modulo SchedulingTime Modulo 2 Adder1 Adder2 Mult
0 0 1:A1 1:A2
1 1 1:M1
2 0 2:A1 2:A2 1:M2
3 1 2:M1
4 0 3:A1 3:A2 2:M2
5 1 1:A3 3:M1
6 0 3:M2
7 1 2:A3
8 0
9 1 3:A3
inserting iteration 3
![Page 97: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW](https://reader031.fdocuments.us/reader031/viewer/2022020214/58efb0e31a28ab585d8b45af/html5/thumbnails/97.jpg)
97
Modulo Scheduled Loop
prologprolog
epilogepilog
5x kernel5x kernel
Modulo 2
Adder 1 Adder 2 Mult
0 3:A1 3:A2 2:M2
1 1:A3 3:M1
MRT