The IA-64 Architectural Innovations
description
Transcript of The IA-64 Architectural Innovations
The IA-64 Architectural Innovations
Hardware Support for Software PipeliningJosé Nelson Amaral
1
Suggested Reading
2
Intel IA-64 Architecture SoftwareDeveloper’s Manual, Chapters 8, 9
Instruction Group
3
An instruction group is a set of instructions thathave no read after write (RAW) or write after write (WAW)
register dependencies.Consecutive instruction groups are separated by stops
(represented by a double semi-column in the assembly code).
ld8 r1=[r5] // First groupsub r6=r8, r9 // First groupadd r3=r1,r4 ;; // First groupst8 [r6]=r12 // Second group
Instruction Bundles
4
Instructions are organized in bundles of three instructions,with the following format:
instruction slot 2 instruction slot 1 instruction slot 0 template127 8786 46 45 5 4 0
41 41 41 5
Instruction Description Execution Unit Type
A Integer ALU I-unit or M-unit I Non-ALU
integer I-unit
M Memory M-unit F Floating-Point F-unit B Branch B-unit
L+X Extended I-unit/B-unit
Bundles
5
In assembly, each 128-bit bundle is enclosed in curly braces and contains a template specification
{ .miild4 r28=[r8] // Load a 4-byte valueadd r9=2,r1 // 2+r1 and put in r9add r30=1,r1 // 1+r1 and put in r30
}
An instruction group can extend over an arbitrarynumber of bundles.
Templates
6
There are restrictions on the type of instructions thatcan be bundled together. The IA-64 has five slot types(M, I, F, B, and L), six instruction types (M, I, A, F, B, L),and twelve basic template types (MII, MI_I, MLX, MMI,M_MI, MFI, MMF, MIB, MBB, BBB, MMB, and MFB).
The underscore in the bundle accronym indicates a stop.
Every basic bundle type has two versions: one with a stop at the end of the bundle and one without.
Control Dependency Preventing Code Motion
7
add r7=r6,1 // cycle 0 add r13=r25, r27 cmp.eq p1, p2=r12, r23(p1) br. cond some_label ;;
ld4 r2=[r3] ;; // cycle 1 sub r4=r2, r11 // cycle 3
ld
brblock A
block B
In the code below, the ld4 is control dependent on thebranch, and thus cannot be safely moved up in conventional processor architectures.
Control Speculation
8
(p1) br.cond.dptk L1 // cycle 0 ld8 r3=[r5] ;; // cycle 1 shr r7=r3,r87 // cycle 3
In the following code, suppose a load latency of two cycles
However, if we execute the load before we know thatwe actually have to do it (control speculation), we get:
ld8.s r3=[r5] // earlier cycle // other, unrelated instructions(p1) br.cond.dptk L1 ;; // cycle 0 chk.s r3, recovery // cycle 1 shr r7=r3,r87 // cycle 1
Control Speculation
9
ld8.s r3=[r5] // earlier cycle // other, unrelated instructions(p1) br.cond.dptk L1 ;; // cycle 0 chk.s r3, recovery // cycle 1 shr r7=r3,r87 // cycle 1
The ld8.s instruction is a speculative load, and thechk.s instruction is a check instruction that verifiesif the value loaded is still good.
Ambiguous Memory Dependencies
10
An ambiguous memory dependency is a dependencebetween a load and a store, or between two stores,where it cannot be determined if the instructions involved access overlapping memory locations.
Two or more memory references are independentif it is known that they access non-overlapping memory locations.
Data Speculation
11
An advanced load allows a load to be movedabove a store even if it is not known wetherthe load and the store may reference overlappingmemory locations.
st8 [r55]=r45 // cycle 0ld8 r3=[r5] ;; // cycle 0shr r7=r3,r87 // cycle 2
ld8.a r3=[r5] ;; // Advanced Load// other, unrelated instructionsst8 [r55]=r45 // cycle 0ld8.c r3=[r5] ;; // cycle 0 - checkshr r7=r3,r87 // cycle 0
Moving Up Loads + Uses: Recovery Code
12
st8 [r4] = r12 // cycle 0: ambiguous storeld8 r6 = [r8] ;; // cycle 0: load to advanceadd r5 = r6,r7 // cycle 2st8 [r18] = r5 // cycle 3
Original Code
ld8.a r6 = [r8] ;; // cycle -3// other, unrelated instructionsadd r5 = r6,r7 // cycle -1; add that uses r6// other, unrelated instructionsst8 [r4]=r12 // cycle 0chk.a r6, recover // cycle 0: checkback: // Return point from jump to recoverst8 [r18] = r5 // cycle 0
recover:ld8 r6 = [r8] ;; // Reload r6 from [r8] add r5 = r6,r7 // Re-execute the addbr back // Jump back to main code
SpeculativeCode
ld.c, chk.a and the ALAT
13
The execution of an advanced load, ld.a, creates anentry in a hardware structure, the Advanced LoadAddress Table (ALAT). This table is indexed by theregister number. Each entry records the loadaddress, the load type, and the size of the load.
When a check is executed, the entry for the registeris checked to verify that a valid enter with the typespecified is there.
ld.c, chk.a and the ALAT
14
An entry e is removed from the ALAT when:
(1) A store overlaps with the memory locations specified in e;(2) Another advanced load to the same register is executed;(3) There is a context switch caused by the operating system (or hardware);(4) Capacity limitation of the ALAT implementation requires reuse of the ALAT slot.
Not a Thing (NaT)
15
The IA-64 has 128 general purpose registers, eachwith 64+1 bits, and 128 floating point registers, eachwith 82 bits.
The extra bit in the GPRs is the NaT bit that is used toindicate that the content of the register is not valid.
NaT=1 indicates that an instruction that generated anexception wrote to the register. It is a way to deferexceptions caused by speculative loads.
Any operation that uses NaT as an operand results in NaT.
If-conversion
16
If-conversion uses predicates to transform aconditional code into a single control stream code.
if(r4) {add r1= r2, r3ld8 r6=[r5]
}
cmp.ne p1, p0=r4, 0 ;; Set predicate reg(p1) add r1=r2, r3(p1) ld8 r6=[r5]
if(r1)r2 = r3 + r3
elser7 = r6 - r5
cmp.ne p1, p2 = r1, 0 ;; Set predicate reg(p1) add r2 = r3, r4(p2) sub r7 = r6,r5
Optimization of Loops
17
Instructions Description:
ld4 r4 = [r5], 4 ;; r4 MEM[r5]r5 r5 + 4
st4 [r6] = r7, 4 MEM[r6] r7r6 r6 + 4
br.cloop L1 if LC 0then LC LC -1 goto L1
void f(int *p, int *q, int A, int N){ int t, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + A; *q++ = t; }}
L1: ld4 r4 = [r5], 4 ;; // Cycle 0 load postinc 4 add r7 = r4, r9 ;; // Cycle 2 st4 [r6] = r7, 4 // Cycle 3 store postinc 4 br.cloop L1 ;; // Cycle 3
Optimization of Loops
18
(a) L1: ld4 r4 = [r5], 4 ;; (b) add r7 = r4, r9 ;; (c) st4 [r6] = r7, 4 (d) br.cloop L1 ;;
1 2 3 40 a12 b3 c/d4 a56 b7 c/d8 a910 b
Cycle
s
Iterations
11 c/d12 a1314 b
If LC=1000, how long doesit take for this loop to execute?
It takes 4000 cycles.
Optimization of Loops:Loop Unrolling
19
(a) L1: ld4 r4 = [r5], 4 ;; (b) ld4 r14 = [r5], 4 ;; (c) add r7 = r4, r9 ;;(d) add r17 = r14, r9(e) st4 [r6] = r7,4 ;;(f) st4 [r6] = r17,4 (g) br.cloop L1 ;;
Cycle
s
Iterations1 2 3 4
0 a1 b2 c3 d/e4 f/g5 a6 b7 c8 d/e9 f/g10 a11 b12 c13 d/e14 f/g
For simplicity we assume thatN is a multiple of 2.
Because the loads (a) and (b)both update r5 they have to beserialized
Optimization of Loops:Loop Unrolling
20
(a) L1: ld4 r4 = [r5], 4 ;; (b) ld4 r14 = [r5], 4 ;; (c) add r7 = r4, r9 ;;(d) add r17 = r14, r9(e) st4 [r6] = r7,4 ;;(f) st4 [r6] = r17,4 (g) br.cloop L1 ;;
Cycle
s
Iterations1 2 3 4
0 a1 b2 c3 d/e4 f/g5 a6 b7 c8 d/e9 f/g10 a11 b12 c13 d/e14 f/g
If LC=1000 for the originalloop, how long does
it take for this loop to execute?
It takes 2500 cycles.Thus the loop is
4000/2500 = 1.6 times faster
Optimization of Loops:Expanding the Induction Variable
21
add r15 = 4, r5 add r16 = 4, r6 ;;(a) L1: ld4 r4 = [r5], 8 (b) ld4 r14 = [r15], 8 ;; (c) add r7 = r4, r9 (d) add r17 = r14, r9(e) st4 [r6] = r7,8 ;;(f) st4 [r16] = r17,8 (g) br.cloop L1 ;;
Cycle
s
Iterations1 2 3 4
0 a/b12 c/d3 e/f/g4 a/b56 c/d7 e/f/g8 a/b910 c/d11 e/f/g12 a/b1314 c/d
We use twice as many functionalunits as the original code.
But no instruction is issued incycle 1, and functional unitsare still under-utilized.
Optimization of Loops:Expanding the Induction Variable
22
add r15 = 4, r5 add r16 = 4, r6 ;;(a) L1: ld4 r4 = [r5], 8 (b) ld4 r14 = [r15], 8 ;; (c) add r7 = r4, r9 (d) add r17 = r14, r9(e) st4 [r6] = r7,8 (f) st4 [r6] = r17,8 (g) br.cloop L1 ;;
Cycle
s
Iterations1 2 3 4
0 a/b12 c/d3 e/f/g4 a/b56 c/d7 e/f/g8 a/b910 c/d11 e/f/g12 a/b1314 c/d
If LC=1000 for the originalloop, how long does
it take for this loop to execute?
It takes 2000 cycles.Thus the loop is
4000/2000 = 2.0 times faster
Optimization of Loops:Further Loop Unrolling
23
add r15 = 4, r5 add r25 = 8, r5 add r35 = 12, r5 add r16 = 4, r6 add r26 = 8, r6 add r36 = 12, r6 ;; add r16 = 4, r6 ;;(a) L1: ld4 r4 = [r5], 16 (b) ld4 r14 = [r15], 16 ;;(c) ld4 r24 = [r25], 16(d) ld4 r34 = [r35], 16 ;;(e) add r7 = r4, r9 (f) add r17 = r14, r9;;(g) st4 [r6] = r7,16 (h) st4 [r16] = r17,16(i) add r27 = r24, r9(j) add r37 = r34, r9 ;;(k) st4 [r26] = r27, 16(l) st4 [r36] = r37, 16 (m) br.cloop L1 ;;
Iterations
Cycle
s
1 2 3 40 a/b1 c/d2 e/f3 g/h/i/j4 k/l/m5 a/b6 c/d7 e/f8 g/h/i/j9 k/l/m10 a/b11 c/d12 e/f13 g/h/i/j14 k/l/m
Optimization of Loops:Further Loop Unrolling
24
Iterations
Cycle
s
1 2 3 40 a/b1 c/d2 e/f3 g/h/i/j4 k/l/m5 a/b6 c/d7 e/f8 g/h/i/j9 k/l/m10 a/b11 c/d12 e/f13 g/h/i/j14 k/l/m
If LC=1000 for the originalloop, how long does it take for this loop
(unrolled 4 times) to execute?
It takes 250*5=1250 cycles.Thus the loop is
4000/1250 = 3.2 times faster
Loop Optimization:Loop Unrolling
25
In the previous example we obtained a good utilization of the functional units through loop unrolling.
But at the cost of code expansion and higher register pressure.
Software Pipelining offers an alternativeby overlapping the execution of operationsfrom multiple iterations of the loop.
Loop Optimization:Software Pipelining
26
(S1) ld4 r4 = [r5], 4 (S2) - - -(S3) add r7 = r4, r9 (S4) st4 [r6] = r7, 4
Cycle
s
* This is not real code
Iterations1
0 S112 S33 S4456789
2 3 4
S1S1
S3 S1S4 S3
S4 S3S4
5 6 7
S1S1
S3 S1S4 S3
S4 S3S4
prologue
kernel
epilogue
Loop Optimization:Software Pipelining Code
27
ld4 r4 = [r5], 4 ;; // load x[1] ld4 r4 = [r5], 4 ;; // load x[2] add r7 = r4, r9 // y[1] = x[1]+ k
ld4 r4 = [r5], 4 ;; // load x[3]
L1: ld4 r4 = [r5], 4 // load x[i+3] add r7 = r4, r9 // y[i+1] = x[i+1] + k st4 [r6] = r7, 4 // store y[i] br.cloop L1 ;;
st4 [r6] = r7, 4 // store y[n-2]add r7 = r4, r9 ;; // y[n-1] = x[n-1] + kst4 [r6] = r7, 4 // store y[n-1]add r7 = r4,r9 ;; // y[n] = x[n] + kst4 [r6] = r7, 4 // store y[n]
prologue
kernel
epilogue
Software Pipelining and Data Dependencies.
28
void f(int *p, int *q, int N){ int t, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; }}
loop:
ldl RA ← [RC]
incr RC ← RC+1
add RB ← 1 + RA
stl [RD] ← RB
incr RD ← RD+1
if(loop not done)
goto loop
Naïve Code:
Create anauto-incrementaddressing mode
Software Pipelining and Data Dependencies.
29
void f(int *p, int *q, int N){ int t, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; }}
loop:
ldl RA ← [RC]+
add RB ← 1 + RA
stl [RD]+ ← RB
if(loop not done)
goto loop
Naïve Code:
Scalar Expansion:Write to a differentregister in each iteration
Software Pipelining and Data Dependencies.
30
void f(int *p, int *q, int N){ int t, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; }}
Naïve Code:
How to create anunbounded numberof registers?
Still have RAWdependencies!
Rotate the Registers!
loop:
ldl RAi ← [RC]+
add RBi ← 1 + RAi
stl [RD]+ ← RBi
if(loop not done)
goto loop
Software Pipelining and Data Dependencies.
31
void f(int *p, int *q, int N){ int t, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; }}
loop:
ldl R32 ← [RC]+
add R34 ← 1 + R33
stl [RD]+ ← R35
if(loop not done)
copy temp ← R35
copy R35 ← R34
copy R34 ← R33
copy R33 ← R32
copy R32 ← temp
goto loop
Dependencieson the copies!
Hardware Rotates Registers Automatically!
Simulating an Infinite Register File
32
prolog: ldl r32 ← [r12]+ (rotate) r33 ← r32 ldl r32 ← [r12] add r34 ← 1 + r33 (rotate) r35 ← r34 (rotate) r34 ← r33 (rotate) r33 ← r32
loop: ldl r32 ← [r12]+ add r34 ← 1 + r33 stl [r13]+ ← r35 if(loop is not done) (rotate) temp ← r39 (rotate) r39 ← r38 (rotate) r38 ← r37 (rotate) r37 ← r36 (rotate) r36 ← r35 (rotate) r35 ← r34 (rotate) r34 ← r33 (rotate) r33 ← r32 (rotate) r32 ← temp goto loop
epilog: add r34 ← 1 + r33 stl [r13]+ ← r35 (rotate) r35 ← r34 stl [r13]+ ← r35
void f(int *p, int *q, int N){ int t, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; }}
Would be better tonot generate separatecode for prolog andepilog.
Use predicateRegisters
Simulating an Infinite Register File
33
prolog:(1) ldl r32 ← [r12]+(0) add r34 ← 1 + r33(0) stl [r13]+ ← r35 (rotate all) (1) ldl r32 ← [r12]+(1) add r34 ← 1 + r33(0) stl [r13]+ ← r35 (rotate all)
loop:(1) ldl r32 ← [r12]+(1) add r34 ← 1 + r33(1) stl [r13]+ ← r35 if(loop is not done) (rotate all) goto loop
void f(int *p, int *q, int N){ int t, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; }}
prolog:(0) ldl r32 ← [r12]+(1) add r34 ← 1 + r33(1) stl [r13]+ ← r35 (rotate all) (0) ldl r32 ← [r12]+(0) add r34 ← 1 + r33(1) stl [r13]+ ← r35 (rotate all)
Still need separate codefor prolog and epilog.
Rotate predicateRegisters!
Simulating an Infinite Register File
34
loop:(p16) ldl r32 ← [r12]+(p17) add r34 ← 1 + r33(p18) stl [r13]+ ← r35 if(loop is not done) (rotate all) goto loop
void f(int *p, int *q, int N){ int t, c; for(c=0 ; c<N ; c++){ t = *p++; t = t + 1; *q++ = t; }}
We have been ignoring the loopcounter c and the test at the endof the loop.
Create a SpecialSoftware PipeliningBranch
Support for Software Pipelining in the IA-64
35
After a loop is converted into a software pipeline,it looks quite different from the original loop, Intel adopts the following terminology:
source loop and source iteration: refer to the original source code
kernel loop and kernel iteration: refer to the code that implements the software pipeline.
Loop Support in the IA-64:Register Rotation
36
The IA-64 has a rotating register base (rrb)register that is decremented by specialsoftware pipelined loop branches.
When the rrb is decremented the valued storedin register X appear to move to register X+1,and the value of the highest numbered rotatingregister appears to move to the lowest numbered rotating register.
Loop Support in the IA-64:Register Rotation
• What registers can rotate?– The predicate registers p16-p63;– The floating-point registers f32-f127;– A programable portion of the general registers:
• The function alloc can allocate 0, 8, 16, 24, …, 96 general registers as rotating registers
• The lowest numbered rotating register is r32.
– There are three rrb: rrb.gr, rrb.fr rrb.pr
37
How Register Rotation Helps Software Pipeline
38
The concept of a software pipelining branch:
L1: ld4 r35 = [r4], 4 // post-increment by 4 st4 [r5] = r37, 4 // post-increment by 4
swp_branch L1 ;;
The pseudo-instruction swp_branch in the example rotatesthe general registers.
Therefore the value stored into r35 is read in r37 two kerneliterations (and two rotations) later.
The register rotation eliminated a dependence betweenthe load and the store instructions, and allowed the loop toexecute in one cycle.
How Register Rotation Helps Software Pipeline
39
The concept of a software pipelining branch:
L1: ld4 r35 = [r4], 4 // post-increment by 4 st4 [r5] = r37, 4 // post-increment by 4
swp_branch L1 ;;
7
R32R33
R35R34
R36R37R38R39
0RRB
Physical Logical
R35
R37
87
R32R33
R35R34
R36R37R38R39
-1RRB
Physical Logical
R35
R37
987
R32R33
R35R34
R36R37R38R39
-2RRB
Physical Logical
R35
R37
The stage predicate
40
(S1): (p16) ld4 r4 = [r5], 4 (S2): (p17) - - -(S3): (p18) add r7 = r4, r9 (S4): (p19) st4 [r6] = r7, 4
When assembling a software pipeline the programmer canassign a stage predicate to each stage of the pipeline tocontrol the execution of the instructions in that stage.
p16 is architecturally defined as the predicate for the first stage,p17 for the second, and so on.
The software pipeline branch rotates the predicate registers andinjects a 1 in p16. Thus enabling one stage of the pipelineat a time for the execution of the prolog.
The stage predicate
41
(S1): (p16) ld4 r4 = [r5], 4 (S2): (p17) - - -(S3): (p18) add r7 = r4, r9 (S4): (p19) st4 [r6] = r7, 4
When the kernel counter reaches zero, the softwarepipeline branch starts to decrement the epilog counterand injects 0 in p16 at every rotation to execute theepilogue of the software pipelined loop.
Anatomy of a Software Pipelining Branch
42
LC?
PR[16]=1
RRB--
branch
PR[16]=0
RRB--
PR[16]=0PR[16]=0
RRB--
fall-thru
EC?
== 0 (epilog)
EC--
>1
EC--
=1
EC
=0
LC--
0(prolog/kernel)
special unrolledloops
Software Pipelining Example in the IA-64
43
mov pr.rot = 0 // Clear all rotating predicate registerscmp.eq p16,p0 = r0,r0 // Set p16=1mov ar.lc = 4 // Set loop counter to n-1mov ar.ec = 3 // Set epilog counter to 3
…loop:(p16) ldl r32 = [r12], 1 // Stage 1: load x(p17) add r34 = 1, r33 // Stage 2: y=x+1(p18) stl [r13] = r35,1 // Stage 3: store y
br.ctop loop // Branch back
Software Pipelining Example in the IA-64
44
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x132 33 34 35 36 37 38
General Registers (Physical)
0 0116 17 18
Predicate Registers
4
LC
3
EC
x4x5
x1x2x3
Memory
39
32 33 34 35 36 37 38 39
General Registers (Logical)
0
RRB
Software Pipelining Example in the IA-64
45
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0 0116 17 18
Predicate Registers
4
LC
3
EC
x4x5
x1x2x3
Memory
x132 33 34 35 36 37 38
General Registers (Physical)
39
32 33 34 35 36 37 38 39
General Registers (Logical)
0
RRB
Software Pipelining Example in the IA-64
46
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0 0116 17 18
Predicate Registers
4
LC
3
EC
x4x5
x1x2x3
Memory
x132 33 34 35 36 37 38
General Registers (Physical)
39
32 33 34 35 36 37 38 39
General Registers (Logical)
0
RRB
Software Pipelining Example in the IA-64
47
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0 0116 17 18
Predicate Registers
4
LC
3
EC
1
x4x5
x1x2x3
Memory
x133 34 35 36 37 38 39
General Registers (Physical)
32
32 33 34 35 36 37 38 39
General Registers (Logical)
-1
RRB
Software Pipelining Example in the IA-64
48
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 0116 17 18
Predicate Registers
3
LC
3
EC
x4x5
x1x2x3
Memory
x133 34 35 36 37 38 39
General Registers (Physical)
32
32 33 34 35 36 37 38 39
General Registers (Logical)
-1
RRB
Software Pipelining Example in the IA-64
49
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 0116 17 18
Predicate Registers
3
LC
3
EC
x4x5
x1x2x3
Memory
x133 34 35 36 37 38 39
General Registers (Physical)
32
32 33 34 35 36 37 38 39
General Registers (Logical)
x2
-1
RRB
Software Pipelining Example in the IA-64
50
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 0116 17 18
Predicate Registers
3
LC
3
EC
x4x5
x1x2x3
Memory
x133 34 35 36 37 38 39
General Registers (Physical)
32
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1
-1
RRB
Software Pipelining Example in the IA-64
51
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 0116 17 18
Predicate Registers
3
LC
3
EC
x4x5
x1x2x3
Memory
x133 34 35 36 37 38 39
General Registers (Physical)
32
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1
-1
RRB
Software Pipelining Example in the IA-64
52
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 0116 17 18
Predicate Registers
3
LC
3
EC
x4x5
x1x2x3
Memory
x133 34 35 36 37 38 39
General Registers (Physical)
32
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1
-1
RRB
Software Pipelining Example in the IA-64
53
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
2
LC
3
EC
1
x4x5
x1x2x3
Memory
x134 35 36 37 38 39 32
General Registers (Physical)
33
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1
-2
RRB
Software Pipelining Example in the IA-64
54
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
2
LC
3
EC
x4x5
x1x2x3
Memory
x134 35 36 37 38 39 32
General Registers (Physical)
33
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1 x3
-2
RRB
Software Pipelining Example in the IA-64
55
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
y2
1 1116 17 18
Predicate Registers
2
LC
3
EC
x4x5
x1x2x3
Memory
34 35 36 37 38 39 32
General Registers (Physical)
33
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1 x3
-2
RRB
Software Pipelining Example in the IA-64
56
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
2
LC
3
EC
x4x5
x1x2x3 y1
Memory
y234 35 36 37 38 39 32
General Registers (Physical)
33
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1 x3
-2
RRB
Software Pipelining Example in the IA-64
57
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
2
LC
3
EC
x4x5
x1x2x3 y1
Memory
y234 35 36 37 38 39 32
General Registers (Physical)
33
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1 x3
-2
RRB
Software Pipelining Example in the IA-64
58
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 11
16 17 18
Predicate Registers
1
LC
3
EC
1
x4x5
x1x2x3 y1
Memory
-3
RRB
y235 36 37 38 39 32 33
General Registers (Physical)
34
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1 x3
Software Pipelining Example in the IA-64
59
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
1
LC
3
EC
x4x5
x1x2x3 y1
Memory
-3
RRB
y2 x435 36 37 38 39 32 33
General Registers (Physical)
34
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1 x3
Software Pipelining Example in the IA-64
60
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
1
LC
3
EC
x4x5
x1x2x3 y1
Memory
y2 x435 36 37 38 39 32 33
General Registers (Physical)
34
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 x3
-3
RRB
Software Pipelining Example in the IA-64
61
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
1
LC
3
EC
x4x5
x1x2x3 y1
y2
Memory
y2 x435 36 37 38 39 32 33
General Registers (Physical)
34
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 x3
-3
RRB
Software Pipelining Example in the IA-64
62
1 1116 17 18
Predicate Registers
1
LC
3
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2
Memory
y2 x435 36 37 38 39 32 33
General Registers (Physical)
34
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 x3
-3
RRB
Software Pipelining Example in the IA-64
63
1 1116 17 18
Predicate Registers
0
LC
3
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1
x4x5
x1x2x3 y1
y2
Memory
-4
RRB
y2 x436 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 x3
Software Pipelining Example in the IA-64
64
1 1116 17 18
Predicate Registers
0
LC
3
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2
Memory
y2 x5 x4
36 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 x3
-4
RRB
Software Pipelining Example in the IA-64
65
1 1116 17 18
Predicate Registers
0
LC
3
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2
Memory
y2 x5 x436 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-4
RRB
Software Pipelining Example in the IA-64
66
1 1116 17 18
Predicate Registers
0
LC
3
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2y3
Memory
-4
RRB
y2 x5 x436 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
Software Pipelining Example in the IA-64
67
1 1116 17 18
Predicate Registers
0
LC
3
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2y3
Memory
y2 x5 x4
36 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-4
RRB
Software Pipelining Example in the IA-64
68
1 1016 17 18
Predicate Registers
0
LC
2
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0
x4x5
x1x2x3 y1
y2y3
Memory
y2 x5 x4
37 38 39 32 33 34 35
General Registers (Physical)
36
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-5
RRB
Software Pipelining Example in the IA-64
69
1 1016 17 18
Predicate Registers
0
LC
2
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0
x4x5
x1x2x3 y1
y2y3
Memory
y2 x5 x4
37 38 39 32 33 34 35
General Registers (Physical)
36
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-5
RRB
Software Pipelining Example in the IA-64
70
1 1016 17 18
Predicate Registers
0
LC
2
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2y3
Memory
y2 x5 x437 38 39 32 33 34 35
General Registers (Physical)
36
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-5
RRB
Software Pipelining Example in the IA-64
71
1 1016 17 18
Predicate Registers
0
LC
2
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2y3
Memory
y2 x5 y537 38 39 32 33 34 35
General Registers (Physical)
36
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-5
RRB
Software Pipelining Example in the IA-64
72
1 1016 17 18
Predicate Registers
0
LC
2
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4
y1y2y3
Memory
y2 x5 y537 38 39 32 33 34 35
General Registers (Physical)
36
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-5
RRB
Software Pipelining Example in the IA-64
73
1 1016 17 18
Predicate Registers
0
LC
2
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4
y1y2y3
Memory
y2 x5 y537 38 39 32 33 34 35
General Registers (Physical)
36
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-5
RRB
Software Pipelining Example in the IA-64
74
0 1016 17 18
Predicate Registers
0
LC
1
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0
x4x5
x1x2x3
y4
y1y2y3
Memory
y2 x5 y536 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-6
RRB
Software Pipelining Example in the IA-64
75
0 1016 17 18
Predicate Registers
0
LC
1
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4
y1y2y3
Memory
y2 x5 y536 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-6
RRB
Software Pipelining Example in the IA-64
76
0 1016 17 18
Predicate Registers
0
LC
1
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4
y1y2y3
Memory
y2 x5 y536 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-6
RRB
Software Pipelining Example in the IA-64
77
0 1016 17 18
Predicate Registers
0
LC
1
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4y5
y1y2y3
Memory
y2 x5 y536 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-6
RRB
Software Pipelining Example in the IA-64
78
0 1016 17 18
Predicate Registers
0
LC
1
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4y5
y1y2y3
Memory
y2 x5 y536 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-6
RRB
Software Pipelining Example in the IA-64
79
0 1016 17 18
Predicate Registers
0
LC
1
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4y5
y1y2y3
Memory
y2 x5 y536 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-6
RRB
Software Pipelining Example in the IA-64
80
0 0016 17 18
Predicate Registers
0
LC
0
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0
x4x5
x1x2x3
y4y5
y1y2y3
Memory
y2 x5 y537 38 39 32 33 34 35
General Registers (Physical)
36
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-7
RRB