The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

19
he IA-64 architecture and Itanium processor he IA-64 architecture and Itanium processor Explicitly Parallel Instruction Computing Explicitly Parallel Instruction Computing Frans Dondorp Frans Dondorp Presentation et 4 074, January 8 Presentation et 4 074, January 8 th th 2001 2001

description

The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing. Frans Dondorp Presentation et 4 074, January 8 th 2001. Contents.  Introduction to the IA-64 architecture and EPIC.  The Itanium  processor.  Branch removal.  Predication. - PowerPoint PPT Presentation

Transcript of The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

Page 1: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

The IA-64 architecture and Itanium processorsThe IA-64 architecture and Itanium processorsExplicitly Parallel Instruction ComputingExplicitly Parallel Instruction Computing

The IA-64 architecture and Itanium processorsThe IA-64 architecture and Itanium processorsExplicitly Parallel Instruction ComputingExplicitly Parallel Instruction Computing

Frans DondorpFrans DondorpPresentation et 4 074, January 8Presentation et 4 074, January 8thth 2001 2001

Frans DondorpFrans DondorpPresentation et 4 074, January 8Presentation et 4 074, January 8thth 2001 2001

Page 2: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

ContentsContents

Introduction to the IA-64 architecture and EPICIntroduction to the IA-64 architecture and EPIC The ItaniumThe Itanium processor processor

Branch removalBranch removal

PredicationPredication

Speculative executionSpeculative execution

Control speculationControl speculation

Comparison: ARM conditional instructionsComparison: ARM conditional instructions

Data speculationData speculation

Page 3: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

Introduction to the IA-64 architectureIntroduction to the IA-64 architecture

Joint research by Intel and Hewlett-Packard (1994)Joint research by Intel and Hewlett-Packard (1994)

exploitation of the ILP conceptexploitation of the ILP concept tight coupling of hard- and softwaretight coupling of hard- and software

EPIC is introduced as basic concept:EPIC is introduced as basic concept:ExplicitlyExplicitly Parallel Instruction Computing Parallel Instruction Computing

This results in a more complex task for the compiler andThis results in a more complex task for the compiler andHardware support for communication of meta-informationHardware support for communication of meta-information

speculation, predication and branch hintsspeculation, predication and branch hints

““The future of computing”The future of computing” – Intel web site

Page 4: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

The ItaniumThe Itanium processor processor

The ItaniumThe Itanium, formerly code-named , formerly code-named MercedMerced, is the first processor based , is the first processor based on the IA-64 architectureon the IA-64 architecture

Still a prototype, compilers announcedStill a prototype, compilers announced(as of nov. 2000) (as of nov. 2000)

10-stage pipeline, running at 800Mhz10-stage pipeline, running at 800Mhz

To support EPIC, it is equipped with:To support EPIC, it is equipped with:4 ALU’s, 4 MMX units, 4 FPU’s (2 SP, 2 DP), 2 L/S units, 3 br units4 ALU’s, 4 MMX units, 4 FPU’s (2 SP, 2 DP), 2 L/S units, 3 br units

MS Win2K and Linux announced (as of oct. 2000)MS Win2K and Linux announced (as of oct. 2000)

Page 5: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

IA 64 resources and instructionsIA 64 resources and instructions

Register resourcesRegister resources

r0r1....

r31....

r32............

r126r127

64 + 1 b

128 GR’s

Static

Stacked /Rotating

f0f1....

f31....

f32............

f126f127

82 b

128 FR’s

Rotating

ar0ar1....

....

....

....

....

....

....f126f127

64 b

128 AR’s

AR application registerBR Branch registerFR Floating point registerGR General registerPR Predicate register

p0

b0b1....b7

64 b

8 BR’s

... ... p15 p16 ... ... ... p62 p63

64 PR’s

1 b

Deferred exception(Not A Thing, NaT)

Control speculation

Function calllinkage and return

(64b address space!)

Holds result of a conditional expression

evaluationPredication

Support for register stackand software pipelining

Page 6: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

IA64 resources and instructionsIA64 resources and instructions

Instruction encodingInstruction encoding

Instruction 2 Instruction 1 Instruction 0 Template

41 b 41 b 41 b 5 b

IA-64 “Bundle”

Op Reg 1 Reg 2 Reg 3 Predicate

14 b 7 b 7 b 7 b 6 b

Instruction format

{.mii ld8 r1 = 4[r2] add r3 = r1, r3 shr r7 = r4, r12}{.mbb ld8 r6 = 8[r5](p3) br.cond Label1(p4) br.cond Label2}

Templates are used to group instructions to exploit parallel execution by keeping execution units buzy.

Predicates are used to allow for conditional

execution.6 bits used to address 64 predicate registers

The Itanium processor issues 8 ops/clock:The Itanium processor issues 8 ops/clock:

ALU ALUALU ALU

MMX MMX

L/S L/S

MMX MMX

FP S FP DFP S FP D BRBR BR

M I I M B B

Page 7: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

Branch removalBranch removal

Branch-prediction is costly Branch-prediction is costly Cost of misprediction is proportional to pipeline lengthCost of misprediction is proportional to pipeline length

Optimizing the use of prediction resources can Optimizing the use of prediction resources can significantly improve the overall performancesignificantly improve the overall performance

Conditional instructions can eliminate the need for Conditional instructions can eliminate the need for branchesbranches

cmp r1, r2

beq equal

mov r1, #0

bal end

.equal

mov r2, #0

.end

cmp r1, r2

moveq r1, #0

movne r2, #0

Executes only if eq-bit is setin status register; else NOP

Page 8: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

Branch removal – Conditional instructionsBranch removal – Conditional instructions

Conditional instructions can reduce a branch-penalty due Conditional instructions can reduce a branch-penalty due to a misprediction from N pipeline-stages to 1to a misprediction from N pipeline-stages to 1

Implementing conditional instructions in instruction Implementing conditional instructions in instruction space directly increases instruction-size while the amount space directly increases instruction-size while the amount of conditions to test on is limited (typically to a few bits in of conditions to test on is limited (typically to a few bits in the processor status register)the processor status register)

Unbalanced execution paths: conditional code might Unbalanced execution paths: conditional code might decrease performance in favor of a branch mispredictiondecrease performance in favor of a branch misprediction

ARM Conditional Instructions

Page 9: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

Branch removal – Conditional instructionsBranch removal – Conditional instructions

Example: conditional code performanceExample: conditional code performance(one instruction executed each cycle)(one instruction executed each cycle)

cmp r1, r2

moveq r1, #0

addeq r2, r2, #10

ldbeq r3, (r5)+

inceq r3

stbeq r3, (r5)+

inceq r1

mov r2, #0

r1 r2

cmp r1, r2

bne end

mov r1, #0

add r2, r2, #10

ldb r3, (r5)+

inc r3

stb r3, (r5)+

inc r1

.end

mov r2, #0

6 NOP’s

LOSS:

6

vs r1 r2mispredict

Pipeline flushed: branch-penalty

LOSS:

#pipeline

On a machine with a 5-stage pipeline, conditional On a machine with a 5-stage pipeline, conditional instructions would lead to performance lossinstructions would lead to performance loss

The compiler should decide!The compiler should decide!

Page 10: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

PredicationPredication

Predication: Predication: tagging instructions with a boolean valuetagging instructions with a boolean value

cmp.necmp.ne p1, p0 = r4, 0;;p1, p0 = r4, 0;;(p1) add(p1) add r1 = r2, r3r1 = r2, r3(p1) ld8(p1) ld8 r6 = [r5]r6 = [r5]

The limitations of conditional instructions are decreased The limitations of conditional instructions are decreased by predication: with predication the amount of conditions by predication: with predication the amount of conditions to test on equals the number of predicate registers to test on equals the number of predicate registers

SET BOOLEAN VALUESCompare r4 to #0; not equal

p1 is TRUE if r4 0p2 = NOT(p1)

if r4 0 thenr1 = (r2 + r3)

if r4 0 thenr6 = MEM(r5)

Page 11: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

Advantages of predicationAdvantages of predication

The compiler has more freedom when The compiler has more freedom when schedulingscheduling if if predicates are guaranteed not to conflict.predicates are guaranteed not to conflict.

Code motion past branches and Ld/Str ops results in Code motion past branches and Ld/Str ops results in speculative executionspeculative execution

Predication – moving instructionsPredication – moving instructions

Code MotionCode MotionCode MotionCode Motion

UpwardUpwardUpwardUpward DownwardDownwardDownwardDownward

Page 12: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

Speculative executionSpeculative execution

The compiler selects commonly The compiler selects commonly executed blocksexecuted blocks

Instruction Instruction selectionselection, , prioritizationprioritization and and reordeningreordening

To enable agressive code-motion To enable agressive code-motion done by the compiler, explicitly done by the compiler, explicitly speculative instructions must be speculative instructions must be availableavailable

Page 13: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

Speculative execution – Control speculationSpeculative execution – Control speculation

IA-64 provides IA-64 provides speculative loadspeculative load instructions instructions

instrA

instrB

...

br

ld8 r1 = [r2]

use r1

ld8.s r1 = [r2]

use r1

instrA

instrB

...

br

chk.s

The load instruction is replaced by a speculative

load

speculation check

Exception Handling:Exception Handling:If a speculative load raises an exception, a deferred exeception token (If a speculative load raises an exception, a deferred exeception token (NaTNaT) is written to ) is written to the target register. This the target register. This NaTNaT is propagated by almost all instructions. is propagated by almost all instructions.

chk.schk.s checks for NaT and if present, jumps to fix-up code (compiler generated). This fix-up checks for NaT and if present, jumps to fix-up code (compiler generated). This fix-up code may excute the load non-speculatively and return to main code afterwards.code may excute the load non-speculatively and return to main code afterwards.

NaT may be written in r1

Page 14: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

Speculative execution – Data speculationSpeculative execution – Data speculation

IA-64 provides IA-64 provides advanced loadadvanced load instructions instructions

instrA

...

...

store

ld8 r1 = [r2]

use r1

ld8.a r1 = [r2]

use r1

instrA

...

store

chk.a

The load instruction is replaced by an advanced

load

advanced load check

reg# addr

reg# addr

size

size

... ...

addr

...

reg#

...

size

...

...

reg#, addr and size are stored in the

advanced load address table (ALAT)

WaR Handling:WaR Handling:When the When the storestore is executed, all ALAT-entries will be compared with the store address. is executed, all ALAT-entries will be compared with the store address. Overlapping entries are removed.Overlapping entries are removed.

chk.achk.a checks for the address of it’s corresponding checks for the address of it’s corresponding advanced loadadvanced load in the ALAT. If the in the ALAT. If the address is still there, address is still there, chk.achk.a does nothing. If it’s gone, does nothing. If it’s gone, chk.achk.a jumps to fix-up code. jumps to fix-up code.

Page 15: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

Speculative execution – fix-upSpeculative execution – fix-up

The fix-up code generated by the compiler is generalThe fix-up code generated by the compiler is general

In case of control speculation:In case of control speculation:Not only the load is speculative, but also all instructions using the destination register.Not only the load is speculative, but also all instructions using the destination register.

In case of data speculation:In case of data speculation:Not only the load is speculative, but also all computations before the (possibly conflicting) Not only the load is speculative, but also all computations before the (possibly conflicting) store.store.

Although the compiler must include fix-up code to handle Although the compiler must include fix-up code to handle exceptions and WaR-conflicts, this relatively simple exceptions and WaR-conflicts, this relatively simple mechanism allows for aggressive code-motionmechanism allows for aggressive code-motion

Page 16: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

0000 EQ Z

0001 NE ~Z

0010 CS C

0011 CC ~C

0100 MI N

0101 PL ~N

0110 VS V

0111 VC ~V

1000 HI C and ~Z

1001 LS ~C or Z

1010 GE N = V

1011 LT N = ~V

1100 GT (N = V) and ~Z

1101 LE (N = ~V) or Z

1110 AL True

1111 NV False (=NOP)

Comparison: ARM conditional instructionsComparison: ARM conditional instructions

Conditional instructions to allow for branch-removal as Conditional instructions to allow for branch-removal as implemented in the ARM processor (+/- 1985)implemented in the ARM processor (+/- 1985)

Cond 000 OPC S SRC1 DEST SH# SH SRC2

ADDEQS Rd, Rn, Rm, ASL Rc

Rd = Sign(Rn+(Rm << Rc)) Single cycle execution

Straightforward orthogonal instruction coding: all instructions can Straightforward orthogonal instruction coding: all instructions can be coded conditionally on all conditionsbe coded conditionally on all conditions

Only 4 condition bits: Z, C, N, V in processor status register: set Only 4 condition bits: Z, C, N, V in processor status register: set by CMN, CMP, TEQ, TSTby CMN, CMP, TEQ, TST

Flexibility: branch removal, but no code motion!Flexibility: branch removal, but no code motion!(conditional instructions (conditional instructions afterafter CMP) CMP)

Instruction format code

Page 17: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

EPIC: The future of computing?EPIC: The future of computing?

As processors grow in complexity, shifting responsibilities As processors grow in complexity, shifting responsibilities to the compiler seems obviousto the compiler seems obvious

Keeping up with Moore’s law: calls for conceptual Keeping up with Moore’s law: calls for conceptual innovations, not only technologicalinnovations, not only technological

In conclusionIn conclusion

Page 18: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

ReferencesReferences

[1][1] “Introducing the IA-64 architecture” “Introducing the IA-64 architecture”J. Huck, D. Morris, J. Ross (HP), A. Knies, H. Mulder, R. Zahir (Intel)J. Huck, D. Morris, J. Ross (HP), A. Knies, H. Mulder, R. Zahir (Intel)IEEE/Micro, sep-oct 2000, p. 12-23IEEE/Micro, sep-oct 2000, p. 12-23

[2][2] “Itanium processor microarchitecture” “Itanium processor microarchitecture”H. Sharangpani, K. Arora (Intel)H. Sharangpani, K. Arora (Intel)IEEE/Micro, sep-oct 2000, p. 24-43IEEE/Micro, sep-oct 2000, p. 24-43

[3][3] “IA-64 Application developer’s architecture guide, Rev. 1.0” “IA-64 Application developer’s architecture guide, Rev. 1.0”Intel Documentation, may 1999Intel Documentation, may 1999Chap. 11: “Predication, Control Flow and Instruction Stream”Chap. 11: “Predication, Control Flow and Instruction Stream”http://developer.intel.com/software/idap/media/pdf/ADAG.pdf

[4][4] “Itanium processor microarchitecture reference” “Itanium processor microarchitecture reference”Intel Documentation, aug. 2000Intel Documentation, aug. 2000http://developer.intel.com/design/ia-64/downloads/245474.htm

[5][5] “ARM Instruction formats and timings” “ARM Instruction formats and timings”R. Watts, nov. 1995R. Watts, nov. 1995http://www.pinknoise.demon.co.uk/ARMinstrs/index.html

Websites:Websites:- www.intel.com/pressroom- developer.intel.com/design/ia-64

Page 19: The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

It is now safe toask your questions