® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler...

79
® Compiling for Compiling for the Intel® the Intel® Itanium™ Itanium™ Architecture Architecture Steve Skedzielewski Steve Skedzielewski Intel Corporation Intel Corporation Compiler Tricks

Transcript of ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler...

Page 1: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Compiling for the Compiling for the Intel® Itanium™ Intel® Itanium™

ArchitectureArchitecture

Steve SkedzielewskiSteve Skedzielewski

Intel CorporationIntel Corporation

Compiler Tricks

Page 2: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

AgendaAgenda

Architecture PrinciplesArchitecture PrinciplesCompiler Bag of Tricks Compiler Bag of Tricks

– SpeculationSpeculation

– PredicationPredication

– BranchingBranching

– Loop GenerationLoop Generation

Page 3: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Today’s Processors are often 60% IdleToday’s Processors are often 60% IdleToday’s Processors are often 60% IdleToday’s Processors are often 60% Idle

parallelizedparallelizedcodecode parallelizedparallelized

codecode

parallelizedparallelizedcodecode

HardwareHardwareCompilerCompiler

multiplemultiple functional unitsfunctional units

Original SourceOriginal SourceCodeCode

Sequential MachineSequential MachineCodeCode

......

......

Execution Units Available- Execution Units Available- Used InefficientlyUsed Inefficiently

Traditional Architectures: Traditional Architectures: Limited ParallelismLimited Parallelism

Page 4: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Increases Parallel ExecutionIncreases Parallel ExecutionIncreases Parallel ExecutionIncreases Parallel Execution

CompilerCompiler

Itanium™Itanium™ Compiler Views Compiler Views

WiderWiderScopeScope

Original SourceOriginal SourceCodeCode

CompileCompile

Parallel MachineParallel MachineCodeCode

HardwareHardware multiple functional unitsmultiple functional units

......

......

More efficient use of More efficient use of execution resourcesexecution resources

Itanium™ Architecture: Itanium™ Architecture: Explicit ParallelismExplicit Parallelism

Page 5: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Itanium™ Architecture Itanium™ Architecture PrinciplesPrinciples

Explicit parallelism:Explicit parallelism:– Instruction level parallelism (ILP) in machine code Instruction level parallelism (ILP) in machine code

– Compiler schedules across a wide scopeCompiler schedules across a wide scope

Enhanced ILP :Enhanced ILP :– Predication, Speculation, Software pipelining, ... Predication, Speculation, Software pipelining, ...

Compatibility:Compatibility:– Across all Itanium™ processor family membersAcross all Itanium™ processor family members

– IA-32 in hardware and PA-RISC through instruction mapping IA-32 in hardware and PA-RISC through instruction mapping

Massive resources:Massive resources:– Many registersMany registers

– Many functional unitsMany functional units

Page 6: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

ld.sld.sinstr 1instr 1instr 2instr 2brbr

chk.schk.suse use

Itanium™ ArchitectureItanium™ Architecture

instr 1instr 1instr 2instr 2. . .. . .brbr

LoadLoaduseuse

Traditional ArchitecturesTraditional Architectures

Advances a load,Advances a load,even above a brancheven above a branch

Speculation ReviewSpeculation Review

Memory latency is a major performance Memory latency is a major performance bottleneck in today’s systemsbottleneck in today’s systems– CPU to memory gap increasingCPU to memory gap increasing

BarrierBarrier

Page 7: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®® Enables Further ParallelismEnables Further ParallelismEnables Further ParallelismEnables Further Parallelism

Speculating UsesSpeculating Uses

Uses of speculative data can also be Uses of speculative data can also be executed speculativelyexecuted speculatively– distinguishes speculation from simple prefetchdistinguishes speculation from simple prefetch

ld.sld.sinstr 1instr 1instr 2instr 2brbr

chk.schk.suse use

Itanium™ ArchitectureItanium™ Architecture

Page 8: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

ld.sld.sinstr 1instr 1instr 2instr 2brbr

chk.schk.suse use

PropagatePropagateExceptionException

;Exception Detection;Exception Detection

;Exception Delivery;Exception Delivery

Itanium™ ArchitectureItanium™ Architecture

Introducing the NaTIntroducing the NaT(“Not a Thing”)(“Not a Thing”)

NaT is the GR’s 65th bit that indicates:NaT is the GR’s 65th bit that indicates:– whether or not an exception has occurred whether or not an exception has occurred – when a branch to recovery code is requiredwhen a branch to recovery code is required

NaT set during ld.s, tested by Chk.sNaT set during ld.s, tested by Chk.s

Page 9: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

PropagationPropagation All computations propagate NaTs, which reduces All computations propagate NaTs, which reduces

the number of checksthe number of checks

Cmp propagates “false” when writing predicates Cmp propagates “false” when writing predicates

chk.s r5chk.s r5sub r7 = r5,r2sub r7 = r5,r2

ld8.s r3 = (r9)ld8.s r3 = (r9)ld8.sld8.s r4 = (r10) r4 = (r10)shladdshladd r6 = r3, 3, r4r6 = r3, 3, r4ld8.s r5 = (r6)ld8.s r5 = (r6)p1,p2 = cmp(...)p1,p2 = cmp(...) Needs only one chk Needs only one chk

on resulton result

Page 10: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

ld.sld.sinstr 1instr 1instr 2instr 2usesusesbrbr

chk.schk.s(Home Block)(Home Block)

ldldusesusesbr homebr home

Recovery codeRecovery code

Exception Deferral: More Exception Deferral: More Than Skin DeepThan Skin Deep Costly exceptions can be Costly exceptions can be

deferreddeferred OS can control deferral of:OS can control deferral of:

– Page faultsPage faults– Protection violationsProtection violations– ……

NaTs enable deferral with NaTs enable deferral with recoveryrecovery

Enables aggressive code motion at Enables aggressive code motion at compile timecompile time

Enables aggressive code motion at Enables aggressive code motion at compile timecompile time

Page 11: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Store BarrierStore Barrier

Traditional architectures limited by Traditional architectures limited by the store barrierthe store barrier

Traditional architectures limited by Traditional architectures limited by the store barrierthe store barrier

instr 1instr 1instr 2instr 2. . .. . .Store(*)Store(*)

Load (*)Load (*)useuse

BarrierBarrier

Traditional ArchitecturesTraditional Architectures

Page 12: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Introducing Data Introducing Data SpeculationSpeculation

Compiler can issue a load prior to a Compiler can issue a load prior to a preceding, possibly-conflicting storepreceding, possibly-conflicting store

Unique to Itanium™ ArchitectureUnique to Itanium™ ArchitectureUnique to Itanium™ ArchitectureUnique to Itanium™ Architecture

instr 1instr 1instr 2instr 2. . .. . .st8st8

ld8ld8useuse

BarrierBarrier

Traditional ArchitecturesTraditional Architectures

ld8.ald8.ainstr 1instr 1instr 2instr 2st8st8

ld.cld.cuse use

Itanium™ ArchitectureItanium™ Architecture

Page 13: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Data SpeculationData SpeculationUses can be speculatedUses can be speculated

Synergy with control speculation Synergy with control speculation increases performanceincreases performance

Synergy with control speculation Synergy with control speculation increases performanceincreases performance

ld8.ald8.ainstr 1instr 1instr 2instr 2st8st8

ld.cld.cuse use

ld8.ald8.ainstr 1instr 1useuseinstr 2instr 2st8st8

chk.achk.a ld8ld8usesusesbr homebr home

Recovery codeRecovery code

Page 14: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Architectural Support for Architectural Support for Data SpeculationData SpeculationInstructionsInstructions

– ld.a - advanced loadsld.a - advanced loads

– ld.c - check loadsld.c - check loads

–chk.a - advanced load checkschk.a - advanced load checks

Speculative Advanced loads - ld.sa - is Speculative Advanced loads - ld.sa - is an advanced load with deferral an advanced load with deferral

ALAT - HW structure containing ALAT - HW structure containing outstanding advanced loadsoutstanding advanced loads

Page 15: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Advanced Load Address Advanced Load Address Table - ALATTable - ALAT ld.a inserts entries.ld.a inserts entries. Conflicting stores remove entries Conflicting stores remove entries

– Also: ld.c.clr, chk.a.clr, Also: ld.c.clr, chk.a.clr,

Presence of entry indicates successPresence of entry indicates success– chk.a branches when no entry is found chk.a branches when no entry is found

reg # Address

reg # Address

reg # Address...

ld.a reg# =...

stchk.a reg# ?

Page 16: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Speculation BenefitsSpeculation BenefitsReduces impact of memory Reduces impact of memory

latencylatencyImproves code with many cache Improves code with many cache

accessesaccesses–Large databasesLarge databases

–Operating systemsOperating systems

Gives scheduling flexibilityGives scheduling flexibility

Page 17: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

AgendaAgenda

Architecture PrinciplesArchitecture PrinciplesCompiler Bag of TricksCompiler Bag of Tricks

– Speculation Speculation

– PredicationPredication

– BranchingBranching

– Loop GenerationLoop Generation

Page 18: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

PredicationPredication

cmpcmp

p1

p1

p1

p2

p2

p2

Traditional ArchitecturesTraditional Architectures Itanium™ ArchitectureItanium™ Architecture

Converts branches to conditional execution Converts branches to conditional execution – Executes multiple paths simultaneouslyExecutes multiple paths simultaneously

Exposes parallelism and reduces critical path Exposes parallelism and reduces critical path – Better utilizes wider machinesBetter utilizes wider machines

– Reduces mispredicted branchesReduces mispredicted branches

elseelse

thenthen

cmpcmp

Page 19: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Complex TransformationsComplex Transformations

Not your simple if-then-elseNot your simple if-then-elseNot your simple if-then-elseNot your simple if-then-else

• Mark from SPEC CPU95 130.li• Low ILP in each block

Highly mispredicted branch

Page 20: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

set p1 or p2 based upon next path

Complex TransformationsComplex Transformations

Global control flow reductionGlobal control flow reductionGlobal control flow reductionGlobal control flow reduction

p1

p1

p1

p1

p2

p2

p2

• One loop back branch- always taken

Set p1 = true

• Utilizes machine width

Page 21: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Upward Code MovementUpward Code Movement

cmp.unc.eq p1,p2 = r1,r2 :(p1) br --> label : ld r4 = [r3] add r5 = r4,1

cmp.unc.eq p1,p2 = r1,r2 : ld.s r4 = [r3] add r5 = r4,1 :(p1) br --> label chk.s r4, rec

Depending upon deferral mode, the Depending upon deferral mode, the add could cause cache missadd could cause cache miss

Depending upon deferral mode, the Depending upon deferral mode, the add could cause cache missadd could cause cache miss

Speculate both the load and the use

Page 22: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Upward Code MovementUpward Code Movement

cmp.unc.eq p1,p2 = r1,r2 :(p1) br --> label : ld r4 = [r3] add r5 = r4,1

cmp.unc.eq p1,p2 = r1,r2 :(p2) ld r4 = [r3](p2) add r5 = r4,1 :(p1) br --> label

Predication can avoid Predication can avoid speculative side effectsspeculative side effectsPredication can avoid Predication can avoid

speculative side effectsspeculative side effects

Predicate with fall-thru predicateMotion bounded by compare

Page 23: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Downward Code MovementDownward Code MovementA B

C

Predication enables downward code movement from A to C without compensation code in B

A

C

Compensation Block

Merge Block

Main Trace Use predication to merge sparse code in compensation block with code in merge block

Page 24: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Code Motion TradeoffsCode Motion Tradeoffs

A

D

CB

Slots available in hot pathPredicate region formation occurs before scheduling

Predication can pull instructions from lower weight path

Downward Code Motion

Upward Code Motion

Scheduler can move instructions from above and below

Solutions• Heuristic formation• Preschedule information• Reverse if-conversion

Page 25: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Reduces Critical PathReduces Critical PathReduces Critical PathReduces Critical Path

BB

AA

CC

DD

BBAA CC

DD

Introducing Parallel Introducing Parallel ComparesCompares Three new types of compares:Three new types of compares:

– AND: both target predicates set FALSE if compare is falseAND: both target predicates set FALSE if compare is false

– OR: both target predicates set TRUE if compare is trueOR: both target predicates set TRUE if compare is true

– ANDOR: if true, sets one TRUE, sets other FALSEANDOR: if true, sets one TRUE, sets other FALSE

Page 26: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

0

0

1

1

Method of UseMethod of UseOr Predicate• Initially clear predicate• All true compares will set• All false compares do nothing

And Predicate• Initially set predicate• All true compares do nothing• All false compares will clear

cmp.unc.ne p1 = r0,r0

cmp.or.eq p1 = 40,r7cmp.or.eq p1 = 9,r7

cmp.unc.eq p1 = r0,r0

cmp.and.ge p1 = 48,r6cmp.and.lt p1 = 58,r6

Page 27: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Parallel Compare ExampleParallel Compare Example

cmp.unc.eq p1,p2 = r0,0

cmp.and.orcm p1,p2 = c1 cmp.and.orcm p1,p2 = c2 cmp.and.orcm p1,p2 = c3 cmp.and.orcm p1,p2 = c4

(p1) then_code(p2) else_code

c1

c2

c3

else

c4

then

Itanium™ Architecture Code

1

2

Significant control Significant control height reductionheight reduction

Significant control Significant control height reductionheight reduction

0

if (c1 && c2 && c3 && c4) then then_code else else_code

Page 28: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Predication BenefitsPredication Benefits Reduces branches and mispredict penalties Reduces branches and mispredict penalties Parallel compares further reduce critical pathsParallel compares further reduce critical paths Greatly improves code with hard to predict Greatly improves code with hard to predict

branchesbranches Works in tandem with speculationWorks in tandem with speculation Traditional architectures’ “bolt-on” approach can’t Traditional architectures’ “bolt-on” approach can’t

efficiently approximate predicationefficiently approximate predication– Cmove: 39% more instructions, 23% slower performance*Cmove: 39% more instructions, 23% slower performance*

– All instructions need predicationAll instructions need predication

* Source: S. Mahlke, 1995* Source: S. Mahlke, 1995

Page 29: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

AgendaAgenda

Architecture PrinciplesArchitecture PrinciplesCompiler Bag of TricksCompiler Bag of Tricks

– Speculation Speculation

– PredicationPredication

– BranchingBranching

– Loop GenerationLoop Generation

Page 30: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Instruction 1Instruction 1 Instruction 0Instruction 0 TemplateTemplate

128-bit bundle128-bit bundle00127127

QPQPIP-OffsetIP-OffsetBranchBranch

21-bits21-bits

Branch InstructionBranch Instruction

Two basic branch formatsTwo basic branch formats– Relative: IP := IP + Offset21Relative: IP := IP + Offset21

– Indirect: IP := BR[I] Indirect: IP := BR[I] – 8 branch registers for efficient branch execution8 branch registers for efficient branch execution

– Call/Return linking through branch registersCall/Return linking through branch registers

Loop branches with 64-bit loopcount register (LC)Loop branches with 64-bit loopcount register (LC)– Enables perfect branch prediction of counted loopsEnables perfect branch prediction of counted loops

– Traditional architectures always mispredict last iterationTraditional architectures always mispredict last iteration– Important for low trip count loops Important for low trip count loops

41-bits41-bits

Page 31: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

cmp p1 = condcmp p1 = cond(p1) br target;(p1) br target;

Conditional branchesConditional branches

(p0) br target;(p0) br target;Unconditional branchUnconditional branch

Branch PredicatesBranch Predicates

Compare and branch can be in same cycleCompare and branch can be in same cycleCompiler-directed static prediction Compiler-directed static prediction

augments dynamic predictionaugments dynamic prediction– Reduced false mispredicts due to aliasingReduced false mispredicts due to aliasing

– Frees space in H/W predictorFrees space in H/W predictor

– Can give hint for dynamic predictorCan give hint for dynamic predictor

Page 32: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]ld R2=[R1]ld R2=[R1]ld.s R4=[R3]ld.s R4=[R3]ld.s R6=[R5]ld.s R6=[R5]P1,P2 <-cmp.unc(R2==true)P1,P2 <-cmp.unc(R2==true)

(p1)(p1) chk.s R4chk.s R4(p1)(p1) P3,P4 <-cmp.unc(R4==true)P3,P4 <-cmp.unc(R4==true)

(p3)(p3) chk.s R6chk.s R6(p3)(p3) P5,P6 <-cmp.unc(R5==true)P5,P6 <-cmp.unc(R5==true)(P5) br then(P5) br thenelseelse

1

2

4

5

6

7

ThenElse

P1

P2

P5

P3 P4

P6

8 queens control flow8 queens control flowUnconditional ComparesUnconditional Compares

8 Queens Example8 Queens Exampleif ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))

Page 33: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Eight Queens ExampleEight Queens Example

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) (p1) br thenbr thenelseelse

1

2

4

Major reduction in control flowMajor reduction in control flowMajor reduction in control flowMajor reduction in control flow

if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))

Page 34: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

3 branch cycles3 branch cycles 1 branch cycle1 branch cycle

w/o Speculationw/o Speculation Hoisting LoadsHoisting Loads

ld8 r6 = (ra)ld8 r6 = (ra)(p1) br exit1(p1) br exit1

ld8 r7 = (rb)ld8 r7 = (rb)(p3) br exit2(p3) br exit2

ld8 r8 = (rc)ld8 r8 = (rc)(p5) br exit3(p5) br exit3

chk r6, rec0chk r6, rec0(p1) br exit1(p1) br exit1

Chk r7, rec1Chk r7, rec1(p3) br exit2(p3) br exit2

Chk r8, rec2Chk r8, rec2(p5) br exit3(p5) br exit3

ld8.s r6 = (ra)ld8.s r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)

ld8.s r6 = (ra)ld8.s r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)

chk r6, rec0chk r6, rec0(p2) chk r7, rec1(p2) chk r7, rec1(p4) chk r8, rec2 (p4) chk r8, rec2 }{}{(p1) br exit1(p1) br exit1(p3) br exit2(p3) br exit2(p5) br exit3(p5) br exit3}}

P1P1

P6P6P5P5

P2P2

P4P4P3P3

Multi-way branches: more than 1 branch in a single cycleMulti-way branches: more than 1 branch in a single cycle Allows n-way branchingAllows n-way branching

Supports Aggressive SpeculationSupports Aggressive SpeculationSupports Aggressive SpeculationSupports Aggressive Speculation

Multi-way BranchMulti-way Branch

Page 35: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

cmp p1, p2 = c1

cmp p3, p4 = c2

cmp p5, p6 = c3

:

:

st [r10] =

(p1) br exit1

st [r11] =

(p3) br exit2

st [r12] =

(p5) br exit3

cmp p1, p2 = c1

cmp p3, p4 = c2

cmp p5, p6 = c3

:

:

st [r10] =

(p2) st [r11] =

(p4) st [r12] =

(p1) br exit1

(p3) br exit2

(p5) br exit3

Multi-way BranchMulti-way Branchw/o Predicationw/o Predication PredicationPredication

Predication and Multi-way increase ILPPredication and Multi-way increase ILPPredication and Multi-way increase ILPPredication and Multi-way increase ILP

Page 36: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

AgendaAgenda

Architecture PrinciplesArchitecture PrinciplesCompiler Bag of TricksCompiler Bag of Tricks

– SpeculationSpeculation

– PredicationPredication

– BranchingBranching

– Loop GenerationLoop Generation

Page 37: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Loop ExampleLoop Example

for (i=0, i< len, i++) { if (IS_LOWERCASE(line[i])) newline[i] = CNVT_TO_UPPERCASE(line[i]); else newline[i] = line[i];}

for (i=0, i< len, i++) { if (line[i] >= ‘a’ && line[i] <= ‘z’) newline[i] = line[i]-32; else newline[i] = line[i];}

Convert string to uppercase

After macro expansion

Typical integer-type loopTypical integer-type loopTypical integer-type loopTypical integer-type loop

Page 38: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

loop: ld c = [ra], 1 cmp p1 = true

cmp.and p1 = (c > 96) cmp.and p1 = (c < 123)

(p1) sub c = c,32 st [rb] = c, 1 br.cloop loop

Loop Assembly CodeLoop Assembly Code

loop: ld c = [ra], 1

bgt c, 96 bottom blt c, 123 bottom

sub c = c,32bottom: st [rb] = c, 1

blt ra, end loop

Traditional Arch Itanium™ Architecture

Fewer branches and no mispredictions. Fewer branches and no mispredictions. Still low ILP.Still low ILP.

Fewer branches and no mispredictions. Fewer branches and no mispredictions. Still low ILP.Still low ILP.

12

3

5

1

2

3

4

40 cycles for 8 iterations 32 cycles for 8 iterations

4

Page 39: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Unroll for ILPUnroll for ILP ld c = [ra],1loop: ld d = [ra],1 bgt c,115,b1 blt c,96, b1 sub c=c,36b1: st [rb] = c,1 beq rb,end, exit ld c = [ra],1 bgt d,115,b2 blt d,96, b2 sub d=d,36b2: st [rb] = d,1 blt rb,end, loop

ld d

ld c

sub

st c beq

ld c bgt d

blt d

sub

bgt c

blt c

st d blt

b1:

b2:

loop:

Unroll twice• 8 iterations in 33 cycles• 1.2x perf. inprov.• Code size: 2x• Won’t gain by unrolling more

Page 40: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Software PipeliningSoftware Pipelining Overlapping execution of different loop iterationsOverlapping execution of different loop iterations

vs.vs.

More iterations in same amount of timeMore iterations in same amount of time

Whole loop computation in one cycle

Page 41: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

1 ld

2 ld cmps

3 ld cmps ?sub

4 ld cmps ?sub st

Software PipeliningSoftware Pipelining

Cycle

Kernel

Data transferred from one Data transferred from one functional unit to the nextfunctional unit to the next

Data transferred from one Data transferred from one functional unit to the nextfunctional unit to the next

Input

ld

cmps

?sub

st

Output

Page 42: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Introducing Rotating Introducing Rotating RegistersRegisters

GR 32-127, FR32-127 can rotateGR 32-127, FR32-127 can rotate

Separate Rotating Register Base for each: GRs, FRsSeparate Rotating Register Base for each: GRs, FRs

Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB)

Instructions contain a “virtual” register number Instructions contain a “virtual” register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

ReferencesReferences– ““Overlapped Loop Support in the Cydra 5” - Dehnert et. al, 1989Overlapped Loop Support in the Cydra 5” - Dehnert et. al, 1989– ““Code Generation Schemas for Modulo-Scheduled Loops” - Code Generation Schemas for Modulo-Scheduled Loops” -

Rau et. al, MICRO-25, 1992Rau et. al, MICRO-25, 1992

Allows painless transfer of Allows painless transfer of data between stagesdata between stages

Allows painless transfer of Allows painless transfer of data between stagesdata between stages

Page 43: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

s1

s2

s3

s4

Pipelined LoopPipelined Loop

r36 = xx

r34 = xxld

r37 = xx

cmp<

cmp>

sub

st

Kernel codeKernel code

loop:

ld r34 = [ra], 1 cmp p1 = true

cmp.and p1 = (r35>96) cmp.and p1 = (r35<123)

(p1) sub r36 = r36, 32

st [rb] = r37, 1

br.ctop loop

Physical Physical register fileregister file

Virtual Virtual registerregister

RRB = 0

r35 = xx

+

r34 = xx

r35 = xx

r36 = xx

r37 = xx

Page 44: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Fill the pipe ...Fill the pipe ...

r35 = xx

r36 = xx

r34 = Gld

r37 = xx

cmp<

cmp>

sub

st

Execute prologue stage

Kernel codeKernel codeloop:

ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123)(p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop

Physical Physical register fileregister file

Virtual Virtual registerregister

RRB = 0

+

r34 = G

r35 = xx

r36 = xx

r37 = xx

G o _ G r e y h

Page 45: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Fill the pipe ...Fill the pipe ...

r35 = xx

r36 = xx

r34 = Gld

r37 = xx

cmp<

cmp>

sub

st

Perform a loop branch• Decrement lc• Rotate registers by

decrementing RRB

Physical Physical register fileregister file

Virtual Virtual registerregister

RRB = 0

+

r34 = G

r35 = xx

r36 = xx

r37 = xx

Page 46: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Fill the pipe ...Fill the pipe ...

r34 = G

r35 = xx

r33 = old

r36 = xx

cmp<

cmp>

sub

st

Execute prologue stage

Kernel codeKernel codeloop:

ld r34 = [ra], 1 cmp p1 = true cmp.and p1 = (r35>96) cmp.and p1 = (r35<123)(p1) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop

Physical Physical register fileregister file

Virtual Virtual registerregister

RRB = -1

+

r34 = o

r35 = G

r36 = xx

r37 = xx

Page 47: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Fill the pipe ...Fill the pipe ...

r33 = o

r34 = G

r32 = _ld

r35 = xx

cmp<

cmp>

sub

st

Execute prologue stage

Kernel codeKernel codeloop:

ld r34 = [ra], 1 cmp p16 = true cmp.and p16 = (r35>96) cmp.and p16 = (r35<123)(p17) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop

Physical Physical register fileregister file

Virtual Virtual registerregister

RRB = -2

+

r34 = _

r35 = o

r36 = G

r37 = xx

G o _ G r e y h

Page 48: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Execute the KernelExecute the Kernel

r32 = _

r33 = o

r37 = Gld

r34 = G

cmp<

cmp>

sub

st

Execute kernelWhole iteration per cycle

G

Kernel codeKernel codeloop:

ld r34 = [ra], 1 cmp p16 = true cmp.and p16 = (r35>96) cmp.and p16 = (r35<123)(p17) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop

Physical Physical register fileregister file

Virtual Virtual registerregister

RRB = -3

+

r34 = G

r35 = _

r36 = o

r37 = G

G o _ G r e y h

Page 49: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Execute the KernelExecute the Kernel

r37 = G

r32 = _

r36 = rld

r33 = O

cmp<

cmp>

sub

st

Execute kernelWhole iteration per cycle

G O

Kernel codeKernel codeloop:

ld r34 = [ra], 1 cmp p16 = true cmp.and p16 = (r35>96) cmp.and p16 = (r35<123)(p17) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop

Physical Physical register fileregister file

Virtual Virtual registerregister

RRB = -4

+

r34 = r

r35 = G

r36 = _

r37 = O

G o _ G r e y h

Page 50: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Execute the KernelExecute the Kernel

r36 = r

r37 = G

r35 = eld

r32 = _

cmp<

cmp>

sub

st

Execute kernelWhole iteration per cycle

G O

Kernel codeKernel codeloop:

ld r34 = [ra], 1 cmp p16 = true cmp.and p16 = (r35>96) cmp.and p16 = (r35<123)(p17) sub r36 = r36, 32 st [rb] = r37, 1 br.ctop loop

Physical Physical register fileregister file

Virtual Virtual registerregister

RRB = -5

+

r34 = e

r35 = r

r36 = G

r37 = _

G o _ G r e y h

Page 51: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Pipelining OverheadPipelining Overhead

Prologue

Kernel

Epilogue

Prologue and Epilogue are bad

• Code size expansion

• Overhead not good for low trip count loops - cache performance

Can we avoid prologue and epilogue?Can we avoid prologue and epilogue?Can we avoid prologue and epilogue?Can we avoid prologue and epilogue?

Page 52: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

1 ld

2 ld cmps

3 ld cmps ?sub

4 ld cmps ?sub st

Prologue CodePrologue Code

Cycle

Kernel

Incrementally turn on functional unitsIncrementally turn on functional unitsIncrementally turn on functional unitsIncrementally turn on functional units

Page 53: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Avoid Pro and EpiloguesAvoid Pro and Epilogues

r35 = xx

r36 = xx

r34 = xxld

r37 = xx

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Unit EnablerUnit Enabler

Have enable bit on each functional unit

Enablers are initialized to off

Feed through a sequence of bits of

length dependent upon loop count and

pipe depth

Kernel (loop count)Epilogue

Page 54: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

s1

s2

s3s4

Revisiting Rotating Revisiting Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number. Some predicates control pipeline stages, Some predicates control pipeline stages, Stage PredicatesStage Predicates Qualifying PredicatesQualifying Predicates can still be in the loopcan still be in the loop

Complete Loop Codeloop:(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

Page 55: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

How does this workHow does this work

r35 = xx

r36 = xx

r34 = Gld

r37 = xx

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Complete Loop Codeloop:

(p16) ld r34 = [ra], 1(p16) cmp p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = 0

Stage PredicatesStage Predicates

KernelEpilogue

Qualifying Predicate

Page 56: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Auto Predicate GenerationAuto Predicate Generation

Predicate Generator

Initalize• lc to trip count• ec to epilogue count• p16 to true

Loop branches• Rotate predicates by decrementing RRB• When lc > 0

- Decr. lc, set p16=true• When lc = 0

- Decr. ec, set p16=false• Fall through when ec=0

lc ecRRB

p16

Page 57: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Fill the pipe again ...Fill the pipe again ...

r35 = xx

r36 = xx

r34 = Gld

r37 = xx

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Complete Loop Code

KernelEpilogue

loop:

(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = 0

Stage PredicatesStage Predicates

Page 58: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Fill the pipe again ...Fill the pipe again ...

r34 = G

r35 = xx

r33 = old

r36 = xx

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Complete Loop Code

KernelEpilogue

loop:

(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = -1

Stage PredicatesStage Predicates

Page 59: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Fill the pipe again ...Fill the pipe again ...

r33 = o

r34 = G

r32 = _ld

r35 = xx

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Complete Loop Code

KernelEpilogue

loop:

(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = -2

Stage PredicatesStage Predicates

Page 60: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Chunking thru kernelChunking thru kernel

r32 = _

r33 = o

r37 = Gld

r34 = G

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Complete Loop Codeloop:

(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = -3

KernelEpilogue

Page 61: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Chunking thru kernelChunking thru kernel

r37 = G

r32 = _

r36 = rld

r33 = O

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Complete Loop Codeloop:

(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = -4

G

KernelEpilogue

Page 62: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Chunking thru kernelChunking thru kernel

r36 = r

r37 = G

r35 = eld

r32 = _

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Complete Loop Code

Epilogue

loop:

(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = -5

G O

Page 63: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Chunking thru kernelChunking thru kernel

r35 = e

r36 = r

r34 = yld

r37 = G

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Complete Loop Code

Epilogue

loop:

(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = -6

G O

Page 64: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Chunking thru kernelChunking thru kernel

r34 = y

r35 = e

r33 = hld

r36 = r

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Complete Loop Code

Epilogue

loop:

(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = -7

G O G

Page 65: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Draining the pipeDraining the pipe

r33 = h

r34 = Y

r32 = xxld

r35 = E

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Complete Loop Codeloop:

(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = -8

G O G R

Page 66: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Draining the pipeDraining the pipe

r34 = xx

r35 = H

r33 = xxld

r36 = Y

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Complete Loop Codeloop:

(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = -9

G O G R E

Page 67: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Draining the pipeDraining the pipe

r33 = xx

r34 = xx

r32 = xxld

r35 = H

cmp<

cmp>

sub

st

Physical Physical register fileregister file

Complete Loop Codeloop:

(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = -10

G O G R E Y

Fall through the loopDon’t rotate

Page 68: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Example SummaryExample Summary

r33 = xx

r34 = xx

r32 = xxld

r35 = H

cmp<

cmp>

sub

st

Physical Physical register fileregister file

loop:

(p16) ld r34 = [ra], 1(p16) cmp.unc p20 = true(p17) cmp.and p21 = (r35>96)(p17) cmp.and p21 = (r35<123)(p22) sub r36 = r36, 32(p19) st [rb] = r37, 1 br.ctop loop

RRB = -10

G O G R E Y H

• 8 iterations in 12 cycles• 2.6x speedup of initial code• 2.75x over unrolled traditional• No code expansion• No mispredicts (4x, 1 10 cycle miss)

• Minimal register usage

Page 69: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Especially Useful for Integer Code with Especially Useful for Integer Code with Small Number of Loop IterationsSmall Number of Loop Iterations

Especially Useful for Integer Code with Especially Useful for Integer Code with Small Number of Loop IterationsSmall Number of Loop Iterations

Software PipeliningSoftware Pipelining Itanium™ architecture features support SWPItanium™ architecture features support SWP

– Full PredicationFull Predication

– Special branch handling features Special branch handling features

– Register rotation: removes loop copy overheadRegister rotation: removes loop copy overhead

– Predicate rotation/generation: removes prologue & epiloguePredicate rotation/generation: removes prologue & epilogue

Traditional architectures use loop unrollingTraditional architectures use loop unrolling – High overhead: extra code for loop body, prologue, and High overhead: extra code for loop body, prologue, and

epilogue epilogue

Page 70: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Compiler Bag of TricksCompiler Bag of TricksPredicationPredication

– Removes branches and mispredictionsRemoves branches and mispredictions

– Enables aggressive code motionEnables aggressive code motion

– Parallel compares increase parallelismParallel compares increase parallelism

SpeculationSpeculation– Hides memory latencyHides memory latency

– Enables aggressive code motionEnables aggressive code motion

– Control speculation over branchesControl speculation over branches

– Data speculation over storesData speculation over stores

– Compiler-controlled recovery codeCompiler-controlled recovery code

Page 71: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Compiler Bag of TricksCompiler Bag of Tricks

Rich branch architectureRich branch architecture– Multi-way branches increase ILPMulti-way branches increase ILP

– Loop branches Loop branches

– Static direction hints assist predictionStatic direction hints assist prediction

S/W pipelining support with minimal S/W pipelining support with minimal overhead encourages broad usageoverhead encourages broad usage– Performance for small integer loops with Performance for small integer loops with

unknown trip counts as well as monster FP unknown trip counts as well as monster FP loopsloops

Page 72: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

BACKUPBACKUP

Page 73: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

8 Queens Example8 Queens Exampleif ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))

ThenElse

P1

P2

P5

P3 P4

P6

Parallel ComparesParallel Compares

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) br then(p1) br thenelseelse

1

2

4

5

Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5

8 queens control flow8 queens control flow

ThenElse

P1= true P1=False

Page 74: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Five Predicate Compare Five Predicate Compare TypesTypes (qp) p1,p2 <- cmp.relation(qp) p1,p2 <- cmp.relation

– if(qp) {p1 = relation; p2 = !relation}; if(qp) {p1 = relation; p2 = !relation};

(qp) p1,p2 <- cmp.relation.unc(qp) p1,p2 <- cmp.relation.unc– p1 = qp&relation; p2 = qp&!relation;p1 = qp&relation; p2 = qp&!relation;

(qp) p1,p2 <- cmp.relation.and(qp) p1,p2 <- cmp.relation.and– if(qp & (relation==FALSE)) { p1=0; p2=0; }if(qp & (relation==FALSE)) { p1=0; p2=0; }

(qp) p1,p2 <- cmp.relation.or(qp) p1,p2 <- cmp.relation.or– if(qp & (relation==TRUE)) { p1=1; p2=1; }if(qp & (relation==TRUE)) { p1=1; p2=1; }

(qp) p1,p2 <- cmp.relation.or.andcm(qp) p1,p2 <- cmp.relation.or.andcm– if(qp & (relation==TRUE)) { p1=1; p2=0; }if(qp & (relation==TRUE)) { p1=1; p2=0; }

Page 75: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Control Speculation Control Speculation SummarySummaryAll loads have a speculative form that sets All loads have a speculative form that sets

the NaT bit when deferring exceptionsthe NaT bit when deferring exceptionsComputational instructions propagate NaTsComputational instructions propagate NaTsOS controls deferral of faults but supported OS controls deferral of faults but supported

directly in HW - “no-fault speculation”directly in HW - “no-fault speculation”– Minimizes overhead of data that is not usedMinimizes overhead of data that is not used

Chk more effective than non-faulting loadChk more effective than non-faulting load

Page 76: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

More complex exampleMore complex exampleKilltime loop in m88ksimfor (i=0, i<32, i++)

comptime[i] -= MIN(comptime[i], time)

Pipelined LoopLoop: (p16) ld r36 = [r10],4 (p18) cmp p21,p23 = r38,r32 (p22) sub r37 = r0,0 (p24) sub r38 = r38,r32 (p20) st [r11] = r40,4 br.ctop loop

Initial Looploop: ld r5=[r10],4 cmp p1,p2 = r5,r32(p1) br side sub r5=r5,r32 st [addr]=r5,4 br cloopside: add t=0,r0 st4 [addr]=t,4 br cloop

Page 77: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Software Pipelining BenefitsSoftware Pipelining Benefits

Loop pipelining maximizes performance; Loop pipelining maximizes performance; minimizes overheadminimizes overhead– High applicabilityHigh applicability

– Minimum code size - fewer cache misses Minimum code size - fewer cache misses

– Reduced register usageReduced register usage

– Greater performance improvements in higher Greater performance improvements in higher latency conditionslatency conditions

Reduced overhead allows S/W pipelining of Reduced overhead allows S/W pipelining of small loops with unknown trip countssmall loops with unknown trip counts– Good for integer scalar codesGood for integer scalar codes

Page 78: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Memory Address ModesMemory Address ModesRegister Indirect is only address modeRegister Indirect is only address mode

–Memory address comes from a General Memory address comes from a General RegisterRegister

–no add in critical memory access pathno add in critical memory access pathPost-Increment provided for efficient Post-Increment provided for efficient

address arithmeticaddress arithmetic–can add 9-bit signed immediate value, or a can add 9-bit signed immediate value, or a

value from a general registervalue from a general register–uses idle ALU resourcesuses idle ALU resources–avoid extra add instructionsavoid extra add instructions

Benefits vector Floating Point CodeBenefits vector Floating Point CodeBenefits vector Floating Point CodeBenefits vector Floating Point Code

Page 79: ® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.

RR

®®

Memory Address ModesMemory Address ModesLoad InstructionsLoad Instructions

– (qp) ld{1,2,4,8} r1 = [r3] no post-inc(qp) ld{1,2,4,8} r1 = [r3] no post-inc

– (qp) ld{1,2,4,8} r1 = [r3] , imm(qp) ld{1,2,4,8} r1 = [r3] , imm99

– (qp) ld{1,2,4,8} r1 = [r3] , r2(qp) ld{1,2,4,8} r1 = [r3] , r2

Store InstructionsStore Instructions– (qp) st{1,2,4,8} [r3] = r2 no post-inc(qp) st{1,2,4,8} [r3] = r2 no post-inc

– (qp) st{1,2,4,8} [r3] = r2, imm(qp) st{1,2,4,8} [r3] = r2, imm99