ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

20
24 The Itanium processor is the first implementation of the IA-64 instruction set architecture (ISA). The design team opti- mized the processor to meet a wide range of requirements: high performance on Internet servers and workstations, support for 64-bit addressing, reliability for mission-critical applications, full IA-32 instruction set com- patibility in hardware, and scalability across a range of operating systems and platforms. The processor employs EPIC (explicitly parallel instruction computing) design con- cepts for a tighter coupling between hardware and software. In this design style the hard- ware-software interface lets the software exploit all available compilation time infor- mation and efficiently deliver this informa- tion to the hardware. It addresses several fundamental performance bottlenecks in modern computers, such as memory latency, memory address disambiguation, and control flow dependencies. EPIC constructs provide powerful archi- tectural semantics and enable the software to make global optimizations across a large scheduling scope, thereby exposing available instruction-level parallelism (ILP) to the hard- ware. The hardware takes advantage of this enhanced ILP, providing abundant execution resources. Additionally, it focuses on dynam- ic runtime optimizations to enable the com- piled code schedule to flow through at high throughput. This strategy increases the syn- ergy between hardware and software, and leads to higher overall performance. The processor provides a six-wide and 10- stage deep pipeline, running at 800 MHz on a 0.18-micron process. This combines both abundant resources to exploit ILP and high frequency for minimizing the latency of each instruction. The resources consist of four inte- ger units, four multimedia units, two load/store units, three branch units, two extended-precision floating-point units, and two additional single-precision floating-point units (FPUs). The hardware employs dynam- ic prefetch, branch prediction, nonblocking caches, and a register scoreboard to optimize for compilation time nondeterminism. Three levels of on-package cache minimize overall memory latency. This includes a 4-Mbyte level-3 (L3) cache, accessed at core speed, pro- viding over 12 Gbytes/s of data bandwidth. The system bus provides glueless multi- processor support for up to four-processor sys- tems and can be used as an effective building block for very large systems. The advanced FPU delivers over 3 Gflops of numeric capa- bility (6 Gflops for single precision). The bal- anced core and memory subsystems provide Harsh Sharangpani Ken Arora Intel THE ITANIUM PROCESSOR EMPLOYS THE EPIC DESIGN STYLE TO EXPLOIT INSTRUCTION-LEVEL PARALLELISM. ITS HARDWARE AND SOFTWARE WORK IN CONCERT TO DELIVER HIGHER PERFORMANCE THROUGH A SIMPLER, MORE EFFICIENT DESIGN. 0272-1732/00/$10.00 2000 IEEE ITANIUM PROCESSOR MICROARCHITECTURE

Transcript of ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

Page 1: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

24

The Itanium processor is the firstimplementation of the IA-64 instruction setarchitecture (ISA). The design team opti-mized the processor to meet a wide range ofrequirements: high performance on Internetservers and workstations, support for 64-bitaddressing, reliability for mission-criticalapplications, full IA-32 instruction set com-patibility in hardware, and scalability across arange of operating systems and platforms.

The processor employs EPIC (explicitlyparallel instruction computing) design con-cepts for a tighter coupling between hardwareand software. In this design style the hard-ware-software interface lets the softwareexploit all available compilation time infor-mation and efficiently deliver this informa-tion to the hardware. It addresses severalfundamental performance bottlenecks inmodern computers, such as memory latency,memory address disambiguation, and controlflow dependencies.

EPIC constructs provide powerful archi-tectural semantics and enable the software tomake global optimizations across a largescheduling scope, thereby exposing availableinstruction-level parallelism (ILP) to the hard-ware. The hardware takes advantage of thisenhanced ILP, providing abundant executionresources. Additionally, it focuses on dynam-

ic runtime optimizations to enable the com-piled code schedule to flow through at highthroughput. This strategy increases the syn-ergy between hardware and software, andleads to higher overall performance.

The processor provides a six-wide and 10-stage deep pipeline, running at 800 MHz ona 0.18-micron process. This combines bothabundant resources to exploit ILP and highfrequency for minimizing the latency of eachinstruction. The resources consist of four inte-ger units, four multimedia units, twoload/store units, three branch units, twoextended-precision floating-point units, andtwo additional single-precision floating-pointunits (FPUs). The hardware employs dynam-ic prefetch, branch prediction, nonblockingcaches, and a register scoreboard to optimizefor compilation time nondeterminism. Threelevels of on-package cache minimize overallmemory latency. This includes a 4-Mbytelevel-3 (L3) cache, accessed at core speed, pro-viding over 12 Gbytes/s of data bandwidth.

The system bus provides glueless multi-processor support for up to four-processor sys-tems and can be used as an effective buildingblock for very large systems. The advancedFPU delivers over 3 Gflops of numeric capa-bility (6 Gflops for single precision). The bal-anced core and memory subsystems provide

Harsh SharangpaniKen Arora

Intel

THE ITANIUM PROCESSOR EMPLOYS THE EPIC DESIGN STYLE TO EXPLOIT

INSTRUCTION-LEVEL PARALLELISM. ITS HARDWARE AND SOFTWARE WORK IN

CONCERT TO DELIVER HIGHER PERFORMANCE THROUGH A SIMPLER, MORE

EFFICIENT DESIGN.

0272-1732/00/$10.00 2000 IEEE

ITANIUM PROCESSORMICROARCHITECTURE

Page 2: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

high performance for a wide range of appli-cations ranging from commercial workloadsto high-performance technical computing.

In contrast to traditional processors, themachine’s core is characterized by hardwaresupport for the key ISA constructs thatembody the EPIC design style.1,2 Thisincludes support for speculation, predication,explicit parallelism, register stacking and rota-tion, branch hints, and memory hints. In thisarticle we describe the hardware support forthese novel constructs, assuming a basic levelof familiarity with the IA-64 architecture (seethe “IA-64 Architecture Overview” article inthis issue).

EPIC hardwareThe Itanium processor introduces a num-

ber of unique microarchitectural features tosupport the EPIC design style.2 These featuresfocus on the following areas:

• supplying plentiful fast, parallel, andpipelined execution resources, exposeddirectly to the software;

• supporting the bookkeeping and controlfor new EPIC constructs such as predi-cation and speculation; and

• providing dynamic support to handle

events that are unpredictable at compi-lation time so that the compiled codeflows through the pipeline at highthroughput.

Figure 1 presents a conceptual view of theEPIC hardware. It illustrates how the variousEPIC instruction set features map onto themicropipelines in the hardware.

The core of the machine is the wide execu-tion engine, designed to provide the compu-tational bandwidth needed by ILP-rich EPICcode that abounds in speculative and predi-cated operations.

The execution control is augmented with abookkeeping structure called the advancedload address table (ALAT) to support dataspeculation and, with hardware, to managethe deferral of exceptions on speculative exe-cution. The hardware control for speculationis quite simple: adding an extra bit to the datapath supports deferred exception tokens. Thecontrols for both the register scoreboard andbypass network are enhanced to accommo-date predicated execution.

Operands are fed into this wide executioncore from the 128-entry integer and floating-point register files. The register file addressingundergoes register remapping, in support of

25SEPTEMBER–OCTOBER 2000

Compiler-programmed features:

Hardware features:

Branchhints

Registerstack,

rotation

Data and controlspeculation

MemoryhintsPredication

Explicitparallelism;instructiontemplates

Instructioncache,branch

predictors

Fetch

Three levelsof cache

(L1, L2, L3)

Memorysubsystem

128 GR,128 FR,registerremap,stack

engine

Registerhandling

4 integer,4 MMX units

2 load/store units

3 branch units

32-entry ALAT

2 + 2 FMACs

Parallel resources

Fast

, sim

ple

6-is

sue

Issue Control

Byp

asse

s an

d de

pend

enci

es

Speculation deferral management

Figure 1. Conceptual view of EPIC hardware. GR: general register file; FR: floating-point regis-ter file

Page 3: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

register stacking and rotation. The registermanagement hardware is enhanced with acontrol engine called the register stack enginethat is responsible for saving and restoringregisters that overflow or underflow the reg-ister stack.

An instruction dispersal network feeds theexecution pipeline. This network uses explic-it parallelism and instruction templates to effi-ciently issue fetched instructions onto thecorrect instruction ports, both eliminatingcomplex dependency detection logic andstreamlining the instruction routing network.A decoupled fetch engine exploits advancedprefetch and branch hints to ensure that thefetched instructions will come from the cor-rect path and that they will arrive early enoughto avoid cache miss penalties. Finally, memo-ry locality hints are employed by the cachesubsystem to improve the cache allocation andreplacement policies, resulting in a better useof the three levels of on-package cache and allassociated memory bandwidth.

EPIC features allow software to more effec-tively communicate high-level semantic infor-mation to the hardware, thereby eliminatingredundant or inefficient hardware and lead-ing to a more effective design. Notably absentfrom this machine are complex hardwarestructures seen in dynamically scheduled con-temporary processors. Reservation stations,reorder buffers, and memory ordering buffersare all replaced by simpler hardware for spec-ulation. Register alias tables used for registerrenaming are replaced with the simpler and

semantically richer register-remapping hard-ware. Expensive register dependency-detec-tion logic is eliminated via the explicitparallelism directives that are precomputed bythe software.

Using EPIC constructs, the compiler opti-mizes the code schedule across a very largescope. This scope of optimization far exceedsthe limited hardware window of a few hun-dred instructions seen on contemporarydynamically scheduled processors. The resultis an EPIC machine in which the close col-laboration of hardware and software enableshigh performance with a greater degree ofoverall efficiency.

Overview of the EPIC coreThe engineering team designed the EPIC

core of the Itanium processor to be a parallel,deep, and dynamic pipeline that enables ILP-rich compiled code to flow through at highthroughput. At the highest level, three impor-tant directions characterize the core pipeline:

• wide EPIC hardware delivering a newlevel of parallelism (six instructions/clock),

• deep pipelining (10 stages) enabling highfrequency of operation, and

• dynamic hardware for runtime opti-mization and handling of compilationtime indeterminacies.

New level of parallel executionThe processor provides hardware for these

execution units: four integer ALUs, four mul-timedia ALUs, two extended-precision float-ing-point units, two additional single-precisionfloating-point units, two load/store units, andthree branch units. The machine can fetch,issue, execute, and retire six instructions eachclock cycle. Given the powerful semantics ofthe IA-64 instructions, this expands to manymore operations being executed each cycle.The “Machine resources per port” sidebar onp. 31 enumerates the full processor executionresources.

Figure 2 illustrates two examples demon-strating the level of parallel operation support-ed for various workloads. For enterprise andcommercial codes, the MII/MBB templatecombination in a bundle pair provides sixinstructions or eight parallel operations per

26

ITANIUM PROCESSOR

IEEE MICRO

M F I

M I I M B B

M F I

2 ALU opsLoad 4 DP(8 SP) ops via2 ldf-pair and 2ALU ops(postincrement)

4 DP flops(8 SP flops)

6 instructions provide:

• 12 parallel ops/clock for scientific computing• 20 parallel ops/clock for digital content creation

6 instructions provide:

• 8 parallel ops/clock for enterprise and Internet applications

2 loads and2 ALU ops(postincrement)

2 ALU ops2 branchinstructions

Figure 2. Two examples illustrating supported parallelism. SP: single preci-sion, DP: double precision

Page 4: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

clock (two load/store, twogeneral-purpose ALU opera-tions, two postincrement ALUoperations, and two branchinstructions). Alternatively, anMIB/MIB pair allows thesame mix of operations, butwith one branch hint and onebranch operation, instead oftwo branch operations.

For scientific code, the useof the MFI template in eachbundle enables 12 paralleloperations per clock (loadingfour double-precision oper-ands to the registers and executing four double-precision floating-point, two integer ALU, andtwo postincrement ALU operations). For digi-tal content creation codes that use single-precision floating point, the SIMD (singleinstruction, multiple data) features in themachine effectively enable up to 20 paralleloperations per clock (loading eight single-precision operands, executing eight single-precision floating-point, two integer ALU, andtwo postincrementing ALU operations).

Deep pipelining (10 stages)The cycle time and the core pipeline are bal-

anced and optimized for the sequential exe-cution in integer scalar codes, by minimizingthe latency of the most frequent operations,thus reducing dead time in the overall com-putation. The high frequency (800 MHz) andcareful pipelining enable independent opera-tions to flow through this pipeline at highthroughput, thus also optimizing vectornumeric and multimedia computation. Thecycle time accommodates the following keycritical paths:

• single-cycle ALU, globally bypassedacross four ALUs and two loads;

• two cycles of latency for load datareturned from a dual-ported level-1 (L1)cache of 16 Kbytes; and

• scoreboard and dependency control tostall the machine on an unresolved regis-ter dependency.

The feature set complies with the high-frequency target and the degree of pipelin-ing—aggressive branch prediction, and three

levels of on-package cache. The pipelinedesign employs robust, scalable circuit designtechniques. We consciously attempted tomanage interconnect lengths and pipelineaway secondary paths.

Figure 3 illustrates the 10-stage corepipeline. The bold line in the middle of thecore pipeline indicates a point of decoupling inthe pipeline. The pipeline accommodates thedecoupling buffer in the ROT (instructionrotation) stage, dedicated register-remappinghardware in the REN (register rename) stage,and pipelined access of the large register fileacross the WLD (word line decode) and REG(register read) stages. The DET (exceptiondetection) stage accommodates delayed branchexecution as well as memory exception man-agement and speculation support.

Dynamic hardware for runtime optimizationWhile the processor relies on the compiler

to optimize the code schedule based upon thedeterministic latencies, the processor providesspecial support to dynamically optimize forseveral compilation time indeterminacies.These dynamic features ensure that the com-piled code flows through the pipeline at highthroughput. To tolerate additional latency ondata cache misses, the data caches are non-blocking, a register scoreboard enforcesdependencies, and the machine stalls only onencountering the use of unavailable data.

We focused on reducing sensitivity to branchand fetch latency. The machine employs hard-ware and software techniques beyond thoseused in conventional processors and providesaggressive instruction prefetch and advancedbranch prediction through a hierarchy of

27SEPTEMBER–OCTOBER 2000

Front endPrefetch/fetch of 6 instructions/clockHierarchy of branch predictorsDecoupling buffer

Instruction deliveryDispersal of 6 instructions onto 9 issue portsRegister remappingRegister save engine

Operand deliveryRegister file read and bypassRegister scoreboardPredicated dependencies

Execution core4 single-cycle ALUs, 2 load/storesAdvanced load controlPredicate delivery and branchNaT/exceptions/retirement

IPG FET ROT EXP REN WLD REG EXE DET WRB

Figure 3. Itanium processor core pipeline.

Page 5: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

branch prediction structures. A decouplingbuffer allows the front end to speculatively fetchahead, further hiding instruction cache laten-cy and branch prediction latency.

Block diagramFigure 4 provides the block diagram of the

Itanium processor. Figure 5 provides a die plotof the silicon database. A few top-level metallayers have been stripped off to create a suit-able view.

Details of the core pipelineThe following describes details of the core

processor microarchitecture.

Decoupled software-directed front endGiven the high execution rate of the proces-

sor (six instructions per clock), an aggressivefront end is needed to keep the machine effec-tively fed, especially in the presence of dis-ruptions due to branches and cache misses.The machine’s front end is decoupled fromthe back end.

Acting in conjunction with sophisticatedbranch prediction and correction hardware,the machine speculatively fetches instructionsfrom a moderate-size, pipelined instructioncache into a decoupling buffer. A hierarchy ofbranch predictors, aided by branch hints, pro-vides up to four progressively improvinginstruction pointer resteers. Software-initiat-ed prefetch probes for future misses in theinstruction cache and then prefetches suchtarget code from the level-2 (L2) cache into astreaming buffer and eventually into the

28

ITANIUM PROCESSOR

IEEE MICRO

L1 instruction cacheand

fetch/prefetch engineITLB

Branchprediction

Branchunits

IntegerandMMunits

Dual-portL1

datacache

Floating-pointunits

ALAT

IA-32decode

andcontrol

Decouplingbuffer 8 bundles

B B B M M I I F F

Register stack engine/remapping

Branch andpredicate

128 integerregisters

Bus controller

128 floating-pointregisters

SIMDFMAC

Sco

rebo

ard,

pre

dica

te N

aTs,

exc

eptio

ns

96-KbyteL2

cache

4-MbyteL3

cache

Figure 4. Itanium processor block diagram.

Page 6: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

instruction cache. Figure 6illustrates the front-endmicroarchitecture.

Speculative fetches. The 16-Kbyte, four-way set-associativeinstruction cache is fullypipelined and can deliver 32bytes of code (two instructionbundles or six instructions)every clock. The cache is sup-ported by a single-cycle, 64-entry instruction translationlook-aside buffer (TLB) thatis fully associative and backedup by an on-chip hardwarepage walker.

The fetched code is fed into a decouplingbuffer that can hold eight bundles of code. Asa result of this buffer, the machine’s front endcan continue to fetch instructions into thebuffer even when the back end stalls. Con-versely, the buffer can continue to feed theback end even when the front end is disrupt-ed by fetch bubbles due to branches orinstruction cache misses.

Hierarchy of branch predictors. The processoremploys a hierarchy of branch predictionstructures to deliver high-accuracy and low-penalty predictions across a wide spectrum ofworkloads. Note that if a branch mispredic-tion led to a full pipeline flush, there would

be nine cycles of pipeline bubbles before thepipeline is full again. This would mean aheavy performance loss. Hence, we’ve placedsignificant emphasis on boosting the overallbranch prediction rate as well as reducing thebranch prediction and correction latency.

The branch prediction hardware is assistedby branch hint directives provided by thecompiler (in the form of explicit branch pre-dict, or BRP, instructions as well as hint spec-ifiers on branch instructions). The directivesprovide branch target addresses, static hintson branch direction, as well as indications onwhen to use dynamic prediction. These direc-tives are programmed into the branch predic-tion structures and used in conjunction withdynamic prediction schemes. The machine

29SEPTEMBER–OCTOBER 2000

Figure 5. Die plot of the silicon database.

Core processor die

4 x 1Mbyte L3 cache

To dispersal

Branchaddress

calculation 2

Branchaddress

calculation 1

Adaptivemultiway

2-level predictor

Targetaddressregisters

Branchtarget

addresscache

IPm

ultip

lexe

r

Return stack buffer

Instruction cache,ITLB, streaming buffers

Decoupling buffer

Loop exitcorrector

IPG FET ROT EXPStages

Figure 6. Processor front end.

Page 7: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

provides up to four progressive predictionsand corrections to the fetch pointer, greatlyreducing the likelihood of a full-pipeline flushdue to a mispredicted branch.

• Resteer 1: Single-cycle predictor. A specialset of four branch prediction registers(called target address registers, or TARs)provides single-cycle turnaround on cer-tain branches (for example, loop branch-es in numeric code), operating under tightcompiler control. The compiler programsthese registers using BRP hints, distin-guishing these hints with a special “impor-tance” bit designator and indicating thatthese directives must get allocated into thissmall structure. When the instructionpointer of the candidate branch hits inthese registers, the branch is predictedtaken, and these registers provide the tar-get address for the resteer. On such takenbranches no bubbles appear in the execu-tion schedule due to branching.

• Resteer 2: Adaptive multiway and returnpredictors. For scalar codes, the processoremploys a dynamic, adaptive, two-levelprediction scheme3,4 to achieve well over90% prediction rates on branch direction.The branch prediction table (BPT) con-tains 512 entries (128 sets ×4 ways). Eachentry, selected by the branch address,tracks the four most recent occurrencesof that branch. This 4-bit value thenindexes into one of 128 pattern tables(one per set). The 16 entries in each pat-tern table use a 2-bit, saturating, up-downcounter to predict branch direction.

The branch prediction table structureis additionally enhanced for multiwaybranches with a 64-entry, multiwaybranch prediction table (MBPT) thatemploys a similar algorithm but keepsthree history registers per bundle entry.A find-first-taken selection provides thefirst taken branch indication for the mul-tiway branch bundle. Multiway branch-es are expected to be common in EPICcode, where multiple basic blocks areexpected to collapse after use of specula-tion and predication.

Target addresses for this branch resteerare provided by a 64-entry target addresscache (TAC). This structure is updated

by branch hints (using BRP and move-BR hint instructions) and also manageddynamically. Having the compiler pro-gram the structure with the upcomingfootprint of the program is an advantageand enables a small, 64-entry structure tobe effective even on large commercialworkloads, saving die area and imple-mentation complexity. The BPT andMBPT cause a front-end resteer only ifthe target address for the resteer is presentin the target address cache. In the case ofmisses in the BPT and MBPT, a hit in thetarget address cache also provides a branchdirection prediction of taken.

A return stack buffer (RSB) providespredictions for return instructions. Thisbuffer contains eight entries and storesreturn addresses along with correspond-ing register stack frame information.

• Resteers 3 and 4: Branch address calcula-tion and correction. Once branch instruc-tion opcodes are available (ROT stage),it’s possible to apply a correction to pre-dictions made earlier. The BAC1 stageapplies a correction for the exit conditionon modulo-scheduled loops through aspecial “perfect-loop-exit-predictor”structure that keeps track of the loopcount extracted during the loop initial-ization code. Thus, loop exits shouldnever see a branch misprediction in theback end. Additionally, in case of missesin the earlier prediction structures, BAC1extracts static prediction information andaddresses from branch instructions in therightmost slot of a bundle and uses theseto provide a correction. Since most tem-plates will place a branch in the rightmostslot, BAC1 should handle most branch-es. BAC2 applies a more general correc-tion for branches located in any slot.

Software-initiated prefetch. Another key ele-ment of the front end is its software-initiatedinstruction prefetch. Prefetch is triggered byprefetch hints (encoded in the BRP instruc-tions as well as in actual branch instructions)as they pass through the ROT stage. Instruc-tions get prefetched from the L2 cache intoan instruction-streaming buffer (ISB) con-taining eight 32-byte entries. Support existsto prefetch either a short burst of 64 bytes of

30

ITANIUM PROCESSOR

IEEE MICRO

Page 8: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

code (typically, a basic block residing in up tofour bundles) or a long sequential instructionstream. Short burst prefetch is initiated by aBRP instruction hoisted well above the actu-al branch. For longer code streams, thesequential streaming (“many”) hint from thebranch instruction triggers a continuousstream of additional prefetch requests until ataken branch is encountered. The instructioncache filters prefetch requests. The cache tagsand the TLB have been enhanced with anadditional port to check whether an addresswill lead to a miss. Such requests are sent tothe L2 cache.

The compiler can improve overall fetch per-formance by aggressive issue and hoisting ofBRP instructions, and by issuing sequentialprefetch hints on the branch instruction whenbranching to long sequential codes. To fullyhide the latency of returns from the L2 cache,BRP instructions that initiate prefetch shouldbe hoisted 12 fetch cycles ahead of the branch.Hoisting by five cycles breaks even with noprefetch at all. Every hoisted cycle above fivecycles has the potential of shaving one fetchbubble. Although this kind of hoisting of BRPinstructions is a tall order, it does provide amechanism for the compiler to eliminateinstruction fetch bubbles.

Efficient instruction and operand deliveryAfter instructions are fetched in the front

end, they move into the middle pipeline thatdisperses instructions, implements the archi-tectural renaming of registers, and deliversoperands to the wide parallel hardware. Thehardware resources in the back end of themachine are organized around nine issue ports.The instruction and operand delivery hardwaremaps the six incoming instructions onto thenine issue ports and remaps the virtual registeridentifiers specified in the source code ontophysical registers used to access the register file.It then provides the source data to the executioncore. The dispersal and renaming hardwareexploits high-level semantic information pro-vided by the IA-64 software, efficientlyenabling greater ILP and reduced instructionpath length.

Explicit parallelism directives. The instructiondispersal mechanism disperses instructions pre-sented by the decoupling buffer to the proces-

sor’s issue ports. The processor has a total ofnine issue ports capable of issuing up to two

31SEPTEMBER–OCTOBER 2000

Machine resources per portTables A-C describe the vocabulary of operations supported on the different issue ports

in the Itanium processor. The issue ports feed into memory (M), integer (I), floating-point (F),and branch (B) execution data paths.

Table C. Branch execution resources.

Ports for issuing an instruction

Instruction class B0 B1 B2

Conditional or unconditional branch • • •

Call/return/indirect • • •

Loop-type branch, BSW, cover •

RFI •

BRP (branch hint) • • •

Table A. Memory and integer execution resources.

Ports for issuing an instruction Latency

Instruction class M0 M1 I0 I1 (no. of clock cycles)

ALU (add, shift-add, logical,

addp4, compare) • • • •

Sign/zero extend, move long • • 1

Fixed extract/deposit, Tbit, TNaT • 1

Multimedia ALU (add/avg./etc.) • • • • 2

MM shift, avg, mix, pack • • 2

Move to/from branch/predicates/

ARs, packed multiply, pop count • 2

Load/store/prefetch/setf/break.m/

cache control/memory fence • • 2+

Memory management/system/getf • 2+

Table B. Floating-point execution resources.

Ports for issuing an instruction Latency

Instruction class F0 F1 (no. of clock cycles)

FMAC, SIMD FMAC • • 5

Fixed multiply • • 7

FClrf • • 1

Fchk • • 1

Fcompare • 2

Floating-point logicals, class,

min/max, pack, select • 5

Page 9: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

memory instructions (ports M0 and M1), twointeger (ports I0 and I1), two floating-point(ports F0 and F1), and three branch instruc-tions (ports B0, B1, and B2) per clock. Theprocessor’s 17 execution units are fed throughthe M, I, F, and B groups of issue ports.

The decoupling buffer feeds the dispersalin a bundle granular fashion (up to two bun-dles or six instructions per cycle), with a freshbundle being presented each time one is con-sumed. Dispersal from the two bundles isinstruction granular—the processor dispersesas many instructions as can be issued (up tosix) in left-to-right order. The dispersal algo-rithm is fast and simple, with instructionsbeing dispersed to the first available issue port,subject to two constraints: detection of in-struction independence and detection ofresource oversubscription.

• Independence. The processor must ensurethat all instructions issued in parallel areeither independent or contain onlyallowed dependencies (such as a compareinstruction feeding a dependent condi-tional branch). This question is easily dealtwith by using the stop-bits feature of theIA-64 ISA to explicitly communicate par-allel instruction semantics. Instructionsbetween consecutive stop bits are deemedindependent, so the instruction indepen-dence detection hardware is trivial. Thiscontrasts with traditional RISC processorsthat are required to perform O(n2) (typi-cally dozens) comparisons between sourceand destination register specifiers to deter-mine independence.

• Oversubscription. The processor must alsoguarantee that there are sufficient execu-

tion resources to process all the instructionsthat will be issued in parallel. This over-subscription problem is facilitated by theIA-64 ISA feature of instruction bundletemplates. Each instruction bundle notonly specifies three instructions but alsocontains a 4-bit template field, indicatingthe type of each instruction: memory (M),integer (I), branch (B), and so on. Byexamining template fields from the twobundles (a total of only 8 bits), the disper-sal logic can quickly determine the num-ber of memory, integer, floating-point, andbranch instructions incoming every clock.This is a hardware simplification resultingfrom the IA-64 instruction set architecture.Unlike conventional instruction set archi-tectures, the instruction encoding itselfdoesn’t need to be examined to determinethe type of each operation. This featureremoves decoders that would otherwise berequired to examine many bits of theencoded instruction to determine the in-struction’s type and associated issue port.

A second key advantage of the tem-plate-based dispersal strategy is that cer-tain instruction types can only occur onspecific locations within any bundle. Asa result, the dispersal interconnectionnetwork can be significantly optimized;the routing required from dispersal toissue ports is roughly only half of thatrequired for a fully connected crossbar.

Table 1 illustrates the effectiveness of thedispersal strategy by enumerating the instruc-tion bundles that may be issued at full band-width. As can be seen, a rich mix ofinstructions can be issued to the machine athigh throughput (six per clock). The combi-nation of stop bits and bundle templates, asspecified in the IA-64 instruction set, allowsthe compiler to indicate the independence andinstruction-type information directly andeffectively to the dispersal hardware. As aresult, the hardware is greatly simplified, there-by allowing an efficient implementation ofinstruction dispersal to a wide execution core.

Efficient register remapping. After dispersal, thenext step in preparing incoming instructionsfor execution involves implementing the reg-ister stacking and rotation functions.

32

ITANIUM PROCESSOR

IEEE MICRO

Table 1. Instruction bundles capable of

full-bandwidth dispersal.

First bundle* Second bundle

MIH MLI, MFI, MIB, MBB, or MFBMFI or MLI MLI, MFI, MIB, MBB, BBB, or MFBMII MBB, BBB, or MFBMMI BBBMFH MII, MLI, MFI, MIB, MBB, MFB

* B slots support branches and branch hints.

* H designates a branch hint operation in the B slot.

Page 10: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

Register stacking is an IA-64 technique thatsignificantly reduces function call and returnoverhead. It ensures that all procedural inputand output parameters are in specific registerlocations, without requiring the compiler toperform register-register or memory-registermoves. On procedure calls, a fresh registerframe is simply stacked on top of existingframes in the large register file, without theneed for an explicit save of the caller’s registers.This enables low-overhead procedure calls,providing significant performance benefit oncodes that are heavy in calls and returns, suchas those in object-oriented languages.

Register rotation is an IA-64 technique thatallows very low overhead, software-pipelinedloops. It broadens the applicability of com-piler-driven software pipelining to a wide vari-ety of integer codes. Rotation provides a formof register renaming that allows every itera-tion of a software-pipelined loop to have afresh copy of loop variables. This is accom-plished by accessing the registers through anindirection based on the iteration count.

Both stacking and rotation require thehardware to remap the register names. Thisremapping translates the incoming virtual reg-ister specifiers onto outgoing physical registerspecifiers, which are then used to perform theactual lookup of the various register files.Stacking can be thought of as simply addingan offset to the virtual register specifier. In asimilar fashion, rotation can also be viewed asan offset-modulo add. The remapping func-tion supports both stacking and rotation forthe integer register specifiers, but only registerrotation for the floating-point and predicateregister specifiers.

The Itanium processor efficiently supportsthe register remapping for both register stack-ing and rotation with a set of adders and mul-tiplexers contained in the pipeline’s RENstage. The stacking logic requires only one 7-bit adder for each specifier, and the rotationlogic requires either one (predicate or float-ing-point) or two (integer) additional 7-bitadders. The extra adder on the integer side isneeded due to the interaction of stacking withrotation. Therefore, for full six-syllable exe-cution, a total of ninety-eight 7-bit adders and42 multiplexers implement the combinationof integer, floating-point, and predicateremapping for all incoming source and desti-

nation registers. The total area taken by thisfunction is less than 0.25 square mm.

The register-stacking model also requiresspecial handling when software allocates morevirtual registers than are currently physicallyavailable in the register file. A special statemachine, the register stack engine (RSE), han-dles this case—termed stack overflow. Thisengine observes all stacked register allocationor deallocation requests. When an overflow isdetected on a procedure call, the engine silent-ly takes control of the pipeline, spilling regis-ters to a backing store in memory untilsufficient physical registers are available. In asimilar manner, the engine handles the con-verse situation—termed stack underflow—when registers need to be restored from abacking store in memory. While these registersare being spilled or filled, the engine simplystalls instructions waiting on the registers; nopipeline flushes are needed to implement theregister spill/restore operations.

Register stacking and rotation combine toprovide significant performance benefits for avariety of applications, at the modest cost ofa number of small adders, an additionalpipeline stage, and control logic for a pro-grammer-invisible register stack engine.

Large, multiported register files. The processorprovides an abundance of registers and execu-tion resources. The 128-entry integer registerfile supports eight read ports and six writeports. Note that four ALU operations requireeight read ports and four write ports from theregister file, while pending load data returnsneed two additional write ports (two returnsper cycle). The read and write ports can ade-quately support two memory and two integerinstructions every clock. The IA-64 instruc-tion set includes a feature known as postin-crement. Here, the address register of amemory operation can be incremented as aside effect of the operation. This is supportedby simply using two of the four ALU writeports. (These two ALUs and write ports wouldotherwise have been idle when memory oper-ations are issued off their ports).

The floating-point register file also consistsof 128 registers, supports double extended-precision arithmetic, and can sustain twomemory ports in parallel with two multiply-accumulate units. This combination of

33SEPTEMBER–OCTOBER 2000

Page 11: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

resources requires eight read and four writeports. The register write ports are separatedin even and odd banks, allowing each mem-ory return to update a pair of floating-pointregisters.

The other large register file is the predicateregister file. This register file has severalunique characteristics: each entry is 1 bit, ithas many read and write ports (15 reads/11writes), and it supports a “broadside” read orwrite of the entire register file. As a result, ithas a distinct implementation, as describedin the “Implementing predication elegantly”section (next page).

High ILP execution coreThe execution core is the heart of the EPIC

implementation. It supports data-speculativeand control-speculative execution, as well aspredicated execution and the traditional func-tions of hazard detection and branch execu-tion. Furthermore, the processor’s executioncore provides these capabilities in the contextof the wide execution width and powerfulinstruction semantics that characterize theEPIC design philosophy.

Stall-based scoreboard control strategy. As men-tioned earlier, the frequency target of the Ita-nium processor was governed by several keytiming paths such as the ALU plus bypass andthe two-cycle data cache. All of the controlpaths within the core pipeline fit within thegiven cycle time—detecting and dealing withdata hazards was one such key control path.

To achieve high performance, we adopteda nonblocking cache with a scoreboard-basedstall-on-use strategy. This is particularly valu-able in the context of speculation, in whichcertain load operations may be aggressivelyboosted to avoid cache miss latencies, and the

resulting data may potentially not be con-sumed. For such cases, it is key that 1) thepipeline not be interrupted because of a cachemiss, and 2) the pipeline only be interruptedif and when the unavailable data is needed.

Thus, to achieve high performance, thestrategy for dealing with detected data haz-ards is based on stalls—the pipeline only stallswhen unavailable data is needed and stallsonly as long as the data is unavailable. Thisstrategy allows the entire processor pipelineto remain filled, and the in-flight dependentinstructions to be immediately ready to con-tinue as soon as the required data is available.This contrasts with other high-frequencydesigns, which are based on flushing andrequire that the pipeline be emptied when ahazard is detected, resulting in reduced per-formance. On the Itanium processor, innov-ative techniques reap the performance benefitsof a stall-based strategy and yet enable high-frequency operation on this wide machine.

The scoreboard control is also enhanced tosupport predication. Since most operationswithin the IA-64 instruction set architecturecan be predicated, either the producer or theconsumer of a given piece of data may be nul-lified by having a false predicate. Figure 7 illus-trates an example of such a case. Note that ifeither the producer or consumer operation isnullified via predication, there are no hazards.The processor scoreboard therefore considersboth the producer and consumer predicates,in addition to the normal operand availabili-ty, when evaluating whether a hazard exists.This hazard evaluation occurs in the REG(register read) pipeline stage.

Given the high frequency of the processorpipeline, there’s not sufficient time to bothcompute the existence of a hazard, and effect aglobal pipeline stall in a single clock cycle.Hence, we use a unique deferred-stall strategy.This approach allows any dependent consumerinstructions to proceed from the REG into theEXE (execute) pipeline stage, where they arethen stalled—hence the term deferred stall.

However, the instructions in the EXE stageno longer have read port access to the registerfile to obtain new operand data. Therefore, toensure that the instructions in the EXE stageprocure the correct data, the latches at the startof the EXE stage (which contain the sourcedata values) continuously snoop all returning

34

ITANIUM PROCESSOR

IEEE MICRO

Clock 1: cmp.eq rl,r2 → pl, p3 Compute predicates P1, P3cmp.eq r3, r4 → p2, p4;; Compute predicates P2,P4

Clock 2: (p1) ld4 [r3] → r4;; Load nullified if P1=False(Producer nullification)

ClockN: (p4) add r4, r1 → r5 Add nullified if P4=False(Consumer nullification)

Note that a hazard exists only if(a) p1=p2=true AND(b) the r4 result is not available when the add collects its source data

Figure 7. Predicated producer-consumer dependencies.

Page 12: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

data values, intercepting any data that theinstruction requires. The logic used to per-form this data interception is identical to theregister bypass network used to collectoperands for instructions in the REG stage.By noting that instructions observing adeferred stall in the REG stage don’t requirethe use of the bypass network, the EXE stageinstructions can usurp the bypass network forthe deferred stall. By reusing existing registerbypass hardware, the deferred stall strategy isimplemented in an area-efficient manner.This allows the processor to combine the ben-efits of high frequency with stall-basedpipeline control, thereby precluding thepenalty of pipeline flushes due to replays onregister hazards.

Execution resources. The processor provides anabundance of execution resources to exploitILP. The integer execution core includes twomemory and two integer ports, with all fourports capable of executing arithmetic, shift-and-add, logical, compare, and most integerSIMD multimedia operations. The memoryports can also perform load and store opera-tions, including loads and stores with postin-crement functionality. The integer ports addthe ability to perform the less-common inte-ger instructions, such as test bit, look for zerobyte, and variable shift. Additional uncom-mon instructions are also implemented ononly the first integer port.

See the earlier sidebar for a full enumera-tion of the per-port capabilities and associat-ed instruction latencies on the processor. Ingeneral, we designed the method used to mapinstructions onto each port to maximize over-all performance, by balancing the instructionfrequency with the area and timing impact ofadditional execution resources.

Implementing predication elegantly. Predicationis another key feature of the IA-64 architec-ture, allowing higher performance by elimi-nating branches and their associatedmisprediction penalties.5 However, predica-tion affects several key aspects of the pipelinedesign. Predication turns a control depen-dency (branching on the condition) into adata dependency (execution and forwardingof data dependent upon the value of the pred-icate). If spurious stalls and pipeline disrup-

tions get introduced during predicated exe-cution, the benefit of branch mispredictionelimination will be squandered. Care wastaken to ensure that predicates are imple-mented transparently in the pipeline.

The basic strategy for predicated executionis to allow all instructions to read the registerfile and get issued to the hardware regardlessof their predicate value. Predicates are used toconfigure the data-forwarding network, detectthe presence of hazards, control pipelineadvances, and conditionally nullify the exe-cution and retirement of issued operations.Predicates also feed the branching hardware.The predicate register file is a highly multi-ported structure. It is accessed in parallel withthe general registers in the REG stage. Sincepredicates themselves are generated in the exe-cution core (from compare instructions, for

35SEPTEMBER–OCTOBER 2000

IA-32 compatibilityAnother key feature of the Itanium processor is its full support of the IA-32 instruction set

in hardware (see Figure A). This includes support for running a mix of IA-32 applications andIA-64 applications on an IA-64 operating system, as well as IA-32 applications on an IA-32operating system, in both uniprocessor and multiprocessor configurations. The IA-32 enginemakes use of the EPIC machine’s registers, caches, and execution resources. To deliver highperformance on legacy binaries, the IA-32 engine dynamically schedules instructions.1,2 TheIA-64 Seamless Architecture is defined to enable running IA-32 system functions in nativeIA-64 mode, thus delivering native performance levels on the system functionality.

References1. R. Colwell and R. Steck, “A 0.6µm BICMOS Microprocessor with Dynamic

Execution,” Proc. Int’l Solid-State Circuits Conf., IEEE Press, Piscataway, N.J.,1995, pp. 176-177.

2. D. Papworth, “Tuning the Pentium Pro Microarchitecture,” IEEE Micro,Mar./Apr. 1996, pp. 8-15.

IA-32 instructionfetch and decode

IA-32 dynamicand scheduler

IA-32 retirementand exceptions

Sharedinstruction cache

and TLB

SharedIA-64

executioncore

Figure A. IA-32 compatibility microarchitecture.

Page 13: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

example) and may be in flight when they’reneeded, they must be forwarded quickly tothe specific hardware that consumes them.

Note that predication affects the hazarddetection logic by nullifying either data pro-ducer or consumer instructions. Consumernullification is performed after reading thepredicate register file (PRF) for the predicatesources of the six instructions in the REGpipeline stage. Producer nullification is per-formed after reading the predicate register filefor the predicate sources for the six instruc-tions in the EXE stage.

Finally, three conditional branches can beexecuted in the DET pipeline stage; thisrequires reading three additional predicatesources. Thus, a total of 15 read ports areneeded to access the predicate register file.From a write port perspective, 11 predicatescan be written every clock: eight from fourparallel integer compares, two from a float-ing-point compare, and one via the stage pred-icate write feature of loop branches. Theseread and write ports are in addition to a broad-side read and write capability that allows a sin-gle instruction to read or write the entire64-entry predicate register into or from a sin-gle 64-bit integer register. The predicate reg-ister file is implemented as a single 64-bit latchwith 15 simple 64:1 multiplexers being usedas the read ports. Similarly, the 11 write portsare efficiently implemented, with each beinga 6:64 decoder, with an AND-OR structureused to update the actual predicate register filelatch. Broadside reads and writes are easilyimplemented by reading or writing the con-tents of the entire 64 bit latch.

In-flight predicates must be forwardedquickly after generation to the point of con-

sumption. The costly bypasslogic that would have beenneeded for this is eliminatedby taking advantage of thefact that all predicate-writinginstructions have determinis-tic latency. Instead, a specula-tive predicate register file(SPRF) is used and updatedas soon as predicate data iscomputed. The source pred-icate of any dependentinstruction is then readdirectly from this register file,

obviating the need for bypass logic. A sepa-rate architectural predicate register file (APRF)is only updated when a predicate-writinginstruction retires and is only then allowed toupdate the architectural state.

In case of an exception or pipeline flush, theSPRF is copied from the APRF in the shadowof the flush latency, undoing the effect of anymisspeculative predicate writes. The combi-nation of latch-based implementation and thetwo-file strategy allow an area-efficient andtiming-efficient implementation of the high-ly ported predicate registers.

Figure 8 shows one of the six EXE stage pred-icates that allow or nullify data forwarding inthe data-forwarding network. The other fivepredicates are handled identically. Predicationcontrol of the bypass network is implementedvery efficiently by ANDing the predicate valuewith the destination-valid signal present in con-ventional bypass logic networks. Instructionswith false predicates are treated as merely notwriting to their destination register. Thus, theimpact of predication on the operand-forwarding network is fairly minimal.

Optimized speculation support in hardware.With minimal hardware impact, the Itaniumprocessor enables software to hide the latencyof load instructions and their dependent usesby boosting them out of their home basicblock. This is termed speculation. To performeffective speculation, two key issues must beaddressed. First, any exceptions that are detect-ed must be deferrable until an operation’shome basic block is encountered; this is termedcontrol speculation. Second, all stores betweenthe boosted load and its home location mustbe checked for address overlap. If there is an

36

ITANIUM PROCESSOR

IEEE MICRO

=?AND

Sou

rce

bypa

ssm

ultip

lexe

r

Register file data

Bypass forwarded data

"True" source data Added to support predication

Source RegID

Forwarded destination RegID

Forwarded instruction predicate

Bypass forwarded data

Figure 8. Predicated bypass control.

Page 14: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

overlap, the latest store should forward the cor-rect data; this is termed data speculation. TheItanium processor provides effective supportfor both forms of speculation.

In case of control speculation, normalexception checks are performed for a control-speculative load instruction. In the commoncase, no exception is encountered, and there-fore no special handling is required. On adetected exception, the hardware examinesthe exception type, software-managed archi-tectural control registers, and page attributesto determine whether the exception should behandled immediately (such as for a TLB miss)or deferred for future handling.

For a deferral, a special deferred exceptiontoken called NaT (Not a Thing) bit is retainedfor each integer register, and a special float-ing-point value, called NaTVal and encodedin the NaN space, is set for floating-point reg-isters. This token indicates that a deferredexception was detected. The deferred excep-tion token is then propagated into result reg-isters when any of the source registers indicatessuch a token. The exception is reported wheneither a speculation check or nonspeculativeuse (such as a store instruction) consumes aregister that is flagged with the deferred excep-tion token. In this way, NaT generation lever-ages traditional exception logic simply, andNaT propagation uses straightforward datapath logic.

The existence of NaT bits and NaTVals alsoaffect the register spill-and-fill logic. Forexplicit software-driven register spills and fills,special move instructions (store.spill andload.fill) are supported that don’t take excep-tions when encountering NaT’ed data. Forfloating-point data, the entire data is simplymoved to and from memory. For integer data,the extra NaT bit is written into a special reg-ister (called UNaT, or user NaT) on spills, andis read back on the load.fill instruction. TheUNaT register can also be written to memo-ry if more than 64 registers need to be spilled.In the case of implicit spills and fills generat-ed by the register save engine, the engine col-lects the NaT bits into another special register(called RNaT, or register NaT), which is thenspilled (or filled) once for every 64 register saveengine stores (or loads).

For data speculation, the software issues anadvanced load instruction. When the hard-

ware encounters an advanced load, it placesthe address, size, and destination register ofthe load into the ALAT structure. The ALATthen observes all subsequent explicit storeinstructions, checking for overlaps of the validadvanced load addresses present in the ALAT.In the common case, there’s no match, theALAT state is unchanged, and the advancedload result is used normally. In the case of anoverlap, all address-matching advanced loadsin the ALAT are invalidated.

After the last undisambiguated store priorto the load’s home basic block, an instructioncan query the ALAT and find that theadvanced load was matched by an interven-ing store address. In this situation recovery isneeded. When only the load and no depen-dent instructions were boosted, a load-check(ld.c) instruction is used, and the loadinstruction is reissued down the pipeline, thistime retrieving the updated memory data. Asan important performance feature, the ld.cinstruction can be issued in parallel withinstructions dependent on the load resultdata. By allowing this optimization, the crit-ical load uses can be issued immediately,allowing the ld.c to effectively be a zero-cycleoperation. When the advanced load and itsdependent uses were boosted, an advancedcheck-load (chk.a) instruction traps to a user-specified handler for a special fix-up code thatreissues the load instruction and the opera-tions dependent on the load. Thus, supportfor data speculation was added to the pipeline

37SEPTEMBER–OCTOBER 2000

With minimal hardware

impact, the Itanium processor

enables software to hide the

latency of load instructions

and their dependent uses by

boosting them out of their

home basic block.

Page 15: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

38

ITANIUM PROCESSOR

IEEE MICRO

The FPU in the processor is quite advanced. The native 82-bit hard-ware provides efficient support for multiple numeric programming mod-els, including support for single, double, extended, andmixed-mode-precision computations. The wide-range 17-bit exponentenables efficient support for extended-precision library functions as wellas fast emulation of quad-precision computations. The large 128-entryregister file provides adequate register resources. The FPU executionhardware is based on the floating-point multiply-add (FMAC) primitive,which is an effective building block for scientific computation.1 Themachine provides execution hardware for four double-precision or eightsingle-precision flops per clock. This abundant computation bandwidthis balanced with adequate operand bandwidth from the registers andmemory subsystem. With judicious use of data prefetch instructions, aswell as cache locality and allocation management hints, the software caneffectively arrange the computation for sustained high utilization of theparallel hardware.

FMAC unitsThe FPU supports two fully pipelined, 82-bit FMAC units that can exe-

cute single, double, or extended-precision floating-point operations. Thisdelivers a peak of 4 double-precision flops/clock, or 3.2 Gflops at 800MHz. FMAC units execute FMA, FMS, FNMA, FCVTFX, and FCVTXF oper-ations. When bypassed to one another, the latency of the FMAC arith-metic operations is five clock cycles.

The processor also provides support for executing two SIMD-floating-point instructions in parallel. Since each instruction issues two single-precision FMAC operations (or four single-precision flops), the peakexecution bandwidth is 8 single-precision flops/clock or 6.4 Gflops at 800MHz. Two supplemental single-precision FMAC units support this com-putation. (Since the read of an 82-bit register actually yields two single-precision SIMD operands, the second operand in each case is peeled offand sent to the supplemental SIMD units for execution.) The high com-putational rate on single precision is especially suitable for digital con-tent creation workloads.

The divide operation is done in software and can take advantage ofthe twin fully pipelined FMAC hardware. Software-pipelined divide oper-

ations can yield high throughput on division and square-root operationscommon in 3D geometry codes.

The machine also provides one hardware pipe for execution of FCMPsand other operations (such as FMERGE, FPACK, FSWAP, FLogicals, recip-rocal, and reciprocal square root). Latency of the FCMP operations is twoclock cycles; latency of the other floating-point operations is five clockcycles.

Operand bandwidthCare has been taken to ensure that the high computational bandwidth

is matched with operand feed bandwidth. See Figure B. The 128-entryfloating-point register file has eight read and four write ports. Every cycle,the eight read ports can feed two extended-precision FMACs (each withthree operands) as well as two floating-point stores to memory. The fourwrite ports can accommodate two extended-precision results from thetwo FMAC units and the results from two load instructions each clock.To increase the effective write bandwidth into the FPU from memory, wedivided the floating-point registers into odd and even banks. This enablesthe two physical write ports dedicated to load returns to be used to writefour values per clock to the register file (two to each bank), using two ldf-pair instructions. The ldf-pair instructions must obey the restriction thatthe pair of consecutive memory operands being loaded in sends oneoperand to an even register and the other to an odd register for properuse of the banks.

The earliest cache level to feed the FPU is the unified L2 cache (96Kbytes). Two ldf-pair instructions can load four double-precision valuesfrom the L2 cache into the registers. The latency of loads from this cacheto the FPU is nine clock cycles. For data beyond the L2 cache, the band-width to the L3 cache is two double-precision operations/clock (one 64-byte line every four clock cycles).

Obviously, to achieve the peak rating of four double-precision floating-pointoperations per clock cycle, one needs to feed the FMACs with six operandsper clock. The L2 memory can feed a peak of four operands per clock. Theremaining two need to come from the register file. Hence, with the rightamount of data reuse, and with appropriate cache management strategiesaimed at ensuring that the L2 cache is well primed to feed the FPU, many

workloads can deliver sustained per-formance at near the peak floating-point operation rating. For datawithout locality, use of the NT2 andNTA hints enables the data to appearto virtually stream into the FPUthrough the next level of memory.

FPU and integer corecoupling

The floating-point pipeline is cou-pled to the integer pipeline. Regis-ter file read occurs in the REG stage,with seven stages of execution

Floating-point feature set

4-MbyteL3

cache

L2cache

Registerfile

(128-entry,82 bits)

2 stores/clock

2 double-precisionops/clock

6 × 82 bits

2 × 82 bits

4 double-precisionops/clock

(2 × ldf-pair)

Even

Odd

Figure B. FMAC units deliver 8 flops/clock.

Page 16: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

in a straightforward manner, only needingmanagement of a small ALAT in hardware.

As shown in Figure 9, the ALAT is imple-mented as a 32-entry, two-way set-associativestructure. The array is looked up based on theadvanced load’s destination register ID, andeach entry contains an advanced load’s phys-ical address, a special octet mask, and a validbit. The physical address is used to compareagainst subsequent stores, with the octet maskbits used to track which bytes have actuallybeen advance loaded. These are used in case ofpartial overlap or in cases where the load andstore are different sizes. In case of a match, thecorresponding valid bit is cleared. The latercheck instruction then simply queries theALAT to examine if a valid ALAT entry stillexists for the ALAT.

The combination of a simple ALAT for thedata speculation, in conjunction with NaTbits and small changes to the exception logic

for control speculation, eliminates the twofundamental barriers that software has tradi-tionally encountered when boosting instruc-tions. By adding this modest hardware

39SEPTEMBER–OCTOBER 2000

extending beyond the REG stage, followed by floating-point write back.Safe instruction recognition (SIR) hardware enables delivery of preciseexceptions on numeric computation. In the FP1 (or EXE) stage, an early exam-ination of operands is performed to determine the possibility of numericexceptions on the instructions being issued. If the instructions are unsafe(have potential for raising exceptions), a special form of hardware microre-play is incurred. This mechanism enables instructions in the floating-pointand integer pipelines to flow freely in all situations in which no exceptionsare possible.

The FPU is coupled to the integer data path via transfer paths betweenthe integer and floating-point register files. These transfers (setf, getf)are issued on the memory ports and made to look like memory operations(since they need register ports on both the integer and floating-point reg-isters). While setf can be issued on either M0 or M1 ports, getf can onlybe issued on the M0 port. Transfer latency from the FPU to the integerregisters (getf) is two clocks. The latency for the reverse transfer (setf) isnine clocks, since this operation appears like a load from the L2 cache.

We enhanced the FPU to support integer multiply inside the FMAChardware. Under software control, operands are transferred from the inte-ger registers to the FPU using setf. After multiplication is complete, theresult is transferred to the integer registers using getf. This sequencetakes a total of 18 clocks (nine for setf, seven for fmul to write the regis-ters, and two for getf). The FPU can execute two integer multiply-add(XMA) operations in parallel. This is very useful in cryptographic appli-cations. The presence of twin XMA pipelines at 800 MHz allows for over1,000 decryptions per second on a 1,024-bit RSA using private keys (server-side encryption/decryption).

FPU controlsThe FPU controls for operating precision and rounding are derived from

the floating-point status register (FPSR). This register also contains thenumeric execution status of each operation. The FPSR also supports spec-ulation in floating-point computation. Specifically, the register containsfour parallel fields or tracks for both controls and flags to support three par-allel speculative streams, in addition to the primary stream.

Special attention has been placed on delivering high performance forspeculative streams. The FPU provides high throughput in cases where theFCLRF instruction is used to clear status from speculative tracks beforeforking off a fresh speculative chain. No stalls are incurred on suchchanges. In addition, the FCHKF instruction (which checks for exceptionson speculative chains on a given track) is also supported efficiently. Inter-locks on this instruction are track-granular, so that no interlock stalls areincurred if floating-point instructions in the pipeline are only targeting theother tracks. However, changes to the control bits in the FPSR (made viathe FSETC instruction or the MOV GR→FPSR instruction) have a latencyof seven clock cycles.

FPU summaryThe FPU feature set is balanced to deliver high performance across a

broad range of computational workloads. This is achieved through thecombination of abundant execution resources, ample operand bandwidth,and a rich programming environment.

References1. B.Olsson et al., “RISC System/6000 Floating-Point Unit,” IBM

RISC System/6000 Technology, IBM Corp., 1990, pp. 34-42.

16 setsWay 0 Way 1

RegID tag

Physical address of adv load

Valid bit

RegID[3:0]

(index)

Figure 9. ALAT organization.

Page 17: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

support for speculation, the processor allowsthe software to take advantage of the compil-er’s large scheduling window to hide memo-ry latency, without the need for complexdynamic scheduling hardware.

Parallel zero-latency delay-executed branching.Achieving the highest levels of performancerequires a robust control flow mechanism. Theprocessor’s branch-handling strategy is basedon three key directions. First, branch seman-tics providing more program details are need-ed to allow the software to convey complexcontrol flow information to the hardware. Sec-ond, aggressive use of speculation and predi-cation will progressively lead to an emptyingout of basic blocks, leaving clusters of branch-es. Finally, since the data flow from compare todependent branch is often very tight, specialcare needs to be taken to enable high perfor-mance for this important case. The processoroptimizes across all three of these fronts.

The processor efficiently implements thepowerful branch vocabulary of the IA-64instruction set architecture. The hardware takesadvantage of the new semantics for improvedbranch handling. For example, the loop count(LC) register indicates the number of iterationsin a For-type loop, and the epilogue count (EC)register indicates the number of epilogue stagesin a software-pipelined loop.

By using the loop count information, highperformance can be achieved by softwarepipelining all loops. Moreover, the imple-mentation avoids pipeline flushes for the firstand last loop iterations, since the actual num-ber of iterations is effectively communicatedto the hardware. By examining the epilogcount register information, the processorautomatically generates correct stage predi-cates for the epilogue iterations of the soft-ware-pipelined loop. This step leverages thepredicate-remapping hardware along with thebranch prediction information from the loopcount register-based branch predictor.

Unlike conventional processors, the Itani-um processor can execute up to three parallelbranches per clock. This is implemented byexamining the three controlling conditions(either predicates or the loop count/epilogcount counter values) for the three parallelbranches, and performing a priority encodeto determine the earliest taken branch. All side

effects of later instructions are automaticallysquashed within the branch execution unititself, preventing any architectural state updatefrom branches in the shadow of a takenbranch. Given that the powerful branch pre-diction in the front end contains tailored sup-port for multiway branch prediction, minimalpipeline disruptions can be expected due tothis parallel branch execution.

Finally, the processor optimizes for the com-mon case of a very short distance between thebranch and the instruction that generates thebranch condition. The IA-64 instruction setarchitecture allows a conditional branch to beissued concurrently with the integer comparethat generates its condition code—no stop bitis needed. To accommodate this importantperformance optimization, the processorpipelines the compare-branch sequence. Thecompare instruction is performed in thepipeline’s EXE stage, with the results beingknown by the end of the EXE clock. Toaccommodate the delivery of this condition tothe branch hardware, the processor executesall branches in the DET stage. (Note that thepresence of the DET stage isn’t an overheadneeded solely from branching. This stage isalso used for exception collection and priori-tization, and for the second clock of executionfor integer-SIMD operations.) Thus, anybranch issued in parallel with the compare thatgenerates the condition will be evaluated inthe DET stage, using the predicate results cre-ated in the previous (EXE) stage. In this man-ner, the processor can easily handle the case ofcompare and dependent branches issued inparallel.

As a result of branch execution in the DETstage, in the rare case of a full pipeline flushdue to a branch misprediction, the processorwill incur a branch misprediction penalty ofnine pipeline bubbles. Note that we expectthis to occur rarely, given the aggressive mul-titier branch prediction strategy in the frontend. Most branches should be predicted cor-rectly using one of the four progressive resteersin the front end.

The combination of enhanced branchsemantics, three-wide parallel branch execu-tion, and zero-cycle compare-to-branch laten-cy allows the processor to achieve highperformance on control-flow-dominatedcodes, in addition to its high performance on

40

ITANIUM PROCESSOR

IEEE MICRO

Page 18: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

more computation-oriented data-flow-dom-inated workloads.

Memory subsystemIn addition to the high-performance core,

the Itanium processor provides a robust cacheand memory subsystem, which accommo-dates a variety of workloads and exploits thememory hints of the IA-64 ISA.

Three levels of on-package cacheThe processor provides three levels of on-

package cache for scalable performance acrossa variety of workloads. At the first level, instruc-tion and data caches are split, each 16 Kbytesin size, four-way set-associative, and with a 32-byte line size. The dual-ported data cache hasa load latency of two cycles, is write-through,and is physically addressed and tagged. The L1caches are effective on moderate-size workloadsand act as a first-level filter for capturing theimmediate locality of large workloads.

The second cache level is 96 Kbytes in size,is six-way set-associative, and uses a 64-byteline size. The cache can handle two requests perclock via banking. This cache is also the level atwhich ordering requirements and semaphoreoperations are implemented. The L2 cache usesa four-state MESI (modified, exclusive, shared,and invalid) protocol for multiprocessor coher-ence. The cache is unified, allowing it to ser-vice both instruction and data side requestsfrom the L1 caches. This approach allows opti-mal cache use for both instruction-heavy (serv-er) and data-heavy (numeric) workloads. Sincefloating-point workloads often have large dataworking sets and are used with compiler opti-mizations such as data blocking, the L2 cacheis the first point of service for floating-pointloads. Also, because floating-point performancerequires high bandwidth to the register file, theL2 cache can provide four double-precisionoperands per clock to the floating-point regis-

ter file, using two parallel floating-point load-pair instructions.

The third level of on-package cache is 4Mbytes in size, uses a 64-byte line size, and isfour-way set-associative. It communicateswith the processor at core frequency (800MHz) using a 128-bit bus. This cache servesthe large workloads of server- and transaction-processing applications, and minimizes thecache traffic on the frontside system bus. TheL3 cache also implements a MESI protocolfor microprocessor coherence.

A two-level hierarchy of TLBs handles vir-tual address translations for data accesses. Thehierarchy consists of a 32-entry first-level and96-entry second-level TLB, backed by a hard-ware page walker.

Optimal cache managementTo enable optimal use of the cache hierar-

chy, the IA-64 instruction set architecturedefines a set of memory locality hints used forbetter managing the memory capacity at spe-cific hierarchy levels. These hints indicate thetemporal locality of each access at each level ofhierarchy. The processor uses them to deter-mine allocation and replacement strategies foreach cache level. Additionally, the IA-64 archi-tecture allows a bias hint, indicating that thesoftware intends to modify the data of a givencache line. The bias hint brings a line into thecache with ownership, thereby optimizing theMESI protocol latency.

Table 2 lists the hint bits and their mappingto cache behavior. If data is hinted to be non-temporal for a particular cache level, that datais simply not allocated to the cache. (On the L2cache, to simplify the control logic, the proces-sor implements this algorithm approximately.The data can be allocated to the cache, but theleast recently used, or LRU, bits are modifiedto mark the line as the next target for replace-ment.) Note that the nearest cache level to feed

41SEPTEMBER–OCTOBER 2000

Table 2. Implementation of cache hints.

Hint Semantics L1 response L2 response L3 response

NTA Nontemporal (all levels) Don’t allocate Allocate, mark as next replace Don’t allocateNT2 Nontemporal (2 levels) Don’t allocate Allocate, mark as next replace Normal allocationNT1 Nontemporal (1 level) Don’t allocate Normal allocation Normal allocationT1 (default) Temporal Normal allocation Normal allocation Normal allocationBias Intent to modify Normal allocation Allocate into exclusive state Allocate into exclusive state

Page 19: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

the floating-point unit is the L2 cache. Hence,for floating-point loads, the behavior is modi-fied to reflect this shift (an NT1 hint on a float-ing-point access is treated like an NT2 hint onan integer access, and so on).

Allowing the software to explicitly providehigh-level semantics of the data usage patternenables more efficient use of the on-chipmemory structures, ultimately leading tohigher performance for any given cache sizeand access bandwidth.

System busThe processor uses a multidrop, shared sys-

tem bus to provide four-way glueless multi-processor system support. No additional bridgesare needed for building up to a four-way sys-tem. Systems with eight or more processors aredesigned through clusters of these nodes usinghigh-speed interconnects. Note that multidropbuses are a cost-effective way to build high-per-formance four-way systems for commercialtransaction processing and e-business work-loads. These workloads often have highly sharedwriteable data and demand high throughputand low latency on transfers of modified databetween caches of multiple processors.

In a four-processor system, the transaction-based bus protocol allows up to 56 pendingbus transactions (including 32 read transac-tions) on the bus at any given time. Anadvanced MESI coherence protocol helps inreducing bus invalidation transactions and inproviding faster access to writeable data. Thecache-to-cache transfer latency is furtherimproved by an enhanced “defer mechanism,”which permits efficient out-of-order datatransfers and out-of-order transaction com-

pletion on the bus. A deferred transaction onthe bus can be completed without reusing theaddress bus. This reduces data return latencyfor deferred transactions and efficiently usesthe address bus. This feature is critical for scal-ability beyond four-processor systems.

The 64-bit system bus uses a source-syn-chronous data transfer to achieve 266-Mtrans-fers/s, which enables a bandwidth of 2.1Gbytes/s. The combination of these featuresmakes the Itanium processor system a scalablebuilding block for large multiprocessor sys-tems.

The Itanium processor is the first IA-64processor and is designed to meet the

demanding needs of a broad range of enter-prise and scientific workloads. Through its useof EPIC technology, the processor funda-mentally shifts the balance of responsibilitiesbetween software and hardware. The softwareperforms global scheduling across the entirecompilation scope, exposing ILP to the hard-ware. The hardware provides abundant exe-cution resources, manages the bookkeepingfor EPIC constructs, and focuses on dynam-ic fetch and control flow optimizations to keepthe compiled code flowing through thepipeline at high throughput. The tighter cou-pling and increased synergy between hardwareand software enable higher performance witha simpler and more efficient design.

Additionally, the Itanium processor deliv-ers significant value propositions beyond justperformance. These include support for 64bits of addressing, reliability for mission-crit-ical applications, full IA-32 instruction setcompatibility in hardware, and scalabilityacross a range of operating systems and mul-tiprocessor platforms. MICRO

References1. L. Gwennap, “Merced Shows Innovative

Design,” Microprocessor Report, Micro-Design Resources, Sunnyvale, Calif., Oct. 6,1999, pp. 1, 6-10.

2. M.S. Schlansker and B.R. Rau, “EPIC:Explicitly Parallel Instruction Computing,”Computer, Feb. 2000, pp. 37-45.

3. T.Y. Yeh and Y.N. Patt, “Two-Level AdaptiveTraining Branch Prediction,” Proc. 24th Ann.Int’l Symp. Microarchitecture, ACM Press,New York, Nov. 1991, pp. 51-61.

42

ITANIUM PROCESSOR

IEEE MICRO

The Itanium processor is the

first IA-64 processor and is

designed to meet the

demanding needs of a broad

range of enterprise and

scientific workloads.

Page 20: ITANIUM PROCESSOR MICROARCHITECTURE - Binghamton

4. L. Gwennap, “New Algorithm ImprovesBranch Prediction,” Microprocessor Report,Mar. 27, 1995, pp. 17-21.

5. B.R. Rau et al., “The Cydra 5 DepartmentalSupercomputer: Design Philosophies,Decisions, and Trade-Offs,” Computer, Jan.1989, pp. 12-35.

Harsh Sharangpani was Intel’s principalmicroarchitect on the joint Intel-HP IA-64EPIC ISA definition. He managed themicroarchitecture definition and validation ofthe EPIC core of the Itanium processor. Hehas also worked on the 80386 and i486 proces-sors, and was the numerics architect of the Pen-tium processor. Sharangpani received anMSEE from the University of Southern Cali-fornia, Los Angeles and a BSEE from the Indi-an Institute of Technology, Bombay. He holds20 patents in the field of microprocessors.

Ken Arora was the microarchitect of the exe-cution and pipeline control of the EPIC coreof the Itanium processor. He participated inthe joint Intel/HP IA-64 EPIC ISA definitionand helped develop the initial IA-64 simula-tion environment. Earlier, he was a designerand architect on the i486 and Pentium proces-sors. Arora received a BS degree in computerscience and a master’s degree in electrical engi-neering from Rice University. He holds 12patents in the field of microprocessor archi-tecture and design.

Direct questions about this article to HarshSharangpani, Intel Corporation, Mail StopSC 12-402, 2200 Mission College Blvd.,Santa Clara, CA 95052; [email protected].

43SEPTEMBER–OCTOBER 2000

Career Service Center

• Certification

• Educational Activities

• Career Information

• Career Resources

• Student Activities

• Activities Board

http://computer.org

Career Service Center

Introducing theIEEE Computer Society

Career Service Center

Advance your career

Search for jobs

Post a resume

List a job opportunity

Post your company’s profile

Link to career services

http://computer.org/careers/