Download - Intel Itanium Architecture(64-bit)

Intel Itanium Architecture(64-bit)

Overview

Overview

Why develop? RISC processing limit of one instruction per cycle

predicted 1989 by HP Led to HP development of EPIC(Explicitly Parallel

Instruction Computing) Uses a form of VLIW(Very Long Instruction Word)

HP decides to partner with Intel to develop new Architecture based off EPIC in 1994

IA-64 is born

Versions

Merced Codename of the first Intel/HP joint IA-64 chip Development problems

Transistor numbers Teams had different priorities Unanticipated research

Itanium Official name of Merced Released 2001 Due to development delays was lacking

Called the Itanic RISC and CISC performance increases due to

superscaler architectures

Versions

Itanium 2 Released 2002 Codenamed McKinley Improved on Itanium design Outperformed comparable RISC and CISC

processors Madison

Released 2003 Basis for all future versions until 2006

Versions

Montecito Released 2006 Dual Core implementation of Itanium 2 Performance Doubled Power Consumption cut by 20% New Features also added

multi-threading(two per core) Expanded cache Silicon level support for virtualization

Montvale Released 2007 Fastest IA-64 chip to date

Competing Chips

UltraSPARC(Scalable Processor Architecture) Developed by Sun Microsystems RISC Architecture

SPARC64 Developed by Fujitsu RISC Architecture

POWER6(Performance Optimization With Enhanced RISC) Developed by IBM RISC Architecture

Competing Chips

Opteron Developed by AMD X86 Architecture

Xeon Developed by Intel X86 Architecture

Intel Itanium Architecture

Chip Layout

Chip Layout

Itanium Architecture Diagram

Chip Layout

Itanium Specs

4 Integer ALU's 4 multimedia ALU's 2 Extended Precision FP Units 2 Single Precision FP units 2 Load or Store Units 3 Branch Units 10 Stage 6 Wide Pipeline 32k L1 Cache 96K L2 Cache 4MB L3 Cache(extern) 800Mhz Clock

Itanium Specs

Process 180nm System Bus Speed 2.1GB/s

266Mhz 64 bit Wide

Itanium2 Specs

6 Integer ALU's 6 multimedia ALU's 2 Extended Precision FP Units 2 Single Precision FP units 2 Load and Store Units 3 Branch Units 8 Stage 6 Wide Pipeline 32k L1 Cache 256K L2 Cache 3MB L3 Cache(on die) 1Ghz Clock initially

Up to 1.66Ghz on Montvale

Itanium2 Specs

180nm Process Increased to 130nm in 2003 Further increased to 90nm in 2007

System Bus Speed 6.4GB/s 400Mhz 128 bit Wide

Itanium2 Improvements

Initially a 180nm process Increased to 130nm in 2003 Further increased to 90nm in 2007

Improved Thermal Management Clock Speed increased to 1.0Ghz Bus Speed Increase from 266Mhz to 400Mhz L3 cache moved on die

Faster access rate

IA-64 Pipeline Features

Branch Prediction Predicate Registers allow branches to be turned on

or off Compiler can provide branch prediction hints

Register Rotation Allows faster loop execution in parallel

Predication Controls Pipeline Stages

Cache Features

L1 Cache 4 way associative 16Kb Instruction 16Kb Data

L2 Cache Itanium

6 way associative 96 Kb

Itanium2 8 way associative 256 Kb Initially

256Kb Data and 1Mb Instruction on Montvale!

Cache Features

L3 Cache Itanium

4 way associative Accessible through FSB 2-4Mb

Itanium2 2 – 4 way associative On Die 3Mb

Up to 24Mb on Montvale chips(12Mb/core)!

Instruction Set Architecture

Registers

128 Integer Registers 128 Floating Point Registers 64 One-Bit Predicates 8 Branch Registers

Overview

RISC architectures approaching processing limit of 1 instruction per clock cycle

Explicitly Parallel Instruction Computing (EPIC) allowed multiple instructions in one cycle

Implements a form of Very Long Instruction Word (VLIW)

Compiler determines in advance which instructions can be executed in parallel

VLIW

Normally, pipelining is done by checking for interdependencies, then resolving them

This comes at the cost of hardware complexity With VLIW, determining which operations can

execute in parallel is done by the compiler Extra scheduling hardware not needed Result is less hardware complexity, but greater

compiler complexity

Instruction Execution

Each 128-bit instruction word contains 3 instructions

Fetch mechanism can read up to two instruction words per clock cycle

Whenever possible, the compiler can take advantage of this, allowing the processor to execute up to 6 instructions per cycle

Processor Units

The processor has 30 functional units in 11 groups Each unit can execute a particular subset of the

instruction set Common instructions can be executed by multiple

units

Processor Units – cont.

6 general-purpose ALUs, 2 integer units, 1 shift unit

4 data cache units 6 multimedia units, 2 parallel shift units, 1 parallel

multiply, 1 population count 2 floating-point multiply-accumulate units, 2

"miscellaneous" floating-point units 3 branch units

Processor Units – cont.

Some of the units are designed for specific tasks, to improve performance

For instance, the floating-point multiply-accumulate unit

Allows an instruction that has a multiply followed by an add

Very common in scientific processing

Instruction Types

There are a total of 6 instruction types

Bundle Format

3 instructions are grouped together into 128-bit aligned containers called “bundles”

Each bundle has three 41-bit instruction slots and a 5-bit template field

Execution goes from 0 to 2

Instruction Types

Instruction Format

Instructions are 41 bits long Leftmost 4 bits are the opcode Next is opcode extension Then the 3 registers (or immediate values) The last 6 bits deal with predicates (more on this

later)

Instruction Set Sample

Instruction Example 1

Example Optimizations

Implements branch prediction, speculation, and predication

Prediction and speculation deal with determining which branch will most likely be taken

All of this is done by the compiler, and each word has special bits for this

Branch Predication

All possible branches are executed Correct path is kept, all others discarded Almost every instruction in the IA-64 instruction set

is predicated (qp field) Predicates stored in special registers One of these registers is always TRUE, so

unpredicated instructions always have the value true

Register Renaming

Sometimes instructions share the same register name, but do not depend on each other

This makes it impossible to run the instructions in parallel

In this case, a special technique can be used to rename the conflicting registers

This is also performed by the compiler

Register Renaming - Example

1. lw $1, 10242. addi $1, $1, 23. sw $1, 10324. lw $1, 20485. addi $1, $1, 46. sw $1, 2056

• Instructions 4, 5, and 6 are independent of 1, 2, and 3, but the processor cannot finish 4 until 3 is done, because 3 would write the wrong value

Register Renaming - Example

1. lw $1, 10242. addi $1, $1, 23. sw $1, 10324. lw $2, 20485. addi $2, $2, 46. sw $2, 2056

• Now instructions 4, 5, and 6 can be executed in parallel with 1, 2, and 3.

Chip Layout


Compiler/OS Support

Compiler Support

Intel has it's own suite of compilers it uses to produce machine code for IA-64 chips Available through Intel

As of 2007 the following outside compilers can also produce machine code for IA-64 Architectures GCC Open64 MS Visual Studio 2005

OS Support

The following operating systems support IA-64 as of 2007 Windows Server 2003 Linux

Debian Red Hat Novell SuSE

FreeBSD HP-UX OpenVMS NonStop

OS Support

HP provides Virtualization support for it's HP-UX operating system

GCOS is supported by Itanium chips Does this via Instruction Set Simulators

Essentially an application that acts as a middleman for the OS to hardware communication

GCC News

As of 2007, GCC has been further optimized for IA-64 Superblock framework introduced into GCC

Improves effectiveness of later optimizations Duplicates frequently executed code

This means GCC will produce faster IA-64 machine code As most Linux distros use GCC as their main

compiler, this means better and faster open source code!

Being reviewed for inclusion into mainline GCC


Conclusions

Conclusions

Several Differences exist over MIPS for example: Large Instruction Sizes Deeper Pipeline

8 and greater for IA-64 5 for MIPS

Large Instruction Set Pros

Very Fast FP Units Very useful for companies operating large servers Supercomputing

Thunder (LLNL) 2nd Fastest supercomputer in the world 19.94 TFlops

Conclusions

Cons Very costly

< $4000 per chip Requires very smart compilers that are very hard to

develop GCC machine code still has bugs

Fails to compile at times May be fixed when new optimizations introduced into

mainline GCC OVERALL

Great processor for high end servers Not useful for the average user

Conclusions

Future Work Tukwila

Scheduled to be released late 2008 May use 32nm Process 30 Mb on die caches Itanium Bus replaced with Intel Quick Path Interconnect

Faster data xfer rates 4 Cores

Poulson Will use 32nm Process More cores, More Parallelism Not much known as of yet

Kittson Codename for newest IA-64 project

Not much else known, Stay tuned for more!

Q&A

Questions?

Thank You

Thanks for listening!

References

Intel Itanium Architecture Presentation http://www.rrze.uni-erlangen.de/dienste/arbeiten-rechnen/hpc/vortraege/IntelCornelius.pdf Itanium Solutions Alliance http://www.itaniumsolutionsalliance.org/news/pr/view?item_key=8e2e31463df96d0033d7d1450f50492523b9e842

Wikipedia http://www.wikipedia.com

Intel Itanium Developers Manual www.intel.com/design/itanium/manuals/iiasdmanual.htm