Intel Itanium Architecture(64-bit)
Overview
Overview
Why develop? RISC processing limit of one instruction per cycle
predicted 1989 by HP Led to HP development of EPIC(Explicitly Parallel
Instruction Computing) Uses a form of VLIW(Very Long Instruction Word)
HP decides to partner with Intel to develop new Architecture based off EPIC in 1994
IA-64 is born
Versions
Merced Codename of the first Intel/HP joint IA-64 chip Development problems
Transistor numbers Teams had different priorities Unanticipated research
Itanium Official name of Merced Released 2001 Due to development delays was lacking
Called the Itanic RISC and CISC performance increases due to
superscaler architectures
Versions
Itanium 2 Released 2002 Codenamed McKinley Improved on Itanium design Outperformed comparable RISC and CISC
processors Madison
Released 2003 Basis for all future versions until 2006
Versions
Montecito Released 2006 Dual Core implementation of Itanium 2 Performance Doubled Power Consumption cut by 20% New Features also added
multi-threading(two per core) Expanded cache Silicon level support for virtualization
Montvale Released 2007 Fastest IA-64 chip to date
Competing Chips
UltraSPARC(Scalable Processor Architecture) Developed by Sun Microsystems RISC Architecture
SPARC64 Developed by Fujitsu RISC Architecture
POWER6(Performance Optimization With Enhanced RISC) Developed by IBM RISC Architecture
Competing Chips
Opteron Developed by AMD X86 Architecture
Xeon Developed by Intel X86 Architecture
Intel Itanium Architecture
Chip Layout
Chip Layout
Itanium Architecture Diagram
Chip Layout
Itanium Specs
4 Integer ALU's 4 multimedia ALU's 2 Extended Precision FP Units 2 Single Precision FP units 2 Load or Store Units 3 Branch Units 10 Stage 6 Wide Pipeline 32k L1 Cache 96K L2 Cache 4MB L3 Cache(extern) 800Mhz Clock
Itanium Specs
Process 180nm System Bus Speed 2.1GB/s
266Mhz 64 bit Wide
Itanium2 Specs
6 Integer ALU's 6 multimedia ALU's 2 Extended Precision FP Units 2 Single Precision FP units 2 Load and Store Units 3 Branch Units 8 Stage 6 Wide Pipeline 32k L1 Cache 256K L2 Cache 3MB L3 Cache(on die) 1Ghz Clock initially
Up to 1.66Ghz on Montvale
Itanium2 Specs
180nm Process Increased to 130nm in 2003 Further increased to 90nm in 2007
System Bus Speed 6.4GB/s 400Mhz 128 bit Wide
Itanium2 Improvements
Initially a 180nm process Increased to 130nm in 2003 Further increased to 90nm in 2007
Improved Thermal Management Clock Speed increased to 1.0Ghz Bus Speed Increase from 266Mhz to 400Mhz L3 cache moved on die
Faster access rate
IA-64 Pipeline Features
Branch Prediction Predicate Registers allow branches to be turned on
or off Compiler can provide branch prediction hints
Register Rotation Allows faster loop execution in parallel
Predication Controls Pipeline Stages
Cache Features
L1 Cache 4 way associative 16Kb Instruction 16Kb Data
L2 Cache Itanium
6 way associative 96 Kb
Itanium2 8 way associative 256 Kb Initially
256Kb Data and 1Mb Instruction on Montvale!
Cache Features
L3 Cache Itanium
4 way associative Accessible through FSB 2-4Mb
Itanium2 2 – 4 way associative On Die 3Mb
Up to 24Mb on Montvale chips(12Mb/core)!
Instruction Set Architecture
Registers
128 Integer Registers 128 Floating Point Registers 64 One-Bit Predicates 8 Branch Registers
Overview
RISC architectures approaching processing limit of 1 instruction per clock cycle
Explicitly Parallel Instruction Computing (EPIC) allowed multiple instructions in one cycle
Implements a form of Very Long Instruction Word (VLIW)
Compiler determines in advance which instructions can be executed in parallel
VLIW
Normally, pipelining is done by checking for interdependencies, then resolving them
This comes at the cost of hardware complexity With VLIW, determining which operations can
execute in parallel is done by the compiler Extra scheduling hardware not needed Result is less hardware complexity, but greater
compiler complexity
Instruction Execution
Each 128-bit instruction word contains 3 instructions
Fetch mechanism can read up to two instruction words per clock cycle
Whenever possible, the compiler can take advantage of this, allowing the processor to execute up to 6 instructions per cycle
Processor Units
The processor has 30 functional units in 11 groups Each unit can execute a particular subset of the
instruction set Common instructions can be executed by multiple
units
Processor Units – cont.
6 general-purpose ALUs, 2 integer units, 1 shift unit
4 data cache units 6 multimedia units, 2 parallel shift units, 1 parallel
multiply, 1 population count 2 floating-point multiply-accumulate units, 2
"miscellaneous" floating-point units 3 branch units
Processor Units – cont.
Some of the units are designed for specific tasks, to improve performance
For instance, the floating-point multiply-accumulate unit
Allows an instruction that has a multiply followed by an add
Very common in scientific processing
Instruction Types
There are a total of 6 instruction types
Bundle Format
3 instructions are grouped together into 128-bit aligned containers called “bundles”
Each bundle has three 41-bit instruction slots and a 5-bit template field
Execution goes from 0 to 2
Instruction Types
Instruction Format
Instructions are 41 bits long Leftmost 4 bits are the opcode Next is opcode extension Then the 3 registers (or immediate values) The last 6 bits deal with predicates (more on this
later)
Instruction Set Sample
Instruction Example 1
Instruction Example 2
Instruction Example 3
Example Optimizations
Implements branch prediction, speculation, and predication
Prediction and speculation deal with determining which branch will most likely be taken
All of this is done by the compiler, and each word has special bits for this
Branch Predication
All possible branches are executed Correct path is kept, all others discarded Almost every instruction in the IA-64 instruction set
is predicated (qp field) Predicates stored in special registers One of these registers is always TRUE, so
unpredicated instructions always have the value true
Register Renaming
Sometimes instructions share the same register name, but do not depend on each other
This makes it impossible to run the instructions in parallel
In this case, a special technique can be used to rename the conflicting registers
This is also performed by the compiler
Register Renaming - Example
1. lw $1, 10242. addi $1, $1, 23. sw $1, 10324. lw $1, 20485. addi $1, $1, 46. sw $1, 2056
• Instructions 4, 5, and 6 are independent of 1, 2, and 3, but the processor cannot finish 4 until 3 is done, because 3 would write the wrong value
Register Renaming - Example
1. lw $1, 10242. addi $1, $1, 23. sw $1, 10324. lw $2, 20485. addi $2, $2, 46. sw $2, 2056
• Now instructions 4, 5, and 6 can be executed in parallel with 1, 2, and 3.
Chip Layout
Intel Itanium Architecture
Compiler/OS Support
Compiler Support
Intel has it's own suite of compilers it uses to produce machine code for IA-64 chips Available through Intel
As of 2007 the following outside compilers can also produce machine code for IA-64 Architectures GCC Open64 MS Visual Studio 2005
OS Support
The following operating systems support IA-64 as of 2007 Windows Server 2003 Linux
Debian Red Hat Novell SuSE
FreeBSD HP-UX OpenVMS NonStop
OS Support
HP provides Virtualization support for it's HP-UX operating system
GCOS is supported by Itanium chips Does this via Instruction Set Simulators
Essentially an application that acts as a middleman for the OS to hardware communication
GCC News
As of 2007, GCC has been further optimized for IA-64 Superblock framework introduced into GCC
Improves effectiveness of later optimizations Duplicates frequently executed code
This means GCC will produce faster IA-64 machine code As most Linux distros use GCC as their main
compiler, this means better and faster open source code!
Being reviewed for inclusion into mainline GCC
Intel Itanium Architecture
Conclusions
Conclusions
Several Differences exist over MIPS for example: Large Instruction Sizes Deeper Pipeline
8 and greater for IA-64 5 for MIPS
Large Instruction Set Pros
Very Fast FP Units Very useful for companies operating large servers Supercomputing
Thunder (LLNL) 2nd Fastest supercomputer in the world 19.94 TFlops
Conclusions
Cons Very costly
< $4000 per chip Requires very smart compilers that are very hard to
develop GCC machine code still has bugs
Fails to compile at times May be fixed when new optimizations introduced into
mainline GCC OVERALL
Great processor for high end servers Not useful for the average user
Conclusions
Future Work Tukwila
Scheduled to be released late 2008 May use 32nm Process 30 Mb on die caches Itanium Bus replaced with Intel Quick Path Interconnect
Faster data xfer rates 4 Cores
Poulson Will use 32nm Process More cores, More Parallelism Not much known as of yet
Kittson Codename for newest IA-64 project
Not much else known, Stay tuned for more!
Q&A
Questions?
Thank You
Thanks for listening!
References
Intel Itanium Architecture Presentation http://www.rrze.uni-erlangen.de/dienste/arbeiten-rechnen/hpc/vortraege/IntelCornelius.pdf Itanium Solutions Alliance http://www.itaniumsolutionsalliance.org/news/pr/view?item_key=8e2e31463df96d0033d7d1450f50492523b9e842
Wikipedia http://www.wikipedia.com
Intel Itanium Developers Manual www.intel.com/design/itanium/manuals/iiasdmanual.htm
Top Related