Download - Intel Itanium Architecture(64-bit)

Transcript
Page 1: Intel Itanium Architecture(64-bit)

Intel Itanium Architecture(64-bit)

Page 2: Intel Itanium Architecture(64-bit)

Overview

Page 3: Intel Itanium Architecture(64-bit)

Overview

Why develop? RISC processing limit of one instruction per cycle

predicted 1989 by HP Led to HP development of EPIC(Explicitly Parallel

Instruction Computing) Uses a form of VLIW(Very Long Instruction Word)

HP decides to partner with Intel to develop new Architecture based off EPIC in 1994

IA-64 is born

Page 4: Intel Itanium Architecture(64-bit)

Versions

Merced Codename of the first Intel/HP joint IA-64 chip Development problems

Transistor numbers Teams had different priorities Unanticipated research

Itanium Official name of Merced Released 2001 Due to development delays was lacking

Called the Itanic RISC and CISC performance increases due to

superscaler architectures

Page 5: Intel Itanium Architecture(64-bit)

Versions

Itanium 2 Released 2002 Codenamed McKinley Improved on Itanium design Outperformed comparable RISC and CISC

processors Madison

Released 2003 Basis for all future versions until 2006

Page 6: Intel Itanium Architecture(64-bit)

Versions

Montecito Released 2006 Dual Core implementation of Itanium 2 Performance Doubled Power Consumption cut by 20% New Features also added

multi-threading(two per core) Expanded cache Silicon level support for virtualization

Montvale Released 2007 Fastest IA-64 chip to date

Page 7: Intel Itanium Architecture(64-bit)

Competing Chips

UltraSPARC(Scalable Processor Architecture) Developed by Sun Microsystems RISC Architecture

SPARC64 Developed by Fujitsu RISC Architecture

POWER6(Performance Optimization With Enhanced RISC) Developed by IBM RISC Architecture

Page 8: Intel Itanium Architecture(64-bit)

Competing Chips

Opteron Developed by AMD X86 Architecture

Xeon Developed by Intel X86 Architecture

Page 9: Intel Itanium Architecture(64-bit)

Intel Itanium Architecture

Chip Layout

Page 10: Intel Itanium Architecture(64-bit)

Chip Layout

Itanium Architecture Diagram

Page 11: Intel Itanium Architecture(64-bit)

Chip Layout

Page 12: Intel Itanium Architecture(64-bit)

Itanium Specs

4 Integer ALU's 4 multimedia ALU's 2 Extended Precision FP Units 2 Single Precision FP units 2 Load or Store Units 3 Branch Units 10 Stage 6 Wide Pipeline 32k L1 Cache 96K L2 Cache 4MB L3 Cache(extern) 800Mhz Clock

Page 13: Intel Itanium Architecture(64-bit)

Itanium Specs

Process 180nm System Bus Speed 2.1GB/s

266Mhz 64 bit Wide

Page 14: Intel Itanium Architecture(64-bit)

Itanium2 Specs

6 Integer ALU's 6 multimedia ALU's 2 Extended Precision FP Units 2 Single Precision FP units 2 Load and Store Units 3 Branch Units 8 Stage 6 Wide Pipeline 32k L1 Cache 256K L2 Cache 3MB L3 Cache(on die) 1Ghz Clock initially

Up to 1.66Ghz on Montvale

Page 15: Intel Itanium Architecture(64-bit)

Itanium2 Specs

180nm Process Increased to 130nm in 2003 Further increased to 90nm in 2007

System Bus Speed 6.4GB/s 400Mhz 128 bit Wide

Page 16: Intel Itanium Architecture(64-bit)

Itanium2 Improvements

Initially a 180nm process Increased to 130nm in 2003 Further increased to 90nm in 2007

Improved Thermal Management Clock Speed increased to 1.0Ghz Bus Speed Increase from 266Mhz to 400Mhz L3 cache moved on die

Faster access rate

Page 17: Intel Itanium Architecture(64-bit)

IA-64 Pipeline Features

Branch Prediction Predicate Registers allow branches to be turned on

or off Compiler can provide branch prediction hints

Register Rotation Allows faster loop execution in parallel

Predication Controls Pipeline Stages

Page 18: Intel Itanium Architecture(64-bit)

Cache Features

L1 Cache 4 way associative 16Kb Instruction 16Kb Data

L2 Cache Itanium

6 way associative 96 Kb

Itanium2 8 way associative 256 Kb Initially

256Kb Data and 1Mb Instruction on Montvale!

Page 19: Intel Itanium Architecture(64-bit)

Cache Features

L3 Cache Itanium

4 way associative Accessible through FSB 2-4Mb

Itanium2 2 – 4 way associative On Die 3Mb

Up to 24Mb on Montvale chips(12Mb/core)!

Page 20: Intel Itanium Architecture(64-bit)

Instruction Set Architecture

Page 21: Intel Itanium Architecture(64-bit)

Registers

128 Integer Registers 128 Floating Point Registers 64 One-Bit Predicates 8 Branch Registers

Page 22: Intel Itanium Architecture(64-bit)

Overview

RISC architectures approaching processing limit of 1 instruction per clock cycle

Explicitly Parallel Instruction Computing (EPIC) allowed multiple instructions in one cycle

Implements a form of Very Long Instruction Word (VLIW)

Compiler determines in advance which instructions can be executed in parallel

Page 23: Intel Itanium Architecture(64-bit)

VLIW

Normally, pipelining is done by checking for interdependencies, then resolving them

This comes at the cost of hardware complexity With VLIW, determining which operations can

execute in parallel is done by the compiler Extra scheduling hardware not needed Result is less hardware complexity, but greater

compiler complexity

Page 24: Intel Itanium Architecture(64-bit)

Instruction Execution

Each 128-bit instruction word contains 3 instructions

Fetch mechanism can read up to two instruction words per clock cycle

Whenever possible, the compiler can take advantage of this, allowing the processor to execute up to 6 instructions per cycle

Page 25: Intel Itanium Architecture(64-bit)

Processor Units

The processor has 30 functional units in 11 groups Each unit can execute a particular subset of the

instruction set Common instructions can be executed by multiple

units

Page 26: Intel Itanium Architecture(64-bit)

Processor Units – cont.

6 general-purpose ALUs, 2 integer units, 1 shift unit

4 data cache units 6 multimedia units, 2 parallel shift units, 1 parallel

multiply, 1 population count 2 floating-point multiply-accumulate units, 2

"miscellaneous" floating-point units 3 branch units

Page 27: Intel Itanium Architecture(64-bit)

Processor Units – cont.

Some of the units are designed for specific tasks, to improve performance

For instance, the floating-point multiply-accumulate unit

Allows an instruction that has a multiply followed by an add

Very common in scientific processing

Page 28: Intel Itanium Architecture(64-bit)

Instruction Types

There are a total of 6 instruction types

Page 29: Intel Itanium Architecture(64-bit)

Bundle Format

3 instructions are grouped together into 128-bit aligned containers called “bundles”

Each bundle has three 41-bit instruction slots and a 5-bit template field

Execution goes from 0 to 2

Page 30: Intel Itanium Architecture(64-bit)

Instruction Types

Page 31: Intel Itanium Architecture(64-bit)

Instruction Format

Instructions are 41 bits long Leftmost 4 bits are the opcode Next is opcode extension Then the 3 registers (or immediate values) The last 6 bits deal with predicates (more on this

later)

Page 32: Intel Itanium Architecture(64-bit)

Instruction Set Sample

Page 33: Intel Itanium Architecture(64-bit)

Instruction Example 1

Page 34: Intel Itanium Architecture(64-bit)

Instruction Example 2

Page 35: Intel Itanium Architecture(64-bit)

Instruction Example 3

Page 36: Intel Itanium Architecture(64-bit)

Example Optimizations

Implements branch prediction, speculation, and predication

Prediction and speculation deal with determining which branch will most likely be taken

All of this is done by the compiler, and each word has special bits for this

Page 37: Intel Itanium Architecture(64-bit)

Branch Predication

All possible branches are executed Correct path is kept, all others discarded Almost every instruction in the IA-64 instruction set

is predicated (qp field) Predicates stored in special registers One of these registers is always TRUE, so

unpredicated instructions always have the value true

Page 38: Intel Itanium Architecture(64-bit)

Register Renaming

Sometimes instructions share the same register name, but do not depend on each other

This makes it impossible to run the instructions in parallel

In this case, a special technique can be used to rename the conflicting registers

This is also performed by the compiler

Page 39: Intel Itanium Architecture(64-bit)

Register Renaming - Example

1. lw $1, 10242. addi $1, $1, 23. sw $1, 10324. lw $1, 20485. addi $1, $1, 46. sw $1, 2056

• Instructions 4, 5, and 6 are independent of 1, 2, and 3, but the processor cannot finish 4 until 3 is done, because 3 would write the wrong value

Page 40: Intel Itanium Architecture(64-bit)

Register Renaming - Example

1. lw $1, 10242. addi $1, $1, 23. sw $1, 10324. lw $2, 20485. addi $2, $2, 46. sw $2, 2056

• Now instructions 4, 5, and 6 can be executed in parallel with 1, 2, and 3.

Page 41: Intel Itanium Architecture(64-bit)

Chip Layout

Page 42: Intel Itanium Architecture(64-bit)

Intel Itanium Architecture

Compiler/OS Support

Page 43: Intel Itanium Architecture(64-bit)

Compiler Support

Intel has it's own suite of compilers it uses to produce machine code for IA-64 chips Available through Intel

As of 2007 the following outside compilers can also produce machine code for IA-64 Architectures GCC Open64 MS Visual Studio 2005

Page 44: Intel Itanium Architecture(64-bit)

OS Support

The following operating systems support IA-64 as of 2007 Windows Server 2003 Linux

Debian Red Hat Novell SuSE

FreeBSD HP-UX OpenVMS NonStop

Page 45: Intel Itanium Architecture(64-bit)

OS Support

HP provides Virtualization support for it's HP-UX operating system

GCOS is supported by Itanium chips Does this via Instruction Set Simulators

Essentially an application that acts as a middleman for the OS to hardware communication

Page 46: Intel Itanium Architecture(64-bit)

GCC News

As of 2007, GCC has been further optimized for IA-64 Superblock framework introduced into GCC

Improves effectiveness of later optimizations Duplicates frequently executed code

This means GCC will produce faster IA-64 machine code As most Linux distros use GCC as their main

compiler, this means better and faster open source code!

Being reviewed for inclusion into mainline GCC

Page 47: Intel Itanium Architecture(64-bit)

Intel Itanium Architecture

Conclusions

Page 48: Intel Itanium Architecture(64-bit)

Conclusions

Several Differences exist over MIPS for example: Large Instruction Sizes Deeper Pipeline

8 and greater for IA-64 5 for MIPS

Large Instruction Set Pros

Very Fast FP Units Very useful for companies operating large servers Supercomputing

Thunder (LLNL) 2nd Fastest supercomputer in the world 19.94 TFlops

Page 49: Intel Itanium Architecture(64-bit)

Conclusions

Cons Very costly

< $4000 per chip Requires very smart compilers that are very hard to

develop GCC machine code still has bugs

Fails to compile at times May be fixed when new optimizations introduced into

mainline GCC OVERALL

Great processor for high end servers Not useful for the average user

Page 50: Intel Itanium Architecture(64-bit)

Conclusions

Future Work Tukwila

Scheduled to be released late 2008 May use 32nm Process 30 Mb on die caches Itanium Bus replaced with Intel Quick Path Interconnect

Faster data xfer rates 4 Cores

Poulson Will use 32nm Process More cores, More Parallelism Not much known as of yet

Kittson Codename for newest IA-64 project

Not much else known, Stay tuned for more!

Page 51: Intel Itanium Architecture(64-bit)

Q&A

Questions?

Page 52: Intel Itanium Architecture(64-bit)

Thank You

Thanks for listening!

Page 53: Intel Itanium Architecture(64-bit)

References

Intel Itanium Architecture Presentation http://www.rrze.uni-erlangen.de/dienste/arbeiten-rechnen/hpc/vortraege/IntelCornelius.pdf Itanium Solutions Alliance http://www.itaniumsolutionsalliance.org/news/pr/view?item_key=8e2e31463df96d0033d7d1450f50492523b9e842

Wikipedia http://www.wikipedia.com

Intel Itanium Developers Manual www.intel.com/design/itanium/manuals/iiasdmanual.htm