IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium...
Transcript of IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium...
1
IAIAIAIA----64 Architecture64 Architecture64 Architecture64 Architecture
Sunil SaxenaSunil SaxenaPrincipal EngineerPrincipal EngineerIntel CorporationIntel Corporation
September 11th, 2000September 11th, 2000
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 2
IA Processor Roadmap
®®
MadisonIAIA--64 Perf64 Perf
FutureIA-32
DeerfieldIAIA--64 Price/Perf64 Price/Perf
Per
form
ance
’02’00 ’01.25µ .18µ .13µ
. . .. . .
McKinley
ItaniumTM
processor
’99
. . .. . .
. . .. . .
Foster
Outstanding Performance for
32 Bit Volume Apps
Outstanding Performance for
32 Bit Volume Apps
Extends IA Headroom, Scalability and Availability
for the Most Demanding Environments
Extends IA Headroom, Scalability and Availability
for the Most Demanding Environments
Cascades
Pentium®III Xeon™ processor
Strong Execution on Itanium™ Processor, Continued Focus on the Long Term
2
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 3
Agenda Agenda –– IAIA--64 Architecture64 Architecture
llEPIC 101EPIC 101––Application ArchitectureApplication Architecture
––System ArchitectureSystem Architecture
–– Itanium Itanium MicroarchitectureMicroarchitecture
ll Itanium UpdateItanium Update
llUseful URLsUseful URLs
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 4
EPIC Design PhilosophyEPIC Design Philosophy
�Maximize performance via hardware & software synergy
� Advanced features enhance instruction level parallelism
�Predication, Speculation, ...
�Massive hardware resources for parallel execution
� High performance EPIC building block
Achieving performance at the most Achieving performance at the most fundamental levelfundamental level
Time
Per
form
ance
CISC
RISC
OOO / SuperScalarVLIW
EPICEPIC
3
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 5
Instruction 2Instruction 2 Instruction 1Instruction 1 Instruction 0Instruction 0 TemplateTemplate
128128--bit bundlebit bundle
00127127
ss Breaking the sequential execution paradigmBreaking the sequential execution paradigmss Explicit instruction dependency: templateExplicit instruction dependency: template
ss Flexibly groups any number of independent instructionsFlexibly groups any number of independent instructions
ss Explicitly scheduled parallelismExplicitly scheduled parallelismss Enables compiler to create greater parallelismEnables compiler to create greater parallelism
ss Simplifies hardware by removing dynamic mechanisms Simplifies hardware by removing dynamic mechanisms
ss Fully interlockedFully interlocked-- hardware provides compatibilityhardware provides compatibility
Instruction Format: Explicit ParallelismInstruction Format: Explicit Parallelism
The new instruction format enables scalability w/ compatibility
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 6
Branches Limit PerformanceBranches Limit Performance
Traditional Traditional Architectures: 4 Architectures: 4
basic blocksbasic blocks
Control flow introduces branchesControl flow introduces branches
Load a[i].Load a[i].ptrptrp1, p2 =p1, p2 = cmpcmp a[i].a[i].ptrptr != 0!= 0branch if p2branch if p2
Load a[i].lLoad a[i].lstore b[i]store b[i]branchbranch
Load a[i].rLoad a[i].rstore b[i]store b[i]
i = i + 1i = i + 1
elseelse
thenthen
ififIf a[i].ptr != 0
b[i] = a[i].l;else
b[i] = a[i].r;i = i + 1
4
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 7
Load a[i].Load a[i].ptrptrp1, p2 =p1, p2 = cmpcmp a[i].a[i].ptrptr != 0!= 0branch if p2branch if p2
<p1><p1> Load a[i].lLoad a[i].l<p1><p1> store b[i]store b[i]branchbranch
Predication removes branches Predication removes branches and eliminatesand eliminates mispredictsmispredicts
PredicationPredication
<p2><p2> Load a[i].rLoad a[i].r<p2><p2> store b[i]store b[i]
i = i + 1i = i + 1
elseelse
thenthen
ififIf a[i].ptr != 0
b[i] = a[i].l;else
b[i] = a[i].r;i = i + 1
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 8
Predication Enhances ParallelismPredication Enhances ParallelismTraditional ArchitecturesTraditional Architectures: 4 basic blocks: 4 basic blocks IAIA--6464TMTM ArchitectureArchitecture: 1 basic block: 1 basic block
Predication enables more Predication enables more effective use of parallel hardwareeffective use of parallel hardware
Load a[i].Load a[i].ptrptrp1, p2 =p1, p2 = cmpcmp a[i] != 0a[i] != 0jump if p2jump if p2
Load a[i].lLoad a[i].lstore b[i]store b[i]jumpjump
Load a[i].rLoad a[i].rstore b[i]store b[i]
i = i + 1i = i + 1
elseelse
thenthen
ififLoad a[i].Load a[i].ptrptrp1, p2 =p1, p2 = cmpcmp a[i] != 0a[i] != 0
<p1><p1> Load a[i].lLoad a[i].l<p1><p1> store b[i]store b[i]
<p2><p2> Load a[i].rLoad a[i].r<p2><p2> store b[i]store b[i]
i = i + 1i = i + 1
5
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 9
Memory Latency Causes DelaysMemory Latency Causes Delaysll Loads significantly affect performanceLoads significantly affect performance
–– Often first instruction in dependency chain of instructionsOften first instruction in dependency chain of instructions
–– Can incur high latenciesCan incur high latencies
Add t1 + 1Add t1 + 1comp t1 > t2comp t1 > t2branchbranch
Load a[t1Load a[t1--t2]t2]Load b[j]Load b[j]add b[j] + 1 add b[j] + 1
BarrierBarrier
Traditional ArchitecturesTraditional Architectures t1 = t1 + 1If t1 > t2
j = a[t1 - t2]b[j] ++
Loads can cause exceptionsLoads can cause exceptions
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 10
Speculation with IASpeculation with IA--6464TMTM ArchitectureArchitecture
ll Separate load behavior from exception behaviorSeparate load behavior from exception behavior
–– Speculative load instruction (Speculative load instruction (load.sload.s) initiates a load ) initiates a load operation and detects exceptionsoperation and detects exceptions
–– Propagate an exception Propagate an exception “token”“token” (stored with (stored with destination register) from destination register) from load.sload.s to to check.scheck.s
–– Speculative check instruction (Speculative check instruction (check.scheck.s) delivers any ) delivers any exceptions detected by exceptions detected by load.sload.s
;Exception Detection;Exception Detection
;Exception Delivery;Exception Delivery
PropagatePropagateExceptionException
Add t1 + 1Add t1 + 1load.s a[t1load.s a[t1--t2]t2]comp t1 > t2comp t1 > t2jumpjump
Check.sCheck.sLoad b[j]Load b[j]add b[j] + 1add b[j] + 1
6
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 11
Speculation Minimizes the Effect Speculation Minimizes the Effect of Memory Latencyof Memory Latency
ll Give scheduling freedom to the compilerGive scheduling freedom to the compiler
–– Allows Allows load.sload.s to be scheduled above branchesto be scheduled above branches
–– check.scheck.s remains in home block, branches toremains in home block, branches to fixupfixupcode if an exception is propagatedcode if an exception is propagated
Add t1 + 1Add t1 + 1comp t1 > t2comp t1 > t2jumpjump
Load a[t1Load a[t1--t2]t2]Load b[j]Load b[j]add b[j] + 1 add b[j] + 1
Traditional ArchitecturesTraditional Architectures
;Exception Detection;Exception Detection
;Exception Delivery;Exception Delivery
PropagatePropagateExceptionException
Add t1 + 1Add t1 + 1load.s a[t1load.s a[t1--t2]t2]comp t1 > t2comp t1 > t2jumpjump
Check.sCheck.sLoad b[j]Load b[j]add b[j] + 1 add b[j] + 1
IAIA--64 Architecture64 Architecture
BarrierBarrier
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 12
Predication & SpeculationPredication & Speculation
If a[i].ptr != 0b[i] = a[i].l;
elseb[i] = a[i].r;
i = i + 1
Load a[i].Load a[i].ptrptrp1, p2 =p1, p2 = cmpcmp a[i].a[i].ptrptr != 0!= 0
<p1><p1> Load a[i].lLoad a[i].l<p1><p1> store b[i]store b[i]
<p2><p2> Load a[i].rLoad a[i].r<p2><p2> store b[i]store b[i]
i = i + 1i = i + 1
With PredicationWith Predication
Load a[i]Load a[i]load.s a[I].l load.s a[I].rload.s a[I].l load.s a[I].rp1, p2 =p1, p2 = cmpcmp a[i] != 0a[i] != 0
<p1><p1> check.scheck.s<p1><p1> store b[i]store b[i]
<p2><p2> check.scheck.s<p2><p2> store b[i]store b[i]
i = i + 1i = i + 1
With Predication & SpeculationWith Predication & Speculation
Predication and Predication and Speculation = higher ILPSpeculation = higher ILP
7
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 13
Agenda Agenda –– IAIA--64 Architecture64 Architecture
llEPIC 101EPIC 101––Application ArchitectureApplication Architecture
––System ArchitectureSystem Architecture
–– Itanium MicroItanium Micro--ArchitectureArchitecture
ll Itanium UpdateItanium Update
llUseful URLsUseful URLs
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 14
IAIA--64 System Architecture64 System Architecture
ll Virtual Memory ModelVirtual Memory Model
ll Interruption ModelInterruption Model
ll System Software Stack System Software Stack
ll Reliability, Availability, ServiceabilityReliability, Availability, Serviceability
ll CompatibilityCompatibility
8
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 15
IAIA--64 Virtual Memory Model64 Virtual Memory Model
llProcess Address SpaceProcess Address Space
llSystem Address Space ManagementSystem Address Space Management
llVirtual Address TranslationVirtual Address Translation––TLB and Page tableTLB and Page table
llFlexible Object Sharing ModelFlexible Object Sharing Model––Aliasing Aliasing and Global addressingand Global addressing
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 16
Process Address SpaceProcess Address Space
Flat Virtual Space: 264 bytes
64-bit Address
063
0
264
9
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 17
Process Address SpaceProcess Address Space
64-bit Address
Code/TextData/HeapDLLs
OS Kernel
0
264
8 Regions/process063
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 18
System Address SpaceSystem Address Space
.
.
.
≥≥218 Regions64-bit Address
0
264
.
.
. Pages
063
10
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 19
IAIA--64 Region Registers64 Region Registers
•••
8 Region Registers
64-bit Address
063 61 60
≥≥218 Regions
261 bytes in size
RID
.
.
. Pages
############
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 20
Processes and ThreadsProcesses and Threads
•••
•••
Regions Enable Efficient Management Of Processes For Multi-tasking Environments
Process 2
Process 1
11
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 21
Virtual Address Translation: TLBVirtual Address Translation: TLB
llMapping to Physical AddressMapping to Physical Address
Process C
PhysicalAddressesVirtual Addresses
•Access Rights
Process B
Process A
TLB
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 22
TLB OrganizationTLB Organization
ll Separate instruction and data Separate instruction and data TLBsTLBs
ll Software ManagesSoftware Manages–– TR entries, TR entries,
–– PagePage--table updatestable updates
ll Hardware ManagesHardware Manages–– TC TLB refillTC TLB refill
–– Broadcast TLB PurgeBroadcast TLB Purge
ITC
ITR
DTC
DTR
Instruction
Data
Balance TLB For Efficient Memory Management
12
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 23
Virtual Address TranslationVirtual Address Translation
RID
Virtual Page #64-bit Address
RRx063 61 60
offset
Region Registers
Virtual Page #RID Physical Page # Protection
“match”
Physical Address
TLB
“deliver”
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 24
Protection: Can I See it? Can I Access it?Protection: Can I See it? Can I Access it?
KeyKey
Protection KeyRegisters
Key5 rw-Key4 rwxKey3 r--Key2 rw-Key1 r-xpkr0
pkrn
Key
Virtual Page #RRx offset
Virtual Page #RID
TLB
Rights
Priv. LevelAccess Type
Virtual Page #RID RightsVirtual Page #RID Rights
Allow?
Protection Keys Increase TLB Utilization For Large Object Databases
13
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 25
Variable Page SizesVariable Page Sizes
llMinimum on all implementationsMinimum on all implementations––4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M, 4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M,
64M, 256M64M, 256M--bytesbytes
ll 4 GB purge4 GB purge
––Simplify address space deSimplify address space de--allocationallocation
Variable Page Sizes Enable TLB Efficiency For OS And Application Performance
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 26
Hardware Accessed Page TableHardware Accessed Page Table
Flexible Hardware Mechanisms Enable Parallel Execution
Virtual Page #64-bit Address
RRx063 61 60
offset
Region Registers
HashSearch
Virtual HashedPage Table
(VHPT)
Processor
RID
Memory
14
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 27
Virtual Memory Model: ExampleVirtual Memory Model: Example
Region 2 - One RID, protection via multiple keysShared memory areas1,2
3
1
2,3Virtual
AddressSpace
Region 0 - Different RID in each process
Unique address spaces for data
P1P2
P3P4
Region 1 - Same RID if shared
Single address space for codeP1,2,3,4
Flexible Virtual Memory Architecture Enables Variety Of Efficient OS Implementations
Region 7 - One RID, no keyKernel - protected by Priv. level...
...
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 28
IAIA--64 System Architecture64 System Architecture
ll Virtual Memory ModelVirtual Memory Model
ll Interruption ModelInterruption Model
ll System Software StackSystem Software Stack
ll Reliability, Availability, ServiceabilityReliability, Availability, Serviceability
ll CompatibilityCompatibility
15
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 29
IAIA--64 Interruption Model64 Interruption Model
ll Parallel instruction execution, . . .Parallel instruction execution, . . .–– Exception delivery is sequential & preciseException delivery is sequential & precise
–– All exceptions reported on the excepting instruction All exceptions reported on the excepting instruction (including numeric exceptions) (including numeric exceptions)
ll “Interruption” is IA“Interruption” is IA--64 term for...64 term for...
Abort InterruptTrapFault• Hardware reset• Machine check
Asynchronous external event:• device or platformmanagement interrupt• soft-reset
Exception taken before instruction commit, e.g. TLB miss
Exception taken after instruction commit, e.g. FP trap
IA-64 Provides Precise Exception Model To Match Today’s OS Designs
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 30
IAIA--64 Interruption Process64 Interruption Process
Application Code
Normal Instruction Execution Flow:• Instruction A executed
0x1000 INST A0x1010 INST B0x1020 INST C
.
.
.
0x1000 INST A0x1000 INST A0x1010 INST B0x1020 INST C
.
.
.
IP
31
16
24BANK0 REG
(OS data)IVT Code
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
16
3132
127
0
15
BANK1 REG(app data)
IPPSR
0x10000x1000
Current Processor State
IIP
IPSR
Interruption Registers
.
.
.
16
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 31
IAIA--64 Interruption Process64 Interruption ProcessNormal Instruction Execution Flow:• Instruction B executed
31
16
24
Application Code
0x1000 INST A0x1010 INST B0x1020 INST C
.
.
.
0x1000 INST A0x1010 INST B0x1010 INST B0x1020 INST C
.
.
.
IP
BANK0 REG(OS data)
IVT Code
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
32
127
0
15
BANK1 REG(app data)
16
31
IPPSR
0x10100x1010
Current Processor State
IIP
IPSR
Interruption Registers
.
.
.
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 32
Processor switches to Bank 0 registers
preparing to run IVT code
1
IAIA--64 Interruption Process64 Interruption Process
Application Code
0x1000 INST A0x1010 INST B0x1020 INST C
.
.
.
0x1000 INST A0x1010 INST B0x1010 INST B0x1020 INST C
.
.
.
IP
INTERRUPTION
31
16
24BANK1 REG
(app data)BANK BANK
SWITCHINGSWITCHINGIVT Code
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
Interruption Delivery
32
127
0
1516
31
IPPSR
0x10100x1010
Current Processor State
.
.
.
Processor savescurrent state to
interruption registers before interrupt handling
2ProcessorProcessorsaves statesaves state
IIPIPSR
Interruption Registers
0x1010
.
.
.
BANK1 REG(app data)
BANK0 REG(OS data)
17
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 33
IAIA--64 Interruption Process64 Interruption Process Interruption Handling• Instruction X executed in interrupt vector table
Application Code
0x1000 INST A0x1010 INST B0x1020 INST C
.
.
0x1000 INST A0x1010 INST B0x1020 INST C
.
.
Interrupt Vector Table (IVT) Code
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
IP
31
16
24 BANK1 REG(app data)
32
127
0
15
BANK0 REG(OS data)
16
31
IPPSR
0x40000x4000
Current Processor State
.
.
.
IIPIPSR
InterruptionRegisters
0x1010
.
.
.
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 34
IAIA--64 Interruption Process64 Interruption ProcessInterruption Handling• Instruction Y executed in interrupt vector table
Application Code
0x1000 INST A0x1010 INST B0x1020 INST C
.
.
0x1000 INST A0x1010 INST B0x1020 INST C
.
.
IVT Code
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
IP
31
16
24 BANK1 REG(app data)
32
127
0
15
BANK0 REG(OS data)
16
31
IPPSR
0x40100x4010
Current Processor State
.
.
.
IIPIPSR
Interruption Registers
0x1010
.
.
.
18
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 35
BANK0 REG(OS data)
BANK SWITCHING
Processor switches back to Bank 1 registers
1
IAIA--64 Interruption Process64 Interruption Process
31
16
24 BANK0 REG(OS data)
Application Code
0x1000 INST A0x1010 INST B0x1020 INST C
.
.
0x1000 INST A0x1010 INST B0x1020 INST C
.
.
IVT Code
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
IP
RETURN TO APP CODE
IP
Restoring Pre-Interruption State
32
127
0
15
Processor restoresstate from interruption
registers before returning from interrupt
Processorrestores
state
2IPPSR
0x40200x4020
Current Processor State
.
.
.
IIPIPSR
Interruption Registers
0x1010
.
.
.
BANK1 REG(app data)
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 36
IIP
IPSR
Interruption Registers
.
.
.
IAIA--64 Interruption Process64 Interruption ProcessResume Normal Instruction Execution:• Instruction B executed
Application Code
0x1000 INST A0x1010 INST B0x1020 INST C
.
.
.
0x1000 INST A0x10100x1010 INST BINST B0x1020 INST C
.
.
.
IP
IVT Code
0x4000 INST X0x4010 INST Y0x4020 RFI
.
.
0x4000 INST X0x4010 INST Y0x4020 RFI
.
. 0
31
16
24BANK0 REG
(OS Data)
32
127
15
BANK1 REG(app data)
IPPSR
0x10100x1010
Current Processor State
19
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 37
Interruption FeaturesInterruption Features
ll Low interruption latencyLow interruption latency
–– Interruption delivery causes single pipeline breakInterruption delivery causes single pipeline break
–– Key state captured in onKey state captured in on--chip registerschip registers
ll StateState--save controlled by system softwaresave controlled by system software
–– Software makes performance/nesting tradeSoftware makes performance/nesting trade--offoff
–– Shared mechanism for IAShared mechanism for IA--64/IA64/IA--32 interruptions32 interruptions
ll Efficient handler executionEfficient handler execution
–– Interruption vector table (IVT) contains code for Interruption vector table (IVT) contains code for interrupt service routineinterrupt service routine
Provides Fast And Flexible Interruptions For Large I/O Intensive Applications
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 38
Parallelism Across System CallsParallelism Across System Calls
Application Code
…// make system call
br.call _write...
….
_write: epc // privilege promote// without pipeline flushbr os_write….
os_write:…// perform system call br.ret // demote PL and return to user
EPC Page (PL promote and execute only)
Operating SystemKernel
(privileged code)
Application AddressSpace
Fast System Calls Improve Synergy Between OS & Application
20
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 39
IAIA--64 External Interrupts64 External Interrupts
High Performance Message-Based Interrupts Compatible With Today’s Platforms
Processor LINT0 (Intel 8259ALINT1 compatible)
Bridge
External InterruptController
Device w/InterruptController
Device
Device
IPI messages
Processor Processor
System Bus
I/O Bus
Interrupt messages
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 40
IAIA--64 System Architecture64 System Architecture
ll Virtual Memory ModelVirtual Memory Model
ll Interruption ModelInterruption Model
ll System Software StackSystem Software Stack
ll Reliability, Availability, ServiceabilityReliability, Availability, Serviceability
ll CompatibilityCompatibility
21
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 41
IAIA--64 System Software Stack: OS Boot64 System Software Stack: OS Boot
Processor (hardware)Processor (hardware)
Platform (hardware)Platform (hardware)
Processor Abstraction Layer (PAL)Processor Abstraction Layer (PAL)
Reset, machine checks
System Abstraction Layer (SAL) System Abstraction Layer (SAL) IA-32 BIOS
Operating System SoftwareOperating System Software
OS boot
Access to platform resources
EFIEFIEFI
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 42
OS RunningOS Running
Processor (hardware)Processor (hardware)
Platform (hardware)Platform (hardware)
Interruptions
External Interrupts(performance critical)
Processor Abstraction Layer (PAL)Processor Abstraction Layer (PAL)
Instructions
I/O
EFIEFIEFI
Operating System SoftwareOperating System Software
System Abstraction Layer (SAL) System Abstraction Layer (SAL) IA-32 BIOS
22
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 43
OS Calls To Firmware ServicesOS Calls To Firmware Services
Processor (hardware)Processor (hardware)
Platform (hardware)Platform (hardware)
Processor Abstraction Layer (PAL)Processor Abstraction Layer (PAL)
EFIEFIEFI
Operating System SoftwareOperating System Software
Run-TimeServices
Access to platform resources
System Abstraction Layer (SAL) System Abstraction Layer (SAL) IA-32 BIOS
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 44
MachineMachine Check HandlingCheck Handling
Processor (hardware)Processor (hardware)
Platform (hardware)Platform (hardware)
EFIEFIEFI
Operating System SoftwareOperating System Software
Access to platform resources
Processor Abstraction Layer (PAL)Processor Abstraction Layer (PAL)
Reset, machine checks
Machine CheckServices
System Abstraction Layer (SAL) System Abstraction Layer (SAL) IA-32 BIOS
23
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 45
Architected Architected RAS FeaturesRAS Features
–– ReliabilityReliability–– 3 levels of error signaling: 3 levels of error signaling:
–– ContinuableContinuable, local, and global, local, and global
–– AvailabilityAvailability–– Fine grained error containment by cooperation Fine grained error containment by cooperation
between hardware and firmwarebetween hardware and firmware
–– Serviceability Serviceability –– Extensive error logs for error analysisExtensive error logs for error analysis
–– Common error logs for firmware and OSCommon error logs for firmware and OS
Advanced Machine Check Architecture For High Levels of Reliability, Availability, And Serviceability
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 46
IAIA--64 System Architecture64 System Architecture
ll Virtual Memory ModelVirtual Memory Model
ll Interruption ModelInterruption Model
ll System Software StackSystem Software Stack
ll Reliability, Availability, ServiceabilityReliability, Availability, Serviceability
ll CompatibilityCompatibility
24
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 47
CompatibilityCompatibility
ll IAIA--64 supports 64 supports IAIA--32 OS32 OS–– Capable of running unmodified multiCapable of running unmodified multi--processing IAprocessing IA--32 OS, 32 OS,
e.g. NT4.0, Linuxe.g. NT4.0, Linux
ll IAIA--64 OS supports 64 OS supports IAIA--32 Platform32 Platformperipheralsperipherals
–– IAIA--64 support for legacy I/O port space64 support for legacy I/O port space
ll Dependent upon OS & platform Dependent upon OS & platform
implementationimplementation
IA-64 Offers Full IA-32 Compatibility In Hardware: Platforms, OS, Applications
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 48
Agenda Agenda –– IAIA--64 Architecture64 Architecture
llEPIC 101EPIC 101––Application ArchitectureApplication Architecture
––System ArchitectureSystem Architecture
–– Itanium MicroItanium Micro--architecturearchitecture
ll Itanium UpdateItanium Update
llUseful URLsUseful URLs
25
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 49
Branch Hints
Memory Hints
InstructionCache
& BranchPredictors
FetchFetch Memory Memory SubsystemSubsystem
Three levels of cache:L1, L2, L3
Register Stack & Rotation
Explicit Parallelism
128 GR &128 FR,RegisterRemap
&Stack Engine
Register Register HandlingHandling
Fast, S
imp
le 6-Issue
IssueIssue ControlControl
MicroMicro--architecture Features in hardwarearchitecture Features in hardware: :
EPIC Design Maximizes SW-HW Synergy
Architecture Features programmed by compiler::
PredicationData & ControlSpeculation
Byp
asses & D
epen
den
cies
Parallel ResourcesParallel Resources
4 Integer + 4 MMX Units
2 FMACs (4 for SSE)
2 LD/ST units
32 entry ALAT
Speculation Deferral Management
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 50
Intel® Itanium™ Processor Block DiagramIntel® Itanium™ Processor Block Diagram
L1 Instruction Cache andL1 Instruction Cache andFetch/PreFetch/Pre--fetch Enginefetch Engine
128 Integer Registers128 Integer Registers 128 FP Registers128 FP Registers
BranchBranchPredictionPrediction
L2
Cac
he
L2
Cac
he
DualDual--PortPortL1L1
DataDataCacheCache
andandDTLBDTLB
BranchBranchUnitsUnits
Branch & PredicateBranch & PredicateRegistersRegisters
Sco
reb
oar
d, P
red
icat
eS
core
bo
ard
, Pre
dic
ate
,, NaT
sN
aTs ,
Exc
epti
on
s, E
xcep
tio
ns
AL
AT
AL
AT
ITLBITLB
BB BB BB MM MM II II FF FF
IAIA--3232DecodeDecode
andandControlControl
Instruction Instruction QueueQueue
SIMDSIMDFMACFMAC
FloatingFloatingPointPointUnitsUnits
SIMDSIMDFMACFMAC
8 bundles8 bundles
Register Stack Engine / ReRegister Stack Engine / Re--MappingMapping
9 Issue Ports9 Issue Ports
L3
Cac
he
L3
Cac
he
Bus ControllerBus ControllerECCECC
ECCECC
ECCECC
ECCECC
ECCECCECCECC
IntegerIntegerandand
MM UnitsMM Units
26
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 51
Floating Point FeaturesFloating Point Featuresl Native 82-bit hardware provides support for multiple numeric models
l 2 Extended precision pipelined FMACs deliver 4 EP / DP FLOPs/cycle
l Performance for security and 3-D graphics
l 2 Additional single-precision FMACs for 8 SP FLOPs/cycle (SIMD)
l Efficient use of hardware: Integer multiply-add and s/w divide
l Balanced with plenty of operand bandwidth from registers / memory
6 x 82-bit operands
L2 L2 CacheCache
128 entry128 entry8282--bitbit
RFRF
2 x 82-bit results
4Mbyte4MbyteL3 L3
CacheCache
2 stores/clk
2 DP Ops/clk
4 DP Ops/clk
(2 x Fld-pair)
odd
even
Itanium™ processor delivers industryItanium™ processor delivers industry--leading leading floating point performancefloating point performance
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 52
Example Example –– Memory LatencyMemory Latencyunrolled_loop:
ld8 t0=[src],32ld8 t1=[src2],32add loopcnt=-1,loopcnt
ld8 t2=[src3],32ld8 t3=[src4],32;;ld8 t4=[src],32ld8 t5=[src2],32cmp.ne p8,p9=r0,loopcnt
ld8 t6=[src3],32ld8 t7=[src4],32
lfetch.nta [sf],64lfetch.excl.nta [df],64
st8 [dst]=t0,32st8 [dst2]=t1,32st8 [dst3]=t2,32st8 [dst4]=t3,32;;st8 [dst]=t4,32st8 [dst2]=t5,32st8 [dst3]=t6,32st8 [dst4]=t7,32
(p8) br.cond.sptk.few unrolled_loop
27
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 53
Agenda Agenda –– IAIA--64 Architecture64 Architecture
llEPIC 101EPIC 101––Application ArchitectureApplication Architecture
––System ArchitectureSystem Architecture
–– Itanium MicroItanium Micro--architecturearchitecture
ll Itanium UpdateItanium Update
ll Useful URLsUseful URLs
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 54
Itanium™ ProcessorItanium™ Processor
ll 800 MHz production frequency800 MHz production frequency–– Up to 20 operations per clockUp to 20 operations per clock
ll 4 MB high speed on4 MB high speed on--cartridge L3 cachecartridge L3 cachell Over 320M transistorsOver 320M transistors
–– 25M in CPU, 295M in L3 cache25M in CPU, 295M in L3 cache
ll 2.1 GB/s system bus2.1 GB/s system bus–– Enhanced Defer Mechanism enables high Enhanced Defer Mechanism enables high
scalability through improved bus efficiencyscalability through improved bus efficiency
ll Extensive reliability and availability Extensive reliability and availability featuresfeatures
–– ECC, parity protection, enhanced MCAECC, parity protection, enhanced MCA
ll Excellent functionality on initial siliconExcellent functionality on initial silicon–– No architectural or ISA changes plannedNo architectural or ISA changes planned
28
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 55
Itanium™ Processor System ArchitectureItanium™ Processor System Architecture
MAC
MAC
MDC
MDC
MDC
MDC
MAC
MAC
MDC
MDC
MDC
MDCF16[2]
F16[3]
F16
[1]
F16
[0]
82460gxSDC
82460gxSAC
82460gxWXB
82460gxWXB
82460gxWXB
82460gxPXB
82460gxPID
82460gxIFB
FWH
ll Intel 460GX ChipsetIntel 460GX Chipset–– Support for 1Support for 1--4 processors4 processors–– Dual memory portsDual memory ports
–– 4.2 GB/s4.2 GB/s–– Up to 64 GB SDRAMUp to 64 GB SDRAM
–– 64b / 66MHz PCI Hot Plug I/O64b / 66MHz PCI Hot Plug I/O–– Extensive ECC, parity protectionExtensive ECC, parity protection–– FullFull--speed frontspeed front--side bus operation achieved in MP environment side bus operation achieved in MP environment
with prewith pre--production samplesproduction samples
ll Over 30 OEM system designs; multiple custom chipsetsOver 30 OEM system designs; multiple custom chipsets–– Over five 8 processor and greater system designs Over five 8 processor and greater system designs –– Multiple OEM chipsets powered up in 2H ‘99Multiple OEM chipsets powered up in 2H ‘99
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 56
All Elements of IA-64 Program Converging for Successful Solution Launch
IAIA--64 Platform Launch Readiness64 Platform Launch ReadinessOracle 8i, SQL, SAP, IIS ... Oracle 8i, SQL, SAP, IIS ... Mental Ray, Softimage, Mental Ray, Softimage,
NastranNastran ... ...
Adaptec, QAdaptec, Q--Logic, 3Logic, 3--D D Labs, Labs, MatroxMatrox ……
Fast Track Driver Fast Track Driver programprogram
C++, Fortran, Java, C++, Fortran, Java, other offerings from other offerings from Microsoft, EPC, IBM ...Microsoft, EPC, IBM ...
6464--bit Windowsbit WindowsUnix / LinuxUnix / Linux
Novell developer Novell developer releasesreleases
IAIA--64 processor64 processorroadmaproadmap
Over 5Over 5products identified, products identified,
more plannedmore planned
2P workstations2P workstations4P to 512P servers4P to 512P servers
Intel 460 GX PCIIntel 460 GX PCI--setsetCustom OEM chipset designs Custom OEM chipset designs supporting high MP systemssupporting high MP systems
Workstation Applications
ISVs
Server Applications
ISVs
SoftwareTools
Intel/ISVs
Operating Systems
OSVs
System Designs
OEMs
ChipsetsIntel/Industry
ProcessorIntel
HARDWAREHARDWARE
SOFTWARESOFTWARE
Hardware, I/O,
GraphicsIHVs
29
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 57
IAIA--64 Application Program Summary64 Application Program Summary
Workstation ISVs publicly committed to IAWorkstation ISVs publicly committed to IA--6464ll Adobe Adobe
ll Alias/Wavefront Alias/Wavefront
ll Avid/SoftImage Avid/SoftImage
ll CadenceCadence
ll Dassault Dassault
ll DiscreetDiscreet
ll Flometrics Flometrics
ll Infinity Infinity
ll Invent Computing Invent Computing
ll Lizard Tech Lizard Tech
ll Magma Magma
Server ISVs publicly committed to IAServer ISVs publicly committed to IA--6464ll Ariba Ariba
ll Allegis Allegis
ll AltoWeb AltoWeb
ll Apogee NetworksApogee Networks
ll Baan Baan
ll BEA Weblogic BEA Weblogic
ll Brokat Brokat
ll EntrustEntrust
ll Extricity Extricity
ll IBM Software IBM Software
ll Informix Informix
DCC EDA MDA Finance OtherDCC EDA MDA Finance Other
••RenderingRendering••EditingEditing••3D Animation3D Animation
••VerificationVerification••SynthesisSynthesis••DRCDRC
••EquityEquity••Treasury Treasury ••Risk AnalysisRisk Analysis
••CFDCFD••GISGIS••Molecular Molecular Modeling Modeling
••FEAFEA••ModelingModeling••Hi-end CAEHi-end CAE
IA-64 Software Program Increasing Depth and Breadth of IA-64 Software Ramp
IA-64 Workstation focus applicationsIA-64 Server focus applicationsVery Large Databases�Data Warehousing
� Decision Support
� OLTP and OLAP
� ERP and LOB
� Customer Management
E-Business Services� Security Services
� Directory Services
� VPN/IP Gateways
� ISP Dedicated Switches
r Mental Images
r Mentor Graphics r Molecular Simulations Inc.
r MSC r Parametric
r Risk Metrics
r SCALI r Synopsys
r Unigraphics
r Viewlogic r Visual Insights
Scientific / Technical Computing�Computer Aided Engineering/Design
�Finite Element Analysis
�Fluid Dynamics and Simulations
r IONA Technologies
r Lutris r Microsoft
r Nuance r Oasis
r Oblix
r Oracle r People Soft
r Persistence
r Relativity Technologies r RSA
r SAP r SAS
r Selectica
r Silknet r Silverstream
r Softway r Speechworks
r TimesTen
r Torrent r Verisign
r Webline
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 58
McKinley Processor Feature McKinley Processor Feature OverviewOverview
ll Enhanced Itanium™Enhanced Itanium™ processor microarchitecture & system busprocessor microarchitecture & system bus–– Fully binary compatible in hardware with Itanium™Fully binary compatible in hardware with Itanium™ processor processor
–– Expanded resources including more load/store ports and ALUExpanded resources including more load/store ports and ALU
–– OnOn--chip L3 cachechip L3 cache
ll Builds upon Itanium™ platform infrastructureBuilds upon Itanium™ platform infrastructure–– Reuses key technologies Reuses key technologies –– bus protocols, power delivery technology, bus protocols, power delivery technology,
software tools, other key platform software tools, other key platform componenetscomponenets
ll Continued focus on high availability for eContinued focus on high availability for e--businessbusiness–– Extensive error detection & correctionExtensive error detection & correction
–– System management bus with onSystem management bus with on--package power thermal managementpackage power thermal management
ll Production target: end of 2001Production target: end of 2001
Extends IA-64 capability for end ’01 timeframe
30
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 59
IAIA--64 Docs & URLs64 Docs & URLsll IAIA--64 Software Developer’s Manual64 Software Developer’s Manual
–– Info for system & application software, & development tools for Info for system & application software, & development tools for IAIA--6464
–– Software optimization techniques Software optimization techniques
–– Performance monitoring info for optimization supportPerformance monitoring info for optimization support
ll More IAMore IA--64 Documentation:64 Documentation:–– IAIA--64 Software Conventions and Runtime Architecture Guide64 Software Conventions and Runtime Architecture Guide
–– Assembly Language Reference GuideAssembly Language Reference Guide
–– IAIA--64 assembler & reference guide64 assembler & reference guide
–– IAIA--64 Processor64 Processor--specific Application Binary Interfacespecific Application Binary Interface
–– System Abstraction Layer SpecificationSystem Abstraction Layer Specification
and more … and more …
IA-64 Docs Available On Internet:developer.intel.com/design/ia-64/devinfo.htm®®
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 60
BackupBackup
31
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 61
llBackgroundBackground
ll IAIA--64 at work: Code Examples64 at work: Code Examples––xlxgetvaluexlxgetvalue from LIfrom LI
–– control speculation to chase pointerscontrol speculation to chase pointers
––puzzle code fragmentpuzzle code fragment–– loop with nested if statementsloop with nested if statements
–– treeinstreeins code fragmentcode fragment–– classic ifclassic if--thenthen--else statementelse statement
llSummarySummary
IAIA--6464TMTM Architecture InnovationsArchitecture InnovationsOutlineOutline
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 62
XlxgetvalueXlxgetvalue in a Nutshellin a Nutshell
llCode fragment from SpecInt95 Code fragment from SpecInt95 benchmark LIbenchmark LI–– representative of pointer chasing coderepresentative of pointer chasing code
llTechnique used: Control SpeculationTechnique used: Control Speculation
llBenefits: Benefits: ––hide memory latencyhide memory latency
––expose Instruction Level Parallelism allowing expose Instruction Level Parallelism allowing parallel executionparallel execution
32
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 63
Example Machine Model forExample Machine Model for xlxgetvalxlxgetval
Register FileRegister File
L0 DCache
ResteerResteer
Instruction Decode Instruction Decode and Dispatchand Dispatch
L0L0 IcacheIcache InstructionInstructionPointerPointer
6 functional units6 functional units
1 cycle1 cycleload latencyload latency
8 cycle branch8 cycle branch mispredictmispredict
2 memory ports2 memory ports
6 Execution units, 2 memory ports, 1 cycle load latency6 Execution units, 2 memory ports, 1 cycle load latency
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 64
XlxgetvalueXlxgetvalue Step by StepStep by Step
llCompile one iterationCompile one iteration––use control speculation to issue loads as use control speculation to issue loads as
early as possibleearly as possible
llUnroll the loopUnroll the loop––use control speculation to start next iteration use control speculation to start next iteration
before it is safe to do sobefore it is safe to do so
–– take advantage of the machine width to take advantage of the machine width to execute several iterations in parallelexecute several iterations in parallel
Expose ILP with Control Speculation Expose ILP with Control Speculation in pointer chasing codein pointer chasing code
33
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 65
XlxgetvalueXlxgetvalue Code FragmentCode Fragment
for (for (fpfp == xlenvxlenv;; fpfp;;
fpfp == cdrcdr((fpfp))))
for (for (epep = car(= car(fpfp);); epep;;
epep == cdrcdr((epep))))
if (sym == if (sym == car(car(car(car(epep))) )))
return (return (cdrcdr(car((car(epep)));)));
LdLd fpfpcmp fpcmp fp == nil== nilbrbr to exit if trueto exit if trueLoadLoad epepCond1 = (Cond1 = (cmp epcmp ep ==nil)==nil)br nxtbr nxt__fpfp if Cond1if Cond1load car(load car(epep))load x= car(car(load x= car(car(epep))))Cond2 = (comp sym== x)Cond2 = (comp sym== x)brbr to return if Cond2to return if Cond2br nxtbr nxt__epep
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 66
Compiling . . .Compiling . . . for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))
Cycle Unit 1 Uni t 2 Unit 3 Unit 4 Unit 5 Unit 6
1 Load ep1
2
3
4
5
34
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 67
Cycle Unit 1 Uni t 2 Unit 3 Unit 4 Unit 5 Unit 6
1 Load ep1
2 Cond1 =ep1 == nil
3
4
5
Compiling . . .Compiling . . .for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 68
for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))
Cycle Unit 1 Uni t 2 Unit 3 Unit 4 Unit 5 Unit 6
1 Load ep1
2 Cond1 =ep1 == nil
Load.scar(ep1)
3 check.s
4
5
Compiling . . .Compiling . . .
35
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 69
for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))
Cycle Unit 1 Uni t 2 Unit 3 Unit 4 Unit 5 Unit 6
1 Load ep1
2 Cond1 =ep1 == nil
Load.scar(ep1)
3 check.s Loadx=car(car(ep1))
4
5
Compiling . . .Compiling . . .
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 70
for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))
Cycle Unit 1 Uni t 2 Unit 3 Unit 4 Unit 5 Unit 6
1 Load ep1
2 Cond1 =ep1 == nil
Load.scar(ep1)
3 check.s Loadx=car(car(ep1))
4 Cond2 =sym== x
5
Compiling . . .Compiling . . .
36
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 71
First IterationFirst Iteration
Cycle Unit 1 Uni t 2 Unit 3 Unit 4 Unit 5 Unit 6
1 Load ep1
2 Cond1 =ep1 == nil
Load.scar(ep1)
Br nxt_fpif cond1
3 check.s Loadx=car(car(ep1))
4 Cond2 =sym== x
Br returnif cond2
Br nxt_ep
5
for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))
return (return (cdrcdr(car((car(epep)));)));
Speculation allows the loads to be started earlySpeculation allows the loads to be started early
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 72
Cycle Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 U. 6
1 Ld ep1
2 Ld.scar(ep1)
Cond1=Cmp ep== nil
Ld.s ep2=cdr(ep1)
Br nxt_fpif cond1
3 Check.s Ld carcar(ep1)
4 Cond2=Cmp ==symm
Br returnif cond2
5
for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))
Second Iteration: Unrolling . . .Second Iteration: Unrolling . . .
37
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 73
Second Iteration: Unrolling . . .Second Iteration: Unrolling . . .
Cycle Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 U. 6
1 Ld ep1
2 Ld.scar(ep1)
Cond1 =Cmp ep== nil
Ld.s ep2=cdr(ep1)
Br nxt_fpif cond1
3 Check.s Ld carcar(ep1)
Ld.scar(ep2)
Cond3=Cmpep2==nil
4 Cond2 =Cmp ==symm
Check.s Br returnif cond2
Br nxt_fpif cond3
5Only 1 check for 2 dependent loadsOnly 1 check for 2 dependent loads
for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 74
Second Iteration: Unrolling . . .Second Iteration: Unrolling . . .for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))
if (if (sym ==sym == car(car(car(car(epep))))))Cycle Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 U. 6
1 Ld ep1
2 Ld.scar(ep1)
Cond1 =Cmp ep== nil
Ld.s ep2=cdr(ep1)
Br nxt_fpif cond1
3 Check.s Ld carcar(ep1)
Ld.scar(ep2)
Cond3 =Cmpep2==nil
4 Cond2 =Cmp ==symm
Check.s Ld carcar(ep2)
Br returnif cond2
Br nxt_fpif cond3
5
38
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 75
Cycle Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 U. 6
1 Ld ep1
2 Ld.scar(ep1)
Cond1 =Cmp ep== nil
Ld.s ep2=cdr(ep1)
Br nxt_fpif cond1
3 Check.s Ld carcar(ep1)
Ld.scar(ep2)
Cond3 =Cmpep2==nil
4 Cond2 =Cmp ==symm
Check.s Ld carcar(ep2)
Br returnif cond2
Br nxt_fpif cond3
5 Cond4 =Cmp ==sym
Br returnif cond4
Brnxt_ep
Speculation enables efficient machine utilizationSpeculation enables efficient machine utilization
Second Iteration: Unrolling . . .Second Iteration: Unrolling . . .for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))
if (if (sym ==sym == car(car(car(car(epep))))))
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 76
Optimized CodeOptimized CodeCycle Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 U. 6
0 Ld ep1 Done outside of the loop
1 Ld.scar(ep1)
Cond1 =Cmp ep== nil
Ld.s ep2=cdr(ep1)
Br nxt_fpif cond1
2 Check.s Ld carcar(ep1)
Ld.scar(ep2)
Cond3 =Cmpep2==nil
3 Cond2 =Cmp ==symm
Check.s Ld carcar(ep2)
Br returnif cond2
Br nxt_fpif cond3
4 Ld nxtep1 =cdr(ep2)
Cond4 =Cmp ==sym
Br returnif cond4
Brnxt_ep
for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))
First load can be done at the bottom of the loopFirst load can be done at the bottom of the loop
( )
39
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 77
Cycle Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 U. 6
0 Ld ep1 Done outside of the loop
1 Ld.scar(ep1)
Cond1 =Cmp ep== nil
Ld.s ep2=cdr(ep1)
Br nxt_fpif cond1
2 Check.s Ld carcar(ep1)
Ld.scar(ep2)
Cond3 =Cmpep2==nil
3 Cond2 =Cmp ==symm
Check.s Ld carcar(ep2)
Br returnif cond2
Br nxt_fpif cond3
4 Ld nxtep1 =cdr(ep2)
Cond4 =Cmp ==sym
Br returnif cond4
Brnxt_ep
Scheduled without Control Scheduled without Control SpeculationSpeculation
Loads are delayed by one clockLoads are delayed by one clock
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 78
Cycle Unit 1 Unit 2 Unit 3 Unit 4 Unit5
Unit6
1 Cond1 =Cmp ep ==nil
Br nxt_fp ifcond1
2 Ldcar(ep1)
Ld ep2 =cdr(ep1)
3 Ld carcar(ep1)
Cond3 =Cmpep2==nil
4 Cond2 =Cmp ==symm
Ld car(ep2) Br return ifcond2
Br nxt_fp ifcond3
5 Ld carcar(ep2)
6 Ld nxt ep1= cdr(ep2)
Cond4 =Cmp ==sym
Br return ifcond4
Br nxt_ep
Recompiled without SpeculationRecompiled without Speculation
Inefficient use of machine widthInefficient use of machine width
40
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 79
xlxgetvaluexlxgetvalue ConclusionsConclusions
llControl speculation Benefits:Control speculation Benefits:–– hides memory latency throughhides memory latency through
–– loading data before knowing if the address is a loading data before knowing if the address is a valid pointervalid pointer
–– loading data before knowing if the next loop loading data before knowing if the next loop iteration is valid iteration is valid
–– enables the compiler to expose parallelism enables the compiler to expose parallelism in pointer chasing codein pointer chasing code
On average over 50% of loads can On average over 50% of loads can be executed speculativelybe executed speculatively
Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference
IntelIntelLabsLabsPage 80
ScoreboardScoreboard
With speculation Without speculation
xlxgetvaluexlxgetvalue
2 cyclesper
iteration
3 cyclesper
iteration
Speculation provides a significant Speculation provides a significant performance advantageperformance advantage