IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium...

40
1 IA IA IA IA- 64 Architecture 64 Architecture 64 Architecture 64 Architecture Sunil Saxena Sunil Saxena Principal Engineer Principal Engineer Intel Corporation Intel Corporation September 11th, 2000 September 11th, 2000 Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Intel Intel Labs Labs Page 2 IA Processor Roadmap ® Madison IA IA-64 Perf 64 Perf Future IA-32 Deerfield IA IA-64 Price/Perf 64 Price/Perf Performance ’02 ’00 ’01 .25μ .18μ .13μ . . . . . . McKinley Itanium TM processor ’99 . . . . . . . . . . . . Foster Outstanding Performance for 32 Bit Volume Apps Extends IA Headroom, Scalability and Availability for the Most Demanding Environments Cascades Pentium ® III Xeon™ processor Strong Execution on Itanium™ Processor, Continued Focus on the Long Term

Transcript of IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium...

Page 1: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

1

IAIAIAIA----64 Architecture64 Architecture64 Architecture64 Architecture

Sunil SaxenaSunil SaxenaPrincipal EngineerPrincipal EngineerIntel CorporationIntel Corporation

September 11th, 2000September 11th, 2000

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 2

IA Processor Roadmap

®®

MadisonIAIA--64 Perf64 Perf

FutureIA-32

DeerfieldIAIA--64 Price/Perf64 Price/Perf

Per

form

ance

’02’00 ’01.25µ .18µ .13µ

. . .. . .

McKinley

ItaniumTM

processor

’99

. . .. . .

. . .. . .

Foster

Outstanding Performance for

32 Bit Volume Apps

Outstanding Performance for

32 Bit Volume Apps

Extends IA Headroom, Scalability and Availability

for the Most Demanding Environments

Extends IA Headroom, Scalability and Availability

for the Most Demanding Environments

Cascades

Pentium®III Xeon™ processor

Strong Execution on Itanium™ Processor, Continued Focus on the Long Term

Page 2: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

2

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 3

Agenda Agenda –– IAIA--64 Architecture64 Architecture

llEPIC 101EPIC 101––Application ArchitectureApplication Architecture

––System ArchitectureSystem Architecture

–– Itanium Itanium MicroarchitectureMicroarchitecture

ll Itanium UpdateItanium Update

llUseful URLsUseful URLs

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 4

EPIC Design PhilosophyEPIC Design Philosophy

�Maximize performance via hardware & software synergy

� Advanced features enhance instruction level parallelism

�Predication, Speculation, ...

�Massive hardware resources for parallel execution

� High performance EPIC building block

Achieving performance at the most Achieving performance at the most fundamental levelfundamental level

Time

Per

form

ance

CISC

RISC

OOO / SuperScalarVLIW

EPICEPIC

Page 3: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

3

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 5

Instruction 2Instruction 2 Instruction 1Instruction 1 Instruction 0Instruction 0 TemplateTemplate

128128--bit bundlebit bundle

00127127

ss Breaking the sequential execution paradigmBreaking the sequential execution paradigmss Explicit instruction dependency: templateExplicit instruction dependency: template

ss Flexibly groups any number of independent instructionsFlexibly groups any number of independent instructions

ss Explicitly scheduled parallelismExplicitly scheduled parallelismss Enables compiler to create greater parallelismEnables compiler to create greater parallelism

ss Simplifies hardware by removing dynamic mechanisms Simplifies hardware by removing dynamic mechanisms

ss Fully interlockedFully interlocked-- hardware provides compatibilityhardware provides compatibility

Instruction Format: Explicit ParallelismInstruction Format: Explicit Parallelism

The new instruction format enables scalability w/ compatibility

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 6

Branches Limit PerformanceBranches Limit Performance

Traditional Traditional Architectures: 4 Architectures: 4

basic blocksbasic blocks

Control flow introduces branchesControl flow introduces branches

Load a[i].Load a[i].ptrptrp1, p2 =p1, p2 = cmpcmp a[i].a[i].ptrptr != 0!= 0branch if p2branch if p2

Load a[i].lLoad a[i].lstore b[i]store b[i]branchbranch

Load a[i].rLoad a[i].rstore b[i]store b[i]

i = i + 1i = i + 1

elseelse

thenthen

ififIf a[i].ptr != 0

b[i] = a[i].l;else

b[i] = a[i].r;i = i + 1

Page 4: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

4

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 7

Load a[i].Load a[i].ptrptrp1, p2 =p1, p2 = cmpcmp a[i].a[i].ptrptr != 0!= 0branch if p2branch if p2

<p1><p1> Load a[i].lLoad a[i].l<p1><p1> store b[i]store b[i]branchbranch

Predication removes branches Predication removes branches and eliminatesand eliminates mispredictsmispredicts

PredicationPredication

<p2><p2> Load a[i].rLoad a[i].r<p2><p2> store b[i]store b[i]

i = i + 1i = i + 1

elseelse

thenthen

ififIf a[i].ptr != 0

b[i] = a[i].l;else

b[i] = a[i].r;i = i + 1

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 8

Predication Enhances ParallelismPredication Enhances ParallelismTraditional ArchitecturesTraditional Architectures: 4 basic blocks: 4 basic blocks IAIA--6464TMTM ArchitectureArchitecture: 1 basic block: 1 basic block

Predication enables more Predication enables more effective use of parallel hardwareeffective use of parallel hardware

Load a[i].Load a[i].ptrptrp1, p2 =p1, p2 = cmpcmp a[i] != 0a[i] != 0jump if p2jump if p2

Load a[i].lLoad a[i].lstore b[i]store b[i]jumpjump

Load a[i].rLoad a[i].rstore b[i]store b[i]

i = i + 1i = i + 1

elseelse

thenthen

ififLoad a[i].Load a[i].ptrptrp1, p2 =p1, p2 = cmpcmp a[i] != 0a[i] != 0

<p1><p1> Load a[i].lLoad a[i].l<p1><p1> store b[i]store b[i]

<p2><p2> Load a[i].rLoad a[i].r<p2><p2> store b[i]store b[i]

i = i + 1i = i + 1

Page 5: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

5

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 9

Memory Latency Causes DelaysMemory Latency Causes Delaysll Loads significantly affect performanceLoads significantly affect performance

–– Often first instruction in dependency chain of instructionsOften first instruction in dependency chain of instructions

–– Can incur high latenciesCan incur high latencies

Add t1 + 1Add t1 + 1comp t1 > t2comp t1 > t2branchbranch

Load a[t1Load a[t1--t2]t2]Load b[j]Load b[j]add b[j] + 1 add b[j] + 1

BarrierBarrier

Traditional ArchitecturesTraditional Architectures t1 = t1 + 1If t1 > t2

j = a[t1 - t2]b[j] ++

Loads can cause exceptionsLoads can cause exceptions

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 10

Speculation with IASpeculation with IA--6464TMTM ArchitectureArchitecture

ll Separate load behavior from exception behaviorSeparate load behavior from exception behavior

–– Speculative load instruction (Speculative load instruction (load.sload.s) initiates a load ) initiates a load operation and detects exceptionsoperation and detects exceptions

–– Propagate an exception Propagate an exception “token”“token” (stored with (stored with destination register) from destination register) from load.sload.s to to check.scheck.s

–– Speculative check instruction (Speculative check instruction (check.scheck.s) delivers any ) delivers any exceptions detected by exceptions detected by load.sload.s

;Exception Detection;Exception Detection

;Exception Delivery;Exception Delivery

PropagatePropagateExceptionException

Add t1 + 1Add t1 + 1load.s a[t1load.s a[t1--t2]t2]comp t1 > t2comp t1 > t2jumpjump

Check.sCheck.sLoad b[j]Load b[j]add b[j] + 1add b[j] + 1

Page 6: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

6

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 11

Speculation Minimizes the Effect Speculation Minimizes the Effect of Memory Latencyof Memory Latency

ll Give scheduling freedom to the compilerGive scheduling freedom to the compiler

–– Allows Allows load.sload.s to be scheduled above branchesto be scheduled above branches

–– check.scheck.s remains in home block, branches toremains in home block, branches to fixupfixupcode if an exception is propagatedcode if an exception is propagated

Add t1 + 1Add t1 + 1comp t1 > t2comp t1 > t2jumpjump

Load a[t1Load a[t1--t2]t2]Load b[j]Load b[j]add b[j] + 1 add b[j] + 1

Traditional ArchitecturesTraditional Architectures

;Exception Detection;Exception Detection

;Exception Delivery;Exception Delivery

PropagatePropagateExceptionException

Add t1 + 1Add t1 + 1load.s a[t1load.s a[t1--t2]t2]comp t1 > t2comp t1 > t2jumpjump

Check.sCheck.sLoad b[j]Load b[j]add b[j] + 1 add b[j] + 1

IAIA--64 Architecture64 Architecture

BarrierBarrier

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 12

Predication & SpeculationPredication & Speculation

If a[i].ptr != 0b[i] = a[i].l;

elseb[i] = a[i].r;

i = i + 1

Load a[i].Load a[i].ptrptrp1, p2 =p1, p2 = cmpcmp a[i].a[i].ptrptr != 0!= 0

<p1><p1> Load a[i].lLoad a[i].l<p1><p1> store b[i]store b[i]

<p2><p2> Load a[i].rLoad a[i].r<p2><p2> store b[i]store b[i]

i = i + 1i = i + 1

With PredicationWith Predication

Load a[i]Load a[i]load.s a[I].l load.s a[I].rload.s a[I].l load.s a[I].rp1, p2 =p1, p2 = cmpcmp a[i] != 0a[i] != 0

<p1><p1> check.scheck.s<p1><p1> store b[i]store b[i]

<p2><p2> check.scheck.s<p2><p2> store b[i]store b[i]

i = i + 1i = i + 1

With Predication & SpeculationWith Predication & Speculation

Predication and Predication and Speculation = higher ILPSpeculation = higher ILP

Page 7: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

7

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 13

Agenda Agenda –– IAIA--64 Architecture64 Architecture

llEPIC 101EPIC 101––Application ArchitectureApplication Architecture

––System ArchitectureSystem Architecture

–– Itanium MicroItanium Micro--ArchitectureArchitecture

ll Itanium UpdateItanium Update

llUseful URLsUseful URLs

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 14

IAIA--64 System Architecture64 System Architecture

ll Virtual Memory ModelVirtual Memory Model

ll Interruption ModelInterruption Model

ll System Software Stack System Software Stack

ll Reliability, Availability, ServiceabilityReliability, Availability, Serviceability

ll CompatibilityCompatibility

Page 8: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

8

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 15

IAIA--64 Virtual Memory Model64 Virtual Memory Model

llProcess Address SpaceProcess Address Space

llSystem Address Space ManagementSystem Address Space Management

llVirtual Address TranslationVirtual Address Translation––TLB and Page tableTLB and Page table

llFlexible Object Sharing ModelFlexible Object Sharing Model––Aliasing Aliasing and Global addressingand Global addressing

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 16

Process Address SpaceProcess Address Space

Flat Virtual Space: 264 bytes

64-bit Address

063

0

264

Page 9: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

9

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 17

Process Address SpaceProcess Address Space

64-bit Address

Code/TextData/HeapDLLs

OS Kernel

0

264

8 Regions/process063

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 18

System Address SpaceSystem Address Space

.

.

.

≥≥218 Regions64-bit Address

0

264

.

.

. Pages

063

Page 10: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

10

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 19

IAIA--64 Region Registers64 Region Registers

•••

8 Region Registers

64-bit Address

063 61 60

≥≥218 Regions

261 bytes in size

RID

.

.

. Pages

############

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 20

Processes and ThreadsProcesses and Threads

•••

•••

Regions Enable Efficient Management Of Processes For Multi-tasking Environments

Process 2

Process 1

Page 11: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

11

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 21

Virtual Address Translation: TLBVirtual Address Translation: TLB

llMapping to Physical AddressMapping to Physical Address

Process C

PhysicalAddressesVirtual Addresses

•Access Rights

Process B

Process A

TLB

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 22

TLB OrganizationTLB Organization

ll Separate instruction and data Separate instruction and data TLBsTLBs

ll Software ManagesSoftware Manages–– TR entries, TR entries,

–– PagePage--table updatestable updates

ll Hardware ManagesHardware Manages–– TC TLB refillTC TLB refill

–– Broadcast TLB PurgeBroadcast TLB Purge

ITC

ITR

DTC

DTR

Instruction

Data

Balance TLB For Efficient Memory Management

Page 12: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

12

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 23

Virtual Address TranslationVirtual Address Translation

RID

Virtual Page #64-bit Address

RRx063 61 60

offset

Region Registers

Virtual Page #RID Physical Page # Protection

“match”

Physical Address

TLB

“deliver”

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 24

Protection: Can I See it? Can I Access it?Protection: Can I See it? Can I Access it?

KeyKey

Protection KeyRegisters

Key5 rw-Key4 rwxKey3 r--Key2 rw-Key1 r-xpkr0

pkrn

Key

Virtual Page #RRx offset

Virtual Page #RID

TLB

Rights

Priv. LevelAccess Type

Virtual Page #RID RightsVirtual Page #RID Rights

Allow?

Protection Keys Increase TLB Utilization For Large Object Databases

Page 13: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

13

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 25

Variable Page SizesVariable Page Sizes

llMinimum on all implementationsMinimum on all implementations––4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M, 4K, 8K, 16K, 64K, 256K, 1M, 4M, 16M,

64M, 256M64M, 256M--bytesbytes

ll 4 GB purge4 GB purge

––Simplify address space deSimplify address space de--allocationallocation

Variable Page Sizes Enable TLB Efficiency For OS And Application Performance

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 26

Hardware Accessed Page TableHardware Accessed Page Table

Flexible Hardware Mechanisms Enable Parallel Execution

Virtual Page #64-bit Address

RRx063 61 60

offset

Region Registers

HashSearch

Virtual HashedPage Table

(VHPT)

Processor

RID

Memory

Page 14: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

14

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 27

Virtual Memory Model: ExampleVirtual Memory Model: Example

Region 2 - One RID, protection via multiple keysShared memory areas1,2

3

1

2,3Virtual

AddressSpace

Region 0 - Different RID in each process

Unique address spaces for data

P1P2

P3P4

Region 1 - Same RID if shared

Single address space for codeP1,2,3,4

Flexible Virtual Memory Architecture Enables Variety Of Efficient OS Implementations

Region 7 - One RID, no keyKernel - protected by Priv. level...

...

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 28

IAIA--64 System Architecture64 System Architecture

ll Virtual Memory ModelVirtual Memory Model

ll Interruption ModelInterruption Model

ll System Software StackSystem Software Stack

ll Reliability, Availability, ServiceabilityReliability, Availability, Serviceability

ll CompatibilityCompatibility

Page 15: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

15

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 29

IAIA--64 Interruption Model64 Interruption Model

ll Parallel instruction execution, . . .Parallel instruction execution, . . .–– Exception delivery is sequential & preciseException delivery is sequential & precise

–– All exceptions reported on the excepting instruction All exceptions reported on the excepting instruction (including numeric exceptions) (including numeric exceptions)

ll “Interruption” is IA“Interruption” is IA--64 term for...64 term for...

Abort InterruptTrapFault• Hardware reset• Machine check

Asynchronous external event:• device or platformmanagement interrupt• soft-reset

Exception taken before instruction commit, e.g. TLB miss

Exception taken after instruction commit, e.g. FP trap

IA-64 Provides Precise Exception Model To Match Today’s OS Designs

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 30

IAIA--64 Interruption Process64 Interruption Process

Application Code

Normal Instruction Execution Flow:• Instruction A executed

0x1000 INST A0x1010 INST B0x1020 INST C

.

.

.

0x1000 INST A0x1000 INST A0x1010 INST B0x1020 INST C

.

.

.

IP

31

16

24BANK0 REG

(OS data)IVT Code

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

16

3132

127

0

15

BANK1 REG(app data)

IPPSR

0x10000x1000

Current Processor State

IIP

IPSR

Interruption Registers

.

.

.

Page 16: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

16

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 31

IAIA--64 Interruption Process64 Interruption ProcessNormal Instruction Execution Flow:• Instruction B executed

31

16

24

Application Code

0x1000 INST A0x1010 INST B0x1020 INST C

.

.

.

0x1000 INST A0x1010 INST B0x1010 INST B0x1020 INST C

.

.

.

IP

BANK0 REG(OS data)

IVT Code

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

32

127

0

15

BANK1 REG(app data)

16

31

IPPSR

0x10100x1010

Current Processor State

IIP

IPSR

Interruption Registers

.

.

.

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 32

Processor switches to Bank 0 registers

preparing to run IVT code

1

IAIA--64 Interruption Process64 Interruption Process

Application Code

0x1000 INST A0x1010 INST B0x1020 INST C

.

.

.

0x1000 INST A0x1010 INST B0x1010 INST B0x1020 INST C

.

.

.

IP

INTERRUPTION

31

16

24BANK1 REG

(app data)BANK BANK

SWITCHINGSWITCHINGIVT Code

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

Interruption Delivery

32

127

0

1516

31

IPPSR

0x10100x1010

Current Processor State

.

.

.

Processor savescurrent state to

interruption registers before interrupt handling

2ProcessorProcessorsaves statesaves state

IIPIPSR

Interruption Registers

0x1010

.

.

.

BANK1 REG(app data)

BANK0 REG(OS data)

Page 17: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

17

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 33

IAIA--64 Interruption Process64 Interruption Process Interruption Handling• Instruction X executed in interrupt vector table

Application Code

0x1000 INST A0x1010 INST B0x1020 INST C

.

.

0x1000 INST A0x1010 INST B0x1020 INST C

.

.

Interrupt Vector Table (IVT) Code

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

IP

31

16

24 BANK1 REG(app data)

32

127

0

15

BANK0 REG(OS data)

16

31

IPPSR

0x40000x4000

Current Processor State

.

.

.

IIPIPSR

InterruptionRegisters

0x1010

.

.

.

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 34

IAIA--64 Interruption Process64 Interruption ProcessInterruption Handling• Instruction Y executed in interrupt vector table

Application Code

0x1000 INST A0x1010 INST B0x1020 INST C

.

.

0x1000 INST A0x1010 INST B0x1020 INST C

.

.

IVT Code

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

IP

31

16

24 BANK1 REG(app data)

32

127

0

15

BANK0 REG(OS data)

16

31

IPPSR

0x40100x4010

Current Processor State

.

.

.

IIPIPSR

Interruption Registers

0x1010

.

.

.

Page 18: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

18

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 35

BANK0 REG(OS data)

BANK SWITCHING

Processor switches back to Bank 1 registers

1

IAIA--64 Interruption Process64 Interruption Process

31

16

24 BANK0 REG(OS data)

Application Code

0x1000 INST A0x1010 INST B0x1020 INST C

.

.

0x1000 INST A0x1010 INST B0x1020 INST C

.

.

IVT Code

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

IP

RETURN TO APP CODE

IP

Restoring Pre-Interruption State

32

127

0

15

Processor restoresstate from interruption

registers before returning from interrupt

Processorrestores

state

2IPPSR

0x40200x4020

Current Processor State

.

.

.

IIPIPSR

Interruption Registers

0x1010

.

.

.

BANK1 REG(app data)

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 36

IIP

IPSR

Interruption Registers

.

.

.

IAIA--64 Interruption Process64 Interruption ProcessResume Normal Instruction Execution:• Instruction B executed

Application Code

0x1000 INST A0x1010 INST B0x1020 INST C

.

.

.

0x1000 INST A0x10100x1010 INST BINST B0x1020 INST C

.

.

.

IP

IVT Code

0x4000 INST X0x4010 INST Y0x4020 RFI

.

.

0x4000 INST X0x4010 INST Y0x4020 RFI

.

. 0

31

16

24BANK0 REG

(OS Data)

32

127

15

BANK1 REG(app data)

IPPSR

0x10100x1010

Current Processor State

Page 19: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

19

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 37

Interruption FeaturesInterruption Features

ll Low interruption latencyLow interruption latency

–– Interruption delivery causes single pipeline breakInterruption delivery causes single pipeline break

–– Key state captured in onKey state captured in on--chip registerschip registers

ll StateState--save controlled by system softwaresave controlled by system software

–– Software makes performance/nesting tradeSoftware makes performance/nesting trade--offoff

–– Shared mechanism for IAShared mechanism for IA--64/IA64/IA--32 interruptions32 interruptions

ll Efficient handler executionEfficient handler execution

–– Interruption vector table (IVT) contains code for Interruption vector table (IVT) contains code for interrupt service routineinterrupt service routine

Provides Fast And Flexible Interruptions For Large I/O Intensive Applications

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 38

Parallelism Across System CallsParallelism Across System Calls

Application Code

…// make system call

br.call _write...

….

_write: epc // privilege promote// without pipeline flushbr os_write….

os_write:…// perform system call br.ret // demote PL and return to user

EPC Page (PL promote and execute only)

Operating SystemKernel

(privileged code)

Application AddressSpace

Fast System Calls Improve Synergy Between OS & Application

Page 20: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

20

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 39

IAIA--64 External Interrupts64 External Interrupts

High Performance Message-Based Interrupts Compatible With Today’s Platforms

Processor LINT0 (Intel 8259ALINT1 compatible)

Bridge

External InterruptController

Device w/InterruptController

Device

Device

IPI messages

Processor Processor

System Bus

I/O Bus

Interrupt messages

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 40

IAIA--64 System Architecture64 System Architecture

ll Virtual Memory ModelVirtual Memory Model

ll Interruption ModelInterruption Model

ll System Software StackSystem Software Stack

ll Reliability, Availability, ServiceabilityReliability, Availability, Serviceability

ll CompatibilityCompatibility

Page 21: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

21

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 41

IAIA--64 System Software Stack: OS Boot64 System Software Stack: OS Boot

Processor (hardware)Processor (hardware)

Platform (hardware)Platform (hardware)

Processor Abstraction Layer (PAL)Processor Abstraction Layer (PAL)

Reset, machine checks

System Abstraction Layer (SAL) System Abstraction Layer (SAL) IA-32 BIOS

Operating System SoftwareOperating System Software

OS boot

Access to platform resources

EFIEFIEFI

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 42

OS RunningOS Running

Processor (hardware)Processor (hardware)

Platform (hardware)Platform (hardware)

Interruptions

External Interrupts(performance critical)

Processor Abstraction Layer (PAL)Processor Abstraction Layer (PAL)

Instructions

I/O

EFIEFIEFI

Operating System SoftwareOperating System Software

System Abstraction Layer (SAL) System Abstraction Layer (SAL) IA-32 BIOS

Page 22: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

22

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 43

OS Calls To Firmware ServicesOS Calls To Firmware Services

Processor (hardware)Processor (hardware)

Platform (hardware)Platform (hardware)

Processor Abstraction Layer (PAL)Processor Abstraction Layer (PAL)

EFIEFIEFI

Operating System SoftwareOperating System Software

Run-TimeServices

Access to platform resources

System Abstraction Layer (SAL) System Abstraction Layer (SAL) IA-32 BIOS

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 44

MachineMachine Check HandlingCheck Handling

Processor (hardware)Processor (hardware)

Platform (hardware)Platform (hardware)

EFIEFIEFI

Operating System SoftwareOperating System Software

Access to platform resources

Processor Abstraction Layer (PAL)Processor Abstraction Layer (PAL)

Reset, machine checks

Machine CheckServices

System Abstraction Layer (SAL) System Abstraction Layer (SAL) IA-32 BIOS

Page 23: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

23

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 45

Architected Architected RAS FeaturesRAS Features

–– ReliabilityReliability–– 3 levels of error signaling: 3 levels of error signaling:

–– ContinuableContinuable, local, and global, local, and global

–– AvailabilityAvailability–– Fine grained error containment by cooperation Fine grained error containment by cooperation

between hardware and firmwarebetween hardware and firmware

–– Serviceability Serviceability –– Extensive error logs for error analysisExtensive error logs for error analysis

–– Common error logs for firmware and OSCommon error logs for firmware and OS

Advanced Machine Check Architecture For High Levels of Reliability, Availability, And Serviceability

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 46

IAIA--64 System Architecture64 System Architecture

ll Virtual Memory ModelVirtual Memory Model

ll Interruption ModelInterruption Model

ll System Software StackSystem Software Stack

ll Reliability, Availability, ServiceabilityReliability, Availability, Serviceability

ll CompatibilityCompatibility

Page 24: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

24

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 47

CompatibilityCompatibility

ll IAIA--64 supports 64 supports IAIA--32 OS32 OS–– Capable of running unmodified multiCapable of running unmodified multi--processing IAprocessing IA--32 OS, 32 OS,

e.g. NT4.0, Linuxe.g. NT4.0, Linux

ll IAIA--64 OS supports 64 OS supports IAIA--32 Platform32 Platformperipheralsperipherals

–– IAIA--64 support for legacy I/O port space64 support for legacy I/O port space

ll Dependent upon OS & platform Dependent upon OS & platform

implementationimplementation

IA-64 Offers Full IA-32 Compatibility In Hardware: Platforms, OS, Applications

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 48

Agenda Agenda –– IAIA--64 Architecture64 Architecture

llEPIC 101EPIC 101––Application ArchitectureApplication Architecture

––System ArchitectureSystem Architecture

–– Itanium MicroItanium Micro--architecturearchitecture

ll Itanium UpdateItanium Update

llUseful URLsUseful URLs

Page 25: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

25

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 49

Branch Hints

Memory Hints

InstructionCache

& BranchPredictors

FetchFetch Memory Memory SubsystemSubsystem

Three levels of cache:L1, L2, L3

Register Stack & Rotation

Explicit Parallelism

128 GR &128 FR,RegisterRemap

&Stack Engine

Register Register HandlingHandling

Fast, S

imp

le 6-Issue

IssueIssue ControlControl

MicroMicro--architecture Features in hardwarearchitecture Features in hardware: :

EPIC Design Maximizes SW-HW Synergy

Architecture Features programmed by compiler::

PredicationData & ControlSpeculation

Byp

asses & D

epen

den

cies

Parallel ResourcesParallel Resources

4 Integer + 4 MMX Units

2 FMACs (4 for SSE)

2 LD/ST units

32 entry ALAT

Speculation Deferral Management

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 50

Intel® Itanium™ Processor Block DiagramIntel® Itanium™ Processor Block Diagram

L1 Instruction Cache andL1 Instruction Cache andFetch/PreFetch/Pre--fetch Enginefetch Engine

128 Integer Registers128 Integer Registers 128 FP Registers128 FP Registers

BranchBranchPredictionPrediction

L2

Cac

he

L2

Cac

he

DualDual--PortPortL1L1

DataDataCacheCache

andandDTLBDTLB

BranchBranchUnitsUnits

Branch & PredicateBranch & PredicateRegistersRegisters

Sco

reb

oar

d, P

red

icat

eS

core

bo

ard

, Pre

dic

ate

,, NaT

sN

aTs ,

Exc

epti

on

s, E

xcep

tio

ns

AL

AT

AL

AT

ITLBITLB

BB BB BB MM MM II II FF FF

IAIA--3232DecodeDecode

andandControlControl

Instruction Instruction QueueQueue

SIMDSIMDFMACFMAC

FloatingFloatingPointPointUnitsUnits

SIMDSIMDFMACFMAC

8 bundles8 bundles

Register Stack Engine / ReRegister Stack Engine / Re--MappingMapping

9 Issue Ports9 Issue Ports

L3

Cac

he

L3

Cac

he

Bus ControllerBus ControllerECCECC

ECCECC

ECCECC

ECCECC

ECCECCECCECC

IntegerIntegerandand

MM UnitsMM Units

Page 26: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

26

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 51

Floating Point FeaturesFloating Point Featuresl Native 82-bit hardware provides support for multiple numeric models

l 2 Extended precision pipelined FMACs deliver 4 EP / DP FLOPs/cycle

l Performance for security and 3-D graphics

l 2 Additional single-precision FMACs for 8 SP FLOPs/cycle (SIMD)

l Efficient use of hardware: Integer multiply-add and s/w divide

l Balanced with plenty of operand bandwidth from registers / memory

6 x 82-bit operands

L2 L2 CacheCache

128 entry128 entry8282--bitbit

RFRF

2 x 82-bit results

4Mbyte4MbyteL3 L3

CacheCache

2 stores/clk

2 DP Ops/clk

4 DP Ops/clk

(2 x Fld-pair)

odd

even

Itanium™ processor delivers industryItanium™ processor delivers industry--leading leading floating point performancefloating point performance

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 52

Example Example –– Memory LatencyMemory Latencyunrolled_loop:

ld8 t0=[src],32ld8 t1=[src2],32add loopcnt=-1,loopcnt

ld8 t2=[src3],32ld8 t3=[src4],32;;ld8 t4=[src],32ld8 t5=[src2],32cmp.ne p8,p9=r0,loopcnt

ld8 t6=[src3],32ld8 t7=[src4],32

lfetch.nta [sf],64lfetch.excl.nta [df],64

st8 [dst]=t0,32st8 [dst2]=t1,32st8 [dst3]=t2,32st8 [dst4]=t3,32;;st8 [dst]=t4,32st8 [dst2]=t5,32st8 [dst3]=t6,32st8 [dst4]=t7,32

(p8) br.cond.sptk.few unrolled_loop

Page 27: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

27

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 53

Agenda Agenda –– IAIA--64 Architecture64 Architecture

llEPIC 101EPIC 101––Application ArchitectureApplication Architecture

––System ArchitectureSystem Architecture

–– Itanium MicroItanium Micro--architecturearchitecture

ll Itanium UpdateItanium Update

ll Useful URLsUseful URLs

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 54

Itanium™ ProcessorItanium™ Processor

ll 800 MHz production frequency800 MHz production frequency–– Up to 20 operations per clockUp to 20 operations per clock

ll 4 MB high speed on4 MB high speed on--cartridge L3 cachecartridge L3 cachell Over 320M transistorsOver 320M transistors

–– 25M in CPU, 295M in L3 cache25M in CPU, 295M in L3 cache

ll 2.1 GB/s system bus2.1 GB/s system bus–– Enhanced Defer Mechanism enables high Enhanced Defer Mechanism enables high

scalability through improved bus efficiencyscalability through improved bus efficiency

ll Extensive reliability and availability Extensive reliability and availability featuresfeatures

–– ECC, parity protection, enhanced MCAECC, parity protection, enhanced MCA

ll Excellent functionality on initial siliconExcellent functionality on initial silicon–– No architectural or ISA changes plannedNo architectural or ISA changes planned

Page 28: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

28

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 55

Itanium™ Processor System ArchitectureItanium™ Processor System Architecture

MAC

MAC

MDC

MDC

MDC

MDC

MAC

MAC

MDC

MDC

MDC

MDCF16[2]

F16[3]

F16

[1]

F16

[0]

82460gxSDC

82460gxSAC

82460gxWXB

82460gxWXB

82460gxWXB

82460gxPXB

82460gxPID

82460gxIFB

FWH

ll Intel 460GX ChipsetIntel 460GX Chipset–– Support for 1Support for 1--4 processors4 processors–– Dual memory portsDual memory ports

–– 4.2 GB/s4.2 GB/s–– Up to 64 GB SDRAMUp to 64 GB SDRAM

–– 64b / 66MHz PCI Hot Plug I/O64b / 66MHz PCI Hot Plug I/O–– Extensive ECC, parity protectionExtensive ECC, parity protection–– FullFull--speed frontspeed front--side bus operation achieved in MP environment side bus operation achieved in MP environment

with prewith pre--production samplesproduction samples

ll Over 30 OEM system designs; multiple custom chipsetsOver 30 OEM system designs; multiple custom chipsets–– Over five 8 processor and greater system designs Over five 8 processor and greater system designs –– Multiple OEM chipsets powered up in 2H ‘99Multiple OEM chipsets powered up in 2H ‘99

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 56

All Elements of IA-64 Program Converging for Successful Solution Launch

IAIA--64 Platform Launch Readiness64 Platform Launch ReadinessOracle 8i, SQL, SAP, IIS ... Oracle 8i, SQL, SAP, IIS ... Mental Ray, Softimage, Mental Ray, Softimage,

NastranNastran ... ...

Adaptec, QAdaptec, Q--Logic, 3Logic, 3--D D Labs, Labs, MatroxMatrox ……

Fast Track Driver Fast Track Driver programprogram

C++, Fortran, Java, C++, Fortran, Java, other offerings from other offerings from Microsoft, EPC, IBM ...Microsoft, EPC, IBM ...

6464--bit Windowsbit WindowsUnix / LinuxUnix / Linux

Novell developer Novell developer releasesreleases

IAIA--64 processor64 processorroadmaproadmap

Over 5Over 5products identified, products identified,

more plannedmore planned

2P workstations2P workstations4P to 512P servers4P to 512P servers

Intel 460 GX PCIIntel 460 GX PCI--setsetCustom OEM chipset designs Custom OEM chipset designs supporting high MP systemssupporting high MP systems

Workstation Applications

ISVs

Server Applications

ISVs

SoftwareTools

Intel/ISVs

Operating Systems

OSVs

System Designs

OEMs

ChipsetsIntel/Industry

ProcessorIntel

HARDWAREHARDWARE

SOFTWARESOFTWARE

Hardware, I/O,

GraphicsIHVs

Page 29: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

29

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 57

IAIA--64 Application Program Summary64 Application Program Summary

Workstation ISVs publicly committed to IAWorkstation ISVs publicly committed to IA--6464ll Adobe Adobe

ll Alias/Wavefront Alias/Wavefront

ll Avid/SoftImage Avid/SoftImage

ll CadenceCadence

ll Dassault Dassault

ll DiscreetDiscreet

ll Flometrics Flometrics

ll Infinity Infinity

ll Invent Computing Invent Computing

ll Lizard Tech Lizard Tech

ll Magma Magma

Server ISVs publicly committed to IAServer ISVs publicly committed to IA--6464ll Ariba Ariba

ll Allegis Allegis

ll AltoWeb AltoWeb

ll Apogee NetworksApogee Networks

ll Baan Baan

ll BEA Weblogic BEA Weblogic

ll Brokat Brokat

ll EntrustEntrust

ll Extricity Extricity

ll IBM Software IBM Software

ll Informix Informix

DCC EDA MDA Finance OtherDCC EDA MDA Finance Other

••RenderingRendering••EditingEditing••3D Animation3D Animation

••VerificationVerification••SynthesisSynthesis••DRCDRC

••EquityEquity••Treasury Treasury ••Risk AnalysisRisk Analysis

••CFDCFD••GISGIS••Molecular Molecular Modeling Modeling

••FEAFEA••ModelingModeling••Hi-end CAEHi-end CAE

IA-64 Software Program Increasing Depth and Breadth of IA-64 Software Ramp

IA-64 Workstation focus applicationsIA-64 Server focus applicationsVery Large Databases�Data Warehousing

� Decision Support

� OLTP and OLAP

� ERP and LOB

� Customer Management

E-Business Services� Security Services

� Directory Services

� VPN/IP Gateways

� ISP Dedicated Switches

r Mental Images

r Mentor Graphics r Molecular Simulations Inc.

r MSC r Parametric

r Risk Metrics

r SCALI r Synopsys

r Unigraphics

r Viewlogic r Visual Insights

Scientific / Technical Computing�Computer Aided Engineering/Design

�Finite Element Analysis

�Fluid Dynamics and Simulations

r IONA Technologies

r Lutris r Microsoft

r Nuance r Oasis

r Oblix

r Oracle r People Soft

r Persistence

r Relativity Technologies r RSA

r SAP r SAS

r Selectica

r Silknet r Silverstream

r Softway r Speechworks

r TimesTen

r Torrent r Verisign

r Webline

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 58

McKinley Processor Feature McKinley Processor Feature OverviewOverview

ll Enhanced Itanium™Enhanced Itanium™ processor microarchitecture & system busprocessor microarchitecture & system bus–– Fully binary compatible in hardware with Itanium™Fully binary compatible in hardware with Itanium™ processor processor

–– Expanded resources including more load/store ports and ALUExpanded resources including more load/store ports and ALU

–– OnOn--chip L3 cachechip L3 cache

ll Builds upon Itanium™ platform infrastructureBuilds upon Itanium™ platform infrastructure–– Reuses key technologies Reuses key technologies –– bus protocols, power delivery technology, bus protocols, power delivery technology,

software tools, other key platform software tools, other key platform componenetscomponenets

ll Continued focus on high availability for eContinued focus on high availability for e--businessbusiness–– Extensive error detection & correctionExtensive error detection & correction

–– System management bus with onSystem management bus with on--package power thermal managementpackage power thermal management

ll Production target: end of 2001Production target: end of 2001

Extends IA-64 capability for end ’01 timeframe

Page 30: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

30

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 59

IAIA--64 Docs & URLs64 Docs & URLsll IAIA--64 Software Developer’s Manual64 Software Developer’s Manual

–– Info for system & application software, & development tools for Info for system & application software, & development tools for IAIA--6464

–– Software optimization techniques Software optimization techniques

–– Performance monitoring info for optimization supportPerformance monitoring info for optimization support

ll More IAMore IA--64 Documentation:64 Documentation:–– IAIA--64 Software Conventions and Runtime Architecture Guide64 Software Conventions and Runtime Architecture Guide

–– Assembly Language Reference GuideAssembly Language Reference Guide

–– IAIA--64 assembler & reference guide64 assembler & reference guide

–– IAIA--64 Processor64 Processor--specific Application Binary Interfacespecific Application Binary Interface

–– System Abstraction Layer SpecificationSystem Abstraction Layer Specification

and more … and more …

IA-64 Docs Available On Internet:developer.intel.com/design/ia-64/devinfo.htm®®

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 60

BackupBackup

Page 31: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

31

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 61

llBackgroundBackground

ll IAIA--64 at work: Code Examples64 at work: Code Examples––xlxgetvaluexlxgetvalue from LIfrom LI

–– control speculation to chase pointerscontrol speculation to chase pointers

––puzzle code fragmentpuzzle code fragment–– loop with nested if statementsloop with nested if statements

–– treeinstreeins code fragmentcode fragment–– classic ifclassic if--thenthen--else statementelse statement

llSummarySummary

IAIA--6464TMTM Architecture InnovationsArchitecture InnovationsOutlineOutline

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 62

XlxgetvalueXlxgetvalue in a Nutshellin a Nutshell

llCode fragment from SpecInt95 Code fragment from SpecInt95 benchmark LIbenchmark LI–– representative of pointer chasing coderepresentative of pointer chasing code

llTechnique used: Control SpeculationTechnique used: Control Speculation

llBenefits: Benefits: ––hide memory latencyhide memory latency

––expose Instruction Level Parallelism allowing expose Instruction Level Parallelism allowing parallel executionparallel execution

Page 32: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

32

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 63

Example Machine Model forExample Machine Model for xlxgetvalxlxgetval

Register FileRegister File

L0 DCache

ResteerResteer

Instruction Decode Instruction Decode and Dispatchand Dispatch

L0L0 IcacheIcache InstructionInstructionPointerPointer

6 functional units6 functional units

1 cycle1 cycleload latencyload latency

8 cycle branch8 cycle branch mispredictmispredict

2 memory ports2 memory ports

6 Execution units, 2 memory ports, 1 cycle load latency6 Execution units, 2 memory ports, 1 cycle load latency

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 64

XlxgetvalueXlxgetvalue Step by StepStep by Step

llCompile one iterationCompile one iteration––use control speculation to issue loads as use control speculation to issue loads as

early as possibleearly as possible

llUnroll the loopUnroll the loop––use control speculation to start next iteration use control speculation to start next iteration

before it is safe to do sobefore it is safe to do so

–– take advantage of the machine width to take advantage of the machine width to execute several iterations in parallelexecute several iterations in parallel

Expose ILP with Control Speculation Expose ILP with Control Speculation in pointer chasing codein pointer chasing code

Page 33: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

33

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 65

XlxgetvalueXlxgetvalue Code FragmentCode Fragment

for (for (fpfp == xlenvxlenv;; fpfp;;

fpfp == cdrcdr((fpfp))))

for (for (epep = car(= car(fpfp);); epep;;

epep == cdrcdr((epep))))

if (sym == if (sym == car(car(car(car(epep))) )))

return (return (cdrcdr(car((car(epep)));)));

LdLd fpfpcmp fpcmp fp == nil== nilbrbr to exit if trueto exit if trueLoadLoad epepCond1 = (Cond1 = (cmp epcmp ep ==nil)==nil)br nxtbr nxt__fpfp if Cond1if Cond1load car(load car(epep))load x= car(car(load x= car(car(epep))))Cond2 = (comp sym== x)Cond2 = (comp sym== x)brbr to return if Cond2to return if Cond2br nxtbr nxt__epep

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 66

Compiling . . .Compiling . . . for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))

Cycle Unit 1 Uni t 2 Unit 3 Unit 4 Unit 5 Unit 6

1 Load ep1

2

3

4

5

Page 34: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

34

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 67

Cycle Unit 1 Uni t 2 Unit 3 Unit 4 Unit 5 Unit 6

1 Load ep1

2 Cond1 =ep1 == nil

3

4

5

Compiling . . .Compiling . . .for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 68

for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))

Cycle Unit 1 Uni t 2 Unit 3 Unit 4 Unit 5 Unit 6

1 Load ep1

2 Cond1 =ep1 == nil

Load.scar(ep1)

3 check.s

4

5

Compiling . . .Compiling . . .

Page 35: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

35

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 69

for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))

Cycle Unit 1 Uni t 2 Unit 3 Unit 4 Unit 5 Unit 6

1 Load ep1

2 Cond1 =ep1 == nil

Load.scar(ep1)

3 check.s Loadx=car(car(ep1))

4

5

Compiling . . .Compiling . . .

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 70

for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))

Cycle Unit 1 Uni t 2 Unit 3 Unit 4 Unit 5 Unit 6

1 Load ep1

2 Cond1 =ep1 == nil

Load.scar(ep1)

3 check.s Loadx=car(car(ep1))

4 Cond2 =sym== x

5

Compiling . . .Compiling . . .

Page 36: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

36

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 71

First IterationFirst Iteration

Cycle Unit 1 Uni t 2 Unit 3 Unit 4 Unit 5 Unit 6

1 Load ep1

2 Cond1 =ep1 == nil

Load.scar(ep1)

Br nxt_fpif cond1

3 check.s Loadx=car(car(ep1))

4 Cond2 =sym== x

Br returnif cond2

Br nxt_ep

5

for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))

return (return (cdrcdr(car((car(epep)));)));

Speculation allows the loads to be started earlySpeculation allows the loads to be started early

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 72

Cycle Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 U. 6

1 Ld ep1

2 Ld.scar(ep1)

Cond1=Cmp ep== nil

Ld.s ep2=cdr(ep1)

Br nxt_fpif cond1

3 Check.s Ld carcar(ep1)

4 Cond2=Cmp ==symm

Br returnif cond2

5

for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))

Second Iteration: Unrolling . . .Second Iteration: Unrolling . . .

Page 37: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

37

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 73

Second Iteration: Unrolling . . .Second Iteration: Unrolling . . .

Cycle Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 U. 6

1 Ld ep1

2 Ld.scar(ep1)

Cond1 =Cmp ep== nil

Ld.s ep2=cdr(ep1)

Br nxt_fpif cond1

3 Check.s Ld carcar(ep1)

Ld.scar(ep2)

Cond3=Cmpep2==nil

4 Cond2 =Cmp ==symm

Check.s Br returnif cond2

Br nxt_fpif cond3

5Only 1 check for 2 dependent loadsOnly 1 check for 2 dependent loads

for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 74

Second Iteration: Unrolling . . .Second Iteration: Unrolling . . .for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))

if (if (sym ==sym == car(car(car(car(epep))))))Cycle Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 U. 6

1 Ld ep1

2 Ld.scar(ep1)

Cond1 =Cmp ep== nil

Ld.s ep2=cdr(ep1)

Br nxt_fpif cond1

3 Check.s Ld carcar(ep1)

Ld.scar(ep2)

Cond3 =Cmpep2==nil

4 Cond2 =Cmp ==symm

Check.s Ld carcar(ep2)

Br returnif cond2

Br nxt_fpif cond3

5

Page 38: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

38

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 75

Cycle Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 U. 6

1 Ld ep1

2 Ld.scar(ep1)

Cond1 =Cmp ep== nil

Ld.s ep2=cdr(ep1)

Br nxt_fpif cond1

3 Check.s Ld carcar(ep1)

Ld.scar(ep2)

Cond3 =Cmpep2==nil

4 Cond2 =Cmp ==symm

Check.s Ld carcar(ep2)

Br returnif cond2

Br nxt_fpif cond3

5 Cond4 =Cmp ==sym

Br returnif cond4

Brnxt_ep

Speculation enables efficient machine utilizationSpeculation enables efficient machine utilization

Second Iteration: Unrolling . . .Second Iteration: Unrolling . . .for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))

if (if (sym ==sym == car(car(car(car(epep))))))

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 76

Optimized CodeOptimized CodeCycle Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 U. 6

0 Ld ep1 Done outside of the loop

1 Ld.scar(ep1)

Cond1 =Cmp ep== nil

Ld.s ep2=cdr(ep1)

Br nxt_fpif cond1

2 Check.s Ld carcar(ep1)

Ld.scar(ep2)

Cond3 =Cmpep2==nil

3 Cond2 =Cmp ==symm

Check.s Ld carcar(ep2)

Br returnif cond2

Br nxt_fpif cond3

4 Ld nxtep1 =cdr(ep2)

Cond4 =Cmp ==sym

Br returnif cond4

Brnxt_ep

for (for (epep = = car(car(fpfp);); epep;; epep == cdrcdr((epep))))if (if (sym ==sym == car(car(car(car(epep))))))

First load can be done at the bottom of the loopFirst load can be done at the bottom of the loop

( )

Page 39: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

39

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 77

Cycle Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 U. 6

0 Ld ep1 Done outside of the loop

1 Ld.scar(ep1)

Cond1 =Cmp ep== nil

Ld.s ep2=cdr(ep1)

Br nxt_fpif cond1

2 Check.s Ld carcar(ep1)

Ld.scar(ep2)

Cond3 =Cmpep2==nil

3 Cond2 =Cmp ==symm

Check.s Ld carcar(ep2)

Br returnif cond2

Br nxt_fpif cond3

4 Ld nxtep1 =cdr(ep2)

Cond4 =Cmp ==sym

Br returnif cond4

Brnxt_ep

Scheduled without Control Scheduled without Control SpeculationSpeculation

Loads are delayed by one clockLoads are delayed by one clock

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 78

Cycle Unit 1 Unit 2 Unit 3 Unit 4 Unit5

Unit6

1 Cond1 =Cmp ep ==nil

Br nxt_fp ifcond1

2 Ldcar(ep1)

Ld ep2 =cdr(ep1)

3 Ld carcar(ep1)

Cond3 =Cmpep2==nil

4 Cond2 =Cmp ==symm

Ld car(ep2) Br return ifcond2

Br nxt_fp ifcond3

5 Ld carcar(ep2)

6 Ld nxt ep1= cdr(ep2)

Cond4 =Cmp ==sym

Br return ifcond4

Br nxt_ep

Recompiled without SpeculationRecompiled without Speculation

Inefficient use of machine widthInefficient use of machine width

Page 40: IA-64 Architecture - Linux Clusters · PDF file–Itanium Micro-Architecture lItanium Update ... Protection Key Registers Key5 rw- ... Flexible Virtual Memory Architecture Enables

40

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 79

xlxgetvaluexlxgetvalue ConclusionsConclusions

llControl speculation Benefits:Control speculation Benefits:–– hides memory latency throughhides memory latency through

–– loading data before knowing if the address is a loading data before knowing if the address is a valid pointervalid pointer

–– loading data before knowing if the next loop loading data before knowing if the next loop iteration is valid iteration is valid

–– enables the compiler to expose parallelism enables the compiler to expose parallelism in pointer chasing codein pointer chasing code

On average over 50% of loads can On average over 50% of loads can be executed speculativelybe executed speculatively

Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference

IntelIntelLabsLabsPage 80

ScoreboardScoreboard

With speculation Without speculation

xlxgetvaluexlxgetvalue

2 cyclesper

iteration

3 cyclesper

iteration

Speculation provides a significant Speculation provides a significant performance advantageperformance advantage