Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look...

54
® * Other brands and names may be claimed as the property of others. Copyright © 2002 Intel Corporation. All rights reserved. Using Intel® Developer Using Intel® Developer on Itanium® Architecture on Itanium® Architecture for Application Tuning for Application Tuning VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. Heinz Bast Heinz Bast [email protected] [email protected] Intel® Software Enabling Group EMEA Intel® Software Enabling Group EMEA

Transcript of Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look...

Page 1: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. Copyright © 2002 Intel Corporation. All rights reserved.

Using Intel® Developer Using Intel® Developer on Itanium® Architecture on Itanium® Architecture for Application Tuning for Application Tuning

VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Heinz BastHeinz Bast

[email protected]@Intel.com

Intel® Software Enabling Group EMEAIntel® Software Enabling Group EMEA

Page 2: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. Copyright © 2002 Intel Corporation. All rights reserved.

VTL - The Intel® VTune™ VTL - The Intel® VTune™ Performance Analyzer for Performance Analyzer for LinuxLinux

VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Page 3: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 3

VTune™ Performance AnalyzerVTune™ Performance Analyzer

Helps to identify and characterize Helps to identify and characterize performance issues byperformance issues by

• Collecting performance data Collecting performance data • CPU-Cycles (time)CPU-Cycles (time)

• Micro-architectural events of processorMicro-architectural events of processor

• Platform resource utilizationPlatform resource utilization

• Organizing and displaying the data Organizing and displaying the data

• Identifying performance ‘hotspots’Identifying performance ‘hotspots’

• Suggesting improvementsSuggesting improvements

VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Page 4: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 4

A Note about Vtune & other ToolsA Note about Vtune & other Tools

This can be done too by many other toolsThis can be done too by many other tools– HPCMonHPCMon

– Free utility from Intel – includes source codeFree utility from Intel – includes source code– Ask presenter for a copyAsk presenter for a copy

– EMONEMON– Batch-like tool used within IntelBatch-like tool used within Intel– Knows too about some non-published monitor eventsKnows too about some non-published monitor events– Available on request ( no support ) if there is a need ( NDA ) Available on request ( no support ) if there is a need ( NDA )

– PFMON from HPPFMON from HP– ftp://ftp.hpl.hp.com/pub/linux-ia64/ftp://ftp.hpl.hp.com/pub/linux-ia64/

– PAPI (PapiRun, PapiProf), Rabbit, HPCToolKit, etcPAPI (PapiRun, PapiProf), Rabbit, HPCToolKit, etc– Look at the WEB: There are numerous of themLook at the WEB: There are numerous of them

Difference is in easy-of-use, added features APIs, processor Difference is in easy-of-use, added features APIs, processor support, OS Support, navigation, performance data support, OS Support, navigation, performance data compatibility, source code support etccompatibility, source code support etc

Intel, Itanium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

The most useful feature of Vtune is Event Based Sampling: Configuring and monitoring of the Itanium™ architecture performance counters and

displaying the event occurrence data against the work load of the system being analyzed

Page 5: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 5

Vtune™ Performance Analysis for Vtune™ Performance Analysis for LinuxLinux Native: Vtune for Linux 3.0Native: Vtune for Linux 3.0

– Any IA-32 or Itanium® system running recent Linux Any IA-32 or Itanium® system running recent Linux version version – Some kernel and GLIBC dependenciesSome kernel and GLIBC dependencies

– Full Eclipsed-based GUI only for IA32 todayFull Eclipsed-based GUI only for IA32 today– Due to Eclipse issues with 64bit Due to Eclipse issues with 64bit

– For Itanium® & EM64T command-line versionFor Itanium® & EM64T command-line version– But graphical viewers for resultBut graphical viewers for result– Eclipse-based release for 64bit system later in 2005Eclipse-based release for 64bit system later in 2005

Remote Data CollectionRemote Data Collection– Allows full Windows GUI to be used for Linux too Allows full Windows GUI to be used for Linux too

Intel, Itanium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Page 6: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 6

Remote Data CollectionRemote Data Collection VTune™ analyzer for Windows installed on host VTune™ analyzer for Windows installed on host

systemsystem Remote sampling data collector installed on target Remote sampling data collector installed on target

systemsystem

Host System

•Windows* OS

•IA-32 or Itanium

•Controls target

•View results of data collection

Target System

-IA-32 or Itanium® processor family

-Windows or Linux*

-Intel® PXA250 applications processor running Windows CE

LAN Connection

Intel, Itanium, VTune, and the intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Page 7: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 7

Linux Driver KitLinux Driver KitRequired for RDC and Vtune™ for LinuxRequired for RDC and Vtune™ for Linux Pre-built binaries for many kernels

Source code SDK in Vtune7.2– Also at http://opensource.intel.com

Driver kit requires kernel to export sys_call_table

– some older kernels have to be rebuildsome older kernels have to be rebuild

Many OSV kernels have explicit supportMany OSV kernels have explicit support– SUSE 8.x, 9.0, Redhat AS 2.1 Update 2SUSE 8.x, 9.0, Redhat AS 2.1 Update 2

Support for 2.6 kernel being added now Support for 2.6 kernel being added now ( patch for Vtune3.0 )( patch for Vtune3.0 )

Page 8: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 8

Vtune™ FeaturesVtune™ Features Sampling of Execution AddressesSampling of Execution Addresses

– Profiling based on processor event countersProfiling based on processor event counters

Call Graph Profiling - Instrumented analysisCall Graph Profiling - Instrumented analysis– Call tree, number of calls, timing informationCall tree, number of calls, timing information

– Executing Instrumented CodeExecuting Instrumented Code

Tracking of System Performance CountersTracking of System Performance Counters– Performance Monitor (perfmon) Style CountersPerformance Monitor (perfmon) Style Counters

– Extended Performance DLL APIs – SDK Available!Extended Performance DLL APIs – SDK Available!

Intel® Tuning Assistant: Interpret the results Intel® Tuning Assistant: Interpret the results ( Windows or RDC only )( Windows or RDC only )

Page 9: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 9

The Sampling MethodologyThe Sampling Methodology ““Sample” the CPU’s execution contextSample” the CPU’s execution context

– Instruction Address ( Module, source line, assembly line)Instruction Address ( Module, source line, assembly line)– OS ProcessOS Process– OS Thread IDOS Thread ID

Very easy to use, no special buildVery easy to use, no special build– Source line view requires symbol info ( -g compiler option)Source line view requires symbol info ( -g compiler option)– Very low intrusionVery low intrusion– System-wide measurementsSystem-wide measurements

Sample rate set to provide statistically meaningful dataSample rate set to provide statistically meaningful data– Based on CPU clock speed or auto-calibratedBased on CPU clock speed or auto-calibrated

Measures performance sensitive CPU eventsMeasures performance sensitive CPU events– Cycles (Time)Cycles (Time)– Cache misses, branch mispredictions, bank conflicts…Cache misses, branch mispredictions, bank conflicts…

– On Itanium there are far above 100 of such events, many of them having multiple On Itanium there are far above 100 of such events, many of them having multiple sub-eventssub-events

– Maximal 4 events each runMaximal 4 events each run– Restricted by number of PMU ( performance monitoring unit ) registersRestricted by number of PMU ( performance monitoring unit ) registers

Page 10: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 10 VTune and the Intel logo are trademarks or registered trademarks of Intel

Corporation or its subsidiaries in the United States or other countries.

Sampling Process ViewSystem-wide Data Collection

Page 11: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 11 VTune and the Intel logo are trademarks or registered trademarks of Intel

Corporation or its subsidiaries in the United States or other countries.

Sampling Source ViewSource Code Annotated with Performance Data

Page 12: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 12

VTL - Vtune ™ Native Linux VersionVTL - Vtune ™ Native Linux VersionSampling on Linux / IA32Sampling on Linux / IA32

Test: MySQL 4.0.11: test-select

memcpy contains 6 of the first 11 top hot-spots

Page 13: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 13

Selective SamplingSelective Sampling

The Vtune™ Pause/Resume API can be used to The Vtune™ Pause/Resume API can be used to limit sampling to specific parts of your applimit sampling to specific parts of your app

#include <vtuneapi.h> #include <vtuneapi.h>

Link with vtuneapi.libLink with vtuneapi.lib

Call VTResume() and VTPause() as appropriateCall VTResume() and VTPause() as appropriate

Enable „Start with data collection paused“ option Enable „Start with data collection paused“ option in configuration dialogin configuration dialog

There is also a more sophisticated Config/Start/Stop API available (see online documentation for more details)

Page 14: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 14

Vtune™ Call Graph FeatureVtune™ Call Graph Feature

Instrumented technologyInstrumented technology– Some performance degradationSome performance degradation

– Binary is instrumented Binary is instrumented

– Identifies function to function calling sequencesIdentifies function to function calling sequences

Reports statistics for each called functionReports statistics for each called function– Execution timeExecution time

– Blocked timeBlocked time

– Calling sequences & frequency of occurrenceCalling sequences & frequency of occurrence

Page 15: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 15

Vtune™ Call-graph View

Page 16: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 16

Vtune™ Call-graph View (VTL, cgview)

Page 17: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 17

VTL – Vtune™ for Linux VTL – Vtune™ for Linux Usage Model (1 of 2)Usage Model (1 of 2) Single-invocation command lineSingle-invocation command line

– $ vtl activity –c sampling$ vtl activity –c sampling– $ vtl run$ vtl run– $ vtl activity –c sampling run$ vtl activity –c sampling run

All VTune Activities and results stored in All VTune Activities and results stored in semi-hidden projectsemi-hidden project

User configures an Activity and runs it with User configures an Activity and runs it with a single invocationa single invocation– User may have multiple Activities in the projectUser may have multiple Activities in the project

– Each Activity may have multiple data collectors Each Activity may have multiple data collectors and multiple application/module profilesand multiple application/module profiles

Page 18: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 18

VTL – Vtune™ for Linux VTL – Vtune™ for Linux Usage Model (2 of 2)Usage Model (2 of 2) Results viewed with a single invocationResults viewed with a single invocation

– Some filtering available depending on the Some filtering available depending on the datadata

– Results accumulate until deleted by userResults accumulate until deleted by user

User may pack project and unpack on a User may pack project and unpack on a Windows systemWindows system– User can ta\ke advantage of VTune GUI on User can ta\ke advantage of VTune GUI on

WindowsWindows

– Provides access to capabilities not found on Provides access to capabilities not found on the command linethe command line

Page 19: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 19

VTL Command Line SyntaxVTL Command Line SyntaxSome ExamplesSome Examples General status commandsGeneral status commands

– vtl query –lcvtl query –lc

lists all collectors ( sampling and callgraph for 2.0) lists all collectors ( sampling and callgraph for 2.0)

– vtl –help –c samplingvtl –help –c sampling

lists all events available for EBS ( event base sampling )lists all events available for EBS ( event base sampling )

Create/Run a Sampling activity Create/Run a Sampling activity – vtl activity –c sampling –app gzip, “-f big” runvtl activity –c sampling –app gzip, “-f big” run

Create and run a single Sampling collector Activity with application Create and run a single Sampling collector Activity with application ‘gzip –f big’ ; default settings ( Instruction Retired and Cycles )‘gzip –f big’ ; default settings ( Instruction Retired and Cycles )

– vtl activity –d 20 –c sampling –o “-ec en=‘L3_READS-ALL-vtl activity –d 20 –c sampling –o “-ec en=‘L3_READS-ALL-MISS’” –app gzip,”-f big MISS’” –app gzip,”-f big Create and run for 20 seconds a single Sampling collector Activity Create and run for 20 seconds a single Sampling collector Activity with application ‘gzip –f big’ collecting all L3 cache misses – data with application ‘gzip –f big’ collecting all L3 cache misses – data and instructionand instructionUse option –cpu_mask <list> to select subset of processorsUse option –cpu_mask <list> to select subset of processors

Page 20: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 20

VTL Command Line Syntax(2)VTL Command Line Syntax(2)More ExamplesMore Examples View Sampling ResultsView Sampling Results

– vtl view vtl view

– vtl view -guivtl view -gui

shows result of last activity ( defaults ) shows result of last activity ( defaults )

– vtl view –hf –mn gzipvtl view –hf –mn gzip

view results for module ( application ) gzip in hot-spot view results for module ( application ) gzip in hot-spot function mode ( most active modules first )function mode ( most active modules first )

– vtl view –code –mn gzip –fn deflate –sea poavtl view –code –mn gzip –fn deflate –sea poa

view results in source code mode for function ‘deflate’ view results in source code mode for function ‘deflate’ of module ( application ) gzip; of module ( application ) gzip; sshow how eevents vents aas s ppercentage ercentage oof f aactivityctivity

Page 21: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 21

VTL Command Line Syntax(3)VTL Command Line Syntax(3)More ExamplesMore Examples Configuring and view Callgraph ActivityConfiguring and view Callgraph Activity

– vtl activity –c callgraph –app gzip, “-f big” –vtl activity –c callgraph –app gzip, “-f big” –moi gzip runmoi gzip runCreate and run a Callgraph Activity with application Create and run a Callgraph Activity with application ‘gzip –f big’ ; default settings; ‘gzip –f big’ ; default settings; mmodule odule oof f iinterest ‘gzip’; nterest ‘gzip’; in case ‘app’ is a script, the module of interest can in case ‘app’ is a script, the module of interest can select the binary to be anlayzedselect the binary to be anlayzed

– vtl view vtl view show the just generated call-graph in table-formatshow the just generated call-graph in table-format

– vtl view -gui vtl view -gui show the just generated call-graph in GUI-format; show the just generated call-graph in GUI-format; requires installation of CGVIEW tool ( free available requires installation of CGVIEW tool ( free available from Intel)from Intel)

Page 22: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 22

VTune in Eclipse – Call Graph ViewVTune in Eclipse – Call Graph View

Page 23: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 23

How to use Vtune™ for Itanium®How to use Vtune™ for Itanium®

The Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.

1.1. Find “hotspots” regarding time (cyles)Find “hotspots” regarding time (cyles)– By sampling of event CPU_CYCLESBy sampling of event CPU_CYCLES– By call-graphBy call-graph– Straight-forward and all you need in many casesStraight-forward and all you need in many cases– but doesn’t tell you “why”but doesn’t tell you “why”

2.2. Find “hot-spots” regarding expensive ‘occurrence’ eventsFind “hot-spots” regarding expensive ‘occurrence’ events– By sampling for e.g. L3 Cache misses, branch-miss predictions, By sampling for e.g. L3 Cache misses, branch-miss predictions,

RSE-activationsRSE-activations– Provides hints for code modificationsProvides hints for code modifications– Interpretation can be misleading Interpretation can be misleading

– E.g L3 cache misses can be neutral ( Prefetch ) or hint for expensive E.g L3 cache misses can be neutral ( Prefetch ) or hint for expensive eventsevents

– Requires some generic knowledge about Itanium architectureRequires some generic knowledge about Itanium architecture

3.3. Stall cycle analysisStall cycle analysis– By sampling for events causing stallsBy sampling for events causing stalls– Most sophisticated and requires detailed knowledge of processorMost sophisticated and requires detailed knowledge of processor– Only available in this form for Itanium® architectureOnly available in this form for Itanium® architecture

Page 24: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 24

Introduction to Stall Cycle Analysis Introduction to Stall Cycle Analysis

The Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.

The main idea: The main idea: – Assume algorithm and platform are perfectly optimized/configuredAssume algorithm and platform are perfectly optimized/configured– Total Cycles = Cycles to execute instructions + Cycles where the Total Cycles = Cycles to execute instructions + Cycles where the

processor pipeline is stalledprocessor pipeline is stalled– Minimize the stall-cyclesMinimize the stall-cycles

– In case this value is zero, we have 6 instructions/cycle thus can’t be In case this value is zero, we have 6 instructions/cycle thus can’t be betterbetter

This is Itanium-2 specificThis is Itanium-2 specific– For Itanium (-1) counter structure and names slightly differentFor Itanium (-1) counter structure and names slightly different– Does not work this way for IA-32 due to more non-deterministic (out-Does not work this way for IA-32 due to more non-deterministic (out-

of-order) execution featuresof-order) execution features We only can outline the main idea hereWe only can outline the main idea here

– Detailed documentation available:Detailed documentation available:– Itanium Reference Manual for Software DevelopersItanium Reference Manual for Software Developers– Itanium-2 Reference Manual for Software OptimizationItanium-2 Reference Manual for Software Optimization– Introduction to Micro-architectural Software Optimization

Page 25: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 25

Starting the Tree Structure AnalysisStarting the Tree Structure Analysis

CPU_CYCLES

Cycles Retiring Instr

BACK_END_BUBBLE.ALL

BACK_END_BUBBLE.ALL are the stall cycles to be reduced. Eliminating the major contributions to this number is the most promising way for optimization

Page 26: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 26

Itanium-2® Processor 8 stage PipelineItanium-2® Processor 8 stage Pipeline

ROT

IPG

Pipeline Front End

Instruction Buffer

WRB

DET

EXE

REG

REN

EXP

Pipeline Back End

L1D Cache

L1D Micropipeline

FPU micropipeline

Page 27: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 27

Top-Level Sum RuleTop-Level Sum RuleFirst level partitioningFirst level partitioning

Listed in priority order as: Listed in priority order as:

Back_End_Bubble.All = Back_End_Bubble.All =

BE_Flush_Bubble + BE_Flush_Bubble + (contributions from DET stage)(contributions from DET stage)

BE_L1D_FPU_Bubble + BE_L1D_FPU_Bubble + (micropipelines stall DET also)(micropipelines stall DET also)

BE_EXE_Bubble + BE_EXE_Bubble + (from EXE stage)(from EXE stage)

BE_RSE_Bubble + BE_RSE_Bubble + (from REN stage)(from REN stage)

Back_End_Bubble.FEBack_End_Bubble.FE (from pipeline FE)(from pipeline FE)

Cycle Accounting Sum Rule Starts the TreeCycle Accounting Sum Rule Starts the Tree

Page 28: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 28

Stall Cycles Have 5 ComponentsStall Cycles Have 5 Components

BE_EXE_BUBBLE

CPU_CYCLES

Cycles Retiring Instr

BACK_END_BUBBLE.ALL

BE_FLUSH_BUBBLE

BE_L1D_FPU_BUBBLE

BE_RSE_BUBBLE

BACK_END_BUBBLE.FE

Decreasing Pipeline Priority

Page 29: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 29

Important BE_EXE_BUBBLE Sub-EventsImportant BE_EXE_BUBBLE Sub-Events

BE_EXE_BUBBLE.GRALLBE_EXE_BUBBLE.GRALL– counts all stall cycles due to waiting for valid data delivery counts all stall cycles due to waiting for valid data delivery

to general registerto general register

BE_EXE_BUBBLE.GRGRBE_EXE_BUBBLE.GRGR– counts all stall cycles due to waiting for valid data delivery counts all stall cycles due to waiting for valid data delivery

to general register from an integer instruction to general register from an integer instruction

DifferenceDifference– BE_EXE_BUBBLE.GRALL - BE_EXE_BUBBLE.GRGR BE_EXE_BUBBLE.GRALL - BE_EXE_BUBBLE.GRGR

– approximates memory access stall for integer dataapproximates memory access stall for integer data

BE_EXE_BUBBLE.FRALLBE_EXE_BUBBLE.FRALL– counts all stall cycles due to waiting for valid data delivery to counts all stall cycles due to waiting for valid data delivery to

an FP registeran FP register

Page 30: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 30

Main Components of BE_EXE_BUBBLEMain Components of BE_EXE_BUBBLE

CPU_CYCLES

Cycles Retiring Instr

BACK_END_BUBBLE.ALL

BE_FLUSH_BUBBLE

BE_L1D_FPU_BUBBLE

BE_EXE_BUBBLE

BE_RSE_BUBBLE

BACK_END_BUBBLE.FE

Sub-Events of BE_EXE_BUBBLE Yield More Detail

BE_EXE_BUBBLE.GRALL

BE_EXE_BUBBLE.FRALL

BE_EXE_BUBBLE.GRGR

Page 31: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 31

Stall Cycles and Occurrence EventsStall Cycles and Occurrence Events

Sub-events often lead to cause of stallsSub-events often lead to cause of stalls– Point to occurrence events Point to occurrence events

– E.g. L2 Bank Conflicts ( each adding at least 6 cycles) or E.g. L2 Bank Conflicts ( each adding at least 6 cycles) or cache missescache misses

– Stall-Subevent = Stall-Subevent = ∑∑ ( Ocurrence Event X Penalty ) ( Ocurrence Event X Penalty )

13 cycles13 cyclesL3 Cache

FP LoadsInteger LoadsCache Level

210 cycles210 cyclesMemory

6 cycles5 cyclesL2 Cache

Not Applicable1 cycleL1 Data Cache

Page 32: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 32

ItaniumItanium®® 2 L2 Cache Access 2 L2 Cache Access

L2 data access controlled by 32 entry queue (OzQ) L2 data access controlled by 32 entry queue (OzQ) and allows out of order data returnand allows out of order data return

– FP data loaded to FP register file directly from L2FP data loaded to FP register file directly from L2

Minimum integer latency is 5 cyclesMinimum integer latency is 5 cycles

Minimum floating point latency is 6 cyclesMinimum floating point latency is 6 cycles

Latency is increased by:Latency is increased by:– Cache missCache miss

– Bank conflicts cause OzQ cancels (measured to add 6 Bank conflicts cause OzQ cancels (measured to add 6 cycles)cycles)

– Multiple misses and misses to lines being updated will Multiple misses and misses to lines being updated will cause OzQ recirculates (measured to add ~17 cycles)cause OzQ recirculates (measured to add ~17 cycles)

Page 33: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 33

L2 Unified Cache Bank StructureL2 Unified Cache Bank Structure 256KB, 128 byte cache lines, 8 way associativity256KB, 128 byte cache lines, 8 way associativity

Each associative set is 1KB, 256 associative setsEach associative set is 1KB, 256 associative sets

Bank structure allows fast transfers from/to large ItaniumBank structure allows fast transfers from/to large Itanium®® 2 2 Processor L2 Cache Processor L2 Cache

16 banks each 16 bytes wide16 banks each 16 bytes wide

cache lines cache lines

8Of8Line8Of7Line

8Of6Line8Of5Line

8Of4Line8Of3Line

8Of2Line8Of 1Line

16 banks cover 256 bytes = 2 cache lines

Bank 0 is a column of 1024 16 byte elements

Page 34: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 34

Follow the Tree down to the Occurrence Follow the Tree down to the Occurrence EventsEvents

Ex: Stall Cycles are Associated with Occurrence Events

CPU_CYCLES

Cycles Retiring Instr

BACK_END_BUBBLE.ALL

BE_FLUSH_BUBBLE

BE_L1D_FPU_BUBBLE

BE_EXE_BUBBLE

BE_RSE_BUBBLE

BACK_END_BUBBLE.FE

“_EXE_”.GRALL

“_EXE_”.FRALL

“_EXE_”.GRGR

L2_Force_Recirc

L1D_Read_Misses

L2_Data_References

L3_Reads.Data_read.All

L3_Reads.Data_read.Hit

L3_Reads.Data_read.Miss

L2_OzQ_Cancels1.Bank_Conf

Dear_Latency

Page 35: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 35

Constraining Performance Monitoring Constraining Performance Monitoring Events on IPFEvents on IPF

The Performance Monitoring Events can be The Performance Monitoring Events can be constrained to only increment on particularconstrained to only increment on particular– Instruction type (opcode matching)Instruction type (opcode matching)

– Instruction Pointer range (IP matching)Instruction Pointer range (IP matching)

– Virtual Address Range (Data Address matching)Virtual Address Range (Data Address matching)

– Or any combination of the above Or any combination of the above – default is no constraint = collect all eventsdefault is no constraint = collect all events

Unique Features of the ItaniumUnique Features of the Itanium® Processor ® Processor FamilyFamily

Page 36: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 36

Opcode Matching the Matrix Multiply ExampleOpcode Matching the Matrix Multiply Example

O2O2

592542.2 X 1010Prefetch

6.7 X 1072.1 X 1092.2 X 1010Fp Load

6.7 X 1076.4 X 1092.2 X 1010Default

L3 Cache Misses

Instructions Retired

CPU_CyclesOpcode Match

6.7 X 1075 X 1083.3 X 109Prefetch

1 X 1052.1 X 1093.3 X 109Fp Load

6.7 X 1076.4 X 1093.3 X 109Default

L3 Cache Misses

Instructions Retired

CPU_CyclesOpcode Match

O3

Opcode Matching Shows L3 Misses Are Opcode Matching Shows L3 Misses Are Fixed by O3Fixed by O3

Page 37: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 37

How Does This Work?How Does This Work?

Instructions are 41 bit fieldsInstructions are 41 bit fields– Define a unique instruction and register usageDefine a unique instruction and register usage

– 3 per 128 bit bundle3 per 128 bit bundle

– Plus 5 bits for the templatePlus 5 bits for the template

Opcode matching can work with classes of Opcode matching can work with classes of instructionsinstructions– By using only a subset of the 41 bitsBy using only a subset of the 41 bits

– Done with Done with – an instruction fieldan instruction field

– a mask field (defining which bits to ignore)a mask field (defining which bits to ignore)

– A template fieldA template field

Page 38: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 38

Example MasksExample Masks

lfetch Template is MOpcode field is 0x0CB00000000Mask field is 0x030FFFFFFFF

fploads Template is MOpcode field is 0x0C000000000Mask field is 0x037FFFFFFFF

This is WAY too Painful!!This is WAY too Painful!!

Page 39: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 39

The Prototype VTune™ AnalyzerThe Prototype VTune™ Analyzer

Page 40: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 40

Example: Matrix MultiplyExample: Matrix Multiply

““Naïve” coding: Naïve” coding:

for(i=0; i<N; i++){for(i=0; i<N; i++){

for(j=0; j<N; j++){for(j=0; j<N; j++){

for(k=0; k<N; k++){for(k=0; k<N; k++){

c[i][j]=c[i][j]+a[i][k]*b[k][j]; }}}c[i][j]=c[i][j]+a[i][k]*b[k][j]; }}}

Page 41: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 41

Sample: Matrix MultiplySample: Matrix MultiplyA=BxC, N=M=K=1024, double, in CA=BxC, N=M=K=1024, double, in C

% of peak per-formance

Compiler Option

Coding Method / Modification

N/A

-O3

-O3

-O3

-O3

-O3

-O3

-O2

-O2

-O2

0.92MKL

0.95DTLB

0.91Data Blocking

0.75Transpose

0.54Bank Conflicts

0.35Loop Unrolling

0.16Full Compiler Optimization

0.07Explicit Pre-fetching

0.03Simple Loop Exchange: ikj-Loop

0.006Naïve Coding: ijk-Loop

Peak Performance: 4GFLOPs (1GHz Itanium-2)

Page 42: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 42

Software Pipelining TermsSoftware Pipelining Terms Initiation IntervalInitiation Interval (II) (II) is the number of cycles is the number of cycles

between the start of successive iterations in the loopbetween the start of successive iterations in the loop– If If the II is the II is nn cycles, a new loop iteration will be cycles, a new loop iteration will be

completed every completed every nn cycles at steady state cycles at steady state

Scheduled IIScheduled II is the cycles per iteration of the is the cycles per iteration of the pipelined looppipelined loop

Minimum IIMinimum II is the smallest ii that is feasible for is the smallest ii that is feasible for pipelined loop (according to the compiler’s coded pipelined loop (according to the compiler’s coded uarch scheduling rules).uarch scheduling rules).

Recurrence IIRecurrence II is caused by is caused by loop-carried loop-carried dependence edges (memory and register dependence edges (memory and register dependences) from instructions in one iteration to dependences) from instructions in one iteration to instructions in subsequent iterationsinstructions in subsequent iterations

– If the recurrences are caused by memory-dependence If the recurrences are caused by memory-dependence edges, the report prints out details of such edgesedges, the report prints out details of such edges

Page 43: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 43

Tips on how to read SWP reportTips on how to read SWP report

If Recurrence II > 0 , means compiler detected loop If Recurrence II > 0 , means compiler detected loop carried dependenciescarried dependencies

If Minimum II = Scheduled II, means loop is optimally If Minimum II = Scheduled II, means loop is optimally scheduled according to the compilerscheduled according to the compiler

Percent of Resource II used by memory ops, floating Percent of Resource II used by memory ops, floating point ops and integer ops shows the utilization of the point ops and integer ops shows the utilization of the corresponding execution units throughout the loop corresponding execution units throughout the loop kernelkernel

If your floating point Resource II utilization is less If your floating point Resource II utilization is less than memory – not optimal situation for number than memory – not optimal situation for number crunching algorithm. Consider loop balancing.crunching algorithm. Consider loop balancing.

Page 44: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 44

Sample: Matrix MultiplySample: Matrix MultiplyDemo Vtune on Matrix applicationDemo Vtune on Matrix application

Vtune™ opens ‘Saved Project File’ of real Itanium®-2 session

Before: Loop unrolled; many L2 Bank conflicts

Next step: Change alignment of matrices by adding padding data

Look at decrease in

L2_OZQ_CANCELS1.BANK_CONV

Cycle-reduction: Multiply this number by 6 ( penalty)

Page 45: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 45

HLO: Prefetching of DataHLO: Prefetching of Data Load data to caches ahead of accessLoad data to caches ahead of access

Most efficient way to hide memory latency Most efficient way to hide memory latency

Three levels of control detail for activationThree levels of control detail for activation– Implicitly by compiler using –O3Implicitly by compiler using –O3

– #pragma prefetch <address> / noprefetch #pragma prefetch <address> / noprefetch <address><address>

– lfetch assembly instruction via intrinsic lfetch assembly instruction via intrinsic _lfetch(hint, pointer)_lfetch(hint, pointer) or or explicit function call explicit function call CALL LFETCH(hint, pointer)CALL LFETCH(hint, pointer)

The Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.

#pragma prefetch b,c#pragma noprefetch d for (i = 0; i < MAX; ++i) { a[i] = b[i] * c[i] + d[i];}

Page 46: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 46

Prefetching of DataPrefetching of DataExlicit use of lfetch IntrinsicExlicit use of lfetch Intrinsic #include <ia64intrin.h>#include <ia64intrin.h>

void multiply_d(double a[][DIM], double b[][DIM], double c[][DIM])void multiply_d(double a[][DIM], double b[][DIM], double c[][DIM])

{{

int i,j,k, advance=5;int i,j,k, advance=5;

double *basepnt_b, *basepnt_c, double temp;double *basepnt_b, *basepnt_c, double temp;

for(i=0;i<NUM;i++) {for(i=0;i<NUM;i++) {

for(k=0;k<NUM;k++) {for(k=0;k<NUM;k++) {

basepnt_b = &b[k][16*advance];basepnt_b = &b[k][16*advance];

basepnt_c = &c[i][16*advance];basepnt_c = &c[i][16*advance];

for(j=0;j<NUM;j++) {for(j=0;j<NUM;j++) {

if(j%16 == 0 && NUM-j > 16*advance) {if(j%16 == 0 && NUM-j > 16*advance) {

__lfetch(__lfhint_nt1,(void *)basepnt_b);__lfetch(__lfhint_nt1,(void *)basepnt_b);

basepnt_b += 16;basepnt_b += 16;

__lfetch(__lfhint_nt1,(void *)basepnt_c);__lfetch(__lfhint_nt1,(void *)basepnt_c);

basepnt_c += 16;}basepnt_c += 16;}

c[i][j] = c[i][j] + a[i][k] * b[k][j];c[i][j] = c[i][j] + a[i][k] * b[k][j];

}}}}}}}}

The Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.

Page 47: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 47

Avoid DenormalAvoid Denormal ComputationsComputations

Exponent bits all zero; mantissa bits not zero (single Exponent bits all zero; mantissa bits not zero (single precision value < 1.1 e –38)precision value < 1.1 e –38)

Computed in software on IntelComputed in software on Intel®® Itanium Itanium®® processor processor - 100s - 100s of clock cyclesof clock cycles

Symptom: excessive kernel timeSymptom: excessive kernel time Understand why you have denormal results and see if they Understand why you have denormal results and see if they

are justified; if not:are justified; if not:– Translate to normal problemTranslate to normal problem– Increase precision by scaling valuesIncrease precision by scaling values– Set flush-to-zero (ftz) mode Set flush-to-zero (ftz) mode

– -ftz (default in ICC 8.1 / -O3; turn off with –ftz-)-ftz (default in ICC 8.1 / -O3; turn off with –ftz-)– Applies to main() function onlyApplies to main() function only– Caveat: potentially changes resultCaveat: potentially changes result

Many more switches to control FP precision, exceptions, Many more switches to control FP precision, exceptions, IEEE754 standard conformance etcIEEE754 standard conformance etc

– Whole chapter in compiler manualWhole chapter in compiler manual

Intel, Itanium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Page 48: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 48

Non-IEEE is faster …Non-IEEE is faster …Result of simple benchmark:Result of simple benchmark:

for(I=0;I<len;I++)a[I]=func(b[I]);for(I=0;I<len;I++)a[I]=func(b[I]);

The penalties on Itanium® for IEEE Divides and The penalties on Itanium® for IEEE Divides and Sqrts and the impact of Non-IEEE algorithms are Sqrts and the impact of Non-IEEE algorithms are approximated by this test run:approximated by this test run:

function Default Stall cycles NoIEEE Gainbase 1.7recip 7.1 5.4 cycles/call 4.1 3 cycles/calldiv 7.1 5.4 cycles/call 4.3 2.8 cycles/callrecip_sqrt 15.1 13.4 cycles/call 8.6 6.5 cycles/callsqrt 14.6 12.9 cycles/call 7.6 7 cycles/call

Compiler (C/C++) option: -IPF_FP_relaxed(-)

Page 49: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 49

Data Alignment Data Alignment

Aligning data on boundaries ensures Aligning data on boundaries ensures maximal performancemaximal performance

Compiler works hard to align data on Compiler works hard to align data on boundaries for youboundaries for you

Some coding practices can help or hinder Some coding practices can help or hinder the compiler efforts and some cause the compiler efforts and some cause major performance problemsmajor performance problems– Avoid casting pointersAvoid casting pointers

– Avoid using packed structuresAvoid using packed structures

The Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.

Page 50: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 50

AlignmentAlignment PerformancePerformance

Sample Application:Sample Application:

1x58.6MB.75x78MBReordered

N/AN/A1.2x63.5MBByte Load(__unaligned keyword)

.75x43.9MB5x63.5MBPacked

1x58.6MB1x117MBAligned

ClocksData SizeClocksData Size

IA-32IA-32IPFIPF

The Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.

Page 51: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 51

Handling MisalignmentHandling Misalignment

If packed structures are unavoidableIf packed structures are unavoidable– Consider changing algorithmConsider changing algorithm

– Unpack structure once at beginningUnpack structure once at beginning

– Do all computation with unpacked dataDo all computation with unpacked data

– Repack structure when doneRepack structure when done

–Use __unaligned keywordUse __unaligned keyword– Forces compiler to do byte loadsForces compiler to do byte loads

The Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.

Page 52: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 52

SummarySummary Most Vtune™ tools available for Itanium® todayMost Vtune™ tools available for Itanium® today

– Missing parts will show up very soonMissing parts will show up very soon

Vtune™ Performance Analyzer offers in-depth Vtune™ Performance Analyzer offers in-depth but comfortable access to complex but comfortable access to complex “performance monitor counters” on Itanium® 2 “performance monitor counters” on Itanium® 2

Vtune™ is a ‘live’ product – processor support Vtune™ is a ‘live’ product – processor support for new processors and architectural features for new processors and architectural features immediately availableimmediately available

Vtune 7.2 (Windows) and Vtune 3.0 for Linux will Vtune 7.2 (Windows) and Vtune 3.0 for Linux will support EM64T ( 64 bit exensions) too – Q3/2004 support EM64T ( 64 bit exensions) too – Q3/2004

VTune and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

Page 53: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 53

ItaniumItanium--2® Processor 8 stage Pipeline2® Processor 8 stage Pipeline

Two stage Front End (FE) gets and formats Two stage Front End (FE) gets and formats instructions from L1I cache or the Instruction instructions from L1I cache or the Instruction Streaming BufferStreaming Buffer

– FE loads pipeline Instruction Buffer which stages FE loads pipeline Instruction Buffer which stages instructions for Back Endinstructions for Back End

6 stage Pipeline Back End 6 stage Pipeline Back End – Expands the templates (EXP)Expands the templates (EXP)– Prepares registers for access by the instructions (REN)Prepares registers for access by the instructions (REN)– Loads data from registers to functional units (REG)Loads data from registers to functional units (REG)– EXE stage invokes instructions and routes output from EXE stage invokes instructions and routes output from

single cycle ALUs back to REG stage as neededsingle cycle ALUs back to REG stage as needed– DET stage detects micropipeline stalls, exceptions and DET stage detects micropipeline stalls, exceptions and

branch mispredictions and flushes the pipelinebranch mispredictions and flushes the pipeline– WRB stage writes output of functional units to registersWRB stage writes output of functional units to registers

Page 54: Using Intel® Developer on Itanium® Architecture for ...parallel/parallelrechner/altix...– Look at the WEB: There are numerous of them Difference is in easy-of-use, added features

®®

* Other brands and names may be claimed as the property of others. 54

ItaniumItanium® 2 Memory System® 2 Memory SystemItanium® 2 Processor

L2256KB8-way128B lines5-7 CLKSBanked

32 GB/s

L33MB12-way128B lines12-15 CLKS

ExternalMemory

6.4 GB/s

32 GB/s

32 GB/s

L1D16KB64B lines1 CLK

L1I16KB64B lines1 CLK

128 FP Registers

128 General Registers

Core Pipeline (functional units)