A Brief Analysis of the SPEC CPU2000 Benchmarks on … 2002 - Intel® Itanium® 2 Processor A Brief...

24
HotChips 2002 - Intel® Itanium® 2 Processor A Brief Analysis of the SPEC CPU2000 Benchmarks on the Intel® Itanium® 2 Processor James McCormick, HP Allan Knies, Intel All trademarks are the property of their respective owners

Transcript of A Brief Analysis of the SPEC CPU2000 Benchmarks on … 2002 - Intel® Itanium® 2 Processor A Brief...

HotChips 2002 - Intel® Itanium® 2 Processor

A Brief Analysis of the SPECCPU2000 Benchmarks on theIntel® Itanium® 2 Processor

James McCormick, HPAllan Knies, Intel

All trademarks are the property of their respective owners

HotChips 2002 - Intel® Itanium® 2 Processor

Agenda

This is a data-oriented presentation, not research

Agenda– Brief performance summary– Comparison of how HP and Intel compilers use

Itanium® architecture features and compare to bestRISC.

– Analysis of microarchitectural features of the Intel®Itanium® 2 processor and how it affects SPECCPU2000 performance

HotChips 2002 - Intel® Itanium® 2 Processor

Overall Performance

SPEC{int/fp}_base2000 810/1356 (best 0.18u)Linpack 1000: 3.5 Gflops (best overall)TPC-C (SQL/4P): 78K tpmC (best 4P number)SPECweb_SSL: 1520 connections (best of class)

! Itanium® 2 processor best of class on a wide range of applications

SPEC CPU2000 Results

0200400600800

1000120014001600

SPECint_base2000 SPECint_peak2000 SPECfp_base2000 SPECfp_peak2000

results from www.spec.org

Itanium2/HP 1Ghz IBM Power4 1.3 Ghz Alpha 21264 1Ghz UltraSPARC IIICu 1.05 Ghz

HotChips 2002 - Intel® Itanium® 2 Processor

Part I: ISA and CompilerComparisons

Allan Knies

Itanium® Architecture and PerformanceTeam

Intel [email protected]

HotChips 2002 - Intel® Itanium® 2 Processor

• Number of ‘useful instructions’ (shown in blue) +/- 1% of Alpha• Total instructions (blue+green) +20-30% of Alpha (due to NOPs)• Bundling is main cause of extra NOPS! We’ll see that extra instr/nops are not substantially impacting perf

Instruction Mix: High Lev el

0.0E +00

5.0E +10

1.0E +11

1.5E +11

2.0E +11

2.5E +11

3.0E +11

3.5E +11

4.0E +11

Itanium 2 /Inte l Itanium 2 /HP A lpha

Tota

l Ins

truc

tions

B ranch or Not P redic ated P redicate True P redicate Fals e Nop

HotChips 2002 - Intel® Itanium® 2 Processor

Instruction Mix Details IntroductionItanium/Alpha ISAs:• Itanium® arch has 40% fewer memory operations and 30% fewer

branches than Alpha (some impact from no-pgo on Alpha)• Itanium arch has about 10% more ALU/compares/shifts than Alpha• Itanium arch has about 20-30% NOPs – eventually expect this to be

under 20%

HP/Intel Compiler:• HP compiler uses more memory and ALU ops than Intel• Implies HP more conservative with registers – we’ll see impact later

! Itanium® architecture trades more ‘easy’ instructions (NOP, alu,cmp) for reducing the ‘hard’ instructions (branch, load)

! More than one way to get good performance from a compiler

HotChips 2002 - Intel® Itanium® 2 Processor

• Alpha has 1.4x mem refs of Itanium® architecture (incl/RSE ops)• RSE ops are easy to optimize in the future, if needed• HP compiler is more consv with regs, but has more ALU/reloads

! Large register file, good compiler technology, and RSE pay off! HP has lower RSE costs, but more reloads/alu ops

Instruction Mix: Memory Instruction Details

0.0E+001.0E+102.0E+103.0E+104.0E+105.0E+106.0E+107.0E+108.0E+109.0E+101.0E+11

Itanium2/Intel Itanium2/HP Alpha

Mem

ory

Inst

ruct

ions

RSEStoresLoads

HotChips 2002 - Intel® Itanium® 2 Processor

• Total time spent in RSE is only 3%-4% of overall execution• 1½ -3 cycles per call/return for RSE spill/fill activity• 1-3 instructions per subroutine setup for RSE

• Intel compiler has fewer calls, but more cycles/call

! Register stack provides very low overhead call/return support

Features: Register S tack

0 .0E +00

2 .0E +09

4 .0E +09

6 .0E +09

8 .0E +09

1 .0E +10

Inte l HP

RS E

Tota

l RSE

Cyc

les

Total RS E C y c les

HotChips 2002 - Intel® Itanium® 2 Processor

• HP compiler generates 13% fewer br’s than Intel, 9% more mispredictions

• HP compiler generates 31% fewer branches than Alpha

• Itanium-based binaries fewer tk br’s than Alpha, data skewed by lack of PGO

! Itanium® architecture reduces the # branches and branch mispredicts! HP/Intel compilers both reduce # branches – but with different focus

Instruction M ix: Branch Detail

1 .389E +10 1 .282E +10

2 .42E +10

1 .895E +101 .624E +10

1 .383E +10

0 .0E +00

5 .0E +09

1 .0E +10

1 .5E +10

2 .0E +10

2 .5E +10

3 .0E +10

3 .5E +10

4 .0E +10

Itanium 2/Inte l Itanium 2/HP A lpha

Bra

nch

Inst

ruct

ions

Not Taken B ranchesTaken B ranches

HotChips 2002 - Intel® Itanium® 2 Processor

• Useful instructions per taken branch is very high• Useful instructions per mispredicted branch slightly better for Intel• Useful instructions/call very high – Intel/HP compilers very aggr inlining

! Advanced compilers reduce stress on br prediction/Icache HW! Trading Istream size for regularity improves HW efficiency

Parallelism: Region Size

19

15 9

21 2

20

14 115 8

0

50

10 0

15 0

20 0

25 0

U s e fu l In s tr/Tk B r U s e fu l In s t/MP B ra n ch U s e fu l In s t/C a ll

Use

ful I

nstr

uctio

ns

Inte lHP

HotChips 2002 - Intel® Itanium® 2 Processor

• About 20-30% of loads are speculative in Intel binaries• Data shows tiny penalty for chk.s usage despite high usage rate• Intel has 10x more chk.s than HP, HP uses ‘no recovery model’selectively (per benchmark decision)• HP has 10x more chk.a/ld.c then Intel, recovery less than 1% time

! Speculation heavily used, but causes little overhead

Features: Speculation

1.0E+00

1.0E+02

1.0E+04

1.0E+06

1.0E+08

1.0E+10

Intel HP Intel HP

chk.s chk.a and ld.c

Inst

ruct

ions

# Checks # Failed

HotChips 2002 - Intel® Itanium® 2 Processor

• Useful IPC computed using ‘unstalled IPC’• Compilers find 2.5-3.0 IPC in integer apps (even beyond SPEC)• Dynamic delays reduce this to 1.3 achieved for CPU2000 integer

! Differences in perf/heuristics shows headroom for both compilers! Good IPC found by both compilers, room for uArch improvements

Useful IPC

0 .000 .501 .001 .502 .002 .503 .003 .504 .00

Inte

lH

P

Inte

lH

P

Inte

lH

P

Inte

lH

P

Inte

lH

P

Inte

lH

P

Inte

lH

P

Inte

lH

P

Inte

lH

P

Inte

lH

P

Inte

lH

P

Inte

lH

P

Inte

lH

P

g zip vp r gcc m cf cra fty pa rse r eo n pe rl ga p vo rtex bzip2 two lf A vg

Use

ful I

PC

A chie ved F o und

HotChips 2002 - Intel® Itanium® 2 Processor

Notes

• Itanium-based binaries used for these stats are older than thoseused for the official SPEC submission (less than 10% difference)

• The results for Intel® Itanium® 2 processor in Part I are: one withthe Intel compiler running on 64-bit MS OS and another with theHP compiler running under HP-UX

• Alpha ISA numbers via simulation, binaries used near peak (noprofile guided optimization), tuned for 21264. Alpha data missingVPR and PERL – thus, left out of all averages in Part I.

• Results computed with arithmetic averages – data thus skewedtowards long-running benchmarks

HotChips 2002 - Intel® Itanium® 2 Processor

Part II: Microarchitecture

James McCormickHP Colorado VLSI Laboratory

HotChips 2002 - Intel® Itanium® 2 Processor

Cycle Accounting: INT

7%

38%

3%1%

49%

2%

PIPE FLUSH DATA ACCESSINSTR ACCESS SCOREBOARDRSE UNSTALLED

Cycle Accounting: FP

1%

47%

1%

0%

50%

1%

PIPE FLUSH DATA ACCESSINSTR ACCESS SCOREBOARDRSE UNSTALLED

!Remaining: unstalled execution and data access!Great performance

HotChips 2002 - Intel® Itanium® 2 Processor

Instructions Per Unstalled Cycle

0123456

Integer Floating Point

IPC

USEFUL w/ PRED OFF + NOPS

!High machine parallelism

HotChips 2002 - Intel® Itanium® 2 Processor

Total Instruction Fetch Latency

0

20

40

60

80

100

164.

gzip

175.

vpr

176.

gcc

181.

mcf

186.

craf

ty

197.

pars

er

252.

eon

253.

perlb

mk

254.

gap

255.

vorte

x

256.

bzip

2

300.

twolf

AVER

AGE

168.

wup

wise

171.

swim

172.

mgr

id

173.

appl

u

177.

mes

a

178.

galg

el

179.

art

183.

equa

ke

187.

face

rec

188.

amm

p

189.

luca

s

191.

fma3

d

200.

sixt

rack

301.

apsi

AVER

AGE

Integer Floating Point

Perc

ent

L1I HIT L2 INSTR HIT L3 INSTR HIT MEM INSTR ACCESS

!Noticeable L1I misses!Very small I-fetch component

HotChips 2002 - Intel® Itanium® 2 Processor

!High FE throughput rate! I-miss latency hidden

Instruction Fetch Ahead

0

1

2

3

4

5

6

INT AVERAGE FP AVERAGE

Inst

ruct

ions

Per

Cyc

le

0102030405060708090100

Perc

ent E

xpos

ed

DECOUPLED DISP THROUGHPUT DECOUPLED FE THROUGHPUTFE BUBBLES EXPOSED

HotChips 2002 - Intel® Itanium® 2 Processor

Branch M ispredictions

37.7

02468

101214161820

164.

gzip

175.

vpr

176.

gcc

181.

mcf

186.

craf

ty

197.

pars

er

252.

eon

253.

perlb

mk

254.

gap

255.

vorte

x

256.

bzip2

300.

twolf

AVER

AGE

Integer

Mis

pred

ictio

n Pe

rcen

tage

02468101214161820

Full

Pipe

lines

Per

Fl

ush

WRO NG PAT H WRO NG T ARG ET UNK NO WN FULL P IPEL INES PER M IS PRED B R

!High accuracy, low penalty!Helps instruction fetch

HotChips 2002 - Intel® Itanium® 2 Processor

Total Data Read Latency

0102030405060708090

100

164.

gzip

175.

vpr

176.

gcc

181.

mcf

186.

craf

ty

197.

pars

er

252.

eon

253.

perlb

mk

254.

gap

255.

vorte

x

256.

bzip

2

300.

twol

f

AVER

AGE

168.

wup

wis

e

171.

swim

172.

mgr

id

173.

appl

u

177.

mes

a

178.

galg

el

179.

art

183.

equa

ke

187.

face

rec

188.

amm

p

189.

luca

s

191.

fma3

d

200.

sixt

rack

301.

apsi

AVER

AGE

Integer Floating Point

Perc

ent o

f Lat

ency

L1D DATA READ L2 DATA READ L3 DATA READ MEM DATA READ

!Large component, large opportunity

HotChips 2002 - Intel® Itanium® 2 Processor

SummaryItanium® 2 Processor Delivers Leadership

Performance

– Architecture / Compilers• Expose lots of ILP to in-order pipeline• Replace difficult instructions with easy ones• RSE and large register file work well together

– CPU Design• Machine parallelism handles high ILP• Aggressive design reduces bottlenecks

HotChips 2002 - Intel® Itanium® 2 Processor

AcknowledgementsJason Cantin (U. Wisconsin):

Ran all the experiments to gather the Alpha ISA data withchanges/alterations at our request. Without Jason, we would have noAlpha data of any kind. Thanks to his efforts to rerun all the data forus!

Hwansoo Han, Youngsoo Choi, Geetha Vederaman (Intel):Intel data gathering, formatting, methodology, performance monitoring

Bryan Black, Ed Grochowski, Jim Callister, Carole Dulong (Intel):Extensive draft review and comments

Caliper Team (HP):Tool support

HotChips 2002 - Intel® Itanium® 2 Processor

Backup/Reference Slides

HotChips 2002 - Intel® Itanium® 2 Processor

Configuration InformationIntel® Itanium® 2 processor data for Intel systems/compilers:

– Binaries: Pre-production version of Intel C++ 6.0 Compiler, -O3 with interproceduraloptimization and profile guided optimization

– Run on: Pre-production stepping of Itanium 2 processor 800Mhz/200Mhz core/bus, Intel 870chipset, monitor kernel and user instructions and events

Intel® Itanium® 2 processor data for HP systems/compilers:– Binaries: Pre-production version of HP compiler– Run on: Pre-production stepping of Itanium 2 processor 1000Mhz/200Mhz core/bus, rx2600

prototype, monitor kernel and user instructions and eventsAlpha ISA data:

– Run on: Functional simulator, system code not simulated– Simulation details: http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/– All data thanks to Jason Cantin at the University of Wisconsin.– Binaries: http://www.eecs.umich.edu/~chriswea/benchmarks/Com.cf

In general, these binaries are optimized for 21264, peak optimization. Usually, –g3 –fast –O4,but NO profile feedback. Compiler: DEC C V5.9-005 and DI GI T AL C++ V6.1-027

Other remarks:– All averages in slides left out PERL and VPR due to data not available for Alpha