A Brief Analysis of the SPEC CPU2000 Benchmarks on the Intel® Itanium… · 2013. 7. 28. ·...

24
HotChips 2002 - Intel® Itanium® 2 Processor A Brief Analysis of the SPEC CPU2000 Benchmarks on the Intel® Itanium® 2 Processor James McCormick, HP Allan Knies, Intel All trademarks are the property of their respective owners

Transcript of A Brief Analysis of the SPEC CPU2000 Benchmarks on the Intel® Itanium… · 2013. 7. 28. ·...

  • HotChips 2002 - Intel® Itanium® 2 Processor

    A Brief Analysis of the SPECCPU2000 Benchmarks on theIntel® Itanium® 2 Processor

    James McCormick, HPAllan Knies, Intel

    All trademarks are the property of their respective owners

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Agenda

    This is a data-oriented presentation, not research

    Agenda– Brief performance summary– Comparison of how HP and Intel compilers use

    Itanium® architecture features and compare to bestRISC.

    – Analysis of microarchitectural features of the Intel®Itanium® 2 processor and how it affects SPECCPU2000 performance

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Overall Performance

    SPEC{int/fp}_base2000 810/1356 (best 0.18u)Linpack 1000: 3.5 Gflops (best overall)TPC-C (SQL/4P): 78K tpmC (best 4P number)SPECweb_SSL: 1520 connections (best of class)

    ! Itanium® 2 processor best of class on a wide range of applications

    SPEC CPU2000 Results

    0200400600800

    1000120014001600

    SPECint_base2000 SPECint_peak2000 SPECfp_base2000 SPECfp_peak2000

    results from www.spec.org

    Itanium2/HP 1Ghz IBM Power4 1.3 Ghz Alpha 21264 1Ghz UltraSPARC IIICu 1.05 Ghz

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Part I: ISA and CompilerComparisons

    Allan Knies

    Itanium® Architecture and PerformanceTeam

    Intel [email protected]

  • HotChips 2002 - Intel® Itanium® 2 Processor

    • Number of ‘useful instructions’ (shown in blue) +/- 1% of Alpha• Total instructions (blue+green) +20-30% of Alpha (due to NOPs)• Bundling is main cause of extra NOPS! We’ll see that extra instr/nops are not substantially impacting perf

    Instruction Mix: High Lev el

    0.0E +00

    5.0E +10

    1.0E +11

    1.5E +11

    2.0E +11

    2.5E +11

    3.0E +11

    3.5E +11

    4.0E +11

    Itanium 2 /Inte l Itanium 2 /HP A lpha

    Tota

    l Ins

    truc

    tions

    B ranch or Not P redic ated P redicate True P redicate Fals e Nop

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Instruction Mix Details IntroductionItanium/Alpha ISAs:• Itanium® arch has 40% fewer memory operations and 30% fewer

    branches than Alpha (some impact from no-pgo on Alpha)• Itanium arch has about 10% more ALU/compares/shifts than Alpha• Itanium arch has about 20-30% NOPs – eventually expect this to be

    under 20%

    HP/Intel Compiler:• HP compiler uses more memory and ALU ops than Intel• Implies HP more conservative with registers – we’ll see impact later

    ! Itanium® architecture trades more ‘easy’ instructions (NOP, alu,cmp) for reducing the ‘hard’ instructions (branch, load)

    ! More than one way to get good performance from a compiler

  • HotChips 2002 - Intel® Itanium® 2 Processor

    • Alpha has 1.4x mem refs of Itanium® architecture (incl/RSE ops)• RSE ops are easy to optimize in the future, if needed• HP compiler is more consv with regs, but has more ALU/reloads

    ! Large register file, good compiler technology, and RSE pay off! HP has lower RSE costs, but more reloads/alu ops

    Instruction Mix: Memory Instruction Details

    0.0E+001.0E+102.0E+103.0E+104.0E+105.0E+106.0E+107.0E+108.0E+109.0E+101.0E+11

    Itanium2/Intel Itanium2/HP Alpha

    Mem

    ory

    Inst

    ruct

    ions

    RSEStoresLoads

  • HotChips 2002 - Intel® Itanium® 2 Processor

    • Total time spent in RSE is only 3%-4% of overall execution• 1½ -3 cycles per call/return for RSE spill/fill activity• 1-3 instructions per subroutine setup for RSE

    • Intel compiler has fewer calls, but more cycles/call

    ! Register stack provides very low overhead call/return support

    Features: Register S tack

    0 .0E +00

    2 .0E +09

    4 .0E +09

    6 .0E +09

    8 .0E +09

    1 .0E +10

    Inte l HP

    RS E

    Tota

    l RSE

    Cyc

    les

    Total RS E C y c les

  • HotChips 2002 - Intel® Itanium® 2 Processor

    • HP compiler generates 13% fewer br’s than Intel, 9% more mispredictions

    • HP compiler generates 31% fewer branches than Alpha

    • Itanium-based binaries fewer tk br’s than Alpha, data skewed by lack of PGO

    ! Itanium® architecture reduces the # branches and branch mispredicts! HP/Intel compilers both reduce # branches – but with different focus

    Instruction M ix: Branch Detail

    1 .389E +10 1 .282E +10

    2 .42E +10

    1 .895E +101 .624E +10

    1 .383E +10

    0 .0E +00

    5 .0E +09

    1 .0E +10

    1 .5E +10

    2 .0E +10

    2 .5E +10

    3 .0E +10

    3 .5E +10

    4 .0E +10

    Itanium 2/Inte l Itanium 2/HP A lpha

    Bra

    nch

    Inst

    ruct

    ions

    Not Taken B ranchesTaken B ranches

  • HotChips 2002 - Intel® Itanium® 2 Processor

    • Useful instructions per taken branch is very high• Useful instructions per mispredicted branch slightly better for Intel• Useful instructions/call very high – Intel/HP compilers very aggr inlining

    ! Advanced compilers reduce stress on br prediction/Icache HW! Trading Istream size for regularity improves HW efficiency

    Parallelism: Region Size

    19

    15 9

    21 2

    20

    14 115 8

    0

    50

    10 0

    15 0

    20 0

    25 0

    U s e fu l In s tr/Tk B r U s e fu l In s t/MP B ra n ch U s e fu l In s t/C a ll

    Use

    ful I

    nstr

    uctio

    ns

    Inte lHP

  • HotChips 2002 - Intel® Itanium® 2 Processor

    • About 20-30% of loads are speculative in Intel binaries• Data shows tiny penalty for chk.s usage despite high usage rate• Intel has 10x more chk.s than HP, HP uses ‘no recovery model’selectively (per benchmark decision)• HP has 10x more chk.a/ld.c then Intel, recovery less than 1% time

    ! Speculation heavily used, but causes little overhead

    Features: Speculation

    1.0E+00

    1.0E+02

    1.0E+04

    1.0E+06

    1.0E+08

    1.0E+10

    Intel HP Intel HP

    chk.s chk.a and ld.c

    Inst

    ruct

    ions

    # Checks # Failed

  • HotChips 2002 - Intel® Itanium® 2 Processor

    • Useful IPC computed using ‘unstalled IPC’• Compilers find 2.5-3.0 IPC in integer apps (even beyond SPEC)• Dynamic delays reduce this to 1.3 achieved for CPU2000 integer

    ! Differences in perf/heuristics shows headroom for both compilers! Good IPC found by both compilers, room for uArch improvements

    Useful IPC

    0 .000 .501 .001 .502 .002 .503 .003 .504 .00

    Inte

    lH

    P

    Inte

    lH

    P

    Inte

    lH

    P

    Inte

    lH

    P

    Inte

    lH

    P

    Inte

    lH

    P

    Inte

    lH

    P

    Inte

    lH

    P

    Inte

    lH

    P

    Inte

    lH

    P

    Inte

    lH

    P

    Inte

    lH

    P

    Inte

    lH

    P

    g zip vp r gcc m cf cra fty pa rse r eo n pe rl ga p vo rtex bzip2 two lf A vg

    Use

    ful I

    PC

    A chie ved F o und

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Notes

    • Itanium-based binaries used for these stats are older than thoseused for the official SPEC submission (less than 10% difference)

    • The results for Intel® Itanium® 2 processor in Part I are: one withthe Intel compiler running on 64-bit MS OS and another with theHP compiler running under HP-UX

    • Alpha ISA numbers via simulation, binaries used near peak (noprofile guided optimization), tuned for 21264. Alpha data missingVPR and PERL – thus, left out of all averages in Part I.

    • Results computed with arithmetic averages – data thus skewedtowards long-running benchmarks

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Part II: Microarchitecture

    James McCormickHP Colorado VLSI Laboratory

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Cycle Accounting: INT

    7%

    38%

    3%1%

    49%

    2%

    PIPE FLUSH DATA ACCESSINSTR ACCESS SCOREBOARDRSE UNSTALLED

    Cycle Accounting: FP

    1%

    47%

    1%

    0%

    50%

    1%

    PIPE FLUSH DATA ACCESSINSTR ACCESS SCOREBOARDRSE UNSTALLED

    !Remaining: unstalled execution and data access!Great performance

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Instructions Per Unstalled Cycle

    0123456

    Integer Floating Point

    IPC

    USEFUL w/ PRED OFF + NOPS

    !High machine parallelism

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Total Instruction Fetch Latency

    0

    20

    40

    60

    80

    100

    164.

    gzip

    175.

    vpr

    176.

    gcc

    181.

    mcf

    186.

    craf

    ty

    197.

    pars

    er

    252.

    eon

    253.

    perlb

    mk

    254.

    gap

    255.

    vorte

    x

    256.

    bzip

    2

    300.

    twolf

    AVER

    AGE

    168.

    wup

    wise

    171.

    swim

    172.

    mgr

    id

    173.

    appl

    u

    177.

    mes

    a

    178.

    galg

    el

    179.

    art

    183.

    equa

    ke

    187.

    face

    rec

    188.

    amm

    p

    189.

    luca

    s

    191.

    fma3

    d

    200.

    sixt

    rack

    301.

    apsi

    AVER

    AGE

    Integer Floating Point

    Perc

    ent

    L1I HIT L2 INSTR HIT L3 INSTR HIT MEM INSTR ACCESS

    !Noticeable L1I misses!Very small I-fetch component

  • HotChips 2002 - Intel® Itanium® 2 Processor

    !High FE throughput rate! I-miss latency hidden

    Instruction Fetch Ahead

    0

    1

    2

    3

    4

    5

    6

    INT AVERAGE FP AVERAGE

    Inst

    ruct

    ions

    Per

    Cyc

    le

    0102030405060708090100

    Perc

    ent E

    xpos

    ed

    DECOUPLED DISP THROUGHPUT DECOUPLED FE THROUGHPUTFE BUBBLES EXPOSED

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Branch M ispredictions

    37.7

    02468

    101214161820

    164.

    gzip

    175.

    vpr

    176.

    gcc

    181.

    mcf

    186.

    craf

    ty

    197.

    pars

    er

    252.

    eon

    253.

    perlb

    mk

    254.

    gap

    255.

    vorte

    x

    256.

    bzip2

    300.

    twolf

    AVER

    AGE

    Integer

    Mis

    pred

    ictio

    n Pe

    rcen

    tage

    02468101214161820

    Full

    Pipe

    lines

    Per

    Fl

    ush

    WRO NG PAT H WRO NG T ARG ET UNK NO WN FULL P IPEL INES PER M IS PRED B R

    !High accuracy, low penalty!Helps instruction fetch

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Total Data Read Latency

    0102030405060708090

    100

    164.

    gzip

    175.

    vpr

    176.

    gcc

    181.

    mcf

    186.

    craf

    ty

    197.

    pars

    er

    252.

    eon

    253.

    perlb

    mk

    254.

    gap

    255.

    vorte

    x

    256.

    bzip

    2

    300.

    twol

    f

    AVER

    AGE

    168.

    wup

    wis

    e

    171.

    swim

    172.

    mgr

    id

    173.

    appl

    u

    177.

    mes

    a

    178.

    galg

    el

    179.

    art

    183.

    equa

    ke

    187.

    face

    rec

    188.

    amm

    p

    189.

    luca

    s

    191.

    fma3

    d

    200.

    sixt

    rack

    301.

    apsi

    AVER

    AGE

    Integer Floating Point

    Perc

    ent o

    f Lat

    ency

    L1D DATA READ L2 DATA READ L3 DATA READ MEM DATA READ

    !Large component, large opportunity

  • HotChips 2002 - Intel® Itanium® 2 Processor

    SummaryItanium® 2 Processor Delivers Leadership

    Performance

    – Architecture / Compilers• Expose lots of ILP to in-order pipeline• Replace difficult instructions with easy ones• RSE and large register file work well together

    – CPU Design• Machine parallelism handles high ILP• Aggressive design reduces bottlenecks

  • HotChips 2002 - Intel® Itanium® 2 Processor

    AcknowledgementsJason Cantin (U. Wisconsin):

    Ran all the experiments to gather the Alpha ISA data withchanges/alterations at our request. Without Jason, we would have noAlpha data of any kind. Thanks to his efforts to rerun all the data forus!

    Hwansoo Han, Youngsoo Choi, Geetha Vederaman (Intel):Intel data gathering, formatting, methodology, performance monitoring

    Bryan Black, Ed Grochowski, Jim Callister, Carole Dulong (Intel):Extensive draft review and comments

    Caliper Team (HP):Tool support

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Backup/Reference Slides

  • HotChips 2002 - Intel® Itanium® 2 Processor

    Configuration InformationIntel® Itanium® 2 processor data for Intel systems/compilers:

    – Binaries: Pre-production version of Intel C++ 6.0 Compiler, -O3 with interproceduraloptimization and profile guided optimization

    – Run on: Pre-production stepping of Itanium 2 processor 800Mhz/200Mhz core/bus, Intel 870chipset, monitor kernel and user instructions and events

    Intel® Itanium® 2 processor data for HP systems/compilers:– Binaries: Pre-production version of HP compiler– Run on: Pre-production stepping of Itanium 2 processor 1000Mhz/200Mhz core/bus, rx2600

    prototype, monitor kernel and user instructions and eventsAlpha ISA data:

    – Run on: Functional simulator, system code not simulated– Simulation details: http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/– All data thanks to Jason Cantin at the University of Wisconsin.– Binaries: http://www.eecs.umich.edu/~chriswea/benchmarks/Com.cf

    In general, these binaries are optimized for 21264, peak optimization. Usually, –g3 –fast –O4,but NO profile feedback. Compiler: DEC C V5.9-005 and DI GI T AL C++ V6.1-027

    Other remarks:– All averages in slides left out PERL and VPR due to data not available for Alpha