Simultaneous Multithreading

download Simultaneous Multithreading

of 8

Transcript of Simultaneous Multithreading

  • 7/27/2019 Simultaneous Multithreading

    1/8

    A

    s the processor communi ty prepares

    for a billion transistors on a chip,

    researchers continue to debate the

    most effective way to use them. O neapproach is to add more memory ( either

    cache or primary memory) to the chip, but

    the performance gain from memory alone is

    limi ted. Another approach is to increase the

    level of systems integration, bringing sup-

    port functions like graphics accelerators and

    I /O controllers on chip. A lthough integra-

    tion lowers system costs and communica-

    tion latency, the overall performance gain to

    applications is again marginal.

    We believe the only way to signifi cantly

    improve performance is to enhance the

    processors computational capabi li ties. I n

    general, this means increasing parallelism

    in al lits available forms. At present only cer-

    tain forms of parallelism are being exploited.

    Current superscalars, for example, can exe-

    cute four or more instructions per cycle; in

    practice, however, they achieve only one or

    two, because current applications have low

    instruction-level parallelism. Placing multi-

    ple superscalar processors on a chi p i s also

    not an effective solution, because, in addi-

    tion to the low instruction-level parallelism,

    performance suffers when there is little

    thread-level parallelism. A better solution i s

    to design a processor that can exploit all

    types of parallelism well.

    Simultaneous multithreading is a processor

    design that meets this goal, because it con-

    sumes both thread-level and instruction-level

    parallelism. In SM T processors, thread-level

    parallelism can come from either multi-

    threaded, parallel programs or individual,

    independent programs in a multiprogram-

    ming workload. I nstruction-level parallelism

    comes from each single program or thread.

    Because it successfully (and simultaneously)

    exploi ts both types of parallelism, SMT

    processors use resources more efficiently, and

    both instruction throughput and speedups are

    greater.Simultaneous multithreading combines

    hardware features of wide-issue superscalars

    and multithreaded processors. From super-

    scalars, it inherits the ability to issue multi-

    ple instructions each cycle; and lik e

    multi threaded processors it contai ns hard-

    ware state for several programs (or threads) .

    T he result is a processor that can issue mul-

    tiple instructions from multiple threadseach

    cycle, achieving better performance for a

    variety of work loads. For a mi x of indepen-

    dent programs (multiprogramming) , the

    overall throughput of the machine is

    improved. Similarly, programs that are par-

    allelizable, either by a compiler or a pro-

    grammer, reap the same throughput

    benefits, resulting in program speedup.

    Finally, a single-threaded program that must

    execute alone will have all machine

    resources available to it and will maintain

    roughly the same level of performance as

    when executing on a single-threaded, wide-

    issue processor.

    Equal in importance to itsperformance ben-

    efits is the simplicity of SMT s design.

    Simultaneous multithreading adds minimal

    hardware complexity to, and, in fact, is a

    straightforward extension of, conventional

    dynamically scheduled superscalars. Hardware

    designers can focus on building a fast, single-

    threaded superscalar, and add SM T s multi-

    thread capabili ty on top.

    G iven the enormous transistor budget in

    the next computer era, we believe simulta-

    neous multi threading provi des an efficient

    base technology that can be used in many

    ways to extract improved performance. For

    example, on a one billion transistor chip, 20

    12 IEEE M i cr o 0272-1732/97/$10.00 1997 IEEE

    Simultaneous

    mu ltithreadin g exploits

    both in stru ction-level

    and thread-level

    paral leli sm by issui ng

    in structions from

    di fferent threads in the

    same cycle.

    SIMULTANEOUSMULTITHREADING:A Platform for Next-Generation Processors

    Susan J. Egg ers

    University of Washingt on

    Joe l S. Emer

    Digital Equipment Corp.

    Henry M . Levy

    University of Washingt on

    Jack L. Lo

    University of Washingt on

    Rebecca L. St amm

    Digital Equipment Corp.

    Dean M . Tull sen

    University o f Californ ia,

    San Diego

  • 7/27/2019 Simultaneous Multithreading

    2/8

    to 40 SM Ts could be used side-by-side to that of achieve per-

    formance comparable to a much larger number of conven-

    tional superscalars. Wi th IRAM technology, SMT s high

    execution rate, which currently doubles memory bandwidth

    requirements, can fully exploit the increased bandwidth

    capabili ty. I n both bi lli on-transistor scenarios, the SM T

    processor we describe here could serve as the processorbuilding block .

    How SM T w orksT he di fference between superscalar, multi threading, and

    simultaneous multithreading is pictured in Figure 1, which

    shows sample executi on sequences for the three architec-

    tures. Each row represents the issue slots for a single exe-

    cution cycle: a filled box indicates that the processor found

    an instruction to execute in that issue slot on that cycle; an

    empty box denotes an unused slot. We characterize the

    unused slots as horizontal or vertical waste. H orizontal waste

    occurs when some, but not all, of the issue slots in a cycle

    can be used. I t typically occurs because of poor instruction-

    level paralleli sm. Vertical waste occurs when a cycle goescompletely unused. T his can be caused by a long latency

    instruction ( such as a memory access) that inhibi ts further

    instruction issue.

    Figure 1a shows a sequence from a conventional super-

    scalar. A s in all superscalars, it is executing a single program,

    or thread, from which it attempts to find multiple instruc-

    tions to issue each cycle. When i t cannot, the issue slots go

    unused, and it incurs both horizontal and vertical waste.

    Figure 1b shows a sequence from a multithreaded archi-

    tecture, such as the Tera.1 M ulti threaded processors contain

    hardware state (a program counter and registers) for sever-

    al threads. O n any given cycle a processor executes instruc-

    tions from one of the threads. O n the next cycle, i t swi tches

    to a different thread context and executes instructions from

    the new thread. As the figure shows, the pri mary advantage

    of multithreaded processors is that they better tolerate long-

    latency operations, effectively eliminating vertical waste.

    H owever, they cannot remove horizontal waste. Conse-

    quently, as instruction issue width continues to increase, mul-

    tithreaded architectures wi ll ultimately suffer the same fate

    as superscalars: they will be limi ted by the instruction-level

    parallelism in a single thread.

    Figure 1c shows how each cyclean SM T processor selects

    instructions for execution from all threads. I t exploits instruc-

    tion-level paralleli sm by selecting instructions from any

    thread that can ( potenti ally) issue. T he processor then

    dynamically schedules machine resources among the instruc-

    tions, providing the greatest chance for the highest hardware

    util ization. I f one thread has high instruction-level paral-

    lelism, that parallelism can be satisfied; i f multiple threads

    each have low instruction-level parallelism, they can be exe-

    cuted together to compensate. In this way, SM T can recov-

    er issue slots lost to both horizontal and vertical waste.

    SMT modelWe derived our SM T model from a high-performance, out-

    of-order, superscalar archi tecture whose dynamic schedul-

    ing core is simi lar to that of the M ips R10000. I n each cycle

    the processor fetches eight instructions from the instruction

    cache. A fter instruction decoding, the register-renaming logic

    maps the architectural registers to the hardware renaming

    registers to remove false dependencies. I nstructions are then

    fed to either the integer or floating-point dispatch queues.

    When their operands become available, instructions are

    issued from these queues to their corresponding functional

    units. To support out-of-order execution, the processor tracks

    instruction and operand dependencies so that it can deter-

    mine which instructions it can issue and which must wait for

    previ ously issued instructions to finish. A fter instructions

    complete execution, the processor retires them in order and

    frees hardware registers that are no longer needed.

    O ur SM T model, which can simultaneously execute

    threads from up to eight hardware contexts, i s a straightfor-

    ward extension of this conventional superscalar. We repli-

    cated some superscalar resources to support simultaneous

    multi threading: state for the hardware contexts ( registers and

    program counters) and per-thread mechanisms for pi peline

    flushing, i nstruction retirement, trapping, precise interrupts,

    and subroutine return. We also added per-thread ( address-

    space) i dentifiers to the branch target buffer and translation

    look -aside buffer. O nly two components, the instruction fetch

    unit and the processor pipeli ne, were redesigned to benefit

    from SM T s multithread instruction issue.

    Simultaneous multi threading needs no special hardware to

    schedule instructions from the different threads onto the

    functional uni ts. D ynamic scheduling hardware in current

    out-of-order superscalars is already functionally capable of

    simultaneous multithreaded scheduling. Register renaming

    eliminates register name conflicts both within and between

    threads by mapping thread-specific architectural registers

    onto the hardware registers; the processor then issues instruc-

    tions (after their operands have been calculated or loaded

    from memory) without regard to thread.

    This minimal redesign has two i mportant consequences.

    First, since most hardware resources are still avai lable to a

    single thread executing alone, SM T provi des good perfor-

    Sep tembe r/Octobe r 1997 13

    Thread 1

    Thread 2

    Thread 3

    Thread 4

    Thread 5

    Time

    (processorcycles)

    (a) (b) (c)

    Figure 1. How a rchitectu res partition issue slot s (fun ction-

    al units): a superscalar (a), a fi ne-gra ined multi threade d

    superscalar (b), and a simultane ous multi threaded proces-

    sor (c). The ro w s of sq ua res represen t issue slot s. The p ro-

    cessor either fi nds an instruction t o execute (fi lled b ox) or

    the slot go es unused (empty b ox).

  • 7/27/2019 Simultaneous Multithreading

    3/8

    mance for programs that cannot be parallelized. Second,

    because the changes to enable simultaneous multi threading

    are minimal, the commercial transition from current super-

    scalars to SMT processors should be fairly smooth.

    H owever, should an SM T implementation negatively

    impact either the targeted processor cycle time or the time

    to design completion, designers could take several approach-

    es to simplify it. B ecause most of SMT s implementation com-

    plexity stems from its wide-issue superscalar underpinnings,

    many of the alternatives involve altering the superscalar. O nesolution is to partition the functional units across a dupli-

    cated register fi le ( as in the Alpha 21264) to reduce ports on

    the register file and dispatch queues.2 Another is to build

    interleaved caches augmented with multiple, independent-

    ly addressed bank s, or phase-pi peline cache accesses to

    increase the number of simultaneous cache accesses. A third

    alternative would subdivide the dispatch queue to reduce

    instruction issue delays.3 As a last resort, a designer could

    reduce the number of register and cache ports by using a

    narrower issue width. T his alternative would have the great-

    est impact on performance.

    Instruction fetching. In a conventional processor, theinstruction unit fetches instructions from a single thread into

    the execution unit. Performance issues revolve around max-

    imizing the number of useful instructions that can be fetched

    ( by mini mizing branch mi spredictions, for example) and

    fetching independent instructions quickly enough to k eep

    functional uni ts busy. An SM T processor places additi onal

    stress on the fetch uni t. T he fetch unit must now fetch instruc-

    tions more quickly to satisfy SM T s more effi cient dynamic

    scheduler, which issues more instructions each cycle

    (because it takes them from multiple threads) . T hus, the fetch

    unit becomes SMT s performance bottleneck .

    O n the other hand, an SMT fetch unit can take advantage

    of the interthread competition for instruction bandwidth to

    enhance performance. First, it can partition this bandwidth

    among the threads. T his is an advantage, because branch

    instructions and cache-line boundaries often make it diffi -

    cult to fi ll issue slots if the fetch unit can access only one

    thread at a time. We fetch from two threads each cycle to

    increase the probabili ty of fetching only useful ( nonspecu-

    lative) instructions. I n addition, the fetch unit can be smart

    about which threads it fetches, fetching those that will pro-

    vide the most immediate performance benefit.

    2.8 fetching. The fetch unit we propose is the 2.8 scheme.4

    T he unit has eight program counters, one for each thread

    context. O n each cycle, i t selects twodifferent threads ( from

    those not already incurring instruction cache misses) and

    fetches eightinstructions from each thread. To match the

    issue hardwares lower instruction wi dth, it then chooses a

    subset of these instructions for decoding. I t takes instruc-

    tions from the first thread until it encounters a branch instruc-

    tion or the end of a cache line and then takes the remaining

    instructions from the second thread.

    Because the instructions come from two di fferentthreads, there is a greater lik elihood of fetching useful

    instructions indeed, the 2.8 scheme performed 10% bet-

    ter than fetching from one thread at a time. I ts hardware

    cost is only an additional port on the instruction cache and

    logic to locate the branch instruction nothing extraordi-

    nary for near-future processors. A less hardware intensive

    alternative is to fetch from multiple threads while limiting

    the fetch bandwi dth to eight instructions. H owever, the

    performance gain here is lower: for example, the 2.8

    scheme performed 5% better than fetching four i nstruc-

    tions from each of two threads.

    Icoun t feedback. Because it can fetch instructions from

    more than one thread, an SMT processor can be selective

    about which threads it fetches. Not all threads provide equal-ly useful i nstructions in a particular cycle. I f the processor

    can predict which threads wi ll produce the fewest delays,

    performance should improve. O ur thread selection hard-

    ware uses the Icount feedback technique, 4 which gives high-

    est priority to the threads with the fewest instructions in the

    decode, renaming, and queue pipeline stages.

    I count increases performance in several ways. Fi rst, i t

    replenishes the dispatch queues wi th instructions from the

    fast-moving threads, avoiding those that wil l fi ll the queues

    with instructions that depend on and are consequently

    blocked behi nd long-latency instructions. Second and most

    important, it maintains in the queues a fairly even distribu-

    tion of instructions among these fast-moving threads, there-

    by increasing interthread parallelism (and the abili ty to hide

    more latencies) . Finally, i t avoids thread starvation, because

    threads whose instructions are not executing wi ll eventual-

    ly have few instructions in the pipeline and will be chosen

    for fetching. T hus, even though eight threads are sharing and

    competing for slots in the dispatch queues, the percentage

    of cycles in which the queues are full is actually lessthan on

    a single-threaded superscalar (8 % of cycles versus 21% ) .

    T his performance also comes wi th a very low hardware

    cost. I count requires only a small amount of addi tional logic

    to increment (decrement) per-thread counters when i nstruc-

    tions enter the decode stage (exi t the dispatch queues) and

    to pi ck the two smallest counter values.

    Icount feedback work s because it addresses all causes

    of di spatch queue inefficiency. I n our experiments,4 it out-

    performed alternati ve schemes that addressed a particular

    cause of dispatch queue i neffi ciency, such as those that

    mi nim ize branch mispredictions ( by givi ng priori ty to

    threads wi th the fewest outstanding branches) or mi nimi ze

    load delays ( by givi ng priori ty to threads wi th the fewest

    outstanding on-chip cache misses) .

    Register file and pipeline. I n simultaneous multi -threading ( as in a superscalar processor) , each thread can

    address 32 architectural i nteger ( and floating-poi nt) regis-

    ters. T he register-renaming mechanism maps these archi-

    14 IEEE M i cr o

    Simultaneous multithreading

    Because it can fet ch in st ru ct ion s

    f rom more than on e thread, an

    SM T pr ocessor can b e selectiv e

    about w hich t hreads i t f etches.

    .

  • 7/27/2019 Simultaneous Multithreading

    4/8

    tectural registers onto a hardware register fi le whose size is

    determined by the number of architectural registers in all

    thread contexts, plus a set of addi tional renaming registers.

    The larger SM T register fi le requires a longer access time; to

    avoi d increasing the processor cycle time, we extended the

    SMT pipeline two stages to allow two-cycle register reads

    and two-cycle writes.The two-stage register access has several rami fications on

    the architecture. For example, the extra stage between instruc-

    tion fetch and execute increases the branch mispredi ction

    penalty by one cycle. The two extra stages between register

    renaming and instruction commit increase the minimum time

    that an executing instruction can hold a hardware register;

    this, in turn, increases the pressure on the renaming registers.

    Finally, the extra stage needed to write back results to the reg-

    ister file requires an extra level of bypass logic.

    Implementation parameters. SMT s implementationparameters wi ll, of course, change as technologies shrink .

    In our simulations, they targeted an implementation that

    should be realized approximately three years from now.

    For the CPU, the parameters are

    an eight-instruction fetch/decode width;

    six i nteger units, four of which can load ( store) from

    ( to) memory;

    four floating-point units;

    32-entry integer and floating-point dispatch queues;

    hardware contexts for eight threads;

    100 additional i nteger renaming registers;

    100 additional fl oating-point renaming registers; and

    retirement of up to 12 instructions per cycle.

    For the memory subsystem, the parameters are

    128-K byte, two-way set associati ve, L1 instruction and

    data caches; the D -cache has four dual-ported banks;

    the I -cache has eight single-ported bank s; the access

    time per bank is two cycles;

    a 16-M byte, direct-mapped, uni fied L2 cache; the single

    bank has a transfer ti me of 12 cycles on a 256-bit bus;

    an 80-cycle memory latency on a 128-bit bus;

    64-byte blocks on all caches;

    16 outstanding cache misses;

    data and instruction T LBs that contain 128 entries each;

    and

    M cFarling-style branch predi ction hardware:5 a 256-

    entry, four-way set-associative branch target buffer with

    an additional thread identifier field, and a hybrid branch

    predictor that selects between global and local predic-

    tors. T he global predictor has 13 history bits; the local

    predi ctor has a 2,048-entry local history table that index-

    es into a 4,096-entry prediction table.

    Simulat ion environmentWe compared SMT with i ts two ancestral processor archi-

    tectures, wi de-issue superscalars and fi ne-grained, multi-

    threaded superscalars. Both are single-processor architectures,

    designed to i mprove i nstruction throughput. Superscalars do

    so by issuing and executing multiple instructions from a sin-

    gle thread, exploiting i nstruction-level parallelism. M ulti-

    threaded superscalars, in addition to heightening instruction-

    level parallelism, hide latencies of one thread by switching to

    and executing instructions from another thread, thereby

    exploiting thread-level parallelism.

    To gauge SM T s potential for executing parallel work loads,

    we also compared it to a third alternative for improvinginstruction throughput: small-scale, single-chip shared-mem-

    ory multiprocessors, whose processors are also superscalars.

    We examined both two- and four-processor multi processors,

    partitioning their scheduling unit resources ( the functional

    units and therefore the issue width dispatch queues, and

    renaming registers) di fferently for each case. I n the two-

    processor multiprocessor ( M P2) , each processor received

    half of SM T s execution resources, so that the total resources

    of the two architectures were comparable. Each processor

    of the four-processor multiprocessor ( M P4) contains approx-

    imately one-fourth of SM T s chip resources. For some exper-

    iments we increased these base configurations unti l each M P

    processor had the resource capabili ty of a single SM T.

    M P2 and MP4 represent an interesting trade-off betweenthread-level and instruction-level parallelism. M P2 can

    exploit more instruction-level parallelism, because each

    processor has more functional units than its M P4 counter-

    part. O n the other hand, M P4 has two more processors to

    take advantage of more thread-level parallelism.

    All processor simulators are execution-driven, cycle-level

    simulators; they model the processor pi peli nes and memo-

    ry subsystems ( including interthread contention for all struc-

    tures in the memory hi erarchy and the buses between them)

    in great detail. The simulators for the three alternative archi-

    tectures reflect the SM T implementation model, but wi thout

    the simultaneous multi threading extensions. Instead they use

    single-threaded fetching ( per processor) and the shorter

    pipeline, wi thout simultaneous multi threaded issue. The fine-

    grained multithreaded processor simulator context swi tches

    between threads each cycle in a round-robin fashion for

    instruction fetch, issue, and retirement. Table 1 (next page)

    summari zes the characteristics of each architecture.

    We evaluated simultaneous multi threading on a multipro-

    gramming work load consisting of several single-threaded

    programs and a group of parallel ( multi threaded) applica-

    tions. We used both types of work loads, because each exer-

    cises different parts of an SM T processor.

    The larger (workload-wide) working set of the multipro-

    grammi ng work load should stress the shared structures in

    an SMT processor ( for example, the caches, TLB, and branch

    prediction hardware) more than the largely identical threads

    of the parallel programs, which share both instructions and

    data. We chose the programs in the multiprogrammi ng work -

    load from the Spec95 and Splash2 benchmark suites. Each

    SMT program executed as a separate thread. T o eliminate

    the effects of benchmark di fferences when simulating fewer

    than eight threads, each data point in T able 2 and Figure 2

    comprises at least four simulation runs, where each run used

    a different combi nation of the benchmarks. T he data repre-

    sent continuous execution with a particular number of

    threads; that is, we stopped simulating when one of the

    threads completed.

    Sep tembe r/Octobe r 1997 15

  • 7/27/2019 Simultaneous Multithreading

    5/8

    The parallel work load consists of coarse-grained ( paral-

    lel threads) and medium-grained (parallel loop i terations)

    parallel programs targeted for shared-memory multiproces-

    sors. A s such, i t was a fair basis for evaluating M P2 and M P4.

    I t also presents a different, but equally challengi ng, test of

    SMT. Unlik e the multiprogramming work load, all threads in

    a parallel application execute the same code, and therefore,

    have simi lar execution resource requirements ( they may

    need the same functional units at the same time, for exam-

    ple) . Consequently, there is potentially more contention for

    these resources.

    We also selected the parallel appli cations from the Spec95

    and Splash2 suites. Where feasible, we executed the entire

    parallel portions of the programs; for the long-running

    Spec95 programs, we simulated several iterations of the main

    loops, using their reference data sets. T he Spec95 programs

    were parallelized with the SUI F compi ler,6 using policies

    developed for shared-memory machines.

    We compi led the multiprogramming work load with cc and

    the parallel benchmarks with a version of the M ultiflow com-

    piler7 that produces Alpha executables. For all programs we

    set compiler optimizations to maximize each programs per-

    formance on the superscalar; however, we disabled trace

    scheduling, so that speculation could be guided by the branch

    prediction and out-of-order execution hardware.

    Simulat ion resultsAs Table 2 shows, simultaneous multithreading achieved

    much higher throughput than the one to two instructions

    per cycle normally reported for current wide-issue super-

    scalars. T hroughput rose consistently wi th the number of

    threads; at eight threads, i t reached 6.2 for the multipro-

    gramming work load and 6.1 for the parallel appli cations. As

    Figure 2b shows, speedups for parallel applications at eight

    threads averaged 1.9 over the same processor with one

    thread, demonstrating SM T s parallel processing capabili ty.

    As we explained earlier, SM T s

    pi peli ne is two cycles longer than

    that of the superscalar and the

    processors of the single-chip multi -

    processor. Therefore, a single thread

    executing on an SM T processor has

    additional latencies and should havelower performance. H owever, as

    Figure 2 shows, SM T s single-thread

    performance was only one percent

    ( parallel workload) and 1.5% ( mul-

    tiprogramming workload) worse

    than that of the single-threaded

    superscalar. A ccurate branch predi c-

    tion hardware and the shared pool

    of renaming registers prevented

    additional penalties from occurring

    frequently.

    Resource contention on SM T is also

    a potential problem. M any of SM T s

    hardware structures, such as thecaches, TLBs, and branch prediction

    tables, are shared by all threads. T he

    unifi ed organization allows a more flexible, and therefore high-

    er, uti lization of these structures, as executing threads place

    different usage demands on them. I t also mak es the entire

    structures available when fewer than eight threads most

    importantly, a single thread are executing. O n the downside,

    interthread use leads to competition for the shared resources,

    potentially driving up cache misses, T LB misses, and branch

    mispredictions.

    We found that interthread interference was signifi cant only

    for the L1 data cache and the branch prediction tables; the

    data sets of our work load fi t comfortably into the off-chip

    L2 cache, and conflicts in the T LBs and the L1 instruction

    cache were minimal. L1 data cache mi sses rose by 68% (par-

    allel workload) and 66% ( multiprogramming workload) and

    branch mi spredictions by 50% and 27% , as the number of

    threads went from one to eight.

    SM T was able to absorb the addi tional confli cts. M any

    of the L1 mi sses were covered by the fully pi peli ned 16-

    M byte L2 cache, whose latency was only 10 cycles longer

    than that of the L1 cache. Consequently, L1 interthread

    confl ict misses degraded performance by less than one

    percent.8 T he most important factor, however, was SMT s

    ability to simultaneously issue instructions from multiple

    threads. T hus, although simultaneous multi threading intro-

    duces addi tional confli cts for the shared hardware struc-

    tures, it has a greater ability to hide them.

    SMT vs. the superscalar.The single-threaded superscalarfell far short of SMT s performance. A s Table 2 shows, the

    superscalars instruction throughput averaged 2.7 instruc-

    tions per cycle, out of a potential of eight, for the multipro-

    grammi ng work load; the parallel work load had a slightly

    higher average throughput of 3.3.

    Consequently, SM T executed the multi programmi ng

    work load 2.3 times faster and the parallel work load 1.9

    times faster (at eight threads) . T he superscalars inabi li ty to

    exploi t more instruction-level parallelism and any thread-

    16 IEEE M i cr o

    Simultaneous multithreading

    Table 1. Comparison of processor architectures.*

    Features SS M P2 M P4 FG M T SM T

    CPUs 1 2 4 1 1

    Functional units/CPU 10 5 3 10 10

    A rchitectural registers/CPU

    (integer or floating point) 32 32 32 256 256

    (8 contexts) (8 contexts)

    Dispatch queue size

    (integer or floating point) 32 16 8 32 32

    Renaming registers/CPU

    (integer or floating point) 100 50 25 100 100

    Pipe stages 7 7 7 9 9

    Threads fetched/cycle 1 1 1 1 2

    M ultithread fetch algorithm n/a n/a n/a Round-robin Icount

    *SS represents a wide-issue superscalar, M P2 and MP4 represent small-scale,

    single-chip, shared-memory multiprocessors, whose processors (two and four,

    respectively) are also superscalars, and FG M T represents a fine-grained multi -

    threaded superscalar.

  • 7/27/2019 Simultaneous Multithreading

    6/8

    level paralleli sm ( and consequent-

    ly hide horizontal and vertical

    waste) contributed to its lower per-

    formance.

    SMT vs. fine-grained multi-threading. By eliminating vertical

    waste, the fine-grained multithread-ed architecture provided speedups

    over the superscalar as high as 1.3 on

    both workloads. H owever, this max-

    imum speedup occurred at only four

    threads; with additional threads,

    performance fell. T wo factors con-

    tribute. First, fine-grained multi-

    threading eliminates only vertical

    waste; given the latency-hidi ng capa-

    bility of its out-of-order processor

    and lockup-free caches, four threads

    were sufficient to do that. Second,

    fine-grained multithreading cannot

    hide the additional conflicts frominterthread competition for shared

    resources, because it can i ssue

    instructions from only one thread

    each cycle. N either limitation applies

    to simultaneous multithreading.

    Consequently, SM T was able to get

    higher instruction throughput and

    greater program speedups than the

    fine-grained multithreaded processor.

    SMT vs. the multiprocessors.SMT obtained better speedups than

    the multiprocessors ( M P2 and M P4), not only when simu-

    lating the machines at their maximum thread capability ( eight

    for SM T, four for M P4, and two for M P2), but also for a given

    number of threads. A t maximum thread capabili ty, SM T s

    throughput reached 6.1 i nstructions per cycle, compared

    with 4.3 for M P2 and 4.2 for M P4.

    Speedups on the multi processors were hindered by the

    fixed partitioning of thei r hardware resources across proces-

    sors, which prevents them from responding well to changes

    in instruction- and thread-level parallelism. Processors were

    idle when thread-level parallelism was insuffi cient; and the

    multiprocessors narrower processors had trouble exploiting

    large amounts of instruction-level parallelism in the unrolled

    loops of indi vidual threads. A n SM T processor, on the other

    hand, dynami cally partitions its resources among threads,

    and therefore can respond well to variations in both types of

    parallelism, exploiting them interchangeably. When only one

    thread is executing, ( almost) all machine resources can be

    dedicated to it; and additional threads ( more thread-level

    parallelism) can compensate for a lack of instruction-level

    parallelism in any single thread.

    To understand how fixed partitioning can hurt multi-

    processor performance, we measured the number of cycles

    in which one processor needed an additional hardware

    resource and the resource was idle in another processor. ( I n

    SMT, the idle resource would have been used.) Fixed parti-

    tioning of the integer units, for both arithmetic and memo-

    ry operations, was responsible for most of M P2s and M P4s

    inefficient resource use. T he floating-point units were also a

    bottleneck for M P4 on this largely floating-point-intensive

    work load. Selectively increasing the hardware resources of

    M P2 and M P4 to match those of SMT eliminated a particu-

    lar bottleneck . H owever, i t did not improve speedups,

    because the bottleneck simply shi fted to a different resource.

    O nly when we gave eachprocessor wi thin M P2 and M P4 al l

    the hardware resources of SM T did the multi processors

    obtain greater speedups. H owever, this occurred only when

    the architecture executed the same number of threads; at

    maximum thread capabili ty, SM T still did better.

    T he speedup results also affect the implementation of

    these machines. Because of their narrower issue width, the

    multiprocessors could very well be bui lt with a shorter cycle

    time. T he speedups indi cate that a multiprocessors cycle

    time must be less than 70% of SM T s before its performance

    is comparable.

    SIMU LTANEO US MU LT ITH READING is an evolution-ary design that attacks multiple sources of waste in wi de-

    issue processors. Without sacrificing single-thread

    performance, SM T uses instruction-level and thread-level

    parallelism to substantially increase effective processor uti-

    lization and to accelerate both multiprogramming and par-

    allel work loads. O ur measurements show that an SM T

    Sep tembe r/Octobe r 1997 17

    Table 2. Instruction throughput executing a multiprogrammingworkload and a parallel workload.

    M ultiprogramming

    workload Parallel workload

    Threads SS FG M T SM T SS M P2 M P4 FG M T SM T

    1 2.7 2.6 3.1 3.3 2.4 1.5 3.3 3.3

    2 3.3 3.5 4.3 2.6 4.1 4.7

    4 3.6 5.7 4.2 4.2 5.6

    8 2.8 6.2 3.5 6.1

    Number of threads

    0.0

    1.0

    2.0

    3.0

    Speedup

    1 2 4 8

    Number of threads

    0.0

    1.0

    2.0

    3.0

    Speedup

    1 2 4 8

    (a) (b)

    SMT

    MP2

    MP4

    FGMT

    Figure 2. Speedups with th e multiprog ramming w orkloa d (a) and pa rallel

    w orkloa d (b).

  • 7/27/2019 Simultaneous Multithreading

    7/8

    processor achieves performance superior to that of several

    competing designs, such as superscalar, traditional multi-

    threaded, and on-chip multiprocessor architectures. We

    believe that the effi ciency of a simultaneous multithreaded

    processor mak es it an excellent building block for future

    technologies and machine designs.

    Several issues remain whose resolution should improve

    SMT s performance even more. O ur future work i ncludes

    compiler and operating systems support for optimizing pro-

    grams targeted for SM T and combining SM T s multithreading

    capability wi th multiprocessing architectures.

    AcknowledgmentsWe thank John O D onnell of Equator T echnologies, Inc.

    and Tryggve Fossum of D igital Equipment Corp. for the

    source to the Alpha AX P version of the M ultiflow compiler.

    We also thank Jennifer Anderson of D EC Western Research

    Laboratory for copies of the SpecFP95 benchmark s, paral-

    lelized by the most recent version of the SUIF compiler, and

    Sujay Parekh for comments on an earlier draft.

    This research was supported by NSF grants M IP-9632977,

    CCR-9200832, and CCR-9632769, N SF PY I Award M IP-9058439,

    D EC Western Research Laboratory, and several fellowships

    ( Intel, M icrosoft, and the Computer Measurement G roup) .

    References1. R. A lverson et al., The Tera Computer System, Proc. Intl Conf.

    Supercomputing, Assoc. of C omputing M achinery, N .Y ., 1990,

    pp. 1-6.

    2. K . Farkas et al., The M ulticluster A rchitecture: Reducing Cycle

    Time Through Partitioning, to appear in Proc. 30th A nn.

    IEEE/ACM Int l Symp . M icroarchitecture, IEEE C omputer Society

    Press, Los Alami tos, C alif. , Dec. 1997.

    3. S. Palacharla, N.P. Jouppi, and J.E. Smith, Complexity-Effective

    Superscalar Processors, Proc. Intl Symp. Computer

    Architecture, ACM , 1997, pp. 206-218.

    4. D. M . T ullsen et al., Exploiting Choice: Instruction Fetch and

    Issue on an Implementable Simultaneous M ultithreading

    18 IEEE M i cr o

    Simultaneous multithreading

    O ther researchers have studied simultaneous multi-

    threading designs, as well as several architectures that rep-

    resent alternative approaches to exploiting parallelism.

    H irata and colleagues,1 presented an architecture for an

    SMT processor and simulated its performance on a ray-trac-ing application. G ulati and Bagherzadeh2 proposed an SM T

    processor with four-way issue. Yamamoto and Nemirovsky3

    evaluated an SMT architecture wi th separate dispatch

    queues for up to four threads.

    T hread-level paralleli sm i s also essential to other next-

    generation architectures. O luk otun and colleagues4 inves-

    tigated design trade-offs for a single-chip multiprocessor

    and compared the performance and estimated area of this

    architecture with that of superscalars. Rather than building

    wi der superscalar processors, they advocate the use of

    multiple, simpler superscalars on the same chip.

    M ultithreaded architectures have also been widely inves-

    tigated. T he Tera5 is a fine-grained multithreaded processor

    that issues up to three operations each cycle. K eckler andD ally6 describe an architecture that dynamically interleaves

    operations from LI W instructions onto individual function-

    al units. T heir M -M achine can be viewed as a coarser

    grained, compi ler-driven SM T processor.

    Some architectures use threads in a speculative manner

    to exploit both thread-level and instruction-level paral-

    lelism. M ultiscalar7 speculatively executes threads using

    dynamic branch prediction techniques and squashes

    threads if control ( branches) or data (memory) specula-

    tion is incorrect. T he superthreaded architecture8 also exe-

    cutes multiple threads concurrently, but does not speculate

    on data dependencies.

    Although all of these architectures exploi t multiple forms

    of parallelism, only simultaneous multithreading has the

    abili ty to dynami cally share execution resources among

    all threads. In contrast, the others parti tion resources either

    in space or in time, thereby limiting their flexibility to adapt

    to available parallelism.

    References1. H. Hirata et al., An Elementary Processor Architecture with

    Simultaneous Instruction Issuing from M ultiple Threads,

    Proc. Intl Symp. Computer Architecture, A ssoc. of

    Computing M achinery, N .Y ., 1992, pp. 136-145.

    2. M . G ulati and N. Bagherzadeh, Performance Study of a

    M ultithreaded Superscalar M icroprocessor, Proc. Int l Symp.

    High-Performance Computer Architecture, AC M , 1996,

    pp.291-301.

    3. W. Yamamoto and M . Nemirovsky, Increasing Superscalar

    Performance through M ultistreaming, Proc. Intl Conf .

    Parallel Architectu res and Compilation Techniques, IFIP,

    Laxenburg, A ustria, 1995, pp. 49-58.

    4. K. O lukotun et al., The Case for a Single-Chip M ultiprocessor,

    Proc. Intl Conf. Architectural Support f or Programming

    Languages and Operat ing Systems, ACM, 1996, pp. 2-11.

    5. R. A lverson et al., The Tera Computer System, Proc. Int l

    Conf. Supercomputing, ACM, 1990, pp. 1-6.

    6. S.W. Keckler and W.J. D ally, Processor Coupling: Integrating

    Compile Time and Runtime Scheduling for Parallelism, Proc.

    Intl Symp. Computer Architecture, ACM , 1992, pp. 202-213.

    7. G .S. Sohi, S.E. Breach, and T. Vi jaykumar, M ultiscalar

    Processors, Proc. Intl Symp. Compu ter A rchitecture, A C M ,

    1995, pp. 414-425.

    8. J.-Y. Tsai and P.-C. Y ew, The Superthreaded Architecture:

    Thread Pipelining with Run-time Data Dependence Checking

    and Control Speculation, Proc. Intl Conf. Parallel

    Archit ectures and Compilati on Techniques, IEEE Computer

    Society Press, Los A lamitos, C alif. , 1996, pp. 49-58.

    Alternat ive approaches to exploiting para llelism

  • 7/27/2019 Simultaneous Multithreading

    8/8

    Processor, Proc. Int l Symp. Compu ter Architectu re, A CM ,

    1996, pp. 191-202.

    5. S. M cFarling, Combining Branch Predictors, Tech. Report TN-

    36, Western Research Laboratory, D igital Equipment Corp., Palo

    A lto, C alif., June 1993.

    6. M .W. Hall et al., M aximizing M ultiprocessor Performance with

    the SUIF Compiler, Computer, Dec. 1996, pp. 84-89.7. P.G . Lowney et al., The M ultiflow Trace Scheduling Compiler,

    J. Supercomput ing, M ay 1993, pp. 51-142.

    8. J.L. Lo et al., Converting Thread-level Parallelism to Instruction-

    level Parallelism via Simultaneous M ultithreading, ACM Trans.

    Compu ter Systems, A CM , A ug. 1997.

    Susan J. Eggers is an associate profes-sor of computer science and engineer-

    ing at the Uni versity of Washington. H er

    research interests are computer archi-

    tecture and back-end compi lation, wi than emphasis on experimental perfor-

    mance analysis. H er current focus is on

    issues in compi ler optimi zation (dynamic compi lation and

    compiling for multithreaded machines) and processor design

    ( multi threaded architectures) .

    Eggers received a PhD in computer science from the

    University of California at Berkeley. She is a recipient of an

    NSF Presidential Y oung Investigator award and a member of

    the IEEE and ACM .

    Joel S. Emer is a senior consulting engi-neer at D igi tal Equipment Corp., where

    he has work ed on processor perfor-

    mance analysis and performance mod-

    eling methodologies for a number of

    VAX and A lpha CPU s. H is current

    research interests include multithreaded

    processor organi zations, techniques for i ncreased instruc-

    tion-level parallelism, instruction and data cache organiza-

    tions, branch-prediction schemes, and data-prefetch

    strategies for future Alpha processors.

    Emer received a PhD in electrical engineering from the

    University of I llinois. H e is a member of the IEEE, AC M , Tau

    Beta Pi , and Eta K appa Nu.

    Henry M. Levy is professor of comput-er science and engineering at the

    University of Washington, where his

    research involves operating systems, dis-

    tributed and parallel systems, and com-

    puter architecture. H is current research

    interests include operati ng system sup-

    port for simultaneous multithreaded processors and distrib-

    uted resource management in local-area and wide-area

    networks.

    Levy received an M S in computer science from the

    University of Washington. H e is a senior member of IEEE, a

    fellow of the ACM , and a recipient of a Fulbright Research

    Scholar award.

    Jack L. Lo is currently a PhD candidatein computer science at the University of

    Washington. H is research i nterests

    include architectural and compiler issues

    for simultaneous multithreading, proces-

    sor microarchitecture, instruction-level

    parallelism, and code scheduling.

    Lo received a BS and an M S in computer science from

    Stanford University. H e is a member of the ACM .

    Rebecca L. Stamm is a principal hard-ware engineer in D igital Semiconduc-

    tors Advanced D evelopment G roup at

    D igi tal Equipment Corp. She has workedon the architecture, logic and circuit

    design, performance analysis, verifica-

    tion, and debugging of four generations

    of VA X and Alpha mi croprocessors and is currently doing

    research in hardware multi threading. She holds 10 US patents

    in computer architecture.

    Stamm received a BSEE from the M assachusetts Institute

    of T echnology and a BA in history from Swarthmore College.

    She is a member of the IEEE and the ACM .

    Dean M. Tullsen is a faculty member incomputer science at the University of

    Cali fornia, San Diego. H is research inter-

    ests include simultaneous multithread-

    ing, novel branch architectures,

    compiling for multithreaded and other

    high-performance processor architec-

    tures, and memory subsystem design. H e received a 1997

    Career SA ard from the National Science Foundation.

    T ullsen received a PhD in computer science from the

    University of Washington, with emphasis on simultaneous

    multithreading. H e is a member of the ACM .

    D irect questions concerning this article to Eggers at Dept.

    of Computer Science and Engineering, Box 352350,

    University of Washington, Seattle WA 98115; eggers@ cs.wash-

    ington.edu.

    Reader Interest SurveyIndicate your interest in this article by circling the appropri ate

    number on the Reader Interest Card.

    Low 150 M edium 151 H igh 152

    Sep tembe r/Octobe r 1997 19