Simultaneous Multithreading

7/27/2019 Simultaneous Multithreading

1/8

A

s the processor communi ty prepares

for a billion transistors on a chip,

researchers continue to debate the

most effective way to use them. O neapproach is to add more memory ( either

cache or primary memory) to the chip, but

the performance gain from memory alone is

limi ted. Another approach is to increase the

level of systems integration, bringing sup-

port functions like graphics accelerators and

I /O controllers on chip. A lthough integra-

tion lowers system costs and communica-

tion latency, the overall performance gain to

applications is again marginal.

We believe the only way to signifi cantly

improve performance is to enhance the

processors computational capabi li ties. I n

general, this means increasing parallelism

in al lits available forms. At present only cer-

tain forms of parallelism are being exploited.

Current superscalars, for example, can exe-

cute four or more instructions per cycle; in

practice, however, they achieve only one or

two, because current applications have low

instruction-level parallelism. Placing multi-

ple superscalar processors on a chi p i s also

not an effective solution, because, in addi-

tion to the low instruction-level parallelism,

performance suffers when there is little

thread-level parallelism. A better solution i s

to design a processor that can exploit all

types of parallelism well.

Simultaneous multithreading is a processor

design that meets this goal, because it con-

sumes both thread-level and instruction-level

parallelism. In SM T processors, thread-level

parallelism can come from either multi-

threaded, parallel programs or individual,

independent programs in a multiprogram-

ming workload. I nstruction-level parallelism

comes from each single program or thread.

Because it successfully (and simultaneously)

exploi ts both types of parallelism, SMT

processors use resources more efficiently, and

both instruction throughput and speedups are

greater.Simultaneous multithreading combines

hardware features of wide-issue superscalars

and multithreaded processors. From super-

scalars, it inherits the ability to issue multi-

ple instructions each cycle; and lik e

multi threaded processors it contai ns hard-

ware state for several programs (or threads) .

T he result is a processor that can issue mul-

tiple instructions from multiple threadseach

cycle, achieving better performance for a

variety of work loads. For a mi x of indepen-

dent programs (multiprogramming) , the

overall throughput of the machine is

improved. Similarly, programs that are par-

allelizable, either by a compiler or a pro-

grammer, reap the same throughput

benefits, resulting in program speedup.

Finally, a single-threaded program that must

execute alone will have all machine

resources available to it and will maintain

roughly the same level of performance as

when executing on a single-threaded, wide-

issue processor.

Equal in importance to itsperformance ben-

efits is the simplicity of SMT s design.

Simultaneous multithreading adds minimal

hardware complexity to, and, in fact, is a

straightforward extension of, conventional

dynamically scheduled superscalars. Hardware

designers can focus on building a fast, single-

threaded superscalar, and add SM T s multi-

thread capabili ty on top.

G iven the enormous transistor budget in

the next computer era, we believe simulta-

neous multi threading provi des an efficient

base technology that can be used in many

ways to extract improved performance. For

example, on a one billion transistor chip, 20

12 IEEE M i cr o 0272-1732/97/$10.00 1997 IEEE

Simultaneous

mu ltithreadin g exploits

both in stru ction-level

and thread-level

paral leli sm by issui ng

in structions from

di fferent threads in the

same cycle.

SIMULTANEOUSMULTITHREADING:A Platform for Next-Generation Processors

Susan J. Egg ers

University of Washingt on

Joe l S. Emer

Digital Equipment Corp.

Henry M . Levy


Jack L. Lo


Rebecca L. St amm

Digital Equipment Corp.

Dean M . Tull sen

University o f Californ ia,

San Diego


2/8

to 40 SM Ts could be used side-by-side to that of achieve per-

formance comparable to a much larger number of conven-

tional superscalars. Wi th IRAM technology, SMT s high

execution rate, which currently doubles memory bandwidth

requirements, can fully exploit the increased bandwidth

capabili ty. I n both bi lli on-transistor scenarios, the SM T

processor we describe here could serve as the processorbuilding block .

How SM T w orksT he di fference between superscalar, multi threading, and

simultaneous multithreading is pictured in Figure 1, which

shows sample executi on sequences for the three architec-

tures. Each row represents the issue slots for a single exe-

cution cycle: a filled box indicates that the processor found

an instruction to execute in that issue slot on that cycle; an

empty box denotes an unused slot. We characterize the

unused slots as horizontal or vertical waste. H orizontal waste

occurs when some, but not all, of the issue slots in a cycle

can be used. I t typically occurs because of poor instruction-

level paralleli sm. Vertical waste occurs when a cycle goescompletely unused. T his can be caused by a long latency

instruction ( such as a memory access) that inhibi ts further

instruction issue.

Figure 1a shows a sequence from a conventional super-

scalar. A s in all superscalars, it is executing a single program,

or thread, from which it attempts to find multiple instruc-

tions to issue each cycle. When i t cannot, the issue slots go

unused, and it incurs both horizontal and vertical waste.

Figure 1b shows a sequence from a multithreaded archi-

tecture, such as the Tera.1 M ulti threaded processors contain

hardware state (a program counter and registers) for sever-

al threads. O n any given cycle a processor executes instruc-

tions from one of the threads. O n the next cycle, i t swi tches

to a different thread context and executes instructions from

the new thread. As the figure shows, the pri mary advantage

of multithreaded processors is that they better tolerate long-

latency operations, effectively eliminating vertical waste.

H owever, they cannot remove horizontal waste. Conse-

quently, as instruction issue width continues to increase, mul-

tithreaded architectures wi ll ultimately suffer the same fate

as superscalars: they will be limi ted by the instruction-level

parallelism in a single thread.

Figure 1c shows how each cyclean SM T processor selects

instructions for execution from all threads. I t exploits instruc-

tion-level paralleli sm by selecting instructions from any

thread that can ( potenti ally) issue. T he processor then

dynamically schedules machine resources among the instruc-

tions, providing the greatest chance for the highest hardware

util ization. I f one thread has high instruction-level paral-

lelism, that parallelism can be satisfied; i f multiple threads

each have low instruction-level parallelism, they can be exe-

cuted together to compensate. In this way, SM T can recov-

er issue slots lost to both horizontal and vertical waste.

SMT modelWe derived our SM T model from a high-performance, out-

of-order, superscalar archi tecture whose dynamic schedul-

ing core is simi lar to that of the M ips R10000. I n each cycle

the processor fetches eight instructions from the instruction

cache. A fter instruction decoding, the register-renaming logic

maps the architectural registers to the hardware renaming

registers to remove false dependencies. I nstructions are then

fed to either the integer or floating-point dispatch queues.

When their operands become available, instructions are

issued from these queues to their corresponding functional

units. To support out-of-order execution, the processor tracks

instruction and operand dependencies so that it can deter-

mine which instructions it can issue and which must wait for

previ ously issued instructions to finish. A fter instructions

complete execution, the processor retires them in order and

frees hardware registers that are no longer needed.

O ur SM T model, which can simultaneously execute

threads from up to eight hardware contexts, i s a straightfor-

ward extension of this conventional superscalar. We repli-

cated some superscalar resources to support simultaneous

multi threading: state for the hardware contexts ( registers and

program counters) and per-thread mechanisms for pi peline

flushing, i nstruction retirement, trapping, precise interrupts,

and subroutine return. We also added per-thread ( address-

space) i dentifiers to the branch target buffer and translation

look -aside buffer. O nly two components, the instruction fetch

unit and the processor pipeli ne, were redesigned to benefit

from SM T s multithread instruction issue.

Simultaneous multi threading needs no special hardware to

schedule instructions from the different threads onto the

functional uni ts. D ynamic scheduling hardware in current

out-of-order superscalars is already functionally capable of

simultaneous multithreaded scheduling. Register renaming

eliminates register name conflicts both within and between

threads by mapping thread-specific architectural registers

onto the hardware registers; the processor then issues instruc-

tions (after their operands have been calculated or loaded

from memory) without regard to thread.

This minimal redesign has two i mportant consequences.

First, since most hardware resources are still avai lable to a

single thread executing alone, SM T provi des good perfor-

Sep tembe r/Octobe r 1997 13

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Time

(processorcycles)

(a) (b) (c)

Figure 1. How a rchitectu res partition issue slot s (fun ction-

al units): a superscalar (a), a fi ne-gra ined multi threade d

superscalar (b), and a simultane ous multi threaded proces-

sor (c). The ro w s of sq ua res represen t issue slot s. The p ro-

cessor either fi nds an instruction t o execute (fi lled b ox) or

the slot go es unused (empty b ox).


3/8

mance for programs that cannot be parallelized. Second,

because the changes to enable simultaneous multi threading

are minimal, the commercial transition from current super-

scalars to SMT processors should be fairly smooth.

H owever, should an SM T implementation negatively

impact either the targeted processor cycle time or the time

to design completion, designers could take several approach-

es to simplify it. B ecause most of SMT s implementation com-

plexity stems from its wide-issue superscalar underpinnings,

many of the alternatives involve altering the superscalar. O nesolution is to partition the functional units across a dupli-

cated register fi le ( as in the Alpha 21264) to reduce ports on

the register file and dispatch queues.2 Another is to build

interleaved caches augmented with multiple, independent-

ly addressed bank s, or phase-pi peline cache accesses to

increase the number of simultaneous cache accesses. A third

alternative would subdivide the dispatch queue to reduce

instruction issue delays.3 As a last resort, a designer could

reduce the number of register and cache ports by using a

narrower issue width. T his alternative would have the great-

est impact on performance.

Instruction fetching. In a conventional processor, theinstruction unit fetches instructions from a single thread into

the execution unit. Performance issues revolve around max-

imizing the number of useful instructions that can be fetched

( by mini mizing branch mi spredictions, for example) and

fetching independent instructions quickly enough to k eep

functional uni ts busy. An SM T processor places additi onal

stress on the fetch uni t. T he fetch unit must now fetch instruc-

tions more quickly to satisfy SM T s more effi cient dynamic

scheduler, which issues more instructions each cycle

(because it takes them from multiple threads) . T hus, the fetch

unit becomes SMT s performance bottleneck .

O n the other hand, an SMT fetch unit can take advantage

of the interthread competition for instruction bandwidth to

enhance performance. First, it can partition this bandwidth

among the threads. T his is an advantage, because branch

instructions and cache-line boundaries often make it diffi -

cult to fi ll issue slots if the fetch unit can access only one

thread at a time. We fetch from two threads each cycle to

increase the probabili ty of fetching only useful ( nonspecu-

lative) instructions. I n addition, the fetch unit can be smart

about which threads it fetches, fetching those that will pro-

vide the most immediate performance benefit.

2.8 fetching. The fetch unit we propose is the 2.8 scheme.4

T he unit has eight program counters, one for each thread

context. O n each cycle, i t selects twodifferent threads ( from

those not already incurring instruction cache misses) and

fetches eightinstructions from each thread. To match the

issue hardwares lower instruction wi dth, it then chooses a

subset of these instructions for decoding. I t takes instruc-

tions from the first thread until it encounters a branch instruc-

tion or the end of a cache line and then takes the remaining

instructions from the second thread.

Because the instructions come from two di fferentthreads, there is a greater lik elihood of fetching useful

instructions indeed, the 2.8 scheme performed 10% bet-

ter than fetching from one thread at a time. I ts hardware

cost is only an additional port on the instruction cache and

logic to locate the branch instruction nothing extraordi-

nary for near-future processors. A less hardware intensive

alternative is to fetch from multiple threads while limiting

the fetch bandwi dth to eight instructions. H owever, the

performance gain here is lower: for example, the 2.8

scheme performed 5% better than fetching four i nstruc-

tions from each of two threads.

Icoun t feedback. Because it can fetch instructions from

more than one thread, an SMT processor can be selective

about which threads it fetches. Not all threads provide equal-ly useful i nstructions in a particular cycle. I f the processor

can predict which threads wi ll produce the fewest delays,

performance should improve. O ur thread selection hard-

ware uses the Icount feedback technique, 4 which gives high-

est priority to the threads with the fewest instructions in the

decode, renaming, and queue pipeline stages.

I count increases performance in several ways. Fi rst, i t

replenishes the dispatch queues wi th instructions from the

fast-moving threads, avoiding those that wil l fi ll the queues

with instructions that depend on and are consequently

blocked behi nd long-latency instructions. Second and most

important, it maintains in the queues a fairly even distribu-

tion of instructions among these fast-moving threads, there-

by increasing interthread parallelism (and the abili ty to hide

more latencies) . Finally, i t avoids thread starvation, because

threads whose instructions are not executing wi ll eventual-

ly have few instructions in the pipeline and will be chosen

for fetching. T hus, even though eight threads are sharing and

competing for slots in the dispatch queues, the percentage

of cycles in which the queues are full is actually lessthan on

a single-threaded superscalar (8 % of cycles versus 21% ) .

T his performance also comes wi th a very low hardware

cost. I count requires only a small amount of addi tional logic

to increment (decrement) per-thread counters when i nstruc-

tions enter the decode stage (exi t the dispatch queues) and

to pi ck the two smallest counter values.

Icount feedback work s because it addresses all causes

of di spatch queue inefficiency. I n our experiments,4 it out-

performed alternati ve schemes that addressed a particular

cause of dispatch queue i neffi ciency, such as those that

mi nim ize branch mispredictions ( by givi ng priori ty to

threads wi th the fewest outstanding branches) or mi nimi ze

load delays ( by givi ng priori ty to threads wi th the fewest

outstanding on-chip cache misses) .

Register file and pipeline. I n simultaneous multi -threading ( as in a superscalar processor) , each thread can

address 32 architectural i nteger ( and floating-poi nt) regis-

ters. T he register-renaming mechanism maps these archi-

14 IEEE M i cr o

Simultaneous multithreading

Because it can fet ch in st ru ct ion s

f rom more than on e thread, an

SM T pr ocessor can b e selectiv e

about w hich t hreads i t f etches.

.


4/8

tectural registers onto a hardware register fi le whose size is

determined by the number of architectural registers in all

thread contexts, plus a set of addi tional renaming registers.

The larger SM T register fi le requires a longer access time; to

avoi d increasing the processor cycle time, we extended the

SMT pipeline two stages to allow two-cycle register reads

and two-cycle writes.The two-stage register access has several rami fications on

the architecture. For example, the extra stage between instruc-

tion fetch and execute increases the branch mispredi ction

penalty by one cycle. The two extra stages between register

renaming and instruction commit increase the minimum time

that an executing instruction can hold a hardware register;

this, in turn, increases the pressure on the renaming registers.

Finally, the extra stage needed to write back results to the reg-

ister file requires an extra level of bypass logic.

Implementation parameters. SMT s implementationparameters wi ll, of course, change as technologies shrink .

In our simulations, they targeted an implementation that

should be realized approximately three years from now.

For the CPU, the parameters are

an eight-instruction fetch/decode width;

six i nteger units, four of which can load ( store) from

( to) memory;

four floating-point units;

32-entry integer and floating-point dispatch queues;

hardware contexts for eight threads;

100 additional i nteger renaming registers;

100 additional fl oating-point renaming registers; and

retirement of up to 12 instructions per cycle.

For the memory subsystem, the parameters are

128-K byte, two-way set associati ve, L1 instruction and

data caches; the D -cache has four dual-ported banks;

the I -cache has eight single-ported bank s; the access

time per bank is two cycles;

a 16-M byte, direct-mapped, uni fied L2 cache; the single

bank has a transfer ti me of 12 cycles on a 256-bit bus;

an 80-cycle memory latency on a 128-bit bus;

64-byte blocks on all caches;

16 outstanding cache misses;

data and instruction T LBs that contain 128 entries each;

and

M cFarling-style branch predi ction hardware:5 a 256-

entry, four-way set-associative branch target buffer with

an additional thread identifier field, and a hybrid branch

predictor that selects between global and local predic-

tors. T he global predictor has 13 history bits; the local

predi ctor has a 2,048-entry local history table that index-

es into a 4,096-entry prediction table.

Simulat ion environmentWe compared SMT with i ts two ancestral processor archi-

tectures, wi de-issue superscalars and fi ne-grained, multi-

threaded superscalars. Both are single-processor architectures,

designed to i mprove i nstruction throughput. Superscalars do

so by issuing and executing multiple instructions from a sin-

gle thread, exploiting i nstruction-level parallelism. M ulti-

threaded superscalars, in addition to heightening instruction-

level parallelism, hide latencies of one thread by switching to

and executing instructions from another thread, thereby

exploiting thread-level parallelism.

To gauge SM T s potential for executing parallel work loads,

we also compared it to a third alternative for improvinginstruction throughput: small-scale, single-chip shared-mem-

ory multiprocessors, whose processors are also superscalars.

We examined both two- and four-processor multi processors,

partitioning their scheduling unit resources ( the functional

units and therefore the issue width dispatch queues, and

renaming registers) di fferently for each case. I n the two-

processor multiprocessor ( M P2) , each processor received

half of SM T s execution resources, so that the total resources

of the two architectures were comparable. Each processor

of the four-processor multiprocessor ( M P4) contains approx-

imately one-fourth of SM T s chip resources. For some exper-

iments we increased these base configurations unti l each M P

processor had the resource capabili ty of a single SM T.

M P2 and MP4 represent an interesting trade-off betweenthread-level and instruction-level parallelism. M P2 can

exploit more instruction-level parallelism, because each

processor has more functional units than its M P4 counter-

part. O n the other hand, M P4 has two more processors to

take advantage of more thread-level parallelism.

All processor simulators are execution-driven, cycle-level

simulators; they model the processor pi peli nes and memo-

ry subsystems ( including interthread contention for all struc-

tures in the memory hi erarchy and the buses between them)

in great detail. The simulators for the three alternative archi-

tectures reflect the SM T implementation model, but wi thout

the simultaneous multi threading extensions. Instead they use

single-threaded fetching ( per processor) and the shorter

pipeline, wi thout simultaneous multi threaded issue. The fine-

grained multithreaded processor simulator context swi tches

between threads each cycle in a round-robin fashion for

instruction fetch, issue, and retirement. Table 1 (next page)

summari zes the characteristics of each architecture.

We evaluated simultaneous multi threading on a multipro-

gramming work load consisting of several single-threaded

programs and a group of parallel ( multi threaded) applica-

tions. We used both types of work loads, because each exer-

cises different parts of an SM T processor.

The larger (workload-wide) working set of the multipro-

grammi ng work load should stress the shared structures in

an SMT processor ( for example, the caches, TLB, and branch

prediction hardware) more than the largely identical threads

of the parallel programs, which share both instructions and

data. We chose the programs in the multiprogrammi ng work -

load from the Spec95 and Splash2 benchmark suites. Each

SMT program executed as a separate thread. T o eliminate

the effects of benchmark di fferences when simulating fewer

than eight threads, each data point in T able 2 and Figure 2

comprises at least four simulation runs, where each run used

a different combi nation of the benchmarks. T he data repre-

sent continuous execution with a particular number of

threads; that is, we stopped simulating when one of the

threads completed.



5/8

The parallel work load consists of coarse-grained ( paral-

lel threads) and medium-grained (parallel loop i terations)

parallel programs targeted for shared-memory multiproces-

sors. A s such, i t was a fair basis for evaluating M P2 and M P4.

I t also presents a different, but equally challengi ng, test of

SMT. Unlik e the multiprogramming work load, all threads in

a parallel application execute the same code, and therefore,

have simi lar execution resource requirements ( they may

need the same functional units at the same time, for exam-

ple) . Consequently, there is potentially more contention for

these resources.

We also selected the parallel appli cations from the Spec95

and Splash2 suites. Where feasible, we executed the entire

parallel portions of the programs; for the long-running

Spec95 programs, we simulated several iterations of the main

loops, using their reference data sets. T he Spec95 programs

were parallelized with the SUI F compi ler,6 using policies

developed for shared-memory machines.

We compi led the multiprogramming work load with cc and

the parallel benchmarks with a version of the M ultiflow com-

piler7 that produces Alpha executables. For all programs we

set compiler optimizations to maximize each programs per-

formance on the superscalar; however, we disabled trace

scheduling, so that speculation could be guided by the branch

prediction and out-of-order execution hardware.

Simulat ion resultsAs Table 2 shows, simultaneous multithreading achieved

much higher throughput than the one to two instructions

per cycle normally reported for current wide-issue super-

scalars. T hroughput rose consistently wi th the number of

threads; at eight threads, i t reached 6.2 for the multipro-

gramming work load and 6.1 for the parallel appli cations. As

Figure 2b shows, speedups for parallel applications at eight

threads averaged 1.9 over the same processor with one

thread, demonstrating SM T s parallel processing capabili ty.

As we explained earlier, SM T s

pi peli ne is two cycles longer than

that of the superscalar and the

processors of the single-chip multi -

processor. Therefore, a single thread

executing on an SM T processor has

additional latencies and should havelower performance. H owever, as

Figure 2 shows, SM T s single-thread

performance was only one percent

( parallel workload) and 1.5% ( mul-

tiprogramming workload) worse

than that of the single-threaded

superscalar. A ccurate branch predi c-

tion hardware and the shared pool

of renaming registers prevented

additional penalties from occurring

frequently.

Resource contention on SM T is also

a potential problem. M any of SM T s

hardware structures, such as thecaches, TLBs, and branch prediction

tables, are shared by all threads. T he

unifi ed organization allows a more flexible, and therefore high-

er, uti lization of these structures, as executing threads place

different usage demands on them. I t also mak es the entire

structures available when fewer than eight threads most

importantly, a single thread are executing. O n the downside,

interthread use leads to competition for the shared resources,

potentially driving up cache misses, T LB misses, and branch

mispredictions.

We found that interthread interference was signifi cant only

for the L1 data cache and the branch prediction tables; the

data sets of our work load fi t comfortably into the off-chip

L2 cache, and conflicts in the T LBs and the L1 instruction

cache were minimal. L1 data cache mi sses rose by 68% (par-

allel workload) and 66% ( multiprogramming workload) and

branch mi spredictions by 50% and 27% , as the number of

threads went from one to eight.

SM T was able to absorb the addi tional confli cts. M any

of the L1 mi sses were covered by the fully pi peli ned 16-

M byte L2 cache, whose latency was only 10 cycles longer

than that of the L1 cache. Consequently, L1 interthread

confl ict misses degraded performance by less than one

percent.8 T he most important factor, however, was SMT s

ability to simultaneously issue instructions from multiple

threads. T hus, although simultaneous multi threading intro-

duces addi tional confli cts for the shared hardware struc-

tures, it has a greater ability to hide them.

SMT vs. the superscalar.The single-threaded superscalarfell far short of SMT s performance. A s Table 2 shows, the

superscalars instruction throughput averaged 2.7 instruc-

tions per cycle, out of a potential of eight, for the multipro-

grammi ng work load; the parallel work load had a slightly

higher average throughput of 3.3.

Consequently, SM T executed the multi programmi ng

work load 2.3 times faster and the parallel work load 1.9

times faster (at eight threads) . T he superscalars inabi li ty to

exploi t more instruction-level parallelism and any thread-

16 IEEE M i cr o


Table 1. Comparison of processor architectures.*

Features SS M P2 M P4 FG M T SM T

CPUs 1 2 4 1 1

Functional units/CPU 10 5 3 10 10

A rchitectural registers/CPU

(integer or floating point) 32 32 32 256 256

(8 contexts) (8 contexts)

Dispatch queue size


Renaming registers/CPU


Pipe stages 7 7 7 9 9

Threads fetched/cycle 1 1 1 1 2

M ultithread fetch algorithm n/a n/a n/a Round-robin Icount

*SS represents a wide-issue superscalar, M P2 and MP4 represent small-scale,

single-chip, shared-memory multiprocessors, whose processors (two and four,

respectively) are also superscalars, and FG M T represents a fine-grained multi -

threaded superscalar.


6/8

level paralleli sm ( and consequent-

ly hide horizontal and vertical

waste) contributed to its lower per-

formance.

SMT vs. fine-grained multi-threading. By eliminating vertical

waste, the fine-grained multithread-ed architecture provided speedups

over the superscalar as high as 1.3 on

both workloads. H owever, this max-

imum speedup occurred at only four

threads; with additional threads,

performance fell. T wo factors con-

tribute. First, fine-grained multi-

threading eliminates only vertical

waste; given the latency-hidi ng capa-

bility of its out-of-order processor

and lockup-free caches, four threads

were sufficient to do that. Second,

fine-grained multithreading cannot

hide the additional conflicts frominterthread competition for shared

resources, because it can i ssue

instructions from only one thread

each cycle. N either limitation applies

to simultaneous multithreading.

Consequently, SM T was able to get

higher instruction throughput and

greater program speedups than the

fine-grained multithreaded processor.

SMT vs. the multiprocessors.SMT obtained better speedups than

the multiprocessors ( M P2 and M P4), not only when simu-

lating the machines at their maximum thread capability ( eight

for SM T, four for M P4, and two for M P2), but also for a given

number of threads. A t maximum thread capabili ty, SM T s

throughput reached 6.1 i nstructions per cycle, compared

with 4.3 for M P2 and 4.2 for M P4.

Speedups on the multi processors were hindered by the

fixed partitioning of thei r hardware resources across proces-

sors, which prevents them from responding well to changes

in instruction- and thread-level parallelism. Processors were

idle when thread-level parallelism was insuffi cient; and the

multiprocessors narrower processors had trouble exploiting

large amounts of instruction-level parallelism in the unrolled

loops of indi vidual threads. A n SM T processor, on the other

hand, dynami cally partitions its resources among threads,

and therefore can respond well to variations in both types of

parallelism, exploiting them interchangeably. When only one

thread is executing, ( almost) all machine resources can be

dedicated to it; and additional threads ( more thread-level

parallelism) can compensate for a lack of instruction-level

parallelism in any single thread.

To understand how fixed partitioning can hurt multi-

processor performance, we measured the number of cycles

in which one processor needed an additional hardware

resource and the resource was idle in another processor. ( I n

SMT, the idle resource would have been used.) Fixed parti-

tioning of the integer units, for both arithmetic and memo-

ry operations, was responsible for most of M P2s and M P4s

inefficient resource use. T he floating-point units were also a

bottleneck for M P4 on this largely floating-point-intensive

work load. Selectively increasing the hardware resources of

M P2 and M P4 to match those of SMT eliminated a particu-

lar bottleneck . H owever, i t did not improve speedups,

because the bottleneck simply shi fted to a different resource.

O nly when we gave eachprocessor wi thin M P2 and M P4 al l

the hardware resources of SM T did the multi processors

obtain greater speedups. H owever, this occurred only when

the architecture executed the same number of threads; at

maximum thread capabili ty, SM T still did better.

T he speedup results also affect the implementation of

these machines. Because of their narrower issue width, the

multiprocessors could very well be bui lt with a shorter cycle

time. T he speedups indi cate that a multiprocessors cycle

time must be less than 70% of SM T s before its performance

is comparable.

SIMU LTANEO US MU LT ITH READING is an evolution-ary design that attacks multiple sources of waste in wi de-

issue processors. Without sacrificing single-thread

performance, SM T uses instruction-level and thread-level

parallelism to substantially increase effective processor uti-

lization and to accelerate both multiprogramming and par-

allel work loads. O ur measurements show that an SM T


Table 2. Instruction throughput executing a multiprogrammingworkload and a parallel workload.

M ultiprogramming

workload Parallel workload

Threads SS FG M T SM T SS M P2 M P4 FG M T SM T

1 2.7 2.6 3.1 3.3 2.4 1.5 3.3 3.3

2 3.3 3.5 4.3 2.6 4.1 4.7

4 3.6 5.7 4.2 4.2 5.6

8 2.8 6.2 3.5 6.1

Number of threads

0.0

1.0

2.0

3.0

Speedup

1 2 4 8

Number of threads

0.0

1.0

2.0

3.0

Speedup

1 2 4 8

(a) (b)

SMT

MP2

MP4

FGMT

Figure 2. Speedups with th e multiprog ramming w orkloa d (a) and pa rallel

w orkloa d (b).


7/8

processor achieves performance superior to that of several

competing designs, such as superscalar, traditional multi-

threaded, and on-chip multiprocessor architectures. We

believe that the effi ciency of a simultaneous multithreaded

processor mak es it an excellent building block for future

technologies and machine designs.

Several issues remain whose resolution should improve

SMT s performance even more. O ur future work i ncludes

compiler and operating systems support for optimizing pro-

grams targeted for SM T and combining SM T s multithreading

capability wi th multiprocessing architectures.

AcknowledgmentsWe thank John O D onnell of Equator T echnologies, Inc.

and Tryggve Fossum of D igital Equipment Corp. for the

source to the Alpha AX P version of the M ultiflow compiler.

We also thank Jennifer Anderson of D EC Western Research

Laboratory for copies of the SpecFP95 benchmark s, paral-

lelized by the most recent version of the SUIF compiler, and

Sujay Parekh for comments on an earlier draft.

This research was supported by NSF grants M IP-9632977,

CCR-9200832, and CCR-9632769, N SF PY I Award M IP-9058439,

D EC Western Research Laboratory, and several fellowships

( Intel, M icrosoft, and the Computer Measurement G roup) .

References1. R. A lverson et al., The Tera Computer System, Proc. Intl Conf.

Supercomputing, Assoc. of C omputing M achinery, N .Y ., 1990,

pp. 1-6.

2. K . Farkas et al., The M ulticluster A rchitecture: Reducing Cycle

Time Through Partitioning, to appear in Proc. 30th A nn.

IEEE/ACM Int l Symp . M icroarchitecture, IEEE C omputer Society

Press, Los Alami tos, C alif. , Dec. 1997.

3. S. Palacharla, N.P. Jouppi, and J.E. Smith, Complexity-Effective

Superscalar Processors, Proc. Intl Symp. Computer

Architecture, ACM , 1997, pp. 206-218.

4. D. M . T ullsen et al., Exploiting Choice: Instruction Fetch and

Issue on an Implementable Simultaneous M ultithreading

18 IEEE M i cr o


O ther researchers have studied simultaneous multi-

threading designs, as well as several architectures that rep-

resent alternative approaches to exploiting parallelism.

H irata and colleagues,1 presented an architecture for an

SMT processor and simulated its performance on a ray-trac-ing application. G ulati and Bagherzadeh2 proposed an SM T

processor with four-way issue. Yamamoto and Nemirovsky3

evaluated an SMT architecture wi th separate dispatch

queues for up to four threads.

T hread-level paralleli sm i s also essential to other next-

generation architectures. O luk otun and colleagues4 inves-

tigated design trade-offs for a single-chip multiprocessor

and compared the performance and estimated area of this

architecture with that of superscalars. Rather than building

wi der superscalar processors, they advocate the use of

multiple, simpler superscalars on the same chip.

M ultithreaded architectures have also been widely inves-

tigated. T he Tera5 is a fine-grained multithreaded processor

that issues up to three operations each cycle. K eckler andD ally6 describe an architecture that dynamically interleaves

operations from LI W instructions onto individual function-

al units. T heir M -M achine can be viewed as a coarser

grained, compi ler-driven SM T processor.

Some architectures use threads in a speculative manner

to exploit both thread-level and instruction-level paral-

lelism. M ultiscalar7 speculatively executes threads using

dynamic branch prediction techniques and squashes

threads if control ( branches) or data (memory) specula-

tion is incorrect. T he superthreaded architecture8 also exe-

cutes multiple threads concurrently, but does not speculate

on data dependencies.

Although all of these architectures exploi t multiple forms

of parallelism, only simultaneous multithreading has the

abili ty to dynami cally share execution resources among

all threads. In contrast, the others parti tion resources either

in space or in time, thereby limiting their flexibility to adapt

to available parallelism.

References1. H. Hirata et al., An Elementary Processor Architecture with

Simultaneous Instruction Issuing from M ultiple Threads,

Proc. Intl Symp. Computer Architecture, A ssoc. of

Computing M achinery, N .Y ., 1992, pp. 136-145.

2. M . G ulati and N. Bagherzadeh, Performance Study of a

M ultithreaded Superscalar M icroprocessor, Proc. Int l Symp.

High-Performance Computer Architecture, AC M , 1996,

pp.291-301.

3. W. Yamamoto and M . Nemirovsky, Increasing Superscalar

Performance through M ultistreaming, Proc. Intl Conf .

Parallel Architectu res and Compilation Techniques, IFIP,

Laxenburg, A ustria, 1995, pp. 49-58.

4. K. O lukotun et al., The Case for a Single-Chip M ultiprocessor,

Proc. Intl Conf. Architectural Support f or Programming

Languages and Operat ing Systems, ACM, 1996, pp. 2-11.

5. R. A lverson et al., The Tera Computer System, Proc. Int l

Conf. Supercomputing, ACM, 1990, pp. 1-6.

6. S.W. Keckler and W.J. D ally, Processor Coupling: Integrating

Compile Time and Runtime Scheduling for Parallelism, Proc.

Intl Symp. Computer Architecture, ACM , 1992, pp. 202-213.

7. G .S. Sohi, S.E. Breach, and T. Vi jaykumar, M ultiscalar

Processors, Proc. Intl Symp. Compu ter A rchitecture, A C M ,

1995, pp. 414-425.

8. J.-Y. Tsai and P.-C. Y ew, The Superthreaded Architecture:

Thread Pipelining with Run-time Data Dependence Checking

and Control Speculation, Proc. Intl Conf. Parallel

Archit ectures and Compilati on Techniques, IEEE Computer

Society Press, Los A lamitos, C alif. , 1996, pp. 49-58.

Alternat ive approaches to exploiting para llelism


8/8

Processor, Proc. Int l Symp. Compu ter Architectu re, A CM ,

1996, pp. 191-202.

5. S. M cFarling, Combining Branch Predictors, Tech. Report TN-

36, Western Research Laboratory, D igital Equipment Corp., Palo

A lto, C alif., June 1993.

6. M .W. Hall et al., M aximizing M ultiprocessor Performance with

the SUIF Compiler, Computer, Dec. 1996, pp. 84-89.7. P.G . Lowney et al., The M ultiflow Trace Scheduling Compiler,

J. Supercomput ing, M ay 1993, pp. 51-142.

8. J.L. Lo et al., Converting Thread-level Parallelism to Instruction-

level Parallelism via Simultaneous M ultithreading, ACM Trans.

Compu ter Systems, A CM , A ug. 1997.

Susan J. Eggers is an associate profes-sor of computer science and engineer-

ing at the Uni versity of Washington. H er

research interests are computer archi-

tecture and back-end compi lation, wi than emphasis on experimental perfor-

mance analysis. H er current focus is on

issues in compi ler optimi zation (dynamic compi lation and

compiling for multithreaded machines) and processor design

( multi threaded architectures) .

Eggers received a PhD in computer science from the

University of California at Berkeley. She is a recipient of an

NSF Presidential Y oung Investigator award and a member of

the IEEE and ACM .

Joel S. Emer is a senior consulting engi-neer at D igi tal Equipment Corp., where

he has work ed on processor perfor-

mance analysis and performance mod-

eling methodologies for a number of

VAX and A lpha CPU s. H is current

research interests include multithreaded

processor organi zations, techniques for i ncreased instruc-

tion-level parallelism, instruction and data cache organiza-

tions, branch-prediction schemes, and data-prefetch

strategies for future Alpha processors.

Emer received a PhD in electrical engineering from the

University of I llinois. H e is a member of the IEEE, AC M , Tau

Beta Pi , and Eta K appa Nu.

Henry M. Levy is professor of comput-er science and engineering at the

University of Washington, where his

research involves operating systems, dis-

tributed and parallel systems, and com-

puter architecture. H is current research

interests include operati ng system sup-

port for simultaneous multithreaded processors and distrib-

uted resource management in local-area and wide-area

networks.

Levy received an M S in computer science from the

University of Washington. H e is a senior member of IEEE, a

fellow of the ACM , and a recipient of a Fulbright Research

Scholar award.

Jack L. Lo is currently a PhD candidatein computer science at the University of

Washington. H is research i nterests

include architectural and compiler issues

for simultaneous multithreading, proces-

sor microarchitecture, instruction-level

parallelism, and code scheduling.

Lo received a BS and an M S in computer science from

Stanford University. H e is a member of the ACM .

Rebecca L. Stamm is a principal hard-ware engineer in D igital Semiconduc-

tors Advanced D evelopment G roup at

D igi tal Equipment Corp. She has workedon the architecture, logic and circuit

design, performance analysis, verifica-

tion, and debugging of four generations

of VA X and Alpha mi croprocessors and is currently doing

research in hardware multi threading. She holds 10 US patents

in computer architecture.

Stamm received a BSEE from the M assachusetts Institute

of T echnology and a BA in history from Swarthmore College.

She is a member of the IEEE and the ACM .

Dean M. Tullsen is a faculty member incomputer science at the University of

Cali fornia, San Diego. H is research inter-

ests include simultaneous multithread-

ing, novel branch architectures,

compiling for multithreaded and other

high-performance processor architec-

tures, and memory subsystem design. H e received a 1997

Career SA ard from the National Science Foundation.

T ullsen received a PhD in computer science from the

University of Washington, with emphasis on simultaneous

multithreading. H e is a member of the ACM .

D irect questions concerning this article to Eggers at Dept.

of Computer Science and Engineering, Box 352350,

University of Washington, Seattle WA 98115; eggers@ cs.wash-

ington.edu.

Reader Interest SurveyIndicate your interest in this article by circling the appropri ate

number on the Reader Interest Card.

Low 150 M edium 151 H igh 152


Simultaneous Multithreading

Documents

Transcript of Simultaneous Multithreading