A survey of new research directions in...

A survey of new research directions in microprocessors

J. Silc a,* , T. Ungererb, B. Robicc

aComputer Systems Department, Jozˇef Stefan Institute, Jamova 39, 1001 Ljubljana, SloveniabDepartment of Computer Design and Fault Tolerance, University of Karlsruhe, Germany

cFaculty of Computer and Information Science, University of Ljubljana, Slovenia

Received 25 January 1999; received in revised form 2 November 1999; accepted 2 November 1999

Abstract

Current microprocessors utilise the instruction-level parallelism by a deep processor pipeline and the superscalar instruction issuetechnique. VLSI technology offers several solutions for aggressive exploitation of the instruction-level parallelism in future generationsof microprocessors. Technological advances will replace the gate delay by on-chip wire delay as the main obstacle to increase the chipcomplexity and cycle rate. The implication for the microarchitecture is that functionally partitioned designs with strict nearest neighbourconnections must be developed. Among the major problems facing the microprocessor designers is the application of even higher degree ofspeculation in combination with functional partitioning of the processor, which prepares the way for exceeding the classical dataflow limitimposed by data dependences. In this paper we survey the current approaches to solving this problem, in particular we analyse several newresearch directions whose solutions are based on the complex uniprocessor architecture. A uniprocessor chip features a very aggressivesuperscalar design combined with a trace cache and superspeculative techniques. Superspeculative techniques exceed the classical dataflowlimit where even with unlimited machine resources a program cannot execute any faster than the execution of the longest dependence chainintroduced by the program’s data dependences. Superspeculative processors also speculate about control dependences. The trace cache storesthe dynamic instruction traces contiguously and fetches instructions from the trace cache rather than from the instruction cache. Since adynamic trace of instructions may contain multiple taken branches, there is no need to fetch from multiple targets, as would be necessarywhen predicting multiple branches and fetching 16 or 32 instructions from the instruction cache. Multiscalar and trace processors defineseveral processing cores that speculatively execute different parts of a sequential program in parallel. Multiscalar processors use a compiler topartition the program segments, whereas a trace processor uses a trace cache to generate dynamically trace segments for the processing cores.A datascalar processor runs the same sequential program redundantly on several processing elements where each processing element hasdifferent data set. This paper discusses and compares the performance potential of these complex uniprocessors.q 2000 Elsevier ScienceB.V. All rights reserved.

Keywords: Advanced superscalar processor; Superspeculative processor; Trace processor; Multiscalar processor; Datascalar processor

1. Introduction

Today’s microprocessors are the powerful descendants ofthe von Neumann computer dating back to a memo ofBurks, Goldstine, and von Neumann in 1946 [3]. The so-called von Neumann architecture is characterised by asequential control flow resulting in a sequential instructionstream. A program counter addresses the next instruction ifthe preceding instruction is not a control instruction such asa jump, branch, subprogram call or return. An instruction iscoded in an instruction format of fixed or variable length,where the opcode is followed by one or more operandsthat can be data, addresses of data, or the address of an

instruction in the case of a control instruction. The opcodedefines the types of operands. Code and data are stored in acommon storage that is linear, addressed in units of memorywords (bytes, words, etc.).

The sequential operating principle of the von Neumannarchitecture is still the basis for today’s most widely usedhigh-level programming languages, and even more astound-ing, of the instruction sets of all modern microprocessors.While the characteristics of the von Neumann architecturestill determine those of a contemporary microprocessor, itsinternal structure has considerably changed. The main goalof the von Neumann design—minimal hardware structure—is today far outweighed by the goal of maximum per-formance. However, the architectural characteristics of thevon Neumann design are still valid since the sequentialhigh-level programming languages that are used todayfollow the von Neumann architectural paradigm.

Microprocessors and Microsystems 24 (2000) 175–190

0141-9331/00/$ - see front matterq 2000 Elsevier Science B.V. All rights reserved.PII: S0141-9331(00)00072-7

www.elsevier.nl/locate/micpro

* Corresponding author. Tel.:1 386-61-177-32-68; fax:1 386-61-219-385.

E-mail address:[email protected] (J. Silc).

Current superscalar microprocessors are a long way fromthe original von Neumann computer. However, despite theinherent use of out-of-order execution within superscalarmicroprocessors today, the order of the instruction flow asseen from outside by the compiler or assembly languageprogrammer still retains the sequential program order—often coined result serialisation—defined by the vonNeumann architecture. At the same time today’s micro-processors strive to extract as much fine-grained or evencoarse-grained parallelism from the sequential programflow as can be achieved by the hardware. Unfortunately, alarge portion of the exploited parallelism is speculativeparallelism, which in the case of incorrect speculation,leads to an expensive rollback mechanism and to a wasteof instruction slots. Therefore, the result serialisation of thevon Neumann architecture poses a severe bottleneck.

At least four classes of future possible developments canbe distinguished; all of which continue the ongoing evo-lution of the von Neumann computer:

• Microarchitectures that retain the von Neumann archi-tecture principle (the result serialisation), althoughinstruction execution is internally performed in a highlyparallel fashion. However, only instruction-levelparallelism can be exploited by the contemporary micro-processors. Because instruction-level parallelism islimited for sequential threads, the exploited parallelismis enhanced by speculative parallelism. Besides thesuperscalar principle applied in commodity micropro-cessors, the superspeculative, multiscalar, trace, anddatascalar processor principles are all hot research topics.All these approaches belong to the same class of imple-mentation techniques because result serialisation must bepreserved. A reordering of results is performed in a

retirement or commitment phase in order to fulfil thisrequirement.

• Processors that modestly deviate from the von Neumannarchitecture but allows the use of sequential vonNeumann languages. Programs are compiled to the newinstruction set principles. Such architectural deviationsinclude very long instruction word (VLIW), SIMDin the case of multimedia instructions, and vectoroperations.

• Processors that optimise the throughput of a multi-programming workload by executing multiple threadsof control simultaneously. Each thread of control is asequential thread executable on a von Neumann com-puter. The new processor principles are the single-chipmultiprocessor and the simultaneous multithreadedprocessor.

• Architectures that break totally with the von Neumannprinciple and that need to use new languages, such asdataflow with dataflow single-assignment languages, orhardware–software codesign with hardware descriptionlanguages. The processor-in-memory, reconfigurablecomputing, and the asynchronous processor approachesalso point in that direction.

This paper focuses on microprocessors that retain resultserialisation; all other microarchitecture principles aredescribed in detail in [29]. We describe several newresearch directions in complex uniprocessor design thatretain the result serialisation of the von Neumann archi-tecture. In Section 2 we define and explain selected tech-nical terms concerning state-of-the-art uniprocessorarchitectures and describe challenges as well as require-ments for future microprocessors. Section 3 describes fiveresearch directions in complex uniprocessor architectures.

J. Silc et al. / Microprocessors and Microsystems 24 (2000) 175–190176

Fig. 1. Components of a superscalar processor.

The performance potential and a comparison of these direc-tions are given in Section 4. In Section 5 we give concludingremarks as well as a brief discussion of the multiprocessoralternatives and highly parallel chip architectures thatdeviate from the von Neumann model.

2. State-of-the-art microprocessors

Processor architecture covers the following two aspectsof microprocessor design: the instruction set architecturewhich defines the boundary between hardware and software(often also referred to as the “architecture” of a processor),and the “microarchitecture” or internal organisation of aprocessor which concerns features like pipelining, super-scalar techniques, primary cache organisation, etc.

2.1. Microarchitecture

First we briefly discuss a common framework of the state-of-the-art and future uniprocessor architectures (for moredetails see Ref. [29]). A state-of-the-art uniprocessor isusually a deep pipelined, superscalar RISC processorwhose components include (see Fig. 1):

• (primary) instruction cache, which holds the instructionsto be fetched and executed;

• instruction fetch unit, which fetches in each clock cycleseveral instructions from the instruction cache into afetch buffer; the fetch unit is supported by abranch targetaddress cache(BTAC) and amemory management unit(MMU);

• instruction decodeand register renameunit, where thefetched instructions are decoded, placed in the instructionbuffer; rename registers (specific physical registers) areassociated with the architectural registers referred to inthe decoded instructions;

• issue unit examines the waiting instructions in theinstruction buffer and simultaneously assigns a numberof them to the functional units;

• reorder buffer, which stores the program order of theissued instructions;

• functional units, which usually comprise one or twoload/store unitsfor transferring data between thedata cacheand integer or floating-point registers, one or morefloating-point units, one or moreinteger units, a multi-media unitfor performing specific multimedia operationswith a single instruction; reservation stations which holdinstructions that wait for unavailable operands may be

arranged in front of the individual functional units or infront of groups of functional units;

• branch unit, which monitors branch history, updates thebranch history table(BHT) and unrolls speculativelyexecuted instructions in the case of mispredictedbranches;

• retire unit, which is used to assure that instructions retire(or commit)in program order even though functionalunits may have completed them out of program order;

• physical registers, which can bearchitectural (integer,floating-point) andrename(integer, floating-point);

• internal buffers(instruction buffer, reorder buffer, etc.);• data cachethat only holds the data of a program;• bus interface unitfor connecting to the environment (e.g.

secondary cachewhen this is not already on the chip,external memory, I/O devices, etc.).

The pipelining (see Fig. 2) starts with theinstruction fetchstage that fetches several instructions from the instructioncache into a fetch buffer. Typically, at least as many instruc-tions as the maximum issue rate are fetched at once. Toavoid pipeline interlocking due to jump or branch instruc-tions, the BTAC contains the jump and branch targetaddresses that are used to fetch instructions from the targetaddress. The fetch buffer decouples the fetch stage from thedecode stage.

In the instruction decodestage, a number of instructionsare decoded. The operand and result registers are renamed,i.e. available physical registers are assigned to the archi-tectural registers specified in the instructions. Then theinstructions are placed in an instruction buffer, often calledthe instruction window. Instructions in the instructionwindow are free from control dependences due to branchprediction, and free from name dependences due to registerrenaming. So, only data dependences and structuralconflicts remain to be solved. The instruction windowdecouples the fetch and decode part, which is operatedstrictly in the program order, from the execution core,which executes instructions out of the program order.

The issue logic examines the waiting instructions in theinstruction window and simultaneously assigns (“issues”) anumber of instructions to the functional units. The programorder of the issued instructions is stored in the reorderbuffer. Instruction issue from the instruction window canbe in order (only in program order) or it can be out oforder. It can be either subject to simultaneous data depen-dences and resource constraints, or divided into two (ormore) stages, checking structural conflict in the first anddata dependences in the next stage. In the case of structuralconflicts first, the instructions are issued to reservationstations (buffers) in front of the functional units where theissued instructions await unavailable operands. Dependingon the specific processor, reservation stations can be centralto a number of functional units (e.g. Intel Pentium III), oreach functional unit can have one or more dedicatedreservation stations (e.g. IBM/Motorola/Apple PowerPC

J. Silc et al. / Microprocessors and Microsystems 24 (2000) 175–190 177

Fig. 2. Superscalar pipeline.

604). In the latter case, a structural conflict arises if morethan one instruction is issued to the reservation stations ofthe same functional unit simultaneously. In this case onlyone instruction can be issued within a cycle.

The instructions await their operands in the reservationstations. An instruction is thendispatchedfrom a reser-vation station to the functional unit when all operands areavailable, and execution starts. The dispatch sends operandsto the functional unit. If all its operands are available duringissue and the functional unit is not busy, an instruction isimmediately dispatched, starting execution in the cyclefollowing issue. Thus, the dispatch is usually not a pipelinestage. An issued instruction may stay in the reservationstation for zero to several cycles. Dispatch and executionare performed out of program order.

When the functional unit finishes the execution of aninstruction and the result is ready for forwarding andbuffering, the instruction is said tocomplete. Instructioncompletion is out of program order. During completionthe reservation station is freed and the state of the executionis noted in the reorder buffer. The state of the reorder bufferentry can denote an interrupt occurrence. The instructioncan be completed and still be speculatively assigned,which is also monitored in the reorder buffer.

After completion, operations arecommittedin order. Byor after commitment, the result of an instruction is madepermanent in the architectural register set, usually bywriting the result back from the rename register to the archi-tectural register. This is often done in a stage of its own,after the commitment of the instruction, with the effect thatthe rename register is freed one cycle after commitment.

Branch prediction is already a well-developed part of thestate-of-the-art microarchitecture design. State-of-the-artmicroprocessors use either a two-bit prediction scheme, acorrelation-based prediction or a combination of both. In atwo-bit prediction scheme two bits are assigned to eachentry in the branch history table. The two bits stand forthe prediction states “predict strongly taken”, “predictweakly taken”, “predict strongly not taken”, and “predictweakly not taken”. In the case of a misprediction out ofthe strongly state cases, the prediction direction is notchanged yet. Instead the prediction move to the respectiveweakly state. A prediction must therefore mispredict twicebefore the prediction is changed. This predictor schemeworks well with nested loops, because only a single mis-prediction occurs at the exit point of each inner loop cycle.

The two-bit predictor scheme uses only the recent beha-viour of a single branch to predict the future of that branch.Correlation between different branch instructions is nottaken into account. Many integer workloads featurecomplex control-flows whereby the outcome of a branchis affected by the outcomes of recently executed branches.The correlation-based predictors [23] or two-level adaptivepredictors [42,43] are branch predictors that additionally usethe behaviour of other branches to make a prediction. Sometwo-level adaptive branch predictors (such asGAg, GAp,

GAs) depend on the history of neighbouring branches.Other two-level predictors (such asPAg, PAs, PAp) dependon what happened when the predicted branch itselfpreviously exhibited a specific history pattern [29]. Whiletwo-bit predictors use self-history only, the correlatingpredictor also uses the history of neighbouring branchesrecorded in a branch history register. The mentioned corre-lation-based predictors use pattern history table (also calledbranch prediction table) which is indexed by the branchhistory register. In order to reduce conflicts, one set of corre-lation-based predictors uses a hash function into the patternhistory table instead of indexing the table. For example,McFarling [21] introduced thegsharepredictor that usesthe bitwise exclusive OR of part of the branch address andthe branch history register as hash function to select an entryin the pattern history table.

2.2. Challenges and requirements for futuremicroprocessors

Today’s general trend in microprocessor design is drivenby several architectural challenges. For example,scalableand faster bussesare needed to support fast chip-to-chipcommunication [15,26].Memory latency(bottleneck) is achallenge which may be solved by a combination of tech-nological improvements [6,11,14] and advanced memoryhierarchy techniques (e.g. larger caches, more elaboratecache hierarchies, prefetching, compiler optimisations,streaming buffers, intelligent memory controllers, andbank interleaving) [5,26].Fault-tolerant chips are neededto cope with soft errors caused by cosmic rays of gammaradiation [26]. To deal with legacy binaries,object codecompatibility should be preserved [5,34], andlow powerconsumptionis of specific importance for the expandingmarket for mobile computers and appliances.

Moreover, processor architecture must take into accountthe technological aspects of the hardware, such as logicdesign and packaging technology. The long interconnectwire delay problem requires a strict functional partitioningwithin the microarchitectureand a floor planning thatavoids long interconnects. Designers can probably bestaccomplish this by dividing the microarchitecture intomultiple processing elements, each no larger than today’ssuperscalar processors. As noted in [30], co-ordinating theseprocessing elements to act as a single unified processor willrequire an additional level of microarchitecture hierarchyfor both control (distribution of instructions) and data (forcommunicating values among the processing elements).

It has yet to be seen whether a modular design is cost-effective for a very complex single instruction streamgeneral-purpose processor, or whether the pendulum willswing back to less complex processors. First, several suchsimple processors could be combined into a multiprocessorchip or a simple processor could be integrated on a DRAMfor closer processor memory integration. For example256 Mbit memory chip together with memory compression


could afford to devote some reasonable percentage of its dieto a simple CPU [12]. The second less-complex hardwareapproach is represented by theexplicitly parallel instructioncomputing(EPIC) design style which specifies instruction-level parallelism explicitly in the machine code and usesfully predicated instruction set [7].

A general-purpose processor will probably not be able tocope with all the challenges mentioned. As a result, severalpossible approaches can be identified, such as:

• focusing on particular market segments [15];• using desktop personal computers in multimedia and

high-end microprocessors in specialised applications(due to different performance/cost trade-offs) [2];

• integrating functions to systems on a chip (e.g. acceler-ated graphics port on-chip) [5];

• partitioning of the microprocessor in a client chip, onepart for general user interaction and server chip parts forspecial applications [34];

• using a CPU core that works like a large ASIC block thatallows system developers to instantiate various deviceson a chip with a simple CPU core [12];

• applying reconfigurable on-chip parts that adapt to appli-cation program requirements.

3. Future microprocessor architectures

Superscalar processors utilise instruction-level parallel-ism to speed-up single thread performance. However,instruction-level parallelism between successively takenbranches is limited, particularly for integer-dominatedprograms. Therefore, the excess of resources in issue-bandwidth, pipeline breadth and functional units, which ispossible with today’s processor chips, is utilised by specu-lation. Here branch prediction with single path speculationand future value speculation techniques solely speculateabout a single thread. Only a single instruction pointer isprovided in these complex uniprocessor architectures. Dualpath execution of branches (see below) would be a first stepbeyond this limitation, because two or even more threads ofcontrol are followed.

One of the major challenges is to find ways of expressingand exposingmore parallelismto the processor. Since thelanguage structures of von Neumann languages used todaylimit the amount of extractable parallelism, not enoughinstruction-level parallelism is available. Hence, we needto look for parallelism at a higher level than the individualinstructions and run more than one thread in parallel.Higher-level parallelism allows smaller computationalunits to work in parallel on multiple threads and therebyfavours amodular designapproach [34].

Another class of architectural proposals contains themultithreaded architectures, in particular simultaneousmultithreaded architectures. Amultithreadedprocessor isable to pursue two or more threads of control in parallel

within the processor pipeline. A fast context switch issupported by multiple program counters and often by amultiple register set on the processor chip. While the multi-threading technique was defined originally for single-issueprocessors and for VLIWs, the simultaneous multithreadingtechnique combines multithreading with a wide-issue super-scalar approach. Thesimultaneous multithreading(SMT)technique [28,35,36] issues instruction from multiplethreads simultaneously to the execution units of a super-scalar processor. Multithreading techniques use coarse-grain parallelism to speed-up the computation of a multi-threaded workload through better utilisation of the resourcesof a single processor. However, single-thread performancemay even slightly deteriorate because of minor register setand thread overheads. We call these techniquesexplicitmultithreading techniques, because the existence of mul-tiple program counters in the microarchitecture is percep-tible in the architecture. Explicit multithreading techniquesare not discussed any further in this paper.

In contrast,implicit multithreadedarchitectures spawnand execute multiple threads implicitly—the threads arenot visible at the architectural level and only concern themicroarchitecture. Multiple threads of control are implicitlydefined by the machine from a sequential program andutilised by some kind of speculative thread of control exe-cution. Examples of such architectures can be found in themultiscalar processor, the trace processor, and the datascalarprocessor. These architectural approaches try to speed-upsingle thread performance by utilising coarse-grain parallel-ism in addition to instruction-level parallelism.

In this section we survey five research directions incomplex uniprocessor architectures:

1. Advanced superscalar processors, which are wide-issuesuperscalars that can issue up to 32 instructions per cycle(IPC).

2. Superspeculative processors, which are wide-issuesuperscalars that use aggressive speculation techniques.

3. Multiscalar processors, which divide a program into acollection of tasks that are distributed to a number ofparallel processing elements (PEs) under the control ofa single hardware sequencer.

4. Trace processors, which break up the processor intoseveral PEs (similar to multiscalar) and the programinto several traces so that the current trace is executedon one PE while the future traces are speculativelyexecuted on other PEs.

5. Datascalar processors, which run the same sequentialprogram redundantly across multiple processors usingdistributed data sets. Loads and stores are only performedlocally by the processor that owns the data, but a localload broadcasts the loaded value to all other processors.

The first two directions focus on the use of instruction-level parallelism together with speculation, while the lastthree are based on implicit multithreading.


3.1. Advanced superscalar processors

Reaching the highest execution rate for a single instruc-tion stream involves delivering the maximum possibleinstruction bandwidth to each cycle to the execution coreand consuming the delivered bandwidth within the exe-cution core. Delivering the optimal instruction bandwidthrequires a high number of instructions to be fetched by eachcycle, a minimal number of cycles in which instructions thatare fetched for a wrongly predicted path are subsequentlydiscarded, and a very wide, full instruction issued to eachcycle. Consuming this instruction bandwidth requiressufficient data supply so that instructions are not unneces-sarily inhibited from being executed and there are sufficientfunctional units to handle the instruction bandwidth. Suchan advanced superscalar processor was suggested in Ref.[24] assuming one giga transistor chips which are foresee-able by the year 2005. The block diagram of the processor isshown in Fig. 3. The instruction cache provides for the out-of-order fetch in the event of instruction cache miss. A largesophisticated trace cache provides a contiguous instructionstream. An aggressive multi-hybrid branch predictor usesmultiple, separate branch predictors, each tuned to adifferent class of branches with support for contextswitching, indirect jumps, and interference handling. Theprocessor features a very wide issue width of 16 or 32IPC. Since there will be many (24–48) highly optimised,pipelined functional units, a large number of reservationstations will be needed to accommodate approximately2000 instructions. Adequate resolution and forwardinglogic will also be required. More than half of the transistors

on the chip are allocated to data and on-chip secondarycaches.

3.1.1. Branch predictionA fast and accurate branch prediction is essential for

advanced superscalar processors with hundreds of in-flightinstructions. Unfortunately, many branches display differentcharacteristics that cannot be optimally predicted by anysingle-scheme branch predictor. As a result, hybrid pre-dictors comprise several predictors, each targeting differentclasses of branches [8,21]. The principal idea is that eachpredictor scheme works best for a particular branch type. Aspredictor tables increase in size, they often take more timeto react to changes in a program (warm-up time). A hybridpredictor with several components can solve this problemby using component predictors with shorter warm-up timeswhile the larger predictors are warming up [24]. The multi-hybrid branch predictor uses a set of selection counters foreach entry in the branch target buffer, in the trace cache, orin a similar structure, keeping track of the predictorcurrently most accurate for each branch, and then usingthe prediction from that predictor for that branch [8,24].The multi-hybrid predictor performs better than hybridpredictor and reaches a prediction rate of 95% with a16 kbyte predictor size and almost 97% with a 256 kbytepredictors using programs of the SPECint95 benchmarksuite [24].

Despite this high prediction rate, the remaining mis-predictions still incur a large performance penalty. If abranch is not, or only sometimes predictable, its irregularbehaviour will frequently yield costly misspeculations. Thepredictability of branches can be assessed by additionallymeasuring theconfidencein the prediction. Alow confidencebranchis a branch which frequently changes its branch direc-tion in an irregular way making its outcome hard to predict oreven unpredictable. Confidence estimation [13] can be usedfor speculation control provided that techniques other thanbranch speculation are provided to utilise the processorresources. Alternative techniques include predication toenlarge the number of instructions between two speculativepredictions, thread switching in multithreaded processors orboth path execution (as in the PolyPath architecture [16])where instructions from both branch directions are fetchedand executed, and the wrong path instructions are eventuallydiscarded. Probably a combination of branch handling tech-niques will be applied in future, such as a multi-hybrid branchpredictor combined with a dual path execution technique tohandle unpredictable branches [37,38,39].

Future microprocessors will require more than oneprediction per cycle, starting speculation over multiplebranches in a single cycle. Here a simple correlation-based predictor scheme is already able to predict multiplebranches without knowing the branch instruction address.However, the instruction fetch is also affected. When mul-tiple taken branches are predicted in each cycle, theninstructions must be fetched from multiple target addresses


Fig. 3. Advanced superscalar architecture.

per cycle, complicating instruction cache access. As long asonly the final branch is predicted as taken, multiple branchescan be predicted in each cycle without requiring multiplefetches. Prediction of multiple branches per cycle andinstruction fetch from multiple target addresses per cyclecan be achieved simultaneously by a trace cache in combi-nation with the next trace prediction.

3.1.2. Out-of-order fetchTo increase instruction supply in the case of instruction

cache misses, an out-of-order fetch can be employed. Anin-order fetch processor, upon encountering a trace cachemiss, waits until the miss is serviced before fetching anynew segments. An out-of-order fetch processor temporarilyignores the segment associated with the miss and attempts tofetch, decode, and issue the segments that follow it. Afterthe miss has been serviced, the processor decodes and issuesthe ignored segment. A related technique fetches instruc-tions that appear after a mispredicted branch, but are notcontrol dependent upon that branch. Out-of-order fetchprovides a way of fetching such control-independentinstructions by skipping the block that follows a hard-to-predict branch until either an accurate prediction can bemade or the branch is resolved. The processor fetches,decodes, and issues instructions that begin at the mergepoint of the alternative paths that follow the branch. Theseinstructions are guaranteed to be on the program’s correctpath. As soon as a prediction can be made or the branch isresolved, the fetch unit will return to the branch and restartfetching there. Upon reaching the merge point, the processorwill then jump past the instructions that it has alreadyfetched [24].

3.1.3. Trace cacheA trace cache is a special instruction cache that captures

dynamic instruction sequences in contrast to the instructioncache that contains static instruction sequences. Each line inthe trace cache stores a dynamic code sequence, which maycontain one or more taken branches. Each line thereforestores a snapshot, or trace, of the dynamic instructionstream. A trace is a sequence of instructions that potentiallycovers several basic blocks starting at any point in adynamic instruction stream. An entire trace consisting ofmultiple basic blocks is fetched in one clock cycle.

Dynamic instruction sequences are built as the programexecutes. The trace construction is off the critical path; itdoes not lengthen the pipeline [25,30]. A trace cache storesan instruction sequence contiguously, while the sameinstruction sequence is stored in the instruction cache innon-contiguous areas because of branch or jump instruc-tions. Moreover, since several branches may be capturedwithin a trace, trace prediction automatically leads tomultiple predicted branches per cycle. The trace cache isindexed using the next address field of the previouslyfetched trace combined with prediction information fornext trace prediction.

3.1.4. Data-cache hierarchyA 16-wide-issue processor will need to execute about

eight loads/stores per cycle. The primary design goal ofthe data cache hierarchy is therefore to provide the neces-sary bandwidth to support these memory references. Asingle, monolithic, multi-ported, primary data cachewould be so large that it would jeopardise the cycle time.Because of this, the primary data cache will be replicated toprovide the required ports. Notice, that replicating theprimary data cache increases only the number of readports. Further features of the data supply system are abigger, secondary data cache with fewer port requirementsand data prefetching.

3.1.5. Data predictionIn addition to control prediction, data prediction will be

widely used in future superscalar processors. Whenever thefull issue bandwidth cannot be filled with instructions thatcan be executed in parallel, control and data prediction canbe applied to increase processor utilisation and potentiallyprocessor performance. Prefetching and set prediction incaches are expected to become the norm in processordesign. Future processors will also predict the addressesof loads, allowing loads to be executed before the availa-bility of the operands needed for their address calculations.Processors will also predict dependencies between loads andstores, allowing them to predict that a load is always depen-dent on some older store.

3.1.6. Out-of-order execution coreThe execution core must consume 16–32 instructions per

cycle to keep up with the fetch unit. As in today’s super-scalar processors logical registers must be renamed to avoidunnecessary delays due to false dependencies, and instruc-tions must be executed out-of-order to compensate for thedelays imposed by the data dependencies that are notpredicted [24]. Execution cores comprising 24–48 func-tional units are envisioned. These units will be suppliedwith instructions from large reservation station units witha total storage capacity of 2000 or more instructions. For abetter functional partitioning and shortening of the signalpropagation on the processor die, the execution units will bepartitioned into clusters of three to five units. Each clusterwill maintain an individual register file. Each execution unithas its own reservation station unit. Data forwarding withina cluster will require one cycle, while data forwardingbetween different clusters will require multiple cycles. Tosolve the difficulty of scheduling instructions from a cen-tralised large instruction window, instruction schedulingwill be done in stages.

3.2. Superspeculative processors

The basis for the superspeculative approach is the obser-vation that in real programs instructions generate manyhighly predictable result values. Consumer instructions


can therefore frequently and successfully speculate on theirsource operand values and begin execution without resultsfrom their producer instructions. Consequently, a super-speculative processor can remove the serialisation con-straints between producer and consumer instructions. As aresult program performance can potentially exceed theclassical dataflow limit which states that a program cannotexecute faster than the longest execution path set by theprogram’s true data dependencies.

The reasons for the existence of value locality are mani-fold [20]. First, due to register spill code the re-use distanceof many shared values is very short in processor cycles.Thus, many stores do not even make it out of the storequeue before their values are needed again. Second, inputsets often contain data with little variation (e.g. sparsematrices or text files with white spaces). Third, compilersoften generate run-time constants due to error-checking,switch statement evaluation, and virtual function calls.Finally, compilers also often load program constants frommemory rather than using immediate operands.

3.2.1. Weak dependence modelConventional superscalar processors employ the strong-

dependence model for program execution, which implies atotal instruction ordering of a sequential program. In thestrong-dependence model two instructions are identified aseither dependent or independent, and when in doubt, depen-dencies are pessimistically assumed to exist. Dependenciesare never allowed to be violated and are enforced duringinstruction processing. To date, most machines enforce suchdependencies in a rigorous fashion.

Since the traditional model is overly rigorous and

unnecessarily restricts available parallelism, the weak-dependence model is applied in superspeculative processors.This model specifies that dependencies can be temporarilyviolated during instruction execution as long as recoverycan be performed prior to affecting the permanent machinestate. The weak-dependence model’s advantage is that themachine can speculate aggressively and temporarily violatethe dependencies as long as corrective measures are in placeto recover from misspeculation. If a significant percentageof speculations are correct, the machine can exceedthe performance limit imposed by the traditional, strong-dependence model.

Similar in concept to branch prediction in current pro-cessors, superspeculation uses two interacting engines.The front-end engine assumes the weak-dependencemodel and is highly speculative, predicting instructions tospeculate aggressively past them. When predictions arecorrect, these speculative instructions effectively skip overcertain stages of instruction execution. On the other hand,the back-end engine still uses the strong-dependence modelto validate the speculations, recover from misspeculation,and provide history and guidance information to thespeculative engine.

3.2.2. Superspeculative microarchitectureA superspeculative microarchitecture must maximise

three key parameters (Fig. 4), that is, instruction flow, i.e.the rate at which useful instructions are fetched, decoded,and dispatched to the execution core; register dataflow, i.e.the rate at which results are produced and register valuesbecome available; and memory dataflow, i.e. the rate atwhich data values are stored and retrieved from data


Fig. 4. Superspeculative architecture.

memory. These three flows roughly correspond to theprocessing of branch, arithmetic/logical, and load/storeinstructions, respectively. In a superspeculative micro-architecture [19], aggressive speculative techniques areemployed to accelerate the processing of all threeinstruction types.

Speculation on the instruction flow uses a two-level adap-tive branch predictor with local and global branch history,combined with a trace cache to execute more than one takenbranch per cycle, which is similar to the advanced super-scalar architecture in the previous section. The mispre-diction latency is reduced by data speculation.

Speculation in the register dataflow comprises sourceoperand value prediction, and value stride prediction.Source operand value prediction eliminates data dependen-cies by the use of a dynamic value prediction table per staticinstruction. Value stride prediction speculates on constant,incremental increases in operand values to increase theaccuracy of value prediction. In value stride prediction, adynamic hardware mechanism detects constants, incre-mental increases in operand values (strides), and usesthem to predict future values. Dependence prediction isapplied to predict the inter-instruction dependencies.Instructions that are data ready are allowed to execute inparallel with the dependence checking for these instructions.Dependence prediction is used when the dynamic historyshows that value prediction cannot be successfully applied.It can be implemented by a dependence prediction tablewith entries that are indexed by hashing together the instruc-tion address bits, thegshare branch predictor’s branchhistory register, and the relative position of the operandbeing looked up [18].

Deeper pipelining often results in dependence checkingand dispatch in multiple pipelined stages. With dependenceand value prediction a three-cycle dispatch nearly matchesthe performance of single-cycle dispatch.

The memory dataflow is used to predict load values tobridge the latency of accessing the storage device, loadaddresses to eliminate address generation interlock (i.e.the delay of a load the address of which is not yetknown), and aliases with earlier outstanding stores. Loadvalue prediction [20] predicts the results of load instructionsat the time of dispatch by exploiting the affinity betweenload instruction addresses and the values the loads produce.Prediction tables implement the predictions. Memory loadsare predicted by a load value prediction unit, which consistsof a load value prediction table for generating value pre-dictions, a load classification table for deciding whichpredictions are likely to be correct, and a constant verifi-cation unit that replaces accessing the conventional memoryhierarchy for verifying highly predictable loads.

Alias prediction is related to dependence prediction.Rather than predicting the dependence distance to a pre-ceding register write, alias prediction predicts the distanceto a preceding store to memory. The predicted distance isthen used to obtain the load value from that offset in the

processor’s store queue, which holds outstanding stores. Forthis speculative forwarding to occur, neither the load nor thestore needed to have their addresses available. A relatedapproach is called the memory dependence prediction thatidentifies the stores in which a load depends. The processoruses a load’s store set, i.e. the set of stores upon which theload has depended, to predict which stores a load must waitfor before executing [4].

3.3. Multiscalar processors

The multiscalar model of execution [9,33] representsanother paradigm to extract a large amount of inherentparallelism from a sequential instruction flow. A programis divided into a collection of tasks by a combination ofhardware and software. The tasks are distributed to anumber of parallelprocessing elements(PEs) within aprocessor. Each PE fetches and executes instructionsbelonging to its assigned task. The functional decompo-sition of the processor chip that is required in order toachieve short wire delays in future generation high-densityprocessor chips is thus naturally realised.

The structure of the processor can be viewed as a collec-tion of sequential (or scalar) processors that cooperate inexecuting a sequential program [32]. The difference whencompared to a single chip multiprocessor (CMP) is the closecoupling of the PEs in the multiscalar processor. While aCMP executes different threads of control that are staticallydetermined by the programmer or by a paralleling compiler,the multiscalar processor executes a sequential program thatis enriched by sequencing information.

3.3.1. Control flow graphA static program is represented by acontrol flow graph

(CFG), where the nodes represent basic blocks, and the arcsrepresent the flow of control from one basic block toanother. Program execution can be viewed as walkingthrough the CFG, generating a dynamic sequence of basicblocks that have to be executed for a particular run of theprogram. To achieve high performance, the multiscalarprocessor must walk through the CFG with a high level ofparallelism. The primary constraint of any parallel walk isthat it must preserve the sequential semantics assumed in theprogram. A program is statically partitioned into (not neces-sarily independent) tasks, which are marked by annotationsof the CFG. Each task is a collection of instructions (part ofa (large) basic block, a basic block, a collection of basicblocks, single loop iteration, an entire loop, a function call,etc.). A task sequencer (speculatively) sequences throughthe program a task at a time, assigning each task to a PE,which in turn unravels the task to determine the dynamicinstructions to be executed, and executes them.

A multiscalar processor walks through the CFG specu-latively, taking task-sized steps, without pausing to inspectany of the instructions within a task. A task is assigned toone of the PEs for execution by passing the initial program


counter of the task to the PE. For each step of its walk, amultiscalar processor assigns a task to a PE for execution,without concern for the actual content of the task, andcontinues from this point to the next point in the CFG.

3.3.2. Multiscalar microarchitecture and execution modelA possible microarchitecture for a multiscalar processor

is shown in Fig. 5. A multiscalar processor can be con-sidered as a collection of PEs with a sequencer that assignstasks to the PEs. Once a task is assigned to a PE, the PEfetches and executes the instructions of the task until it iscomplete. Multiple PEs, each with their own internalinstruction sequencing mechanism support the executionof multiple tasks, and thereby multiple instructions, in anygiven step. Multiple tasks then execute in parallel on thePEs, resulting in an aggregate execution rate of multiple IPC[33].

The key problem is the proper resolution of inter-taskdata dependencies. This concerns, in particular, data thatis passed between instructions via registers and via memory.It is in this area of inter-task data communication that themultiscalar approach differs significantly from the moretraditional multiprocessing methods.

To maintain a sequential appearance a twofold strategy isemployed. First, each processing element adheres tosequential execution semantics for the task assigned to it.Second, a loose sequential order is enforced over the collec-tion of processing elements, which in turn imposes asequential order of the tasks. The sequential order on theprocessing elements is maintained by organising theelements into a circular queue. Head and tail pointers indi-cate, respectively, the elements that are executing theearliest and the latest of the current tasks.

Because a sequential execution model views storage as asingle set of registers and memory locations, multiscalarexecution must maintain this view as well. In order toprovide this behaviour, communication between tasks issynchronised.

The appearance of a single logical register file is main-tained, although copies are distributed to each parallel PE.Register results are dynamically routed among the manyparallel processing elements with the help of compiler-generated masks.

In the case of registers, the control logic synchronises theproduction of register values in predecessor tasks with theconsumption of these values in successor tasks via reser-vation on the registers. The register values that a task mayproduce can be determined statically and maintained in acreate mask. Bits in the create mask correspond to each ofthe logical registers—a bit is set to one if the register ispotentially written by the task. At the time a register valuein thecreate maskis produced, it is forwarded via a circularunidirectional ring to later tasks, i.e. to PEs which arelogical successors. The reservations on registers for asuccessor task are given in theaccum mask, which is theunion of the create masks of currently active predecessortasks. As values arrive from the predecessor PEs, reser-vations are cleared in the successor PEs. If a task uses oneof these values, the consuming instruction can proceed onlyif the value has been received; otherwise, it waits for thevalue to arrive.

For memory operations, the situation is more compli-cated. When a PE is ready to execute a load, it does noteven know whether previous tasks have stores, let alonestores to a given memory location. Here multiscalar pro-cessing employs data dependence speculation—speculating


Fig. 5. A multiscalar processor.

that a load does not depend on instructions executing inpredecessor tasks. Memory access may occur speculativelywithout knowledge of preceding loads and stores. Addressesare disambiguated dynamically, many in parallel, andprocessing waits only for data dependencies. Anaddressresolution buffer(ARB) is provided to hold speculativememory operations and to detect violations of memorydependencies. The ARB checks that the speculation wascorrect, squashing instructions if it was not.

Thus the multiscalar paradigm has at least two forms ofspeculation [32]: control speculation, which is used by thetask sequencer, and data dependence speculation, which isperformed by each PE. It could also use other forms ofspeculation, such as data value speculation, to alleviateintertask dependences.

Multiscalar processors use multiple internal sequencers(PCs) to sequence through a sequential program. Theinternal sequencers require information about which tasksare the possible successors of any given task in the CFG.Such information can be determined statically and placed ina task descriptor. Each internal sequencer may also specu-latively sequence through a task. The task descriptors maybe dispersed within the program text—for instance, beforethe code of the task—or placed in a single location besidethe program text. A multiscalar program may be generatedfrom existing binaries by augmenting the binary with taskdescriptors and tag bits.

3.4. Trace processors

The main ideas of the trace processor were presented in[25,30,40], where it was proposed to create subsystemssimilar in complexity to today’s superscalar processorsand combine replicated subsystems into a full processor.The focus of a trace processor is the trace cache.

A trace processor (Fig. 6) is partitioned into multipledistinct PEs (similar to multiscalar). The code is broken

up into traces that are captured and stored by hardware.One PE executes the current trace while the others executefuture traces speculatively. Instruction fetch hardwarefetches instructions from the instruction cache and simul-taneously generates traces of 8–32 instructions includingpredicted conditional branches. Traces are built as theprogram executes and they are stored in a trace cache. Atrace fetch unit reads traces from the trace cache and parcelsthem out to the parallel PEs.

A trace cache miss causes a trace to be built throughconventional instruction fetching with branch prediction.Blocks of instructions are pre-processed before being putin the trace cache, which greatly simplifies processingafter they are fetched. Pre-processing can include capturingdata dependence relationships, combining and reorderinginstructions, or determining instruction resource require-ments—all of which can be re-used. To support preciseinterrupts, information about the original instruction ordermust also be saved with the trace. During the dispatch phase,instructions move from the trace cache to the instructionbuffers in the PEs. Only inter-trace dependence checkingand register renaming are required.

Because traces are the basic units for fetching and exe-cution, control-flow prediction is moved up to the tracelevel. The unit of control prediction should be a trace, notindividual branches. That suggests a next-trace predictor.Next-trace prediction predicts multiple branches per cycle.Trace processors also employ data value prediction. Datavalue prediction speculates on the input data values of atrace and is combined with next trace prediction. Success-fully predicting a trace’s input data values makes the traceindependent of data availability, and leads to a furtherdecoupling of traces, allowing the trace to execute imme-diately and in parallel with other traces. Data valueprediction and speculation is restricted to inter-tracedependencies.

The expected parallelism within a single trace is suitable


Fig. 6. A trace processor.

for execution in a modest superscalar unit that can be chosento implement the PEs. As multiple PEs issue instructions inparallel, both intra-trace and inter-trace parallelism areexploited.

Because the PEs and register files are distributed, so is thecommunication of register data. The relatively simplebypass paths within a PE allow local result forwarding ina single cycle. Global paths are used for communicatingglobal register results between PEs. The global bypasspaths are likely to require multiple clock cycles. The traceprocessor uses a conventional set of logical registers.Physical registers are divided into local and global sets.The hierarchical organisation of registers allows smallregister files with fast access times and fewer ports perfile. The trace dispatcher remaps the trace’s source anddestination registers to the global registers without theneed for intra-trace dependence checking. The dispatchermaps local registers with reusable mappings based on theintra-trace dependencies detected during instruction pre-processing. Dispatch logic can remap a 16-instructiontrace line using register rename logic that is no morecomplex then the logic used by a conventional four-waysuperscalar processor.

Memory systems for the trace processor will have toprovide very high bandwidth to supply enough data to theprocessor’s multiple PEs. Distributed, multi-ported cachescan be employed, provided that coherence among distri-buted caches is maintained. A large, interleaved cachesystem is also possible, although designers will have todeal with the additional latency in such systems. Each PEin the trace processor generates a stream of load and storerequests to memory. Moreover, these address streams aregenerated speculatively and out of order. The hardware tosort out the address streams and make sure that all memorylocations are accessed in the correct order will have to befairly sophisticated. The ARB as proposed for multiscalarprocessors solves the problem of parallel resolutionof memory addressing hazards. However, the ARB is a

centralised device, separate from data caches, and must bedeveloped further, using distributed mechanisms that mergeaddress resolution and data caching.

3.5. Datascalar processors

The datascalar model of execution runs the same sequen-tial program redundantly across multiple processors [1]. Thedata set is distributed across physical memories that aretightly coupled to their distinct processors. Each processorbroadcasts operands that it loads from its local memory toall other processors. Instead of explicitly accessing a remotememory, processors wait until the requested value is broad-cast. The processors that own memory addresses, and aredropped by the others only execute stores.

The most heavily accessed data is statically replicated byduplicating whole memory pages that are stored in eachprocessor’s local memory. Access to a replicated pagerequires no data communication. The address space isdivided into a replicated and a communicated section. Thelatter holds values that only exist in single copies and areowned by the respective processor. Replicated pages aremapped into each processor’s local memory, and thecommunicated section of the address space is distributedamong the processors.

Fig. 7 demonstrates the execution of load and store opera-tions for replicated and communicated memory. Assumethat both processors execute a sequenceof load-1, store-1,load-2, and store-2. Operationsload-1 and store-1 areissued to the replicated memory and can therefore completelocally on both processors. Operationsload-2 and store-2are issued to the communicated memory of the first pro-cessor. Theload-2 of this processor is deferred until thevalue is broadcast from it. Since all processors are runningthe same program, they all generate the same store value,which is stored only in the communicated memory of theprocessor that owns the address. Therefore,store-2 iscompleted by the first processor, but is aborted on thesecond processor.

The main goal of the datascalar model of execution is theimprovement of memory system performance. Since allphysical memory is local to at least one processor, a requestfor a remote operand is never sent, thus reducing memoryaccess latency and bus traffic. All communication is one-way. Writes never appear on the global bus.

The processors execute the same program in slightlydifferent time steps, due to asynchronous memory accessesand the ability to perform out-of-order execution. Oneprocessor, the lead processor, runs slightly ahead of theothers, especially when it is broadcasting while the otherswait for the broadcast value. When the program executionaccesses an operand that is not owned by the lead processor,a lead change occurs. All processors stall until the new leadprocessor catches up and broadcasts its operands. Thecapability for each processor to run ahead on computation


Fig. 7. Accesses of datascalar processors to replicated and communicatedmemory.

that involves operands owned by the processor is calleddatathreading [1].

The datascalar model creates opportunities for new opti-misations. Because each processor executes the instructionsin a different order due to the out-of-order execution facility,it is possible for a processor to execute a private compu-tation, broadcasting only the result and not the operands tothe other processors. This technique called result commu-nication, deviates from the strict single-program-single-datastream (SPSD) model of datascalar computation. Specu-lation is another optimisation opportunity. However, theadditional bus traffic that might be caused by speculationmust be weighed against the possible performance advan-tage. The broadcast of data may be a critical limitation of thedatascalar model, and frequent superfluous broadcastswould greatly hinder performance. Possible solutions areto hold on to speculative broadcasts until the speculativecondition is resolved or to send the broadcast immediatelyupon issue, followed by a corresponding squash message ifthe load that generated the broadcast is squashed.

The datascalar model is primarily a memory system opti-misation intended for codes that are limited in performanceby the memory system and difficult to parallelise. However,every datascalar machine is a de facto multiprocessor. Whencodes contain coarse-grain parallelism, the datascalarmachine can also run like a traditional multiprocessor.Datascalar and multiprocessing can be viewed as twoendpoints on a spectrum, where the datascalar model isrestricted to run sequential code and makes no attempt toexploit the coarse-grained parallelism in the code, whilemultiprocessing requires compiler and/or programmersupport to generate the parallel code, and its main focus isexplicit exploitation of coarse-grain parallelism.

The datascalar model can be applied to speed upsequential execution wherever multiprocessor hardware isavailable without a sufficient load of parallel tasks. Threepossible candidates for datascalar systems were proposed.The first is the concept of merging processor and memory ona chip as proposed by theintelligent RAM(IRAM) approach[17]. IRAM-based systems connected by a bus or a point-to-point ring would exhibit the parameters needed for a cost-effective datascalar implementation, because remotememory accesses to other IRAM chips would certainly bemore expensive than on-chip memory accesses. Second, thedatascalar model of execution may also be applied withina single chip to alleviate wiring delays—for example,extending the concept of achip multiprocessor(CMP).CMPs access operands from a processor-local memoryfaster than requesting an operand from a remote processormemory across the chip, due to wiring delays. Finally,datascalar could be an alternative to paging innetworks ofworkstations (NOWs), provided that broadcasts weresufficiently inexpensive. Some network topologies like fattrees support efficient broadcasts. Alternatively, opticalinterconnects, especially free-space optical interconnects,provide extremely cheap broadcasts. However, the data

threads of the datascalar model may prove to be too fine-grained to tolerate the communication latencies of today’sNOWs.

4. Performance potential

In this section, we estimate the performance potential foreach of the uniprocessor alternatives. Note, that up to nowonly simulation studies exist and very few prototype imple-mentations are available, so our estimation is preliminary.

4.1. Advanced superscalar processors

Despite all extensive speculation mechanisms, advancedsuperscalar processors only make sense if enough instruc-tion-level parallelism can be supplied by the applicationprograms. Simulations of the SPECint95 with an instructionwindow having 2048 instructions, perfect caches andperfect branch prediction show an IPC rate of about 10 foran issue/execution width of 16 IPC, increasing to 12 for anissue rate of 24, and approximately 13 for an issue rate of 32[24]. The simulation results show the high potential for IPCimprovements over contemporary superscalar processors byapplying aggressive superscalar techniques. Furtherimprovements may be reached by the value prediction anddata speculation techniques that are still in their infancy.

4.2. Superspeculative processors

In [19], the performance of a superspeculative processorSuperflow was described. Superflow was simulatedassuming a fetch width of 32, a 128-entry reorder buffer, a64 kbyte 4-way set-associative data and instruction cacheswith a 10-cycle miss delay to a perfect, pipelined 16 Mbyteunified secondary cache, and a 128-entry, fully associativestore queue. Additional resources were a value load table forgenerating value predictions, a classification table to decidewhich predictions are likely to be correct, a dependenceprediction table, an alias prediction table, and additionalpattern history tables.

Superflow simulations yielded up to 9 IPC for theSPECint95 benchmark suite on a configuration with anissue bandwidth of 32 instructions per cycle [19]. Theresults demonstrate the high performance potential of super-speculative techniques, although such techniques are still farfrom mature.

4.3. Multiscalar processors

In [41], key implications of the architectural features ofmultiscalar processor organisation and task-level specu-lation for compiler task selection are studied from thepoint of view of performance. Important performance issueswere identified including control speculation, data commu-nication, data dependence speculation, load imbalance, andtask overhead. These issues were then correlated with a fewkey characteristics of tasks: task size, inter-task control


flow, and inter-task data dependence. In particular, task sizeaffects load imbalance and overhead, inter-task control flowinfluences control speculation, and inter-task data depen-dence impacts data communication and data dependencespeculation. Therefore, task selection crucially affects over-all performance achieved by a multiscalar processor. It wasfound that the task should neither be small nor large, that thenumber of successors of a task should be as many as can becoped with the control flow speculation hardware, and thatdata dependences should be included within tasks to avoidcommunication and synchronisation delays or misspecu-lation and roll back penalties.

Compared to superscalar processors each task is equiva-lent to a subwindow of the instruction window; collectivelythe multiple sequencers capture a portion of the dynamicinstruction stream. Interoperation communication can becarried out more efficiently if the total instruction windowis broken into subwindows, with more frequent intrawindowand less frequent interwindow communication. Likewise,instruction scheduling becomes more efficient if the overallschedule is treated as an ensemble of (several) smallerschedules, where the smaller schedule is the schedule in asubwindow, as is achieved by the multiscalar model ofexecution.

4.4. Trace processors

In [25], trace processors with 4, 8, and 16 PEs weresimulated. Each PE could hold a trace of 16 instructions.The simulations used the SPECint95 benchmarkscompress,gcc, go, ijpeg, andxlisp and three sets of experiments wereperformed. First, a detailed evaluation of trace processorconfigurations affirmed that significant instruction-levelparallelism can be exploited in integer programs (2–6

IPC). Second, for a trace processor with data predictionapplied to inter-trace dependences, the potential per-formance improvement with perfect prediction was around45% for all benchmarks. Next, with realistic prediction, thegcc benchmark achieved an actual improvement of 10%.Finally, the evaluation of aggressive control flow revealedthat some benchmarks benefited from control independenceby as much as 10%.

4.5. Datascalar processors

Simulation results suggest that the datascalar model ofexecution works best with codes for which traditionalparallelisation techniques fail [1]. Six unmodified SPEC95binaries ran from 7% slower to 50% faster on two nodes,and from 9 to 100% faster on four nodes, than on a systemwith a comparable, more traditional memory system.

However, current technological parameters do not makedatascalar systems a cost-effective alternative to today’smicroprocessors. For a datascalar system to be more costeffective than the alternatives, processing power should becheap, i.e. the dominant cost of each node should bememory. At first glance the datascalar model looks like animmense waste of processing power. However, conclusionsare that the datascalar model of execution may be advan-tageous when remote memory accesses are significantlyslower than local memory accesses, when global broadcastsare relatively inexpensive, and when the cost of additionalprocessors is a only small addition to the total system cost.The communication bandwidth itself may be limited. Thedatascalar model only produces broadcasts, but neverremote stores and remote load requests.

5. Conclusions

As a microprocessor generation becomes mature andfuture hardware technologies are better defined, the nextgeneration starts to become visible. We are at that stagenow. The last time we were in a similar position was almost20 years ago—when superscalar processors began to bediscussed.

In this paper we have surveyed the uniprocessor alterna-tives, such as advanced superscalar, superspeculative,multiscalar, trace, and datascalar processor, which are allexamples of new microarchitectures suitable for the nextgeneration (see Table 1). Developing superspeculative tech-niques prepares the way for exceeding the classical dataflowlimit imposed by data dependences. The multiscalar pro-cessor proposed the basics for a functional distribution ofprocessing elements working simultaneously on tasksgenerated from sequential machine programs. The traceprocessor is similar to the multiscalar processor except forits use of hardware-generated dynamic traces rather thancompiler-generated static tasks. The datascalar approachreduces interprocessor data traffic in favour of broadcasting.A real uniprocessor chip of the future is likely to combine


Table 1Summary of new research directions in microprocessors

Research direction Key idea

Advanced superscalar processor Wide-issue superscalarprocessing core withspeculative execution

Superspeculative processor Wide-issue superscalarprocessing core withaggressive data and controlspeculation

Multiscalar processor Several processing coresspeculatively executedifferent statically generatedprogram segments

Trace processor Several processing coresspeculatively executedifferent dynamicallygenerated trace segments

Datascalar processor Several processing coresredundantly execute thesame sequential programwith different data sets

some of these execution modes within a single micro-architecture. For instance, an advanced superscalarprocessor makes use of superspeculative techniques and atrace cache.

Recent studies have shown that many instructionsperform the same computation and, hence, produce thesame result over and over again, i.e. there is significantresult redundancy in programs. Two hardware techniqueshave been proposed to exploit this redundancy [31]:valueprediction, and instruction reuse. Their impact on super-scalar, multiscalar, or trace processors, as well as how tocombine value prediction techniques in hybrid value pre-dictors has recently been described in Refs. [4,10,27].

All the microarchitecture techniques described increasethe memory bandwidth requirements compared to today’ssuperscalar microprocessors. Therefore, all these micro-architecture techniques may be combined in future withthe processor-in-memory or intelligent RAM approachesthat combine processing elements of various complexitywith DRAM memory on the same chip to solve the memorylatency bottleneck.

All the research directions described in this paper retainresult serialisation—the serial instruction flow as seen bythe programmer and forced by the von Neumann archi-tecture. However, the microarchitectures strive to press asmuch fine-grained or even coarse-grained parallelism fromthe sequential program flow as can be achieved by thehardware. Unfortunately, a large portion of the exploitedparallelism is speculative parallelism, which in the case ofincorrect speculation, leads to an expensive rollbackmechanism and to a waste of instruction slots. Therefore,the result serialisation of the von Neumann architectureposes a severe bottleneck.

Current superscalar microprocessors are able to issue upto six multiple instructions each clock cycle from a con-ventional linear instruction stream. In this paper we haveshown that VLSI technology together with a higher degreeof speculation and functional partitioning of the processorwill allow future complex uniprocessors with an issue band-width of 8–32 instructions per clock cycle. However,instruction-level parallelism found in a conventionalinstruction stream is limited. In general, integer-dominatedprograms feature a rather low instruction-level parallelism,while a high instruction-level parallelism can be extractedfrom floating-point programs.

The solution is the additional utilisation of more coarse-grained parallelism. The main approaches are the chipmultiprocessor (CMP) and the simultaneous multithreadedprocessor (SMP). A CMP places a small number of distinctprocessors (4–16) on a single chip and runs parallelprograms and/or multiple independent tasks on theseprocessors. A SMT processor [28,35,36] combines wide-issue superscalar processor with multithreading and exhibithigh performance increases over single-threaded (super-scalar) processors [22]. These two alternatives optimisethe throughput of a multiprogramming workload while

each thread or process retains the result serialisation of thevon Neumann architecture.

There are also research directions in highly parallel chiparchitectures that deviate from the von Neumann archi-tecture model. Two such directions are focused on theprocessor-in-memory approach (also called intelligentRAM) and on reconfigurable processors. In short, IRAMintegrates processor and memory on the same chip toincrease memory bandwidth, while reconfigurable pro-cessors allow the hardware to adapt dynamically at run-time to the needs of an application.

Acknowledgements

We thank the referees of this paper for many helpfulcomments.

References

[1] D. Burger, S. Kaxiras, J.R. Goodman, Datascalar architectures, in:Proceedings of the ISCA 24, Denver, CO, 1997, pp. 338–349.

[2] B. Burgess, Specialization: a way of life, Computer 31 (1998) 43.[3] A.P. Burks, H.H. Goldstine, J. von Neumann, Preliminary discussion

of the logical design of an electronic computing instrument. Report tothe US Army Ordnance Department, 1946. Reprint in: W. Aspray,A.P. Burks (Eds.), Papers of John von Neumann, MIT Press,Cambridge, MA, 1987, pp. 97–146.

[4] G.Z. Chrysos, J.S. Emer, Memory dependence prediction using storesets, in: Proceedings of the ISCA 25, Barcelona, Spain, 1998, pp.142–153.

[5] R.P. Colwell, Maintaining a leading position, Computer 31 (1998)45–47.

[6] R. Crisp, Direct rambus technology: the new main memory standard,IEEE Micro 17 (1997) 18–28.

[7] C. Dulong, The IA-64 architecture at work, Computer 31 (1998)24–31.

[8] M. Evers, P.-Y. Chang, Y.N. Patt, Using hybrid branch predictors toimprove branch prediction accuracy in the presence of contextswitches, in: Proceedings of the ISCA 23, Philadelphia, PA, 1996,pp. 3–11.

[9] M. Franklin, The multiscalar architecture, Computer Science Tech-nical Report No. 1196, University of Wisconsin-Madison, WI, 1993.

[10] F. Gabbay, A. Mendelson, The effect of instruction fetch bandwidthon value prediction, in: Proceedings of the ISCA 25, Barcelona,Spain, 1998, pp. 272–281.

[11] P. Gillingham, B. Vogley, High-performance, open-standardmemory, IEEE Micro 17 (1997) 29–39.

[12] G. Grohoski, Reining in complexity, Computer 31 (1998) 41–42.[13] D. Grunwald, A. Klauser, S. Manne, A. Pleszkun, Confidence esti-

mation for speculation control, in: Proceedings of the ISCA 25,Barcelona, Spain, 1998, pp. 122–131.

[14] Y. Katayama, Trends in semiconductor memories, IEEE Micro 17(1997) 10–17.

[15] E. Killian, Challenges, not roadblocks, Computer 31 (1998) 44–45.[16] A. Klauser, A. Paithankar, D. Grunwald, Selective eager execution on

the polypath architecture, in: Proceedings of the ISCA 25, Barcelona,Spain, 1998, pp. 250–259.

[17] C.E. Kozyrakis, D.A. Patterson, New direction for computer archi-tecture research, Computer 31 (1998) 24–32.

[18] M.H. Lipasti, J.P. Shen, The performance potential of value anddependence prediction, Lect. Notes Comput. Sci. 1300 (1997)1043–1052.


[19] M.H. Lipasti, J.P. Shen, Superspeculative microarchitecture forbeyond AD 2000, Computer 30 (1997) 59–66.

[20] M.H. Lipasti, C.B. Wilkerson, J.P. Shen, Value locality and loadvalue prediction, in: Proceedings of the ASPLOS VII, Cambridge,MA, 1996, pp. 138–147.

[21] S. McFarling, Combining branch predictors, WRL Technical NotesTN-36, Digital Western Research Laboratory, 1993.

[22] H. Oehring, U. Sigmund, T. Ungerer, Simultaneous multithreadingand multimedia, in: Proceedings of the Workshop on Multi-ThreadedExecution, Architecture and Compilation, Orlando, FL, 1999.

[23] S.-T. Pan, K. So, J.T. Rahmeh, Improving the accuracy of dynamicbranch prediction using branch correlation, in: Proceedings of theASPLOS V, Boston, MA, 1992, pp. 76–84.

[24] Y.N. Patt, S.J. Patel, M. Evers, D.H. Friendly, J. Stark, Billiontransistors, one uniprocessor, one chip, Computer 30 (1997) 51–57.

[25] E. Rotenberg, Q. Jacobson, Y. Sazeides, J.E. Smith, Trace processors,in: Proceedings of the MICRO-30, Research Triangle Park, NC, 1997,pp. 138–148.

[26] P.I. Rubinfeld, Managing problems at high speed, Computer 31(1998) 47–48.

[27] B. Rychlik, J. Faistl, B. Krug, J.P. Shen, Efficiency and performanceimpact of value prediction, in: Proceedings of the PACT, Paris,France,1998, pp. 148–154.

[28] U. Sigmund, T. Ungerer, Identifying bottlenecks in multithreadedsuperscalar multiprocessors, Lect. Notes Comput. Sci. 1123 (1996)797–800.

[29] J. Silc, B. Robic, T. Ungerer, Processor Architecture, Springer, Berlin,1999.

[30] J.E. Smith, S. Vajapeyam, Trace processors: moving to fourth-generation microarchitectures, Computer 30 (1997) 68–74.

[31] A. Sodani, G.S. Sohi, Understanding the differences between valueprediction and instruction reuse, in: Proceedings of the MICRO-31,Dallas, TX, 1998.

[32] G.S. Sohi, Multiscalar: another fourth-generation processor, Com-puter 30 (1997) 72.

[33] G.S. Sohi, S.E. Breach, T.N. Vijaykumar, Multiscalar processors, in:Proceedings of the ISCA 22, Santa Margherita Ligure, Italy, 1995, pp.414–425.

[34] M. Tremblay, Increasing work, pushing the clock, Computer 31(1998) 40–41.

[35] D.M. Tullsen, S.J. Eggers, H.M. Levy, Simultaneous multithreading:maximizing on-chip parallelism, in: Proceedings of the ISCA 22,Santa Margherita Ligure, Italy, 1995, pp. 392–403.

[36] D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L.Stamm, Exploiting choice: instruction fetch and issue on an imple-mentable simultaneous multithreading processor, in: Proceedings ofthe ISCA 23, Philadelphia, PA, 1996, pp. 191–202.

[37] G. Tyson, K. Lick, M. Farrens, Limited dual path execution. Tech-nical Report CSE-TR 346-97, University of Michigan, 1997.

[38] A. Unger, T. Ungerer, E. Zehendner, A compiler technique forspeculative execution of alternative program paths targeting multi-

threaded architectures, in: Proceedings of the Yale MultithreadedProgramming Workshop, New Haven, CT, 1998.

[39] A. Unger, T. Ungerer, E. Zehendner, Static speculation, dynamicresolution, in: Proceedings of the Seventh Workshop on Compilersfor Parallel Computers, Linkoping, Sweden, 1998, pp. 243–253.

[40] S. Vajapeyam, T. Mitra, Improving superscalar instruction dispatchand issue by exploiting dynamic code sequences, in: Proceedings ofthe ISCA 24, Denver, CO, 1997, pp. 1–12.

[41] T.N. Vijaykumar, G.S. Sohi, Task selection for the multiscalar archi-tecture, J. Parallel Distr. Comput. 58 (1999) 132–158.

[42] T.-Y. Yeh, Y.N. Patt, Alternative implementation of two-level adap-tive branch prediction, in: Proceedings of the ISCA 19, Gold Coast,Australia, 1992, pp. 124–134.

[43] T.-Y. Yeh, Y.N. Patt, A comparision of dynamic branch predictorsthat use two levels of branch history, in: Proceedings of the ISCA 20,San Diego, CA, 1993, pp. 257–266.


Jurij S ilc is a researcher at the Jozˇef Stefan Institute in Ljubljana,Slovenia. From 1980 to 1986 he has been assistant researcher at theDepartment for Computer Science and Informatics. From 1987 to 1993he has been the head of the Laboratory for Computer Architecture at thesame department. Since 1994 he is the deputy head of the ComputerSystems Department at the Jozˇef Stefan Institute. Sˇ ilc received his PhDdegree in electrical engineering from the University of Ljubljana,Slovenia, in 1992. His research interests include computer architecture,high-level synthesis, and parallel computing.

Theo Ungerer is a professor of computer science at the University ofKarlsruhe, Germany. Previously, he was scientific assistant at theUniversity of Augsburg (1982–1989 and 1990–1992), visiting assistantprofessor at the University of California, Irvine (1988–1990), professorof computer architecture at the University of Jena, Germany (1992–1993). Since 1993 he is with the Department of Computer Design andFault Tolerance, University of Karlsruhe. Ungerer received a doctoraldegree at the University of Augsburg in 1986, and a second doctoraldegree (habilitation) at the University of Augsburg in 1992. His currentresearch interests are in processor architecture and parallel and distri-buted computing.

Borut Robic is an assistant professor of computer science at the Facultyof Computer and Information Science, University of Ljubljana, Slovenia,since 1994. From 1984 to 1993 he has been assistant researcher atthe Department of Computer Science and Informatics, Jozˇef StefanInstitute, Ljubljana, and since 1994 a researcher at the ComputerSystems Department at the same institute. Robic received his PhDdegree in computer science from the University of Ljubljana, Slovenia,1993. His present research interests include parallel computing,algorithms, and complexity theory.

A survey of new research directions in...

Documents

Transcript of A survey of new research directions in...