00514658

8/7/2019 00514658

1/10

ARAS: Asynchronous RISC Architecture Simulator'Chia-Hsing Chien, Mark A. Franklin, Tienyo Pan' , and Pr i thv i Prabhu

Computer and Communications Research CenterWashington UniversitySt. Louis, Missouri, 63130-4899U.S.A.

AbstractI n this paper, an asynchronous pipel ine instruc-

t ion simulator, ARAS is presented. With th is s im-ulator, one can design selected instruction pipelinesand check their performance. Performance measure-m e n t s of the pipeline configur ation are obtained bysimulating the execution of benchmark programs onthe mac hine architectures developed. Depending onthe simulation results obtained b y using ARAS, thepipeline configuration can be altered to im prove its per-formance . Thus, one can explore the design space ofasynchronous pipeline architectures.1 Introduction

This paper presents a graphic simulation tool,ARAS (Asynchronous RISC Architecture Simulator),which has been developed to ease in the model-ing, visualization and performance evaluation cJf asyn-chronous instruction pipelines. The paper has threeobjectives:

segments thus reducing the cycle time. Superscalartechniques, on the other hand, issue several instruc-tions into parallel pipelines and thus increase the av-erage number of instructions being processed per cy-cle. In this paper, instruction pipelines are modeledusing the ARAS simulation tool which permits visu-alization of pipeline operation and the collection ofpipeline performance data .

Most commercial machines are currently clocked,however, asynchronous design has attracted more at-tention in recent years as clock rates and power levelshave increased. Although asynchronous methodologycurrently requires more chip area, generally entailshigher design complexity, and does not have a largebase of available design automation tools, t her e arecertain potential advantages. Among these are havingperformance governed by mean versus worst case func-tion delays, eliminating limitations associated withclock skew (although introducing other limitations),..__ _ and having potentially lower power levels. Th e perfor-mance advantages associated with asynchronous sys-tems are discussed in more detail in [3]. While the useof asynchronous modules in the design of processorsgoes back to the late 1960s with work at Washing-ton University (St. Louis) [2], it wasn't until 1988that an entire asynchronous microprocessor was de-signed at the California Institute of Technology [lo].Later an asynchronous version of the ARM processor,

T O Present the essential elements of the ARASmodeling tool.To show how ARAS can model instructionpipelines.To demonstra te how ARAS may be used to obtainimportan t performance data which can then guidethe design of asynchronous instruction pipelines.

Early RISC machines often employed a single, 4or 5-stage pipeline [9, 61. In more advanced ma-chines, performance has been improved by increasingthe pipeline depth (i.e., superpipelined) and by em-ploying multiple pipelines (i.e., superscalar). Super-pipelined techniques divide each instruction into finer

'This research has been funded in part by NSF Grant CCR-2Current ly with ITRI: Industrial Technology Research Insti-

9021041 and ARP A Contract DABT-93-C0057.tute, Taiwan, Republic of China.

AMULET1, was completed at University of Manch-ester (UK) [ 5 ] . Currently, SUN Microsystems is de-veloping an asynchronous microprocessor called Coun-terflow Pipeline Processor (CFPP) [ll]and it appearsthat several other institutions and companies are alsoinvestigating the design of asynchronous machines [8] .

Figure 1shows a typical 5-stage instruction pipelinewith several buffers between each stage. Analyticmodeling of such a system is not easily accomplishedfor the following reasons:

0-8186-7098-3/95 $04.00 0 1995 IEEE 210

8/7/2019 00514658

2/10

InstructionDecode

Buffer

Figure 1: A 5-stage pipeline with multiple buffers

0 The service rates associated with each stage arenot easily quantifiable since they reflect the pro-cessing requirements of each instruction type.

0 To achieve a realistic pipeline arrival distributionit is best to model the arrival process with realworkload instruction traces. Such traces gener-ally can not be modeled analytically in a tractablemanner.

0 The basic queueing system associated with a re-alistic: instruction pipeline model has a host oflogical exception conditions associated with elim-inating hazards, result forwarding, etc. These arenot easily captured in analytic models.

In this paper, Section 2 illustrates how the ARASdisplay helps in visualization of the pipelines opera-tion and also discusses the data collection facilities.In Section 3, the basic ARAS operation is briefly con-sidered. Section 4 considers the problems associatedwith insuring that the ARAS pipeline presented bythe user c,an be implemented and indicates the con-straints associated with pipeline construction. Sec-tion 5 illustrates three uses of ARAS. The first is apipeline design experiment, the second and third con-sider the effects of adder design and handshaking de-lays on pipeline performance. Conclusions and sugges-tions for future research and modifications to ARASare presented in Section 6.2 Using ARAS

ARAS simulates the instruction pipeline of a pro-cessor executing a (benchmark) program and thenevaluates processor performance. Figure 2 is an exam-ple of an ARAS display. Construction of the displayand driving program is considered later.

Each reictangle in the display represents a block ofmicro-opeIations associated with a stage in a proces-sors instruction pipeline. Lines between blocks. rep-resent paths that instructions may take during execu-

INSTRUCTION-FETCH E W C H ALv_RAo

a

aL

Figure 2: ARAS Display

tion. Figure 2 illustrates a 4-stage instruction pipeline.The pipeline begins with two parallel instruction fetchblocks, both of which connect to a shared instructiondecode block. The two fetch blocks represent the pos-sible parallel fetching of instructions. The third stagehas two different, but parallel blocks to handle branchand operand fetch operations respectively. In certainsituations this will permit further instruction paral-lelism. Finally, three different but parallel blocks axepresent. This permits two ALU register based instruc-tions to proceed in parallel while a memory access in-struction (e.g., Load) is also taking place.

Visualization of pipeline operation and dynamicsemploys the following main techniques:

0 The execution and movement of instructionsthrough the pipeline corresponds to the move-ment of dots from one block to another. The pres-ence of a dot in a particular block indicates tha tan instruction is being processed in that block;otherwise, the block is empty.

0 A t the time an instruction moves from one block

21 1

8/7/2019 00514658

3/10

I(a ) B l o c k m 1 o I 1 1 2 1 3 I ;IFinishTime I O 20 I O 30(b) BlockID 1 0 I 1 1 2 1 3 1 1FinishTime 0 I O 0 20

to another, the line between the blocks involvedmomentarily t:iickens.

e The color of the dot (not shown in the black andwhite figure) corresponds to the instruction typebeing executed (e.g., arithmetic, logic, branch,etc.).

e The border of each block is color coded io reflectchanges in block status. A block may be idle,busy, or blocked. Changes in border color reflectchanges in block status.

Since there is no clock in th is system, movement ofdots between blocks is governed by the availability ofblocks and input instructions. As indicated, a dot (in-struction) may be blocked in a particular block if thesuccessor block that is required by the instruction isbusy processing another instruction. Thus , the move-ment of the dots, the temporary thickening of blockinterconnect lines, and the color changes (dot or bor-der) together, allowa designer to visualize the pr.3gressof inst ruct ions through the asynchronous pipeline. In-structions flow from left to right, finishing (in this ex-ample) on completion of the four th stage (or some-times earlier when a branch is encountered).

In addition to the dynamic visual display of asyn-chronous instruction processing, a designer can gatherboth global system and local block performance re-sult s. Global system performance includes: systemthroughput, the number of processed instructions, andthe simulation time required for execution of the pro-cessed instructions. The local performance of anygiven block can also be obtained. These include: thenumber of instructions processed, throughput, and thepercent of idle, working and blocked time associatedwith the selected block.

Using these global and local results, a designer canattempt to improve pipeline performance by modi -fying and restructuring the pipeline. For example,pipeline performance is generally improved if eachstage takes roughly equal time. Thus , the designer canexamine the effect of moving various micro-operationsbetween blocks (done by editing a configuration datafile), rerunning ARAS, and comparing the various re-sults. Technical report [l]provides procedures to befollowed for altering the pipeline configuration datafiles.3 Basic ARAS Operation3.1 Discrete-event simulation

The core of ARAS is a standard trace-drivendiscrete-event simulator. After ARAS has accessedthe configuration file for a part icular pipeline, a block

Find the block with smallestfinish time (current block)

4Decrease the finish timeand increase the systemclock

4Update the system stateand perform current blockoperations

I

InstructionFetch

1I n s m t i o n

Decode

(c) =--+Finish Time Move instructionto thenext block orstay in theI s mcb l ock , 11ID=2nI Access I

ID=3

Figure 3: The steps to process an event in ARAS

table is created. This table is used t o schedule events(e.g., the completion of an instructions usage of ablock) generated by the pipelines operation. As indi-cated earilier, a block can be in one of three possiblestates: idle, busy, or blocked. A block in an idle sta teis available to process an instruction. The busy s tat edenotes that the block is currently processing a por-tion of the instructions operation. The blocked stateoccurs when an instruction is prevented from movinginto the next block in the pipeline because its succes-sor block is busy or because there exists a data depen-dency which is unresolved. An event occurs in ARASprincipally when a block completes its processing ofan instruction and changes state from busy to eitheridle, blocked, or busy (for the next instruction). Oncean event has been processed, other blocks may changetheir sta tes triggering yet other events.

Event scheduling is depicted in Figure 3 where theconfiguration of an example pipeline is shown. Theblock table lists the blocks using their unique identi-fication numbers and indicates the finish time associ-ated with each block. Whenever an instruction en-ters a block, a finish time for the block is calculatedbased on the instruction type and operation(s) to beperformed by the block. Block table (a) of Figure 3

212

8/7/2019 00514658

4/10

shows the allocated finish times at a point in the sim-ulation. All the blocks have finish times except forblock 4,which is idle.

ARAS ,proceeds through the simulation by select-ing the busy block with the smallest finish time (scan-ning from right to left). The finish time of th e selectedblock (the current block) is added to the global sim-ulation clock and is also used t o decrement the finishtime of all other busy blocks in the table. Next ARASperforms the operations associated with processing ofthe instruction in the current block (e.g.? updatingregisters and tables, changing block states, etc.). Itnow checks to see if t he current block instructioncan be maived to the next block it requires. With ev-ery movement, all other blocks are checked to see ifthe event permits the further movement of blockedinstructions.

The procedures described above continue, with anew block now being selected as the current block.The simulation ends when the entire trace programhas been processed.3.2 Driving ARAS

The instruction traces used to drive ARAS are ob-tained by collecting execution information for pro-grams running on a SPARC computer. The programto be traced is first executed on a SPARC in singlestep mode under control of the debugger and an in-struction trace is collected. The execution informationcollected is sent through an interpreter which puts itinto a form useful to ARAS. It is placed in a part ofARAS memory corresponding to the source segment.Data and stack segments are also created for ARAS.

Standard benchmark traces are being developed forusers interested in experimenting with various pipelinedesigns. For users interested in developing their owntraces, details on how to use the interpreter programsare described in technical report [11.4 Pipeline Design Constraints

There are some basic constraints on the use ofblocks in the design of instruction pipelines. This sec-tion discusses these constraints, an associated set ofrules, and th e way ARAS handles them.

The ARAS instruction set is derived from theSPARC instruction set. Each instruction can be di-vided into a number of micro-operations (initiallybased on DLX micro-operations [ 6 ] ) , some of whichmay be common for all or a group of instructions.Prior to defining pipeline blocks, the instructionstat e diagram wbich indicates sequencing of micro-operations must be constructed. A simple exampleis shown in Figure 4 where certain ALU and Jumpinstructioins are illustrated [ 6 ] .

- I-Figure 4:A Typical Finite State Diagram

In general, a pipeline block may consist of a sin-gle micro-operation, or a group of sequential micro-operations as defined in the st ate diagram. While theentire set of micro-operations available to t he user doesnot correspond to the actual micro-operations associ-ated with any given SPARC implementation, the basicmicro-operations needed for execution of any SPARCinstruction are present. Currently, due to implemen-tation limitations, ARAS does not handle arbitraryassignments of micro-operations to blocks. However,a set of 40 predefined pipeline blocks (with their as-sociated micro-operations) are available to the user toexplore a variety of pipeline designs. Work is underway to extend this capability and to develop a moreflexible micro-operation to block assignment mecha-nism.

Figure 4 shows three instructions (ADD R1, R2,Rd: SU B R1, R2, Rd: and J R R1) and the path theymust take through the finite state diagram in orderto be executed correctly. Figure 5 shows how eachindividual state represents a micro-operation and howthese micro-operations can grouped together to formblocks which constitute stages in the pipeline.

As a simple example, consider the implementationof the Instruction Fetch ( IF) operation. One approachis to define a single IF block which performs both the

2 13

8/7/2019 00514658

5/10

Micro-Operations

--).

Decoding wtA TEMP c+- C-OperationA-'' BtR2 "+'- B A+TEMP A-TEMP

Write Back -Instruction Fetch - Instruction Decode + Execution +Instruction Fetch Instruction Decode Execution Write Back

Macroblocks (Blocks)

Figure 5: Synthesis of Micro-Operations and Macroblocks

instruction fetch (micro-operation 1 in Figure 4) andthe program counter (PC) update (micro-operation 2)micro-operations. Another approach is to have twosepara te sequential IF blocks, the first performing theinstruction fetch, and the second handling the PC up-date. Block types are available to explore both ap-proaches.

In designing a working pipelined machine, generalrules apply:

0 The micro-operations constituting an instructionshould be executed in the proper sequence, andthis should be true for every instruction and pos-sible instruction sequence.

0 A pipelined machine should be able to handle con-trol hazards which occur due to branching.

0 A pipelined machine should be able to handleresource hazards that may arise when multipleinstructions request the same execution resource(e.g., adder).

0 A pipelined machine should be able to handleda ta hazards and ensure tha t instructions are ex-ecuted with their proper operands.

The first rule is implemented by the user in defin-ing blocks and pipelines that maintain the operationsequence defined by each instruction's sta te diagram.For example, suppose that the write back micro-operation ( R D t C ) is placed prior to the executionmicro-operations performing addit ion and subtrac-tion, (C tA +T EM P and CtA -T EM P) . This clearly

would result in incorrect instruction execution sincethe C register would be written prior to being up-dated. ARAS users are provided with a table (sim-ilar to Figure 5) which gives the sequence of micro-operations to be followed. The table also indicateswhich micro-operations can be performed in parallel(provided there a re no resource hazards).

The second rule is implemented by requiring thatall instructions are in-order up to and including theblock or blocks which include instruction decoding,branch address calculation, and branch condition eval-uation. Out-of-order execution is permitted afterthese micro-operations have been performed. If abranch is detected when the condition evaluation takesplace, only instructions earlier in the pipeline need tobe flushed. This ensures that there will be no controlhazards since the instructions which have entered thepipeline after the branch will not have changed anyregister states before they are flushed.

The third rule is enforced by the handshaking pro-tocol inherent in the asynchronous design. Th at is,an instruction requesting resources (other blocks) tha tare busy wiil be automatically stalled in the currentblock until an acknowledgement is received.

ARAS handles data hazards by using a standardscoreboarding technique [ 6 ] . This includes check-ing the register reservation table at the start of theoperand fetch micro-operation t o ensure tha t i t hasnot been reserved by a prior instruction, and stallingat this block if the register has been reserved. In addi-tion, an instruct ion which will write to a register, mustreserve that register when it is in the operand fetch

214

8/7/2019 00514658

6/10

4S P

6S P

8S P

4SPIPF

IF: Instruction Fetch ID: Instruction Decode EX: Execution WB: Write BackFigure 6: Pipeline Configurations Analyzed Using ARAS

block. Late r, when the write back micro-operationhas completed, this same instruction must cancel theregister reservation. This ensures that register opera-tions are all in-order.5 Performance Evaluation and Design

This selction illust rates three uses of ARAS. Fi rst,the architecture of a pipelined machine is developed byusing ARAS. Second, the effect of alternative module(in this cise adder) designs on pipeline performanceis presented. Thi rd, the influence of a particular de-sign parameter (handshaking delay) on performanceis considered.5.1 Design an Asynchronous Pipeline

Consider an initial pipeline which has three basicstages, an Instruction Fetch stage, an Instruction De-code stage and an Execute stage. Assume that WriteBack operations may be performed in the Executestage. The Execute stage also includes the MemoryAccess micro-operations.

One approach to developing higher performancemachines is to attempt to lengthen the instructionpipeline. Various design alternat ives are available andthe issue is to find a design which yields the highestinstruction throughput. Preliminary executions indi-cated that in a three stage pipeline, the InstructionFetch stage was a bottleneck. It was therefore dividedinto two sl,ages3 and the four stage pipeline of Figure 6was examined .

with ARAS

3We are not concerned here whether or not this can beachieved in an actual hardware implementation. If the resultsshow significant improvement, then such an implementation canbe explored.

0 1 2 3

Figure 7 : Block Status Display

A set of benchmark programs (see Table 1) was ex-ecuted on th e four stage pipelined machine and simu-lation results were then used to identify blocks in thispipeline which were performance bottlenecks.

Block stat us information can be obtained both visu-ally (using the ARAS display) and by the ARAS out-pu t at the end of the simulation. Figure 7 shows thevisualization of the block status in the ARAS display.With this display option the user can get informationabout both the dynamic and the final block status.The figure shows the final block status for a 4-stagepipeline machine for one of the benchmark programs.

The micro-operations for those stages identifieda5 performance bottlenecks were redistributed over agreater number of blocks. This resulted in more blocks(increasing the pipeline depth) with a fewer numberof micro-operations per block (reducing block delay)and enabled us to find configurations which yielded

21 5

8/7/2019 00514658

7/10

Table 1: Benchmark Programs and Instruction CountsProgramNameQsortSieve

Description Instruction Type and CountsALU Branch Memory Total

quick sort of 30 randomly generated numbers 11833 3508 10893 26234calculation of prime numbers up to 100 20462 8393 3235 32090

DsimPercent

discrete event simulation (first 121 iterations) I 22321 5715 12442 40478I 55.3% 17.8% 26.9% 100%

higher throughput. There are a large number of waysin which the blocks may be divided, however, usingsimple heuristic approaches in conjunction with ARASit was not difficult to analyze various configurations.

Table 1 gives details about the three benchmarkprograms used: their instruction counts and the totalnumber of ALU, branch and memory access instruc-tions for each program. Qsort is a program which usesthe quick sort algorithm to sort th irty numbers. Sieveis a program which uses Erastothenes sieve method toobta in all prime numbers less than hundred. Dsim is adiscrete event simulator which performs a hundred andtwenty one iterations. Table 2 shows the percentageof idle, busy and blocked time for the 4-Stage Pipeline(4SP) shown in Figure 6. The benchmark programsdiscussed above were used to obtain this da ta. Thesimulation results (Table 2) show that the ID stagewas busy for 90.94% of the time indicating that it is aperformance bottleneck. This is reinforced by the highpercentage of time that the IF stages are blocked. Theuser of ARAS could also see this through the visual-ization of the pipeline, noting that the ID block w asalmost always busy. Note also that the IF 1 stage isnever idle since instructions are always available to beprocessed.

One approach to reducing this bottleneck is to di-vide the micro-operations associated with the ID blockinto two blocks, ID1 and ID2. In addition, since theIF blocks are busier than the EX block, further in-crease in performance may result from dividipg themicro-operations associated with the instruction fetchstage into three blocks, IF1, IF2 and IF3. Simulation

Block StateIdleBusyBlocked

results (Table 3) for this 6-Stage Pipeline (6SP) showan 11% performance improvement over the four stagecase.

An 8-Stage Pipeline (8SP) was then constructedwith the ID micro-operations now divided across threeblocks, and the EX block divided into an EX anda separa te WB (Write Back) block. The results oftha t simulation are shown in Table 4. Performancefor this case, however, while higher than the 4-stagepipeline, was lower than th e 6-stage pipeline. This re-sults from a combination of factors. First, as the num-ber of pipeline stages increases, the average cycle timeper stage decreases. Handshaking delays (discussedlater) thus become a larger percentage of the stagecycle time and establish a lower bound on this time.In addition, with longer pipelines, the role of hazardsincreases since both the probabilities of a hazard andthe pipeline stall penalties associated with a hazardincrease. The combination of these factors results ina lower throughput [3].

Another approach to increasing performance, in ad-dition to dividing the blocks, is to alter the pipelineconfiguration to take advantage of instruction levelparallelism. An example of this is shown in the4SP/PF case of Figure 6 which illustrates a simplesuperscalar configuration. In this case, branch instruc-tions flow through the BR (Branch) block, while otherinstructions proceed through the OF (Operand Fetch)block. Since there are multiple IF and E X blocks, in-struction level parallelism is present thus potentially

IF1 IF2 ID EX0 5.4% 7.6% 40.2%

61.2% 68.2% 90.9% 59.8%38.8% 26.4% 1.4% 0

I Model I Qsort I Sieve I Dsim I Harmonic Mean I

216

8/7/2019 00514658

8/10

I Blockstate IF1 IF2 IF3 ID1 ID2 ID3 EX WB0 4.6% 10.8% 14.4% 26.2% 24.3% 48.8% 66.1%

55.4% 66.9% 61.1% 55.2% 33.1% 72.3% 50.8% 33.9%44.6% 28.5% 28.1% 30.4% 40.7% 3.4% 0.4% 0

improving performance. Note that while the outputof the BR block is shown as entering the EX blocks,since branch instructions will have zero execute time,they will essentially exit from the system after leavingBR. As shown in Table 3 this configuration yields thehighest performance.

SEL(8) has 8 blocks, each of which performs an addi-tion of two 4-bit numbers. If the SEL(16) (containing16 blocks, each of which performs an addition of 2-bitnumbers), is used, the performance is not asgood as inthe SEL(8) case. This is due to the presence multiplex-ers, which alter the output of each block depending

[BlockstateIdleBusyBlocked

5.2 Influence of a Functional Module:In the current version of ARAS, several functional

modules are simulated in greater detail and providethe information of operation time to the main pipelinesimulation. For example, a cache memory simulatoris present to determine the times associated with eachmemory access. In this simulator, separate set associ-ated caches are present for instructions and da ta.

Another module which is simulated in greater de-tail is the 32-bit 2s complement integer adder. Threedifferent adder designs are available: asynchronousRipple-Carry Adder (RCA), asynchronous carry-SELect adder (SEL), and Conditional-Sum Adder(CSA). The structures of these adders are describedin [4]. Users can observe the change in performanceresulting from the use of different adders in a particu-lar configuration. The simulation results for previousconfigurations using different adders are shown in Ta-ble 6.

Since the CS A has a tree structure and takes a con-stant amount of time (O(log, n ) ) to complete addi-tion, i t is (considered to be one of the fastest adders(for clockeld design) [7, 121 and achieves the highestthroughput for all configurations. However, the com-plexity of the CSA circuit is higher t han tha t of mostother adder circuits. Of the remaining adders, the useof the SEL(8) results in the best throughput. The

Adder

IF(Up) IF(Down) ID BR OF EX1 EX20 0 12.2% 84.9% 28.4% 60.6% 69.0%

67.4% 74.1% 64.8% 12.7% 69.7% 39.4% 31.0%32.6% 25.9% 23.1% 2.4% 1.9% 0 0

on the carry of previous block. The multiplexer delaybecomes more significant when the number of parti-tions is increased. Therefore, the overall throughputis decreased. The configuration using the RCA hasthe worst throughput among all adders since the longcarry chain present increases the addition delay.

Figures 8 and 9 show the distribution of the ad-dition times for two of the adders, RCA and SEL(8).These were obtained using the Qsort benchmark pro-gram referred to in Table 1. The ARAS display en-ables the user to view the distribution of additiontimes either dynamically or at the end of the simula-tion. Addition times are dependent on the operanddistribution and the adder used, and not on thepipeline configuration. The distributions indicate thatlong carry chains occur more often then would be ex-pected with uniformly distributed operands. This ap-pears to be due to additions which occur during effec-tive address calculations (e.g., stack operations whichoccur at high memory addresses) and can influencethe performance of pipeline configurations and choiceof adders.With these sort of simulation results, used inconjunction with VLSI implementation requirements,users can decide which adder design is most appropi-rate for their design environment. For example, giventhat a CSA is much larger than a SEL(8) and increasesthe pipeline performance by only 3%, an SEL(8) de-

21 7

8/7/2019 00514658

9/10

Adder RCA4SP 54.66SP 60.4

SEL(2) SEL(4) SEL(8) SEL(16) CSA58.5 62.4 63.4 60.8 66.264.9 69.5 70.7 67.9 73.3

Figure 8: Addition Time for RCA

I8SP I 60.04SP/PF 1 64.4

Figure 9: Addition Time for SEL(8)

63.3 67.2 68.2 65.8 73.369.0 73.6 74.9 71.9 79.3

sign might be the desired choice. Similar studies canbe performed by focusing on other functional mod-ules, such as the influence of the number of sets incache memory or the size of instruction and da ta cachememories.5.3 Technology and Design Parameters

The prior sections have focused on high levelpipeline architecture considerations and on an indi-vidual functional modules. However, underlying theasynchronous simulation is a set of assumptions con-cerning technology parameters and implementations.ARAS allows one to explore the effects of many of

Cache accessMain memory accessInstruction decodeSimple AL U operation

Table 7: Operation Delays1 Type of operation I Delay 1I Handshaking 1 3 n s 1

I O ns100 ns5 ns3 ns

I Add or Sub (each bit of carrv chain) I 1 ns I~~I Register access and update ITiG 1

these parameters on performance in conjunction withconfiguration considerations.

The performance results associated with Table 3are based on the parameter values of Table 7. Th eoptimum configuration may depend on these param-eters. For example, if handshaking delays are re-duced, then the optimum configuration may have morestages. These synchronization overheads are impor-tant factors which affect the performance of asyn-chronous pipelines (especially in the case of the longerpipelines). To obtain an idea of the sensitivity tohandshaking delays with a given configuration con-sider the results of Table 8. These results indicate tha ta 16% performance gain can be achieved in a 6-stagepipeline if the handshaking delays are reduced from3 to 1.5 ns. If techniques for overlapping handshak-ing with other operations are can be developed whichlead to an effective zero handshaking delay, then a 38%improvement can be achieved over the 3 ns case. Th eresults also show that with lower handshaking delays,the 8-stage pipeline yields higher performance thanthe 6-stage pipeline.

Thus, ARAS reflects the change in the optimaldepth of the pipelines with the change in handshakingdelays. Keeping other factors constant, if handshakingdelays are reduced, deeper pipelines will not necessar-ily perform better than shallower pipelines. This isdue to data dependencies and is seen in the compari-

2 18

8/7/2019 00514658

10/10

Table 8: Simulation Results (MIPS: H. Mean for allBenchmarks)

81.371.5

I x m 4 S P I 6SP I 8SP I 4SP/PF I97.7 104.1 I 100.182.5 83.1 I 87.0

ns 63.8ns 57.7

71.0 68.2 76.862.1 58.0 67.1

son of 6- and 8-stage pipelines for handshaking delaysof 3 ns. Since ARAS uses program traces, such fac-tors are taken in to account. ARAS also indicates thatfor handshaking delays greater than 1.5 ns, the su-perscalar configuration presented outperforms all thesingle pip'eline configurations since this configurationis less sensitive to changes in handshaking delays.

6 Conclusions and Future ResearchAn asynchronous RISC architecture simulator,

ARAS, was presented in this paper. A brief descrip-tion of simulator operation and of the user interfacewas given. General rules for building a working in-struction pipeline were explained and examples of howARAS can be used to explore the performance of alter-native pipeline configurations, functional module per-formance and parameter values was given. ARAS canbe used to simulate a pipelined machine, visualize itsoperation and obtain performance data. ARAS sim-ulation times depend on the size of benchmark pro-grams being used. For the set of programs given inTable 1, ARAS took an average of about 30 secondson a Sun SPARC-20 workstation for simulation andpresentation of the final results. If intermediate re-sults are required (such as the sta tus of registers aftereach instruction etc.) or if the visualization is usedthen longer times will result.

This payperreports on a first version of ARAS. Cur-rent work includes making ARAS more user friendlyand flexible in the block and configurations specfica-tion a r ea . Other enhancements include permittingARAS to simultaneously simulate both clocked andasynchronous versions of the same pipeline configu-ration. The performance data presented in this pa-per used program traces which did not include systemcalls (supervisor mode instructions). Work is now be-ing done iremove this limitation and to integrate theSPECmark set of programs into ARAS.

ReferencesC-H Chien and P. Prabhu. ARAS: AsynchronousRISC Architecture Simulator. Technical ReportWUCCRC-94- 18, Washington Univ., St. Louis,MO, 1994.W.A. Clark. Macromodular Computer Syhstems.In Proc. AFIPS - Spring Joint Computer Confer-ence, pages 489-504, April 1967.M.A. Franklin and T. Pan. Clocked and Asyn-chronous Instruction Pipelines. In Proc. 26thACM/IEEE Symp. on Microarchitecture, pages177-184, Austin, TX, December 1993.M.A. Franklin and T. Pan. Performance Compar-ison of Asynchronous Adders. In Proc. Symp. o nAdvanced Research in Asynchronous Circuits andSystems, Salt Lake City, Utah, November 1994.S.B. f i rber , P. Day, J.D. Garside, N.C. Paver,and J.V. Woods. A Micropipelined ARM. In Int?Conf. on Very Large Scale Integration (VLSI'93),September 1993.J.L. Hennessy and D.A. Patterson. ComputerArchitecture: A Quantitative Approach. MorganKaufmann Publishers, Palo Alto, CA, 1990.K. Hwang. Computer Arithmetic: Principles, Ar-chitecture, and Design. Wiley, 1979.IEEE Computer Society. Proceedings of Int'lSymposium on Advanced Research in Asyn-chronous Circuits and Systems, Salt Lake City,UT, November 1994.N.P. Jouppi and D.W. Wall. AvailableInstruction-Level Parallelism for Superscalar andSuperpipelined Machines. In ASPLOS-III Pro-ceedings, pages 272-282, April 1989.A.J. Martin, S.M. Burns, T.K. Lee, D. Borkovic,and P.J. Hazewindus. The Design of an Asyn-chronous Microprocessor. In Proc. Decennial Cal-tech Conf. on VLSI,pages 20-22. The MIT Press,March 1989.

[Ill R.F. Sproull, I.E. Sutherland, and C.E. Mol-nar. Counterflow Pipeline Processor Architec-ture. IEEE Design and Test of Computers, Fall1994.

[12] S. Waser and M. Flynn. Introduction to Arith-metic for,Digital System Designers. CBS CollegePub., New York, 1982.

219

00514658

Documents

Transcript of 00514658