Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3...

20
CS152 Final Project Berkelium Simon Lee cs152-simonl Dan Teodorescu cs152-danteo Jana van Greunen cs152-janavg Gisela Yu cs152-gisela Katherine Yu cs152-katyu Section: 101 Mon 11-1 Abstract . The goal of the final project was to optimize our processor for better performance. With this objective in mind, we chose to implement multiprocessing, non-blocking loads and a multiply/divide unit. All three of these options pose challenges and require changes to our design from previous labs. In particular, as our design became larger and more complicated, testing became correspondingly more challenging. Our first option, multiprocessing, required that we implement cache coherency. The idea behind cache coherency is to keep the caches of both processors in sync with the main dram. Second, in order to implement non-blocking loads we had to change many aspects of the processor itself. In the initial design of the processor we had made the simplifying assumption that the processor would always stall when memory was busy, and so data and address lines would be

Transcript of Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3...

Page 1: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

CS152

Final Project

Berkelium

Simon Lee cs152-simonlDan Teodorescu cs152-danteoJana van Greunen cs152-janavgGisela Yu cs152-giselaKatherine Yu cs152-katyu

Section: 101 Mon 11-1

Abstract .

The goal of the final project was to optimize our processor for better performance. With this objective in mind, we chose to implement multiprocessing, non-blocking loads and a multiply/divide unit. All three of these options pose challenges and require changes to our design from previous labs. In particular, as our design became larger and more complicated, testing became correspondingly more challenging.

Our first option, multiprocessing, required that we implement cache coherency. The idea behind cache coherency is to keep the caches of both processors in sync with the main dram. Second, in order to implement non-blocking loads we had to change many aspects of the processor itself. In the initial design of the processor we had made the simplifying assumption that the processor would always stall when memory was busy, and so data and address lines would be kept constant. We changed both the cache and processor to overcome this assumption. In addition to the changes in the processor and cache, non-blocking loads also required a large amount of extra control logic to keep track of the pending loads and also to stall the pipeline if the pending load is about to be used. Finally, we chose the third option because we had already implemented a multiplier for lab5 and adding a divider did not require many changes to the pipeline itself.

Page 2: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

This is the organization of our top-level diagram. Note that we have two processors and four caches connected to a single memory. Division of Labor .

The initial division of labor

Simon Non blocking loads L2 cache

Dan Not-so-deep pipelining Monitor

Jana Multiplier Stream buffer

Gisela Not-so-deep pipelining

Katherine Non blocking loads Victim cache

However, we decided to implement multiprocessing instead of the smaller cache enhancements. This table reflects the actual division of labor. The task of writing the report was shared by everyone.

Simon Non-blocking Loads Debugging

Dan Monitor Processor Performance

Jana Multiprocessing Multiply divide unit

Page 3: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

DebuggingGisela Testing

Katherine Non-blocking Loads Debugging

Detailed Strategy .

Multiply/Divide unit

Both the control and data path for the multiply/divide unit are done entirely in schematic. The multiply/divide unit is comprised of three main modules that compute an unsigned multiply, signed multiply, and unsigned divide in parallel. At the end of the computation, the correct data is selected by a mux, and latched into external hi, lo registers. This is illustrated in the figure below.

For the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed multiply is implemented as a booth multiplier with radix two encoding. Both the signed and unsigned divide are computed using the restoring divide algorithm described in Patterson & Hennessey. The only difference between the signed and unsigned divide is the processing of operands before and after the divide. Before a signed divide operation, the operands are negated if they are negative. On completion of the signed divide, the signs of the remainder and quotient are corrected by negating them as needed (the remainder should always have the same sign as the dividend and the quotient is only negative if the sign of the operands differ)

In addition to the sign problem, there were some problems doing a signed divide with two very large numbers (both with the 31st bit set to 1). If very large numbers are used, the result of the divide can either be a 0 or 1. Special by-pass logic had to be implemented to select the original dividend as a remainder if the answer of the divide was 0.

Interface with the rest of the processor

The multiply unit is essentially disconnected from the rest of the processor. The only time that it really needs to interact with the rest of the processor is upon the start of a multiply/divide or mfhi and mflo instructions. We use the same hazard detection unit from lab 5 (the unit that detects a hazard on a lw) to detect if the multiplier is still busy

Page 4: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

when mfhi or mflo is called. If the multiply is not completed yet, we simply stall the pipeline until it is finished.

Multiprocessing

There are three main units that implement the functionality needed in multiprocessing; cache coherence, arbitration with four caches and synchronization unit.

Arbitrator

The arbitrator had to be modified to work with four caches instead of just two. This essentially doubled the number of arbitrator states. The order of servicing the requests remained essentially the same as in lab 6.

The arbitrator must now communicate with four caches.

If the arbitrator is idle and all four caches simultaneously make a memory request, the first processor’s data cache will be serviced first, followed by its instruction cache and then the second processor’s data and instruction caches. Should a request from a processor be made while the arbitrator is not idle, the arbitrator will “stall” that processor, and service its request as soon as the current request has been serviced. To ensure fairness, no processor can have two consecutive requests from the same cache serviced while there is a request from another processor pending.

entity arbitrator_new isgeneric (arbitrator_delay: TIME := 5 ns; delay: TIME := 2 ns );

port ( signal clk : in vlbit; signal datacache_addr : in vlbit_1d(31 downto 0); signal datacache_data: in vlbit_1d(31 downto 0); signal datacache_req: in vlbit; signal datacache_r_w: in vlbit; -- This signal is set to 1 if processor doing a write signal instrcache_addr : in vlbit_1d(31 downto 0); signal instrcache_req : in vlbit; signal datacache_adr2 : in vlbit_1d(31 downto 0); signal datacache_data_second: in vlbit_1d(31 downto 0); signal datacache_req2: in vlbit; signal datacache_r_w2: in vlbit; -- This signal is set to 1 if processor doing a write signal instrcache_adr2 : in vlbit_1d(31 downto 0); signal instrcache_req2 : in vlbit;

Page 5: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

signal wait_H: in vlbit; -- from DRAM signal datavalid_H : in vlbit; -- from DRAM

signal processor_wait : out vlbit; signal processor_wait2 : out vlbit; signal dram_r_w : out vlbit; signal dram_req : out vlbit; signal dram_adr : out vlbit_1d(9 downto 0); signal dram_data_out : out vlbit_1d(31 downto 0); signal D_datavalid_H : out vlbit; signal I_datavalid_H : out vlbit; signal D_datavalid_H2 : out vlbit; signal I_datavalid_H2 : out vlbit; signal state_out: out vlbit_1d(31 downto 0)); end arbitrator_new;

(See the appendix for the entire VHDL file)

Cache coherence

The cache coherence is implemented in two modules. The first module is placed in the main schematic and is basically a write detector. Whenever a processor requests to write to memory, this module sends invalidate_H and invalidate address signals to all four caches. The second module then handles the invalidation from within the caches. This module is placed in the cache table itself. When the invalidate signal is high, the module scans through all the tags and if a tag (and its corresponding index) should match the invalidate address “0” is written to that tag.

A diagram of the invalidator’s behavior, the red arrow signifies and invalidate request.

Below is the vhdl port definition for the cache invalidator unit.entity cache_invalidator isgeneric (delay: TIME := 2 ns );

port ( signal TAG_COMBO_S0_BLK1_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S1_BLK1_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S2_BLK1_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S3_BLK1_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S4_BLK1_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S5_BLK1_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S6_BLK1_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S7_BLK1_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S0_BLK2_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S1_BLK2_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S2_BLK2_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S3_BLK2_OUT : in vlbit_1d(31 downto 0);

Page 6: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

signal TAG_COMBO_S4_BLK2_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S5_BLK2_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S6_BLK2_OUT : in vlbit_1d(31 downto 0); signal TAG_COMBO_S7_BLK2_OUT : in vlbit_1d(31 downto 0); signal invalid : in vlbit; signal invalid_second : in vlbit; signal addr_second : in vlbit_1d(31 downto 0); signal addr : in vlbit_1d(31 downto 0); signal WRITE_S0_BLK1 : out vlbit; signal WRITE_S1_BLK1 : out vlbit; signal WRITE_S2_BLK1 : out vlbit; signal WRITE_S3_BLK1 : out vlbit; signal WRITE_S4_BLK1 : out vlbit; signal WRITE_S5_BLK1 : out vlbit; signal WRITE_S6_BLK1 : out vlbit; signal WRITE_S7_BLK1 : out vlbit; signal WRITE_S0_BLK2 : out vlbit; signal WRITE_S1_BLK2 : out vlbit; signal WRITE_S2_BLK2 : out vlbit; signal WRITE_S3_BLK2 : out vlbit; signal WRITE_S4_BLK2 : out vlbit; signal WRITE_S5_BLK2 : out vlbit; signal WRITE_S6_BLK2 : out vlbit; signal WRITE_S7_BLK2 : out vlbit; signal data : out vlbit_1d(31 downto 0); signal data_second : out vlbit_1d(31 downto 0); signal clk : in vlbit

);end cache_invalidator;

(See the appendix for the entire VHDL file)

Synchronization Unit

The synchronization unit is comprised of two parts: 16 individual registers that serve as the storage for the module, and then the control logic that wraps around it. The control is also the interface to the rest of the processor, and controls what data is returned to the processor. When there are two simultaneous test-and-sets, processor one will be given precedence.

Below is the vhdl port definition:entity sync is

generic (quick: TIME := 1 ns;delay: TIME := 2 ns

);

port (

signal req : in vlbit; signal req_second : in vlbit; signal r_w : in vlbit; signal r_w_second : in vlbit; signal clk : in vlbit; signal from_proc : in vlbit_1d(31 downto 0); signal from_proc_second : in vlbit_1d(31 downto 0); signal addr : in vlbit_1d(31 downto 0); signal addr_second : in vlbit_1d(31 downto 0); signal data : out vlbit_1d(31 downto 0); signal data_second : out vlbit_1d(31 downto 0); signal data_zero : out vlbit_1d(31 downto 0); signal data_one : out vlbit_1d(31 downto 0); signal data_two : out vlbit_1d(31 downto 0);

Page 7: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

signal data_three : out vlbit_1d(31 downto 0); signal data_four : out vlbit_1d(31 downto 0); signal data_five : out vlbit_1d(31 downto 0); signal data_six : out vlbit_1d(31 downto 0); signal data_seven : out vlbit_1d(31 downto 0); signal data_eight : out vlbit_1d(31 downto 0); signal data_nine : out vlbit_1d(31 downto 0); signal data_ten : out vlbit_1d(31 downto 0); signal data_eleven : out vlbit_1d(31 downto 0); signal data_twelve : out vlbit_1d(31 downto 0); signal data_thirteen : out vlbit_1d(31 downto 0); signal data_fourteen : out vlbit_1d(31 downto 0); signal data_fifteen : out vlbit_1d(31 downto 0); signal data_sixteen : out vlbit_1d(31 downto 0); signal data_zero_in : in vlbit_1d(31 downto 0); signal data_one_in : in vlbit_1d(31 downto 0); signal data_two_in : in vlbit_1d(31 downto 0); signal data_three_in : in vlbit_1d(31 downto 0); signal data_four_in : in vlbit_1d(31 downto 0); signal data_five_in : in vlbit_1d(31 downto 0); signal data_six_in : in vlbit_1d(31 downto 0); signal data_seven_in : in vlbit_1d(31 downto 0); signal data_eight_in : in vlbit_1d(31 downto 0); signal data_nine_in : in vlbit_1d(31 downto 0); signal data_ten_in : in vlbit_1d(31 downto 0); signal data_eleven_in : in vlbit_1d(31 downto 0); signal data_twelve_in : in vlbit_1d(31 downto 0); signal data_thirteen_in : in vlbit_1d(31 downto 0); signal data_fourteen_in : in vlbit_1d(31 downto 0); signal data_fifteen_in : in vlbit_1d(31 downto 0); signal data_sixteen_in : in vlbit_1d(31 downto 0); signal write_one : out vlbit; signal write_two : out vlbit; signal write_three : out vlbit; signal write_four : out vlbit; signal write_five : out vlbit; signal write_six : out vlbit; signal write_seven : out vlbit; signal write_eight : out vlbit; signal write_nine : out vlbit; signal write_ten : out vlbit; signal write_eleven : out vlbit; signal write_twelve : out vlbit; signal write_thirteen : out vlbit; signal write_fourteen : out vlbit; signal write_fifteen : out vlbit; signal write_sixteen : out vlbit

);end sync;

(See the appendix for the VHDL file)

Non-blocking loadsNon-blocking loads allow the pipeline to continue execution during a data-cache miss until the requested data is actually needed. Since instruction fetch needs to be handled as soon as possible, the instruction caches are not associated with the MSHR tables. Each data cache has a 2 entries table to store outstanding loads information. A MSHR contains (1) A valid bit indicating the MSHR is in use, (2) the address that has missed in the cache and (3) the register that the result must go back to.

Page 8: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

Implementing Non-Blocking Loads require changes to the Instruction Decode stage and Memory stage.

Decode Stage

In the decode stage, we added logic to check if the decoding instruction is in one of the following three cases:

1) The instruction is using registers in a valid MSHR entry. This check prevents WAR and WAW hazards.

2) The instruction is using an address in a valid MSHR entry.

3) The instruction is a sw instruction and some MSHR entry is valid (i.e. in use).

If one of the above three cases is satisfied, the ID stage stalls the processor until the MSRH entry is freed. Data returning from memory checks the MSHR and updates the appropriate F/E bit, register, and MSHR (making it invalid).

Memory Stage

We use a MSHR controller to handle requests to memory in the memory stage. Upon a request, we stall the processor and check if it is a save or load request, then we send the request over to the data cache. If it is a save or it is a load hit, we are done at this point and we unfreeze the processor. However, if it is a load miss, we check if there is a free MSHR entry. If we cannot find such an entry, we continue to stall the processor until such an entry is found (i.e. some previous load miss is done and free some entry). If we find such an entry, we enter the info for the load into the MSHR, and then we flush all following instructions, unfreeze the processor and restart fetching at the instruction after the load. Upon a done signal from the data cache controller, we invalidate the appropriate MSHR entry. We check and send another request to the data cache if there is a pending entry in the MSHR table. It is best explained with a data flow diagram:

Page 9: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

Testing Methodology .

Following the modular specifications of our project, we tried to make the testing modular. To be able to test the components individually in a meaningful way, we defined specific interfaces between components. For each component, we verified that it individually performed the function it was supposed to, and also that it confirmed to the specified interface. In the end we had to do less testing of the entire complicated system, because we knew that certain parts worked.

Multiplier testing methodology

The multiplier testing occurred in two stages. First, we tested the functionality of each individual multiplier or divider unit. We wrote a .cmd file to toggle specific signals and requests into the multipliers/dividers and then checked the result. In the second step the multiplier was connected to the processor. We wrote a mips test file to test the multiply and divide functionality combined with other instructions like mfhi and mflo. This ensured that the pipeline would stall at the correct times.

Multiprocessor testing methodology

In order to decrease the complexity of testing the entire multiprocessing unit, we tried to break it down into smaller modules that we could test separately. Following the same division as the schematic we tested the Arbitrator, synchronization unit and cache coherency unit separately. To test the arbitrator, we connected two processors and the memory but left out the other two components of multiprocessing. We then executed two programs with completely separate memory spaces within the dram simultaneously. Correctness was ensured by verifying the correctness of each program separately. We used our multiply test as well as the lab5 mystery program.

Test files: arbitrator.cmd, multiprocessor_test.s include a schematic of multproc_fortest (the one without processors)

Finally, we tested the Synchronization and Coherency units by disconnecting them from the processors and simulating loads and stores to the memory. This enabled us to have more control over the memory system and look at more in-depth tight timing on our tests. We implemented the simulation by manipulating signals in Digital Fusion and then looking at the waveform for timing results.

Digital Fusion test files: lock.cmd, coherence.cmd

Page 10: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

Non-blocking loads testing methodology

We tested Non-blocking loads with single processor with various test cases to make sure every execution path is covered. To make sure valid non-blocking loads are used in the mshr entries, we ran test code that fills MSHR entries and check their values manually.We tested if processor stalls correctly in these three cases:

lw to the same address line as the entries in MSHR table- lw $2, 0x01- lw $2 0x02

instructions that use the same destination registers as the registers in MSHR - lw $2, 0x01- add $2, $2, $2

save word following load word- lw $2, 0x01- sw $5, 0x02

Performance Analysis .

We calculated the critical path for this processor from the simulation wave files. The critical path is the DRAM_DATA signal; it gets updated after 30ns. Thus, the maximum clock speed for this processor is 33.3 MHz. We have shown the processor’s maximum instruction throughput in Fig. 1 along with the instruction throughput for the single cycle processor and the pipelined processor with cache. Note that the multiprocessor executes twice as many instructions per second than the non-blocking-load processor, since it is made up of two non-blocking-load processors.

Figure 1 - Processor Speed (in Million of Instructions Per Second)

Page 11: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

The MIPS index, as a standard, is defined as the maximum possible number of instructions executing per second; it does not indicate the execution rate in every case. The MIPS rating can be obtained by assuming that the processors never stall. For example, the loop in the following code fragment is such an example since the loop is loaded in the instruction cache when the first add is fetched from the DRAM (assuming that the first add is four-word aligned in memory so that the entire code fits in one cache row). Since the loop does not access the memory, it doesn’t interfere with the second processor in any way and each processor can execute an instruction on every cycle.

addiu $t1, $zero, 0 # Possibly causes a stalladdiu $t2, $zero, 1024

loop: bne $t1, $t2, loop # Loop executes; one instructionaddiu $t1, $t1, 1 # per cycle without stalls.

Figure 2 - Loop Running at Maximum Speed

In order to show the processor’s real performance we used actual programs to demonstrate the significance of all the new features. Instead of using programs whose solely written to illustrate the performance improvement, we implemented three particular algorithms used in industry, for this processor. These programs are used in many digital circuits like cell phones, portable MP3 players, etc. We implemented a 16x8 matrix transpose program, a single biquad infinite impulse response filter, and a 16-input 8-coefficient complex finite impulse response filter. You can find the code for each of these benchmarks in Appendix III.

The matrix transpose benchmark is designed to run on one processor only since its purpose is to show the execution time improvement of the non-blocking load. We optimized the algorithm by loading the new value in at the beginning of the loop and storing it at the last instruction in the loop; the code in between is control code that does not depend on the loaded value. The previous processor executed the code in 5413 cycles, while the new processor executed the code in 4014 cycles; the result is a 26% cycle improvement. In both cases, the total program memory used was the same (112 bytes). This particular program might not be used as much as the other two, however, it highlights the performance improvement of similar versatile array manipulating code used in typical C/C++/Java programs.

The infinite impulse response filter reads a new value, updates the internal states, and calculates the new output, each iteration through the loop. (See Fig. 3 for the filter block diagram). This program was designed to run on one processor only, similar to the transpose benchmark, since its purpose is to show the combined efficiency of the non-blocking load and the multiplier. The second processor could be used to execute a similar filter in parallel, though. The multiplier and the memory unit are parts of the processor that can significantly increase the program execution time. However, this benchmarks is a carefully designed program that can efficiently use both of these resources while using a few more cycles than the most “cycle consuming” resource. In this case, the four sequential multiplications are the most “cycle consuming” resource; they end up using 128 cycles. In this case, we are sure that the load will never cause a stall, since each load

Page 12: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

is “shadowed” by the ongoing multiplications. Note that the program loads the four filter coefficients at the beginning of the program, before it gets to the loop. Our analysis doesn’t include those loads because they happen only once, as opposed to the loop which can execute infinitely many times. On the previous processor this program executed in 3082 cycles, however, on this processor, the program executed in 2182 cycles. The result is a 29% cycle count improvement. The program memory use was the same in both cases (120 bytes).

The complex block filter is a typical digital signal processing algorithm that is intricate enough to deploy both processors on its execution. To implement this program, we made full use of all the new resources on both processors. The program loads and unpacks 16 complex input values, stores the result from the previous computation, and convolutes the new input with an 8-value complex impulse response. The input data is packed; each value is stored as one byte within a word. Processor 1 loads the new input data and unpacks it by re-writing each byte in a different word location. At the same time, processor 2 stores the previously calculated output to a different location so it won’t be overwritten as the new output is generated. The processors stall until they both finish their tasks and then they both start computing parts of the convolution. Processor 1 is responsible for the first 8 output values (complex values), while processor 2 is responsible for the remaining 8 values. See Fig. 4 for the data manipulation and flow. The processors once again stall until they both finish their tasks. The purpose of the wait between the processors is to prevent them from getting so much out of sync that they both end up manipulating completely different sets of data. We synchronized the processors by having each processor grab its assigned flag at the beginning of the task and releasing it at the end and then wait until it obtains the other processor’s flag. The processors release the newly obtained flags, grab their original flags, and then continue the program execution. In order to compare the performance of this processor vs. the single pipelined processor, we re-wrote the benchmark for the old processor. The old processor loads and unpacks the new input data, stores the old output data and then performs the entire convolution by itself.

Figure 3 - Infinite Impulse Response Filter Block Diagram

z-1

z-1

Input Output

Page 13: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

The benchmark finished in 14310 cycles on the new processor while it finished in 25702 cycles on the old processor. These numbers indicate a 44.3% cycle count improvement. The total instruction memory use for the new processor is 616 bytes, but only 284 bytes on the old processor; a substantial 117% instruction memory increase. Note that the version of the program made for the old processor didn’t need any synchronization code, thus it is reasonable that it is less than half of the code used in for the new processor.

Conclusion .

The final project reflected a realistic design process. At the start we had to choose certain optimizations based on their possible speedup and the time and resources available to us. After choosing which modules to implement, the most important aspect of the project was to define interfaces between the new modules themselves, and between the old processor. Even though we were careful in defining the interfaces, we still had integration and incompatibility problems upon combining them. These problems, however, could have been overcome relatively easily, given a little more time. As always, testing proved to be one of the most important aspects of our design.

Our final design included a multiplier, multiprocessing and non-blocking loads. We could have improved our design in several ways. The cache’s write-through allocate policy combined with multiprocessing made the memory exceptionally slow when both processors are writing to the same data. Our cache coherency, instead of simply invalidating the data in the cache when the other processor writes to the address in the DRAM, could have allowed one processor to write directly into the cache of the other. These improvements were not implemented due to time constraints. We have made a valiant effort to implement all the functionality and tried to ensure that it works (at least for basic test cases).

This is the processor with the most performance that we have designed so far, as we have shown using the MIPS ratings and the benchmark programs. However, unlike

Figure 4 - Complex Convolution Data Flow

Input

Coefficients

Output Processor 1 Output Processor 2 Output

Page 14: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

the previous processors, we realized that the features that make this processor powerful put a demand on the programmer to carefully schedule the instructions within a program. A carelessly designed program can easily eliminate the processor’s computational efficiency.

Appendix I (notebooks) .

Kat's notebook (Kat_Notebook.txt)Gisela’s notebook (Gisela-notebook.txt)Dan’s notebook (Dan-notebook.txt)Simon’s notebook (Simon-notebook.txt)Jana’s notebook (Jana-notebook.txt)

Appendix II (VHDL and schematic files) .

MultiplierMultiplier Divider Unit: MultDiv(1).jpg, MultDiv(2).jpg, MultDiv(3).jpg,

MultDiv(4).jpgMultipliers: Mult(1).jpg, MultSigned(1).jpgDivider: Div(1).jpg

MultiprocessingCache coherence: behv/coherence_check.vhd, behv/cache_invalidator.vhd Arbitrator: behv/arbitrator_new.vhdSynchronization: behv/sync.vhd, behv/choose_sync.vhdSchematics: Multi(1).jpg, Multi(2).jpg, Multi(3).jpg, Multi(4).jpg

Non-blocking-load processorProc-NonBlock(1).jpg, Proc-NonBlock(2).jpg, Proc-NonBlock(3).jpg, Proc-

NonBlock(4).jpg, Proc-NonBlock(5).jpg, Proc-NonBlock(6).jpg, Proc-NonBlock(7).jpg

Non-blocking loads

Cache: behv/data_cache_control.vhdMSHR checker: behv\MSHR_Ctrl.vhd, MSHR_Checker.jpgMSHR table and MSHR controller: behv\MSHR_mem_control.vhd, MSHR and MSHR_control.jpg

MonitorMonitor: Monitor.vhd, Monitor Extension: MonitorEx(1).jpg, MonitorEx(2).jpg

CachesData: DCache(1).jpg, Instruction: ICache(1).jpg

Page 15: Lab 5kubitron/courses/cs152... · Web viewFor the unsigned multiply I used algorithm number 3 described in our textbook. (Computer Architecture, Patterson & Hennessey) The signed

Appendix III (testing and logs) .

Waveforms: Critical-Path.jpg

Test files: cache_performance.s, hazard.s, hazard2.s, lab4_mystery.s, lab5_test1.s, lab5_test2.s, lab5_mystery.s, multiply_test.s, multiprocessor_test.s, inittest/multiply_test.s, inittest/multiprocessor_load_test.s, inittest/combo_sync.s, inittest/coherence_sync.s, inittest/coherency_test.s

Log files: cache_performance_singP_nonblk.log, hazard_singleP_nonBlk.log, hazard2_singleP_nonBlk.log, lab4_mystery_singleP_nonBlk.log, lab5_test1_singleP_nonBlk.log, lab5_test2_singleP_nonBlk.log, lab5_mystery_singleP_nonBlk.log

Test scripts used: simonm.cmd, mult_div_unit_test.cmd, multproc_disjoint.CMD, behv/mult_div_unit_test.cmd, inittest/multiprocessor.cmd, inittest/multproc_load.cmd, inittest/multproc_load.CMD, inittest/multt.cmd, inittest/quickcachetest.cmd, inittest/tempmult.mem, inittest/datapath_mult.cmd, inittest/coherence.cmd

Benchmarks: transpose.s, iir.s, cxconvolution.s (Multiprocessor), cxconcolution2.s (Single Processor)