High Performance Soft Processor Architectures for ... · processors and VLIWs. These architectures...

High Performance Soft Processor Architectures forApplications with Irregular Data- and Instruction-Level

Parallelism

by

Kaveh Aasaraai

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2014 by Kaveh Aasaraai

Abstract

High Performance Soft Processor Architectures for Applications with Irregular Data-

and Instruction-Level Parallelism

Kaveh Aasaraai

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2014

Embedded systems based on FPGAs frequently incorporate soft processors. The

prevalence of soft processors in embedded systems is due to their flexibility and adaptabil-

ity to the application. However, soft processors provide moderate performance compared

to hard cores and custom logic, hence faster performing soft processors are desirable.

Many soft processor architectures have been studied in the past including Vector

processors and VLIWs. These architectures focus on regular applications in which it is

possible to extract data and/or instruction level parallelism offline. However, applications

with irregular parallelism only benefit marginally from such architectures. Targeting

such applications, we investigate superscalar, out-of-order, and Runahead execution on

FPGAs. Although these architectures have been investigated in the ASIC world, they

have not been studied thoroughly for FPGA implementations.

We start by investigating the challenges of implementing a typical inorder pipeline on

FPGAs and propose effective solutions to shorten the processor critical path. We then

show that superscalar processing is undesirable on FPGAs as it leads to low clock fre-

quency and high area cost due to wide datapaths. Accordingly, we focus on investigating

and proposing FPGA-friendly OoO and Runahead soft processors.

We propose FPGA-friendly alternatives for various mechanisms and components used

in OoO execution. We introduce CFC, a novel copy-free checkpointing which exploits

ii

FPGA block RAMs for fast and dense storage. Using CFC, we propose an FPGA-friendly

register renamer and investigate the design and implementation of instruction schedulers

on FPGAs.

We then investigate Runahead execution and introduce NCOR, an FPGA-friendly

non-blocking cache tailored for FPGAs. NCOR removes CAM-based structures used in

conventional designs and achieves the high clock frequency of 278 MHz. Finally, we intro-

duce SPREX, a complete Runahead soft core incorporating CFC and NCOR. Compared

to Nios II, SPREX provides as much as 38% higher performance for applications with

irregular data-level parallelism with minimal area overhead.

iii

Acknowledgements

An important part of my studies was that it was more than studying at school.

I interacted with many people and learned many life lessons, directly and indirectly

from them. I’d like to acknowledge all for their support, friendship, supervision and

company. I hope those who have been omitted from these pages are forgiving for this is

not intentional.

I never took my research supervisor and advisor, Prof. Andreas Moshovos, for granted.

He guided me through my research and supported me academically, financially, and

mentally. He managed to create the perfect balance between supervision and freedom of

work that I truly appreciate. I’d have to advise anyone looking for a Ph.D. supervisor to

make him their first choice.

Half way through my studies I was accompanied with my now ex-wife, Monia Ghobadi.

Although our relationship ended before my studies, but I must admit she was always

supportive and helpful. I wish her well in her life and thank her for all her support.

My parents played a big role in forming my personality and helping me to get to this

point in my life. They are both academic people and throughout the years encouraged

me in my studies. My mother has always been my go to person in times of despair and

hardship.

I’d like to thank my committee members, Professors Paul Chow, Greg Steffan, and

Jason Anderson, for their support in my studies. I had the pleasure of taking several

courses with them and finished some interesting projects under their advice. Throughout

my studies I encountered many technical difficulties, and with no hesitation I knew that

I could seek help from them. Prof. Anderson was kind enough to accept to be part of

my committee last minute, and I truly appreciate his support.

My friends have always been a big part of my life. My life during the past several

years has had many good and bad moments and I’m honored to have had such caring

friends to always be beside me. Soheil and Shabnam, the lovely couple who helped

iv

me through my studies and relationship difficulties will always be my dear friends. My

best friend Paige has always been supportive, in any aspect, and generous in paying me

attention when I needed. She encouraged me, pushed me, and picked me up whenever I

was going through difficult times. Diego and I essentially shared the lab space at school.

His company throughout the years has been very helpful and I’m glad to have made such

a good friend at school.

I’d like to thank all my colleagues at school who were always helpful with their

support and most importantly their constructive criticism. I’d like to thank Myrto for

her friendship and support. I’d also like to thank Maryam, Ian, Alhassan, Elias, Elham,

Jason, Patrick, Mitchel, Eric, Davor, Henry and many more who were always in the

lab! I particularly enjoyed having long and intellectually rich conversations about soft

processors with Henry.

My research was highly dependant on equipment which were primarily donated by

Altera Corp. to our lab. I’d like to thank them for their support and generosity, for they

greatly facilitated my research in this field.

Many faculty members in our group helped me throughout my studies. I had many

interesting and thought provoking conversations with Prof. Jonathan Rose. Prof Vaughn

Betz also helped me through my studies and helped me in making connections to the

industry. Prof. Jason Anderson has always been my good friend and very supportive of

my studies, besides being a member of my committee.

My studies would not have been possible without financial support. I was fortunate to

be granted many awards which helped me focus on my studies. I received supports from

programs ranging from OGSST, NSERC-CGS, DCA to Graduate Student Endowment

Fund awarded by the dean of the graduate studies. Additionally, Prof. Moshovos has

always been generous in supporting me financially to attend conferences and events. In

the latter part of my studies that I had no financial support from the university, he

supported me completely.

v

I’d like to thank all the administrative staff at school. Kelly Chan has always been

cheerful and helpful. Jayne Leake coordinated my TAships and never complained for

all the hardship I gave her with late paperwork! Judith Levene and Darlene Gorzo

helped with all the school administrative work and were always available to answer my

never-ending questions!

vi

Contents

1 Introduction 11.1 Superscalar Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Out-of-Order Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Runahead Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Superscalar vs. OoO and Runahead Execution . . . . . . . . . . . . . . . 41.5 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.6 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.6.1 Soft-Processor Implementation Challenges . . . . . . . . . . . . . 61.6.2 Copy-Free Checkpointing . . . . . . . . . . . . . . . . . . . . . . . 71.6.3 Instruction Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 91.6.4 Non-Blocking Data Cache . . . . . . . . . . . . . . . . . . . . . . 91.6.5 Soft Processor with Runahead Execution . . . . . . . . . . . . . . 10

1.7 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Background and Motivation 132.1 Superscalar Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Out-of-Order Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Runahead Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Narrow vs. Wide Datapath . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Experimental Methodology 203.1 Comparison Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.2 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.3 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.4 IPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Software Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Software Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.2 Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.1 Verilog Implementation . . . . . . . . . . . . . . . . . . . . . . . . 253.3.2 Component Isolation . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.3 Inorder Processor Resembling Nios II . . . . . . . . . . . . . . . . 263.3.4 The System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

vii

3.3.5 System Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.6 Memory Controller . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.7 Peripherals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Soft Processor Implementation Challenges 284.1 Identifying Implementation Inefficiencies . . . . . . . . . . . . . . . . . . 304.2 Processor Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.1 Fetch Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.2 Decode Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2.3 Execute Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2.4 Memory Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.5 Writeback Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4 Critical Path Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.5 Eliminating Critical Paths . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5.1 Multiplier and Shifter . . . . . . . . . . . . . . . . . . . . . . . . . 404.5.2 Branch Misprediction Detection . . . . . . . . . . . . . . . . . . . 414.5.3 Data Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5.4 Fetch Address Selection . . . . . . . . . . . . . . . . . . . . . . . 454.5.5 Data Operand Specialization . . . . . . . . . . . . . . . . . . . . . 46

4.6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 CFC: Copy-Free Checkpointing 505.1 The Need for Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . 505.2 Register Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.1 Checkpointed RAT . . . . . . . . . . . . . . . . . . . . . . . . . . 525.3 CFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3.1 The New RAT Structure . . . . . . . . . . . . . . . . . . . . . . . 545.3.2 RAT Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4 FPGA Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.4.1 Flattening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.4.2 Multiporting the RAT . . . . . . . . . . . . . . . . . . . . . . . . 585.4.3 Dirty Flag Array . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.4.4 Pipelining the CFC . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.5.2 LUT Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.5.3 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.5.4 Impact of Pipelining on IPC . . . . . . . . . . . . . . . . . . . . . 625.5.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

viii

6 Instruction Scheduler 656.1 Instruction Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.2 CAM-Based Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.2.1 CAM on FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.2.2 CAM Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 686.2.3 Back-to-Back Scheduling . . . . . . . . . . . . . . . . . . . . . . . 696.2.4 Scheduling Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.3.2 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.3.3 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.3.4 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.3.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75


7 NCOR: Non-blocking Cache For Runahead Execution 787.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.2 Conventional Non-Blocking Cache . . . . . . . . . . . . . . . . . . . . . . 797.3 Making a Non-Blocking Cache FPGA-Friendly . . . . . . . . . . . . . . . 80

7.3.1 Eliminating MSHRs . . . . . . . . . . . . . . . . . . . . . . . . . 817.3.2 Making the Common Case Fast . . . . . . . . . . . . . . . . . . . 82

7.4 NCOR Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837.4.1 Cache Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 837.4.2 Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847.4.3 Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857.4.4 Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867.4.5 Data and Tag Storage . . . . . . . . . . . . . . . . . . . . . . . . 877.4.6 Request Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.4.7 Meta Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.5 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.5.1 Storage Organization . . . . . . . . . . . . . . . . . . . . . . . . . 887.5.2 BRAM Port Limitations . . . . . . . . . . . . . . . . . . . . . . . 907.5.3 State Machine Complexity . . . . . . . . . . . . . . . . . . . . . . 927.5.4 Latching the Address . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.6.2 Simplified MSHR-Based Non-Blocking Cache . . . . . . . . . . . . 967.6.3 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.6.4 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987.6.5 MSHR-Based Cache Scalability . . . . . . . . . . . . . . . . . . . 997.6.6 Runahead Execution . . . . . . . . . . . . . . . . . . . . . . . . . 1007.6.7 Cache Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 1037.6.8 Secondary Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.6.9 Writeback Stall Effect . . . . . . . . . . . . . . . . . . . . . . . . 105

ix


8 SPREX: Soft Processor with Runahead EXecution 1088.1 Challenges of Runahead Execution in Soft Processors . . . . . . . . . . . 1098.2 SPREX: An FPGA-Friendly Runahead Architecture . . . . . . . . . . . . 110

8.2.1 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1108.2.2 Non-Blocking Cache . . . . . . . . . . . . . . . . . . . . . . . . . 1118.2.3 Extra Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1128.2.4 Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1128.2.5 Register Validity Tracking . . . . . . . . . . . . . . . . . . . . . . 113

8.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.3.2 Stores During Runahead . . . . . . . . . . . . . . . . . . . . . . . 1158.3.3 Register Validity Tracking . . . . . . . . . . . . . . . . . . . . . . 1178.3.4 Number of Outstanding Requests . . . . . . . . . . . . . . . . . . 1178.3.5 Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . 1198.3.6 Branch Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . 1198.3.7 Final Processor Performance . . . . . . . . . . . . . . . . . . . . . 1208.3.8 Runahead Overhead . . . . . . . . . . . . . . . . . . . . . . . . . 120


9 Concluding Remarks 1249.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1259.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

9.2.1 Out-of-Order Execution . . . . . . . . . . . . . . . . . . . . . . . 1279.2.2 Multi-Processor Designs . . . . . . . . . . . . . . . . . . . . . . . 1289.2.3 Power and Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Bibliography 130

x

List of Tables

3.1 SoinSim Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Processor critical paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1 Architectural properties of the simulated processors. . . . . . . . . . . . . 605.2 LUT and BRAM usage and maximum frequency for 4 and 8 checkpoints

on different platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.1 Architectural properties of simulated processors. . . . . . . . . . . . . . . 95

8.1 Architectural properties of the simulated and implemented processors. . . 1158.2 Runahead processor hardware cost breakdown. Numbers in parentheses

denote overhead for Runahead support. . . . . . . . . . . . . . . . . . . . 122

xi

List of Figures

2.1 A typical out-of-order pipeline using register renaming and reorder buffer. 15

2.2 (a) In-order execution of instructions resulting in stalls on cache misses.(b) Overlapping memory requests in Runahead execution. . . . . . . . . 16

2.3 Area and maximum frequency of a minimalistic pipeline for 1-, 2-, and4-way superscalar processors. . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 IPC performance of superscalar, out-of-order, and Runahead processors asa function of cache size. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 The typical 5-stage pipeline implemented in this work. Dotted lines rep-resent control signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Multiplication and shift/rotate operations before (a) and after (b) opti-mization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Branch misprediction detection before (a) and after (b) optimization. Dashedboxes represent registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4 Forwarding data path before and after optimization in the pipeline. Dashedline is the added forwarding path. . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Next address selection data path in the Fetch state before (a) and after(b) optimization. Dashed boxes represent registers. . . . . . . . . . . . . 46

4.6 IPC and relative IPS improvement for the processor after removing criticalpaths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1 Epochs illustrated in a sequence of instructions. . . . . . . . . . . . . . . 53

5.2 CFC main structure consists of c+1 tables and a dirty flag array. . . . . 55

5.3 Finding the most recent mapping: The most recent mapping for registerR1 is in the second column (01), while for R2, it resides in the fourth (11). 55

5.4 Performance impact of an extra renaming stage. . . . . . . . . . . . . . . 62

5.5 Overall processor performance in terms of IPS using various checkpointingschemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.1 An example sequence of instructions being scheduled. Current state ofthe processor is presumed as instruction A being in the memory stage,and instructions B and C are in the scheduler, waiting to be selected forexecution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

xii

6.2 CAM Scheduler with back-to-back scheduling and compaction. OR gatesprovide back-to-back scheduling. The dashed gray lines show the shift-ing interconnect which preserves the relative instruction order inside thescheduler for age-based policy. The selection logic prioritizes instructionselection based on location, i.e., it is a priority encoder. . . . . . . . . . . 69

6.3 Number of ALUTs used by scheduler designs. . . . . . . . . . . . . . . . 72

6.4 Maximum clock frequency of the scheduler designs. . . . . . . . . . . . . 73

6.5 Instructions per cycle achieved using four scheduler designs. . . . . . . . 74

6.6 Overall performance as million instructions per second of four schedulerdesigns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.7 Overall performance of scheduler designs when the operating frequency islimited to 303Mhz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.1 Non-blocking cache structure. . . . . . . . . . . . . . . . . . . . . . . . . 83

7.2 The organization of the Data and Tag storage units. . . . . . . . . . . . 89

7.3 Connections between Data and Tag storages and Lookup and Bus compo-nents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.4 (a) Two-component cache controller. (b) Three-component cache controller. 92

7.5 Lookup and Request state machines. Double-lined states are initial states.Lookup waits for Request completion in the “wait” state. All black statesgenerate requests targeted at the Bus controller. . . . . . . . . . . . . . . 93

7.6 Area comparison of NCOR and MSHR-based caches over various capaci-ties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.7 BRAM usage of NCOR and MSHR-based caches over various capacities. 98

7.8 Clock frequency comparison of NCOR and of a four-entry MSHR-basedcache over various cache capacities. . . . . . . . . . . . . . . . . . . . . . 99

7.9 Area and clock frequency of a 32KB MSHR-based cache with various num-ber of MSHRs. The left axis is ALUTs and the right axis is clock frequency.100

7.10 Speedup gained by Runahead execution on 1- to 4-way superscalar proces-sors. The lower parts of the bars show the IPC of the normal processors.The full bars show the IPC of the Runahead processor. . . . . . . . . . . 101

7.11 The impact of number of outstanding requests on IPC. Speedup is mea-sured over the first configuration with two outstanding requests. . . . . . 101

7.12 Speedup gained by Runahead execution with two and 32 outstanding re-quests, with memory latency of 26 and 100 cycles. . . . . . . . . . . . . . 102

7.13 Performance comparison of Runahead with NCOR and MSHR-based cache.102

7.14 Average runtime in seconds for NCOR and MSHR-based cache. . . . . . 103

7.15 Cache hit ratio for both normal and Runahead execution. . . . . . . . . . 103

7.16 Number of misses per 1000 instructions executed in both normal and Runa-head execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.17 Average number of secondary misses (misses only to different cache blocks)observed per invocation of Runahead executions in a 1-way processor. . . 105

7.18 IPC comparison of normal, Runahead and Runahead with worst case sce-nario for write-back stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . 106

xiii

8.1 Gray components form a typical 5-stage in-order pipeline. Black compo-nents are added to support Runahead execution. . . . . . . . . . . . . . . 111

8.2 Store handling during runahead mode. Speedup comparison (see text fora description of the three choices). . . . . . . . . . . . . . . . . . . . . . 116

8.3 Speedup with and without register validity tracking. . . . . . . . . . . . . 1178.4 NCOR resource usage based on the number of outstanding requests. . . . 1188.5 Speedup comparison of architectures with various numbers of outstanding

requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1188.6 Memory bandwidth usage increase due to Runahead execution. . . . . . . 1198.7 Comparison of branch prediction accuracy for normal and Runahead exe-

cutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208.8 Speedup gained with Runahead execution over normal execution on actual

FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

xiv

Chapter 1

Introduction

Embedded systems increasingly use FPGAs due to their superior cost and flexibility

compared to custom integrated circuits. There are several reasons for which FPGA-

based systems often include processors; for example, certain tasks are best, cost- or

performance-wise, implemented in processors, and processor-based implementations can

be faster and easier to develop, and debug than custom logic. If history is any indica-

tion of the future of embedded systems, it is safe to expect that their functionality will

increase and their applications will evolve increasing in complexity, footprint and func-

tionality (cell phone designs, for example, have followed similar trends). Accordingly, it

is important to develop higher performing embedded processors.

FPGA systems often incorporate two types of processors, soft and hard. Soft pro-

cessors are implemented using the FPGA fabric itself. Hard cores, on the other hand

are fabricated separately, and are either embedded in or external to the FPGA. Hard

cores could offer higher performance compared to soft processors. However, both options

have their shortcomings: Embedded hard cores are wasted when not needed and are

inflexible. External hard cores increase system cost and suffer from increased inter-chip

communication latency. Accordingly, there is a need to develop soft cores that provide

high performance.

1

Chapter 1. Introduction 2

Procesor performance improvement techniques generally rely on increasing the con-

currency of instruction processing. Such techniques include pipelining, superscalar [54],

Very Long Instruction Word (VLIW) [54], Single Instruction Multiple Data (SIMD), and

Vector execution [54, 66]. VLIW, SIMD, and Vector execution exploit instruction-level

parallelism that can be extracted by the programmer or the compiler. When this is

possible, each of these alternatives has specific advantages.

However, there are applications where parallelism is less structured and much more

difficult to extract. It is not possible to extract such irregular parallelism offline, rather

a dynamic architecture is required. Such architectures dynamically identify and extract

instruction-level and data-level parallelism in the code at runtime. Examples of such

architectures are superscalar, out-of-order (OoO), and Runahead execution [25, 54]. The

next three Sections review Superscalar, OoO, and Runahead architectures and comment

on their suitability for a soft-core implementation.

1.1 Superscalar Execution

Superscalar processors use multiple datapaths operating in parallel to increase instruction

throughput. They attempt to overlap the execution of two or more adjacent instructions.

An n-way superscalar processor can execute up to n consecutive instructions at the same

time. To do so, it effectively replicates the pipeline, including all control and data paths.

Superscalar processors are limited in the amount of parallelism they can extract from

the code because the instructions running in parallel must be spatially close to each other.

Furthermore, a wide datapath results in complex interconnect which leads to inefficient

implementations on FPGAs, which we will discuss in Section 1.4. In addition to the

datapath, the control plane also grows in complexity with the widened datapath and

leads to lower clock frequency.


1.2 Out-of-Order Execution

Out-of-Order (OoO) processors exploit instruction-level and data-level parallelism to

achieve high performance. OoO processors allow instructions to execute in any order

that does not violate program semantics [54, 45]. OoO can extract more parallelism than

superscalar execution because in OoO the instructions executing in parallel do not have

to be adjacent in the program order. Furthermore, OoO execution can extract more

parallelism using register renaming and speculative execution [47].

OoO execution is orthogonal to superscalar processing. As such, when combined

with multiple datapaths, OoO execution can offer higher performance than superscalar

design alone. However, this thesis shows that even a 1-way OoO provides performance

comparable to that of a wide superscalar processor.

In mid- to high-end hard cores, OoO execution has been the architecture of choice

since the 1990s, but not so for soft cores [29, 55, 65, 39, 37, 6]. Implementing support

for OoO execution in FPGAs requires a prohibitively large amount of on-chip resources

relative to the potential gain in performance. Other techniques such as VLIW may

provide comparable performance at less expense in terms of on-chip resources. OoO

structures have been developed for Application Specific Integrated Circuits (ASIC) and

for this reason are not necessarily well-suited for the FPGA substrate. However, it may

be possible to port most of the benefits of OoO execution while using structures that are

a better fit to the FPGA substrate. Accordingly, this thesis investigates and develops

FPGA-friendly OoO components as a step toward a practical and efficient OoO soft core.

1.3 Runahead Execution

Runahead execution is a technique that allows the processor to exploit memory-level

parallelism to achieve higher performance. Runahead extends a conventional inorder

pipeline with the ability to continue execution when a memory operation misses in the


cache. With Runahead, the processor continues execution with the hope of finding more

useful misses that can be issued concurrently and thus finish earlier.

Runahead can be considered a lower-complexity alternative to OoO architectures.

In fact, Runahead has been shown to offer most of the benefits of OoO execution [25].

Runahead relies on the observation that often most of the performance benefits of OoO

execution result from allowing multiple outstanding main-memory requests.

Originally, Runahead’s effectiveness was demonstrated for high-end general-purpose

systems with main-memory latencies of a few hundred processor cycles [33]. This the-

sis demonstrates that Runahead remains effective even under the lower main memory

latencies of a few tens of cycles that are observed in FPGA-based systems today.

Runahead, as originally proposed, requires the introduction of additional compo-

nents to a basic in-order pipeline. Runahead primarily exploits non-blocking data caches

which do not map well onto FPGAs because they rely on highly-associative Content-

Addressable Memories (CAMs) [25]. Implementation of CAMs on FPGAs leads to in-

creased area and decreased clock frequency. This thesis proposes FPGA-friendly alter-

natives for Runahead components which deliver comparable performance to CAM-based

techniques without a significant increase in area and without a significant degradation in

clock frequency.

1.4 Superscalar vs. OoO and Runahead Execution

General-purpose processors combine OoO and Runahead with superscalar execution be-

cause resources are plentiful (more than a billion of transistors per chip is common today).

When maintaining low resource usage is important, as it is on an FPGA substrate, OoO

or Runahead can be used on a narrow datapath. In Chapter 2 we demonstrate that

single-datapath, or single-issue, OoO and Runahead executions have the potential of im-

proving performance compared to wide superscalar processors that require more area and


cause reductions in clock frequency when implemented in FPGAs. OoO, for example,

improves performance over simple pipelining by not stalling when an instruction requires

additional cycles to execute. Waiting for the main memory is a major source of delay

even for soft cores.

We also demonstrate that increasing the number of datapaths in the processor leads

to considerably larger area and lower clock frequencies. In fact, we show that by moving

from a 1-way superscalar to a 4-way superscalar processor, the area requirement increases

by a factor of 10, and clock frequency drops by 33%, while the gain in IPC is only 10%.

We conclude that OoO and Runahead have the potential to improve performance

beyond the level of performance provided by simple pipelining, while avoiding the super-

linear costs of datapath replication. Several challenges remain for this potential to be

exploited effectively. First, performance depends not only on IPC but also on the oper-

ating frequency. Hence, the inclusion of OoO and Runahead must be done in a manner

that limits any reduction in the clock frequency. Second, OoO and Runahead introduce

additional structures into the implementation of a basic 1-way in-order pipeline. Such

additional resources in a single-datapath implementation must not increase the total area

beyond that of a multiple-datapath, otherwise the use of OoO and Runahead is no longer

advantageous.

1.5 Objectives

Ideally, existing OoO and Runahead implementations would map easily onto FPGAs and

would achieve reasonable performance while exhibiting reasonable resource cost. How-

ever, the FPGA substrate is different than that of ASICs and exhibits different trade-offs.

Accordingly, it is necessary to revisit conventional OoO and Runahead implementations

while taking the unique characteristics of FPGAs into consideration.

This thesis takes steps towards understanding whether with FPGA-friendly designs


it will be possible to build OoO cores that are performance- and resource-effective. The

goal is to revisit individual components involved in the OoO architecture and propose

FPGA-friendly alternatives.

One objective of this thesis is to propose a complete pipeline that supports Runahead

execution. The proposed design should not impose significant area overhead compared

to an in-order pipeline and it should offer reasonable speedup. Section 1.7 summarizes

the contributions of this thesis in more detail.

1.6 Thesis Overview

Chapter 2 provides background on superscalar, OoO, and Runahead execution, also mo-

tivates exploring narrow architectures as opposed to wide datapaths. Chapter 3 discusses

the experimental methodology followed in this thesis. Chapter 4 investigates soft pro-

cessor implementation challenges and provides solutions to remove most of the identified

difficulties. Chapter 5 proposes a novel checkpointing mechanism, a key component used

in both OoO and Runahead architectures. Chapter 6 studies instruction scheduler de-

signs for OoO architectures and proposes a configuration to be implemented on FPGAs,

offering the best performance for the least area cost. Chapter 7 proposes NCOR, a novel

non-blocking data cache optimized for Runahead execution on FPGAs. Chapter 8 in-

troduces SPREX, a complete soft processor with Runahead execution support. Finally,

Chapter 9 offers concluding remarks and outlines future research directions. The re-

mainder of this section offers an overview of each chapter and its corresponding technical

contribution.

1.6.1 Soft-Processor Implementation Challenges

Similar to any other embedded design, soft processors face their own unique challenges.

The first challenge in implementing soft processors is that the timing-critical components


of a typical pipeline must be identified. Chapter 4 investigates the challenges in imple-

menting a conventional 5-stage inorder pipeline on the FPGA substrate. It starts with a

straightforward soft-processor implementation. It then systematically identifies the crit-

ical paths of the implementation and classifies them into those of the control planes and

data planes.

There are two major challenges in identifying the critical path of a processor. First,

for various reasons such as inherent randomness in the synthesis and place and routing

algorithms, such paths are inter-dependent and a single path may not always constitute

the critical path of a design. Second, it is an open question as to how to properly identify

the next critical path without removing the first critical path.

In this thesis, the choice is made to use the longest path reported by the timing-

analysis tool as the critical path for a particular implementation synthesized by a computer-

aided design tool. This approach enables the selection of only a single path in the presence

of many tightly-coupled paths. Next, in order to identify the next longest path, we artifi-

cially remove the current critical path by introducing registers in the middle of the path.

This technique allows us to remove the path without having to introduce extra logic into

the design. The insertion of registers causes the behavior of the implementation to differ

from the design specifications, and is therefore not strictly correct. Nonetheless, this

approach reflects the focus of this chapter on path identification.

Chapter 4 moves on to proposing sophisticated solutions for eliminating critical paths

of the processor while preserving its correctness. It proposes various solutions and shows

that the processor performance can be greatly improved by applying such optimizations.

1.6.2 Copy-Free Checkpointing

One of the key mechanisms used in almost all modern processor architectures is specu-

lative execution. Speculative execution allows the processor to continue execution when

the outcome of a particular operation, such as a branch instruction, takes multiple cy-


cles to be determined. The processor predicts the outcome of such an operation and

continues execution using the predicted outcome. Once the actual result is available, it

is compared with the prediction. A correct prediction allows the processor to continue

execution with no penalty. An incorrect prediction, on the other hand, introduces a

penalty that stems from having to discard the results of any speculative computations

and perform computations again based on the actual result.

In order to support speculative execution, many approaches have been proposed.

One popular approach is checkpointing, which dictates that a copy of the processor state

must be saved, i.e., checkpointed, for every prediction made. Later, if a prediction is

found to be incorrect, the processor state is restored from the checkpoint corresponding

to that prediction. The storage required to implement this technique is proportional to

the number of checkpoints, i.e., the scope of permissible speculation. To provide good

performance, checkpointing requires the saving and restoring of state to be performed

quickly.

For soft-processors in FPGAs, checkpointing presents a specific implementation chal-

lenge. Rapid copying of the processor state (ideally in a single cycle) is complicated

by the typical use of FPGA block RAM (BRAM) to store the processor state instead of

flip-flops in logic blocks. BRAMs are high-speed, area-efficient memory arrays that signif-

icantly increase design efficiency. However, BRAM components have a limited number of

access ports. Consequently, copy operations to save or restore processor state would effec-

tively be serialized, resulting in poor performance for checkpointing. Chapter 5 proposes

CFC, a novel Copy-Free Checkpointing mechanism which provides the full functionality

of conventional checkpointing, while avoiding data copying. CFC is well-suited for FPGA

implementation as it addresses the port limitations of BRAMs.


1.6.3 Instruction Scheduling

An OoO processor executes instructions in any order that does not violate data depen-

dencies. Instructions are placed in a pool, and those with available operands are chosen to

be executed. The instruction scheduler in an OoO pipeline is responsible for identifying

and issuing ready-to-execute instructions.

Chapter 6 investigates various instruction scheduler designs and proposes the best

configuration for FPGA implementation, considering both performance and area cost. It

considers a range of scheduling policies, number of entries, and the cost-effectiveness of

back-to-back scheduling on the FPGA substrate. It shows that in a practical implemen-

tation, the best performance is achieved with a four-entry scheduler that incorporates

back-to-back scheduling and age-based selection policy.

1.6.4 Non-Blocking Data Cache

Runahead execution exploits data-level parallelism to achieve high performance. It ex-

ploits data prefetching to pre-populate the data cache while a data cache miss is being

serviced. However, unlike data prefetchers, Runahead uses the program’s own instruction

stream to induce subsequent data cache misses that then generate additional memory

requests that act as prefetch operations [23, 16].

Runahead extends a simple inorder pipeline with the ability to continue instruction

execution after a miss in the data cache. Runahead continues instruction execution

speculatively and allows continued access to the data cache. This requires a non-blocking

data cache which is costly to implement on FPGAs because conventional non-blocking

designs use highly-associative CAMs for cache-line tracking [17].

Chapter 7 proposes NCOR, a novel non-blocking data cache optimized for Runahead

execution on FPGA implementation. NCOR only provides a subset of a conventional non-

blocking cache’s features that are required for Runahead execution. Most importantly,

NCOR avoids using CAMs for tracking pending cache lines. Instead, it uses an in-cache


tracking system, in which metadata are stored along with the cache lines. NCOR’s simple

tracking system provides most of the benefits of a conventional tracking scheme, while

using negligible storage.

1.6.5 Soft Processor with Runahead Execution

Chapter 8 introduces SPREX, a complete soft processor implementation with Runahead

execution. SPREX extends simple pipelining and provides Runahead functionality with

minimal area and frequency penalty. SPREX exploits CFC and NCOR for checkpointing

processor state and providing continuous access to the data cache, respectively. SPREX

also tracks the dataflow graph to increase speedup. On average, SPREX offers 10%

speedup over a conventional inorder pipeline.

1.7 Thesis Contributions

The following are the contributions of this thesis:

• This thesis investigates processor implementation challenges on the FPGA sub-

strate and provides solutions to improve performance. Although many designs

require custom processor implementations, many designs also share common fea-

tures, and the challenges of interest are therefore common as well. Hence, it is

of high importance to be aware of such challenges and their workarounds when

implementing a soft processor.

• This thesis proposes CFC, a novel checkpointing mechanism which avoids data

copying, to address the problem of serialized data copying due to BRAM port

limitations. Checkpointing is a widely used mechanism in modern architectures,

e.g., superscalar processing and thread-level speculation. However, conventional

checkpointing schemes use single-cycle parallel data copying to store/retrieve check-

points. Such data copying is feasible in ASIC implementations using techniques


such as bit interleaving. However, on FPGAs large storage is provided using

BRAMs which provide a limited number of ports to access the data. By avoid-

ing data copying, CFC is still able to use BRAMs for storage, yielding a highly

efficient design both in terms of area and frequency.

• This thesis investigates the best configuration, in terms of area, frequency and

instructions per cycle, for instruction scheduling in OoO processors on FPGAs.

Instruction scheduling is at the heart of OoO processors and reorders instructions

for execution to extract the most instruction- and data-level parallelism. How-

ever, a poor choice of scheduling policy can lead to low parallelism, hence lower

performance. Additionally, a scheduler with high clock frequency is desirable to

achieve high performance. Finally, the FPGA substrate offers different trade-offs

compared to ASICs. This thesis proposes the best scheduler configuration when all

such parameters are taken into account.

• This thesis proposes NCOR, a novel non-blocking data cache specialized for Runa-

head execution on FPGAs. Runahead execution relies heavily on continuous access

to the data cache even after an access misses in the cache. Conventional non-

blocking caches track pending cache misses using CAMs, which map poorly to

FPGAs. Accordingly, NCOR does away with CAMs and takes a simpler approach

to track pending misses. Instead of tracking all possible combinations, NCOR only

tracks a subset of miss combinations. Hence, NCOR is able to use in-cache tracking

which greatly reduces area cost and increases its clock frequency.

• This thesis introduces SPREX, a complete soft processor with Runahead execu-

tion. SPREX extends an in-order, 5-stage pipeline with Runahead execution which

offers, on average, 10% speedup. SPREX incorporates CFC and NCOR to pro-

vide Runahead functionality, while achieving high frequency and area efficiency.

Compared to off-the-shelf soft processors, SPREX offers higher performance with


comparable area usage.

Chapter 2

Background and Motivation

This chapter provides background on superscalar, Out-of-Order, and Runahead execu-

tion. It discusses the functionality of each architecture and their advantages over simple

pipelining. We also compare the cost and benefits of narrow, 1-way OoO and Runahead

execution to those of 2- and 4-way superscalar processing. We show that a narrow OoO

and Runahead architecture are more suitable options for FPGA implementation.

2.1 Superscalar Processing

An N-way superscalar processor can execute, in parallel, up to N instructions adjacent

in the original program order. To do so, most of the datapath components are replicated

N times. This includes Arithmetic Logic Units (ALUs), branch predictor, and writeback

logic. Furthermore, many components used in the pipeline must provide multiple accesses

in the same cycle. These include the instruction cache, register file, and data cache.

Most importantly, to avoid unnecessary pipeline stalls, bypass paths are needed among

all N datapaths. Accordingly, superscalar resource costs increase super-linearly with the

number of ways.

13

Chapter 2. Background and Motivation 14

2.2 Out-of-Order Execution

Out-of-order (OoO) processors exploit instruction- and data-level parallelism to achieve

high performance. OoO executes instructions in parallel by issuing multiple instructions

to multiple functional and memory units. Furthermore, instructions taking multiple

cycles to execute, e.g., missing loads, do not block the pipeline as subsequent independent

instructions are free to execute using other functional units [61].

OoO avoids stalling the pipeline by executing independent instructions past those

waiting to complete. Executing instructions out of order provides the opportunity to

overlap instruction execution and achieve higher performance. For example, if a load

instruction is waiting for its data from the main memory, subsequent instructions which

do not depend on the load data are free to execute.

Compared to in-order processors, OoO relies on additional mechanisms to reorder in-

structions while maintaining correctness. Using mechanisms such as Scoreboarding and

Tomasulo’s algorithm, OoO is able to reorder instructions and assign them to functional

units for execution [22, 59]. OoO also uses register renaming to remove false data de-

pendencies in the program. False dependencies are a side effect of a limited number of

architectural registers.

Figure 2.1 shows a typical OoO pipeline using register renaming. Fetch and Decode

stages are similar to those of an in-order pipeline. Next, instructions have their register

operands renamed in the Rename stage. Register renaming is the process of mapping

architectural registers to physical registers. Subsequently, instructions next enter the

instruction scheduler where they wait to be assigned to functional and memory units.

After execution, e.g., loads reading data from the cache, instructions move to the Write

stage. Here instructions write their results, if they have any, into the register file. Finally,

instructions commit sequentially in the Commit stage.

Instructions are fetched, decoded, and renamed in the program order. After being

placed into the instruction scheduler, instructions are free to execute out of order. Any


SchedulerFetch Decode Rename

Execute

Mem

Write Commit

Reorder Buffer

Figure 2.1: A typical out-of-order pipeline using register renaming and reorder buffer.

ready instruction, i.e., one that has all its source operands ready, is free to execute.

Renaming also allows result writebacks to occur out-of-order as false dependencies have

been removed. However, instructions commit, i.e., apply their changes to the processor

architectural state, in the program order to preserve correctness. For example, store

instructions can commit changes to the data cache only when they commit.

OoO uses a Re-Order Buffer (ROB) to preserve the original instruction ordering.

The ROB maintains the list of instructions in the order they were fetched. Later in

the Commit stage, instructions are retrieved from the ROB sequentially and committed.

Hence, instructions are committed in the original program order.

2.3 Runahead Execution

Runahead is an extension of a simple in-order processor that maps well onto the FPGA

fabric. Runahead improves performance by avoiding stalls caused by cache misses, as

Figure 2.2(a) depicts. A conventional in-order processor stalls whenever a memory re-

quest misses in the cache. Even on reconfigurable platforms, a main memory request

may take several tens of soft processor cycles to complete, thereby limiting performance.


Execute

1

Execute

1 2

3Stall

3

Improvement

Stall

2

Runahead Mode

Mem A

Time

Mem B

Mem B

Mem A

Co

nven

tio

nal

Ru

nah

ead

(a)

(b)

Figure 2.2: (a) In-order execution of instructions resulting in stalls on cache misses.(b) Overlapping memory requests in Runahead execution.

Main memory controllers, however, support multiple outstanding requests. Runahead

exploits this capability and improves performance by requesting multiple data blocks

from memory instead of stalling whenever a request is made.

A pipeline with Runahead execution is similar to that of an in-order pipeline. Typ-

ically it consists of five stages of Fetch, Decode, Execute, Memory, and Writeback. In

an in-order pipeline, when a memory instruction is blocked in the Memory stage, all

subsequent instructions are blocked.

As Figure 2.2(b) shows, upon encountering a cache miss, or a trigger miss, instead

of stalling the pipeline, the processor continues to execute subsequent independent in-

structions. This is done with the hope of encountering more cache misses to overlap with

the trigger miss. Effectively, Runahead uses the program itself to predict near-future

accesses that the program will perform, and overlaps their retrieval.

Although all results produced during Runahead mode are discarded, all valid memory

requests are serviced by the main memory and the data requested is eventually placed in

the processor cache. If the program subsequently accesses some of this data, performance

may improve because this data was prefetched (i.e., requested earlier from memory).


Provided that a sufficient number of instructions independent of the trigger miss are

found during Runahead mode, the processor has a good chance of reaching other memory

requests that miss. As long as a sufficient number of useful memory requests are reached

during Runahead mode, performance improves as the processor effectively prefetches

these into the cache.

When a cache miss is detected, the processor creates a checkpoint of its architectural

state (e.g., registers) and enters the Runahead execution mode. While the trigger miss is

being serviced, the processor continues executing subsequent independent instructions.

Upon the delivery of the trigger miss data, the processor uses the checkpoint and restores

all architectural state, so that the results produced in Runahead mode are not visible to

the program. The processor then resumes normal execution starting immediately after

the instruction that caused the trigger miss.

Performance trade-offs with Runahead are complex. On one hand, those memory

accesses that were initiated during Runahead mode and bring useful data into the cache

effectively prefetch data for subsequent instructions and reduce overall execution time.

On the other hand, memory accesses that bring useless data pollute the cache and con-

sume memory bandwidth and resources, e.g., they may evict useful data from the cache

or they may delay subsequent requests.

2.4 Narrow vs. Wide Datapath

In this section we compare narrow, 1-way OoO and Runahead execution with wide super-

scalar pipelines. We estimate the processor performance using a full-system simulator we

developed. We also implement, in Verilog, a trimmed down 5-stage pipeline to estimate

the effect of superscalar processing on FPGAs. For this experiment, each datapath uses

a conventional five-stage pipeline with full bypass paths and the pipeline latches. For

simplicity, the ALU unit includes only a 32-bit adder. No other components are modeled.


0

50

100

150

200

250

300

350

0

500

1000

1500

2000

2500

1-way 2-way 4-way

Fm

ax

(MH

z)

LU

TS

LUTs Fmax

Figure 2.3: Area and maximum frequency of a minimalistic pipeline for 1-, 2-, and 4-waysuperscalar processors.

See Chapter 3 for a more detailed explanation of the experimental methodology.

Figure 2.3 reports how the area and frequency of a superscalar pipeline scale as

the number of ways increases from one to four. The figure shows that in a superscalar

processor, frequency of a wide pipeline is lower than a narrow pipeline, while its area cost

is significantly higher as well. The maximum frequency of the 4-way superscalar is 33%

less than that of the single-issue processor. The 4-way superscalar must extract sufficient

instruction level parallelism (ILP) to compensate for this frequency disadvantage.

Figure 2.4 compares Superscalar processing with OoO and Runahead execution in

terms of IPC. The figure shows that narrow OoO and Runahead processors come close to

wide, 4-way superscalar in-order pipelines. The comparison is made for the performance

of 1-, 2-, and 4-way superscalar, and single-issue OoO and Runahead processors, for a

wide range of cache configurations. The cache size varies from 4KB up to 32KB (stacked

bars). The OoO pipeline outperforms both 1-way and 2-way superscalars for all cache


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

bzip

2

gobm

k

hm

mer

sje

ng

libquantu

m

h264

asta

r

xala

nc

Avera

ge

IPC

4KB 8KB 16KB 32KB

1-way-ooo4-way

2-way

1-way1-way-runahead

Figure 2.4: IPC performance of superscalar, out-of-order, and Runahead processors as afunction of cache size.

sizes, while it performs worse than the 4-way superscalar only with the 32KB cache.

It should be noted that a more advanced compiler could improve performance of the

superscalar processor.

The data presented in this section demonstrate that narrow, 1-way OoO and Runa-

head execution have the potential to improve performance of an in-order pipeline on

FPGAs. In addition, these architectures avoid the superlinear costs of datapath replica-

tion and can potentially achieve low area costs with high clock frequencies.

Following the data presented in this chapter, this thesis targets narrow OoO and

Runahead architectures for FPGA implementation, to avoid the superlinear costs of

superscalar processing on FPGAs.

Chapter 3

Experimental Methodology

In this chapter we explain our experimental methodology. We use a combination of

software simulation and actual hardware implementation to evaluate various designs we

propose. We use multiple performance metrics to measure the efficiency of a given design

and to compare different processor configurations.

3.1 Comparison Metrics

We measure an architecture’s performance using two different metrics: Instructions Per

Cycle (IPC) and Instructions Per Second (IPS). We use IPC for both simulations and

actual hardware implementations. After synthesis and placement-and-routing, we com-

pare designs based on IPS, area and frequency characteristics. We also compare designs

based on their area and frequency characteristics.

3.1.1 Area

We use a design’s area usage as a metric to measure its implementation efficiency on

FPGAs. We measure area usage based on LUT and BRAM usage reported by the

synthesis tool. We primarily use Altera Stratix III FPGAs, in which the basic building

20

Chapter 3. Experimental Methodology 21

blocks are Adaptive Logic Modules (ALMs). Each ALM contains two combinational

adaptive LUTs (ALUTs), two flip-flops and two full adders. ALMs can be configured to

implement logic functions, arithmetic functions, and register functions [12].

In Chapter 5, we also compare checkpointing schemes based on their silicon real

estate. We estimate the silicon area as the total equivalent area, by calculating the sum

of the area of all the ALUTs plus the area of the BRAMs. We follow the same scheme

described in [41, 62].

3.1.2 Frequency

We compare designs based on their operating frequency, a property which can directly

affect their runtime performance. In order to reduce the effect of the inherent randomness

in the tool, we place-and-route every design multiple times using different random seeds.

We report the average of the maximum clock frequencies reported by the tool.

3.1.3 IPC

Before implementing a given design in hardware, we estimate its performance irrespective

of its implementation details. We use IPC, a frequency independent performance metric

to compare designs before implementation. IPC is measured as the rate of instruction

execution per processor clock cycle. Using IPC we can compare the performance of two

architectures solely based on their architectural properties, avoiding implementation-

specific optimizations.

3.1.4 IPS

We use IPS to assess actual performance of a given design. Using IPS we can compare

two designs considering their architectural properties and their hardware implementation

limitations. In Chapter 8 where a complete processor is implemented in hardware, we


measure IPS by clocking the execution time of a specific number of instructions. However,

in the rest of the thesis, when we are designing individual components of the processor, we

estimate the IPS to provide insight into the processor performance. We implement that

particular component in hardware and assume the entire processor can operate at the

same clock frequency. We first use software simulation to measure processor’s IPC with

the proposed micro-architecture, and then use the Formula 3.1 to estimate the processor

IPS.

IPS = IPC ∗ Frequency (3.1)

3.2 Software Setup

3.2.1 Software Simulation

In order to motivate and evaluate this work we use software simulations to pre-evaluate a

given design before going into the time-consuming process of hardware implementation.

We implement, in software, a model of the proposed hardware and use simulations to

estimate its performance in hardware.

We have developed SoinSim, an open-source, cycle-accurate, full-system simulator

for the Altera Nios II instruction set, written in the C language. SoinSim is capable of

modeling various superscalar, Runahead and OoO architectures with numerous detailed

parameters, some of which are listed in Table 3.1.

SoinSim is capable of booting and running the uCLinux operating system [15]. It

models a system consisting of a Nios-II-compatible processor, main memory, timer and

UART. All components are connected through a system bus.


Table 3.1: SoinSim ParametersCommon Properties

Pipeline Stages 5, 6

BPredictor Type Bimodal, GShare

Bimodal Entries Configurable

BTB Entries Configurable

Cache Size Configurable

Cache Associativity Configurable

Memory Latency Configurable

Data Forwarding Configurable

Superscalar Specific

No. ways Configurable

Out-of-Order Specific

Pipeline Stages 7

Scheduler Size Configurable

Scheduler Policy Age / Random

Scheduler Latency Configurable

ROB Size Configurable

No. Physical Registers Configurable

Checkpoints Configurable

Runahead Specific

No. Outstanding Requests Configurable

Include Store Insts. Configurable

Track Registers Configurable

System Bus

SoinSim connects all the components of the system using a bus that follows the Avalon

Bus specifications [9]. In the modeled system, the processor is the only Avalon master

component, capable of initiating bus requests.

Memory Model

In order to estimate main memory access latency in our simulations, we experiment on

a Altera DE-3 board, accessing the DDR2 memory clocked at 266MHz. We experiment

with various memory access patterns and find that a single memory request, on average,

takes 20 cycles to complete. We also find that the memory controller is capable of

pipelining memory requests, and back-to-back memory accesses are serviced faster. For


example, a continuous four-word request is serviced in 30 cycles, rather than 80 cycles if

requested separately.

Accordingly, we have developed a DDR2 memory model which models a fixed-latency,

pipelined memory controller. That is, every initial memory request takes a fixed number

of cycles to service. However subsequent requests that are received before the initial

request is serviced take fewer cycles to return.

Peripherals

SoinSim models three memory-mapped peripherals which are accessible to the processor

through the system bus:

• UART: SoinSim models a UART following Altera’s JtagUART specifications [10].

• Timer: A programmable timer is modeled in software which resembles that of

Altera’s SOPC Timer module [10].

• Performance Counter: This is a custom performance counter to facilitate measuring

various metrics.

3.2.2 Operating System

We boot and run the uCLinux operating system [15] on top of all simulated and hardware

implemented processors. uCLinux is a simplified version of the Linux operating system

which is capable of running arbitrary applications cross-compiled for the Nios II ISA.

The current uCLinux version does not support virtual memory in order to minimize the

overhead of hardware and software memory management.

We use the ramdisk driver to create a memory-mapped disk available to applications.

We store benchmarking data files in this disk space.


3.2.3 Benchmarks

We estimate a given processor’s runtime performance by measuring its performance when

running a specific set of benchmarks. We use benchmarks from the SPEC CPU 2006

benchmark suite that are typically used to evaluate the performance of desktop sys-

tems [56]. We use them as representative of applications that have unstructured data-

and instruction-level parallelism. We make an assumption, motivated by past experi-

ence, that in the future, embedded and FPGA-based systems will be called upon to run

demanding applications such as these.

We use a set of reference inputs for benchmarks as provided by the benchmark suite.

As we do not include floating point units in our processor architectures, as is the case

with Nios II, we use the integer subset of the benchmarks. We compile the benchmarks

using the gcc ported for Nios II by Altera Corp.

Due to the slow speed of simulations, ∼200KIPS on average, we use a sample of one

billion instructions per benchmark. We fast-forward the first billion instructions as to

skip the initialization phase of the benchmark.

We compare designs based on the speedup they achieve in IPC over a baseline imple-

mentation. We run all the benchmarks on every design. To provide a single number as

the speedup for a design, we use the geometric mean of speedups over the execution of

all benchmarks.

3.3 Hardware Setup

3.3.1 Verilog Implementation

We implement proposed hardware designs in Verilog and deploy them on Altera Stratix-

III FPGAs. We use Quartus II for synthesis and place-and-routing the design. Over the

course of this study, we used various Quartus II versions ranging from 8.1 to 12.1.


We use the Altera DE3 development board equipped with an Altera Stratix-III-150

FPGA [58]. The DE3 board has a SODIMM DDR2 slot, providing access to memory

capacities in the order of gigabytes, as needed to boot the operating system and run our

demanding benchmarks.

3.3.2 Component Isolation

In order to measure the area and frequency characteristics of a single component design,

we isolate it for placement and routing. This is done by synthesizing the design in a

top-level module containing only the design itself. In order to reduce the effect of pin

placement on the clock frequency (e.g., due to excessive global routing), all the inputs

and outputs are registered. These include the instruction and data busses, interrupt

lines, and clock and reset signals. All inputs are fed with shift registers to minimize the

number of pins used. All wide outputs (e.g., data bus writedata) are reduced to one bit

signals with XOR reduction operations.

3.3.3 Inorder Processor Resembling Nios II

One of the main objectives of this thesis is to implement a complete soft processor in

hardware and compare it to the current state-of-the-art soft processors. We have chosen

to compare our work with the Nios II/f processor provided by Altera Corp [13]. As

the source code for Nios-II is not disclosed, as is the case with most commercial soft

processors, we found it necessary to implement a baseline in-order pipelined processor

resembling, as accurately as possible, Nios-II-f. Nios-II-f is the fastest version of the

Nios-II processor available.

We have implemented Soin, a complete Nios-II replica. We test Soin’s correctness

using micro-benchmarks. After initial testing, we boot uCLinux on the processor as a

thorough testcase. We found the Linux boot process to be a comprehensive test case for

the processor implementation, covering almost all corner cases.


3.3.4 The System

We use Qsys to create a complete system consisting of the processor, system bus, memory

controller and peripherals.

3.3.5 System Bus

All components in the Qsys system are connected through a memory-mapped Avalon

bus. The processor is the only master component on the bus.

3.3.6 Memory Controller

We use Altera’s UniPhy DDR2 memory interface to access the DDR2 slot on the DE3

board [11]. UniPhy is a commercially used, high performance DDR2 interface, capable

of pipelining memory requests.

3.3.7 Peripherals

The system implemented in hardware consists of the following peripherals which are

connected as Avalon slaves to the system bus.

• Jtag UART: We use Jtag UART to connect to the operating system’s console. Jtag

UART provides UART connectivity through the Jtag port available on the DE3

board.

• Timer: This is a programmable timer available in Altera’s IP library. Timer is used

by the operating system for task scheduling purposes.

• Performance Counter: This is a custom made performance counter to measure

processor’s performance.

Chapter 4

Soft Processor Implementation

Challenges

General-purpose soft processors are a key component in reconfigurable computing since

they provide adequate performance especially for workloads that have little parallelism,

and because they facilitate easy and quick development. Accordingly, many modern de-

signs incorporate multiple instances of general-purpose soft processors. The widespread

use of general-purpose soft processors has led to many designs both by the academic

community and industry. For example, Altera’s Nios II [13] and Xilinx’s Microblaze [64]

are two commonly used designs which provide adequate performance at a low cost. More

advanced soft processors, e.g., LEON3 [51], provide additional functionality and recon-

figurability at the expense of clock frequency and area.

Despite the popularity of soft processors and their widespread use, the implementation

inefficiencies of an entire pipeline as a whole have not been systematically explored.

Instead, several works have addressed specific implementation inefficiencies mostly on a

case-by-case basis. However, a processor pipeline is a complex system, which incorporates

a wide variety of components. Naıvely porting conventional designs that were originally

developed for custom logic implementation can easily lead to high complexity in the

28

Chapter 4. Soft Processor Implementation Challenges 29

processor’s data path and control logic. Accordingly, there is a need to systematically

characterize the sources of inefficiency in soft processor designs. Such a characterization

serves to deepen our understanding of FPGA implementation trade-offs and can serve as

the starting point for developing FPGA-friendly designs that achieve higher performance

and/or lower area.

This chapter systematically characterizes which circuit paths dominate the operat-

ing clock frequency when implementing a typical pipelined general purpose processor

on an FPGA. To do so, we first develop an implementation of a 5-stage pipelined pro-

cessor, a commonly used soft processor architecture. The baseline implementation is

representative of a “textbook” implementation of a 5-stage pipeline that is optimized for

custom logic implementation and that focuses on correctness, modularity, and speed of

development.

The two key questions this chapter then asks are:

1. Which circuit path dominates latency and thus determines the operating clock

frequency?

2. If this critical path was eliminated somehow, which path would be dominating the

clock frequency next?

To answer these questions, this work follows an iterative approach by progressively

synthesizing the design and identifying its critical path. Once the current critical path

has been identified, it is “removed” and the design is synthesized again to identify the

next critical path. Section 4.1 elaborates on the challenges any systematic critical path

identification study faces and the best-effort approach this work follows. Once the various

critical paths are identified, this work proposes a set of optimizations that eliminate them,

improving overall processor frequency.

In summary, this chapter makes two contributions:


1. It identifies the sources of inefficiency in a typical implementation of a 5-stage

pipeline. This analysis focuses on operating frequency, identifying where and why

it suffers. The result of the analysis is an ordered list of critical paths.

2. It proposes several optimizations that eliminate the processor critical paths, im-

proving the operating frequency and performance. The optimizations demonstrate

the utility of the critical path analysis and improve the processor’s clock frequency

from 145MHz to 281MHz. Overall, actual instruction processing throughput in-

creases by 80%.

The goal of this chapter is not to develop the best possible soft processor, nor do we

claim that all the optimizations presented are novel. Rather, this is a step toward system-

atically understanding the sources of inefficiency in soft processor designs. Future work

may rely on the analysis presented here to improve soft processor designs and may follow

a similar methodology to characterize other soft processor designs and architectures.

The remainder of this chapter is organized as follows. Section 4.1 discusses the crit-

ical path identification methodology. Section 4.2 presents the baseline processor design.

Section 4.3 discusses details for the implementation and testing, and it also describes the

specific tools used during the critical path exploration. Section 4.4 presents the criti-

cal path analysis while Section 4.5 proposes several performance optimizations. Finally,

Section 4.6 measures how the processor’s overall performance improves after applying

various optimizations.

4.1 Identifying Implementation Inefficiencies

Given a pipelined processor implementation a designer can follow an iterative refinement

approach in order to improve the processor’s operating frequency and performance. At

each step of the process, the designer would identify the critical path that dominates

the clock frequency. Then they would proceed to develop, if possible, a circuit- or an


architectural-level technique to “remove” this path. Once, and if, the current critical path

is eliminated, another path would now become the critical path and the process can be

repeated. Alternatively, the designer may completely rethink the processor architecture

and design. This work follows the first, iterative approach but the insights it offers are

useful should one decide to completely rethink the processor’s architecture.

A challenge with the iterative refinement approach is that at each step, specific op-

timizations must be developed to eliminate the current critical path. In lieu of actual

optimizations, the study would be of limited value as it only would be able to identify a

single critical path. To overcome this limitation, this work uses a “best-effort” approach

where it artificially removes the critical path at each step. Section 4.4 explains the path

elimination heuristics used on a case-by-case basis. The approach followed in this work

represents a “what if the critical path was magically removed” scenario.

A limitation of the presented analysis is that actual optimizations may alter the rel-

ative importance of the various circuit paths or may give rise to other critical paths.

However, we believe that this analysis represents a meaningful and useful step in iden-

tifying the sources of inefficiency in FPGA-based designs in lieu of actual optimizations.

Moreover, this work goes beyond the critical path analysis and in Section 4.5 presents

specific optimizations that eliminate these paths, while preserving design correctness.

These optimizations demonstrate the utility of the presented path analysis. Section 4.3

discusses how the analysis methodology compensated for lower-level FPGA-specific chal-

lenges during the critical path identification analysis.

4.2 Processor Pipeline

This work implements a classic 5-stage processor pipeline [31] in Verilog. Fig. 4.1 shows

the block diagram of the processor including Fetch, Decode, Execute, Memory and Write-

back stages. The baseline implementation focuses on correctness, modularity and extensi-


Write

DecodeFetch Execute Memory

2

ALU

mul

shift

Branch DCache

31

rfile

0

ICache

BPredictor

Hazard

selectionNext PC

Figure 4.1: The typical 5-stage pipeline implemented in this work. Dotted lines representcontrol signals.

bility rather than clock speed. This section describes the implementation of each pipeline

stage.

4.2.1 Fetch Stage

The Fetch stage is responsible for providing the instruction bits to the Decode stage. It

includes an instruction cache for speeding up instruction fetches as the main memory

latency is high. The instruction cache is capable of fetching one instruction per cycle if

the address hits in the cache.

The fetch stage also predicts the direction and target address of conditional branches

to avoid bubbles in the pipeline [31]. A bimodal branch direction predictor, a dy-

namic branch predictor comprising a table of two-bit saturating counters, predicts the

direction [34]. A Branch Target Buffer (BTB) predicts the target address for taken

branches [34]. The implementation uses a tagless BTB for simplicity and speed. Both

the bimodal predictor and the BTB have 256 entries which are indexed with a portion of

the PC. The BTB and bimodal entries are stored as pairs in one BRAM. It is possible to


use the same BRAM row to store a bimodal and a BTB entry, as they use the same in-

dexing scheme [63]. It has been shown that fusing BTB and bimodal predictor structures

into the same BRAM provides storage and frequency advantages on FPGAs [63].

4.2.2 Decode Stage

The Decode stage is responsible for preparing all data and control signals for the Execute

stage. Depending on the instruction type and pipeline state, data operands may come

from the register file, forwarding lines, or they can be an immediate value from the

instruction bits.

The Decode stage is also responsible for detecting hazards in the pipeline. Hazards

can occur for multiple reasons, for example, if an instruction requires an operand whose

value is yet to be produced in the pipeline. When a hazard is detected, the pipeline must

either be stalled, which introduces penalty cycles as bubbles that perform no useful work,

or a technique must be applied that eliminates the need to stall while ensuring correct

execution semantics.

In order to avoid bubbles in the pipeline, data forwarding is used [31]. One method

of implementing data forwarding is to introduce paths that provide data generated in

later stages of the pipeline to a dependent instruction that is in the decode stage. A

multiplexer must be introduced for each input operand in the decode stage to select

between the normal operand value and the possible forwarding paths from other stages.

Additional logic must also be introduced to perform register-identifier comparisons that

determine the appropriate input to select for each multiplexer.

4.2.3 Execute Stage

The Execute stage includes an arithmetic and logic unit (ALU), which consists of a logical

operation unit, one comparator, one multiplier, one shifter, and two adders. One adder

is used for arithmetic operations and memory address calculations, while the other adder


is used for branch target calculation.

For branch instructions the Execute stage performs a series of operations. First, it

calculates the target address of the branch. In parallel, it determines the branch outcome,

i.e., whether the branch is taken or not. Finally, the calculated branch target is compared

to the predicted target address which was provided by the branch predictor during Fetch.

A misprediction signal is broadcast to all earlier pipeline stages if the two addresses don’t

match, and the pipeline is flushed in the same clock cycle.

4.2.4 Memory Stage

The Memory stage includes a data cache to compensate for the long latency of accessing

the main memory. Load and store instructions lookup their addresses in the data cache.

If the address hits in the cache, loads complete in a single cycle, while stores take two

cycles to complete. For stores, after determining a hit, i.e., in the second cycle, the actual

store operation happens. As a result, a load immediately following a store in the original

program order will have to wait for one additional cycle in the Execute stage.

The data cache is a 2KB blocking, write-back cache [35]. It consists of two storage

units implemented using BRAMs, one for tags and one for data. Loads and stores access

the data cache in the Memory stage. If the address misses in the data cache, the entire

pipeline is stalled, while the cache line is being retrieved from the main memory. For all

other instructions, the memory stage is a pass-through stage.

4.2.5 Writeback Stage

The Writeback stage writes the result of instructions back to the register file.


4.3 Methodology

The entire processor under study is implemented in Verilog, and conforms to the Nios II

ISA. Following the same methodology explained in Chapter 3, we test the processor’s

functionality and measure its performance in terms of both IPC and IPS.

The Verilog design is synthesized using Quartus II 12.1 to a Stratix III chip. The

TimeQuest timing analyzer of the Quartus II software is used to measure the maximum

clock frequency at which the design can operate. The target clock speed is set to 333MHz

(3ns period) in the design constraint file. Our goal is to reach frequencies close to that

of Nios II/f, which is 270MHz on Stratix III devices [14].

There can be many different interfaces and devices the processor may connect to.

To identify the critical paths that are inherent to the processor design and to avoid

artifacts caused by external components, the processor design is isolated for placement

and routing. The isolation process is explained in more detail in Section 3.3.2.

In the critical path analysis, we synthesize the processor and locate the critical path,

that is the circuit path which has the longest cumulative delay. Most critical paths

are tightly coupled with other parallel paths. However, in the analysis we focus on

the top failing path reported by the synthesis tool. Once the critical path is identified,

we artificially eliminate it, that is by introducing registers along the path, effectively

splitting it over two cycles. We then re-synthesize the core to find the next critical path

and continue the process as described. The goal of our analysis is to eliminate paths by

adding/removing as little logic as possible. Following this method to eliminate the paths

may result in a processor design that does not operate correctly. However, we believe

that this is a reasonable approach to determine the next design bottleneck in lieu of an

actual optimization for removing the current critical path. Another method is to declare

top critical paths as false paths by the toolset to exclude them from timing analysis.

We choose the method of introducing registers as it operates in the architecture level as

opposed to the false path setting which is at the circuit level. Section 4.5 demonstrates


the utility of our approach by presenting several optimizations.

4.4 Critical Path Study

Table 4.1 reports the critical paths found in each synthesis iteration. The table reports

the maximum operating frequency with the corresponding path included. The baseline

processor design can operate at 145Mhz. The table reports the top 15 critical paths. If it

was possible to eliminate them all the processor would operate at 281.68Mhz. Removing

most paths results in a monotonic increase in operating frequency except between paths

(D) and (E). Specifically, removing the critical path (D) in the fourth iteration, improves

frequency more than removing the critical path (E) in the fifth iteration. Removing path

(D) results in an isolated efficient routing configuration, mainly caused by the random

nature of the placement and route process. We conclude that the list of the various paths

is more important rather than their relative order. The results also suggest that there

is no single path that, if eliminated, would result in a significant improvement in clock

frequency. Instead, the designer has to contend with multiple, tightly-spaced critical

paths.

The rest of this section discusses each path in more detail also explaining how we

“eliminated” the path for the purpose of identifying the next important critical path. In

some cases the technique used to “eliminate” the critical path breaks correct functionality.

Section 4.5 presents proper ways of removing the critical paths that preserve correctness.

The goal of the analysis is to identify the various critical paths in order of importance

and in lieu of actual optimizations.

A: This path includes the multiplier and forwarding data path. It starts from the

data operand registers provided by the Decode stage, through the multiplier in the

Execute stage, routed back through the forwarding logic to the Decode stage and ends

at the data operand registers. This represents data computation and communication.


Table 4.1: Processor critical paths.

Path Max. Freq. Main Component Type

(MHz)

A 144.99 Multiplier Data

B 184.71 Branch Control

C 199.72 Branch Control

D 211.01 Shifter Data

E 200.84 Hazard Detection Control

F 201.90 Memory Stalls Control

G 206.95 Hazard Detection Control

H 211.46 Forwarding Control

I 214.68 Forwarding Data

J 230.95 ICache Hit Control

K 231.59 Forwarding Data

L 242.72 Multiplier Data

M 249.50 ICache Hit Control

N 249.75 DCache Hit Control

O 281.69 Memory Mux Data

In order to remove this path we registered the output of the multiplier, and allowed

bypass only from the memory stage.

B: This path includes the branch misprediction logic and pipeline redirection. It starts

from the data operands to the Execute stage, and continues in the ALU’s compara-

tor for branch outcome determination. It also includes the address comparator for

misprediction identification which signals the Fetch stage to redirect the program

counter. This path is for branch misprediction identification followed by fetch stage

redirection. We removed this path by registering the branch mispredict signal broad-

cast to the Fetch stage. This effectively delays branch misprediction detection by

one cycle.

C: The third critical path includes the branch misprediction logic and stall signal sent to

the Decode stage. When a branch is identified as mispredicted at the Execute stage,


the instruction currently at the Decode stage must be annulled. To remove this path

we registered the branch outcome signal. This signal determines whether a branch

is taken or not-taken. Similar to path (C), branch misprediction identification is

delayed by one cycle.

D: The fourth critical path includes the shifter in the ALU and the forwarding logic.

It starts from the data operand registers, follows through the shifter and forwarding

logic back to the data operand registers. This is another data computation and com-

munication path. We register the output of the shifter to eliminate this path. This

effectively eliminates one bypass path from the Execute stage back to the Decode

stage.

E: The hazard detection logic in the Decode stage dominates the fifth path. The hazard

signal is broadcast to the Fetch stage where it stalls the fetch process. We register

the fetch redirect signal to remove this path.

F: This path includes the stall signal from the Memory stage to the rest of the pipeline.

We remove this path by registering the memory stall signal.

G: This is another hazard detection dominated path. Hazards are identified in the

Decode stage by checking all forwarding lines. We register the forwarding selection

logic signals to remove this path.

H: The next critical path is in the data path including the forwarding data lines from

the Memory stage to the Decode stage and ending in the data operand registers. We

reduced the data operand multiplexer size by removing one of the inputs (immediate

value for shift operations).

I: The critical path is still through the forwarding logic from the Memory stage to

the Decode stage. We remove two more inputs from the data multiplexers in the

Memory stage (shift and multiplication results).


J: The instruction cache hit signal contributes to this path. This signal directs the

address multiplexer in the Fetch stage to select the next instruction address. We

remove this path by eliminating one input to the multiplexer.

K: The forwarding logic from the Memory stage to the Decode stage surfaces again. This

path includes the sign extension logic required after loading the data from the data

cache. We remove this path by eliminating load instructions from the forwarding

logic.

L: At this point the multiplier alone is the critical path. Both the inputs and the output

of the multiplier are registered. We remove this path by replacing the multiplier with

a simple XOR logic.

M: The instruction cache hit signal to fetch address selection path surfaces again. We

remove this path by registering the ready signal from instruction cache to the Fetch

stage.

N: This path includes the data cache’s lookup address selection logic. The lookup

address is either from load/store instructions or from the write-back logic. The

selection depends on the cache’s next state. We remove this path by using the

current stage (a register) to select the address.

O: This path includes the multiplexer to select between shift, multiplication, loads from

data cache or all other instruction results in the Memory stage. The result is passed

on to the Writeback stage.

We stop critical path exploration at this point as the maximum clock frequency

reached (281 MHz) is higher than our target frequency (270 MHz).

The results of this analysis show there is no single path that dominates the clock

frequency. Instead, removing each problematic path results in a relatively small im-


provement. Only if several paths are eliminated, operating frequency can improve sub-

stantially.

4.5 Eliminating Critical Paths

This section proposes solutions to eliminate some of the critical paths that Section 4.4

identified. All solutions proposed are applicable only to the processor architecture. That

is, they are compiler independent and only change the implementation of the processor.

No compiler options are changed during optimizations. The proposed solutions preserve

the processor’s functionality while increasing its clock frequency. Some of the proposed

optimizations increase clock frequency at the expense of introducing pipeline bubbles

under certain scenarios. These bubbles may delay certain instructions, leading to lower

IPCs. However, as long as these delays are infrequent enough, the gain in frequency can

compensate for the loss in IPC. Section 4.6 measures the resulting performance in IPS,

considering both IPC and clock frequency.

4.5.1 Multiplier and Shifter

The original processor implementation included a multiplier and a shifter in the Execute

stage. Although this reduces the number of cycles required for multiplication and shifting,

it also led to low clock frequency and manifested as critical paths (A), (D) and (L).

Instead, we propose to delay the forwarding of multiplication and shifting operations in

the pipeline by eliminating the bypass path from the execute stage back to the decode

stage. This will introduce bubbles in the pipeline when the next in order instruction in

the pipeline requires the result of the multiplier or shifter. Fig. 4.2 shows the pipeline

before and after this optimization.


Fetch

Decode Memory

Writesign

extend

Load

data

rfile

selection

(a)

Execute

mult

shift

Fetch

Decode Memory

Writesign

extend

Load

data

rfile

selection

(b)

Execute

mult

shift

ALU

ALU

Figure 4.2: Multiplication and shift/rotate operations before (a) and after (b) optimiza-tion.

4.5.2 Branch Misprediction Detection

The Fetch stage predicts the outcome and target address of branch instructions to avoid

stalling fetch on branches. However, when eventually the actual outcome of the branch

is computed in the Execute stage, it must be compared to the one predicted earlier. If a

mismatch is detected, any incorrectly introduced instructions must be flushed from the

pipeline and fetching must be redirected to the computed target address.

Branch misprediction detection includes three steps: 1) The outcome of the branch is

determined, i.e., whether the branch is taken or not-taken. 2) The target address of the

branch is calculated. 3) The actual target of the branch (either fall-through address or


Determined Direction

Predicted

Target

== hit/miss(a)

Determined Direction

Predicted

Target

== hit/miss(b)

reg

iste

r

Computed

Target

Fall-Through Address

Computed

Target

Fall-Through Address

Figure 4.3: Branch misprediction detection before (a) and after (b) optimization. Dashedboxes represent registers.

the target) is compared to that of predicted address in the Fetch stage. Fig. 4.3-a shows

the block diagram of this mechanism.

In order to shorten the long combinatorial paths (B) and (C), we propose to delay

the branch misprediction detection by one clock cycle. As shown in Fig. 4.3-b, branch

outcome and target are calculated in the first clock cycle (Execute stage), and the com-

parison with the predicted target occurs in the next clock cycle (Memory stage). This

shortens the combinatorial path by introducing a register in the path. This optimization

increases branch misprediction recovery time by one clock cycle.


4.5.3 Data Forwarding

When an earlier instruction, already inside the processor pipeline, produces a result used

by a later instruction, its data must be forwarded to avoid a pipeline bubble [31]. Fig. 4.4-

a shows the datapath for forwarding data from various pipeline stages to the Decode stage

where data operands are prepared. Forwarding logic can fall into the critical path as it

uses a large multiplexer and complex selection logic.

A full-blown forwarding logic forwards data from every pipeline stage after the Ex-

ecute stage, and requires a large multiplexer. Furthermore, the selection logic for the

forwarding multiplexer proves to be relatively complex. First, all instructions in the

pipeline producing a result to the same register must be identified through a set of reg-

ister name comparisons. Among all matches, younger sources must be prioritized over

older ones. As the number of data sources increases, the complexity of the selection pro-

cess increases as well. We eliminate the data forwarding delay with the two optimizations

described next.

Two-Cycle Forwarding

The most critical path in data forwarding is due to the selection logic manifested in path

(G). This logic is large and performs various operations sequentially. We shorten this

long combinatorial path using the following observation: Data source identification and

data selection don’t have to occur at the same cycle. Instead, they can be performed in

two separate cycles. In the first cycle, the forwarding logic can determine the source for

a particular data operand. In the next clock cycle, the actual data selection occurs. This

scheme effectively cuts the long path of data forwarding into two smaller paths.

Delayed Data Forwarding

Delaying multiplication and shift operations by one cycle, requires forwarding their result

from the Memory stage to the Decode stage as Fig. 4.4-a shows. Combined with the loads


Write

Fetch

Decode

Execute

Memory

Write

mult

shift

sign

extend

Load

data

rfile

selection

Fetch

Decode

Execute

Memory

rfile

selection

mult

shift

sign

extend

Load

data

(a)

(b)

Figure 4.4: Forwarding data path before and after optimization in the pipeline. Dashedline is the added forwarding path.

and ALU instructions, this requires a 4-to-1 multiplexer in the Memory stage. This

multiplexer manifests in critical path (I) since it resides directly in the forwarding path.

We propose to delay multiplication and shifting results one more cycle and forward them

from the Writeback stage. This reduces the multiplexer size down to 2-to-1.

As Fig. 4.4-a shows, load data from memory passes through the sign extension logic.

This further prolongs the forwarding path manifested in path (K). We propose to remove

load data forwarding to the Decode stage, therefore eliminating the multiplexer in the

Memory stage altogether. This further shortens the forwarding data path as Fig. 4.4-b

shows. Both optimizations may delay certain instruction combinations.


4.5.4 Fetch Address Selection

Although the baseline Fetch stage uses the branch predictor to guess the next instruction

address, it does not have to do so in all cases. More specifically, there are five options

for the next instruction address:

A1: Reset vector

A2: IRQ vector

A3: Redirect address due to branch misprediction

A4: Current PC due to instruction cache miss

A5: The predicted next address by the branch predictor

These options lead to a large 32-bit 5-to-1 multiplexer in the Fetch stage. Further-

more, the select signal depends on the following control signals: reset, interrupt, branch

misprediction, instruction cache miss, data hazard, and memory stall. Having a large

number of combinatorial signals as inputs, the multiplexer in the Fetch stage gives rise

to paths (E) and (J).

We propose reducing the size of the next address multiplexer to 3-to-1 as follows.

We observe that all the address options A1-A3 are redirection addresses. In addition,

we expect that A5 will be the common case, with A4 being less common and A1-A3

occurring infrequently. Accordingly, we propose delaying options A1, A2, and A3 by one

clock cycle. We introduce a redirect address register, holding the redirection address,

selected among options A1, A2, and A3. We use the redirect register to steer the fetch

accordingly in the next cycle.

We also include option A4 in the redirect register by observing that if the Fetch stage

is allowed to advance the PC even if the instruction cache misses, returning back to the

previous fetch address can be treated as a redirection. Therefore, we can include option


next address

(a)

reset

interrupt

i-cache miss

hazard

mem stall

branch miss

Next address

Reset Vector

IRQ Vector

PCBranch

BPrediction

BPrediction

redirect addr.(a)

Reset Vector

IRQ Vector

PCBranch

reset

interrupt

i-cache miss

redirect (b)

branch miss

mem stall

hazard

Figure 4.5: Next address selection data path in the Fetch state before (a) and after (b)optimization. Dashed boxes represent registers.

A4 in the redirect register, effectively removing the instruction cache miss signal from

the multiplexer select input. Fig. 4.5 shows this scheme in detail.

4.5.5 Data Operand Specialization

In the Nios II ISA the second operand for shift/rotate operations can come from only

two sources: the register file or an immediate value from the instruction bits. However,


other instruction types have four options for the second operand. The original, modular

Verilog code of our processor implementation included all possible data sources for all

types of instructions. However, it is not necessary to use the same data multiplexer for

all instruction types. We propose to use a separate 2-to-1 multiplexer for shift/rotate

instructions, shortening path (H).

4.6 Performance

The optimizations proposed in Section 4.5 remove critical paths but may increase the

number of pipeline stalls. Overall processor performance depends on both the Instruc-

tions Per Cycle (IPC) rate and the clock frequency. This section studies the performance

of the processor pipeline taking both into account. Fig. 4.6 reports IPC along with the

instruction per second (IPS) throughput for the various processor designs shown along

the x axis. The baseline configuration is shown at the leftmost side. From left to right,

the graph reports instruction throughput as all paths listed along the x-axis are removed.

For example, configuration I has paths A through E and I removed. The IPS results show

that frequency gains due to optimizations more than compensate for loss in IPC. Pro-

cessor performance starts at 47 million IPS and reaches as high as 85 million IPS after

applying the optimizations, an 80% improvement.

4.7 Related Work

To the best of our knowledge no previous work exists that systematically characterizes

the critical paths in a general purpose soft processor implementation. Several works that

propose optimizations for soft processor implementations exist. The analysis of this work

complements such works and serves as a guide for further optimizations. The closest work

is by Wong et al., who compare the area and delay of processors implemented on custom

CMOS and FPGA substrates [62]. They find that SRAMs and adders are efficient on


0.300

0.305

0.310

0.315

0.320

0.325

0.330

Base A B D E I M

IPC

0

10

20

30

40

50

60

70

80

90

IPS

(M

illi

on

s)

IPC IPS

Figure 4.6: IPC and relative IPS improvement for the processor after removing criticalpaths.

FPGAs mainly due to having dedicated resources. However, CAMs and multiplexers

are found to be extremely inefficient. They also find that data forwarding is inefficient

on FPGAs compared to custom CMOS implementations. Our work complements this

past work as it looks at the architecture of a full processor design identifying specific

architecture components and techniques that are inefficient in an FPGA implementation.

Yiannacouras et. al. explore the impact of soft processor customization on its perfor-

mance [68]. They consider various factors including pipeline depth, pipeline organization,

data forwarding, and multi-cycle operations. They show that fine grain microarchitec-

tural customizations can yield higher overall performance compared to a few hand-picked

optimizations. Furthermore, they show that by subsetting the ISA, they can reach mod-

est frequency improvement, 4% for a 5-stage pipeline. They conclude that after removing

logic from a given path, often another path exists which maintains the previous critical

path length, therefore it is unlikely that one simply reduces all paths.


4.8 Conclusion

This chapter considered a typical pipelined processor design and implemented it on a

modern FPGA. The baseline implementation focused on correctness, development speed,

modularity, and extensibility. It then explored sources of inefficiency in the implemen-

tation and found that the major components limiting speed were branch misprediction

detection, data forwarding, fetch address selection, certain computations, and stall broad-

cast signals. Finally, this work proposed various optimizations to increase processor clock

frequency in order to achieve higher performance.

Chapter 5

CFC: Copy-Free Checkpointing

This chapter proposes CFC, Copy-Free-Checkpointing, a novel, FPGA-friendly check-

pointing mechanism suitable for FPGA implementation. CFC avoids data copying that

would otherwise have to be performed in a serial manner due to the port limitations of

BRAM storage in FPGAs. Here we discuss the need for checkpointing in OoO proces-

sors and show that conventional checkpointing mechanisms map poorly to FPGAs. We

then demonstrate CFC for checkpointing the register renamer table, a key component

in OoO architectures. Finally, CFC is shown to map well to FPGAs while providing all

functionality of a conventional checkpointing scheme. The novel CFC scheme presented

in this chapter has been published as [1].

5.1 The Need for Checkpointing

OoO processors use speculative execution to boost performance. In speculative execution

the processor executes instructions without being certain that it should. A common

form of speculative execution is based on control flow prediction where the processor

executes instructions starting at the predicted target address of a branch. When the

speculation is correct, performance may improve because the processor had a chance of

executing instructions earlier than it would if it had to wait for the branch to decide

50

Chapter 5. CFC: Copy-Free Checkpointing 51

its target. When the speculation fails, all changes done by the erroneously executed

instructions must be undone. For this purpose, OoO processors rely on the Re-Order

Buffer (ROB) [54]. The ROB records, in order, all the changes done by instructions as

they are executed. To recover from a mispeculation, the processor processes the ROB in

reverse order, reverting all erroneous changes to the processor state.

Recovery via the ROB is slow, and requires time that is proportional to the number of

erroneously executed instructions. For this reason, many OoO processors employ check-

pointing, a recovery mechanism that has a fixed latency, often a single cycle [44]. For

storage-based components, a checkpoint is a complete snapshot of its contents. Check-

points are expensive to build, and increase latency. Accordingly, only a few of them are

typically implemented [65, 8, 44].

When both checkpoints and an ROB are available, recovery can proceed as follows:

If the mis-speculated instruction has a checkpoint, recovery proceeds using that check-

point alone. Otherwise, recovery proceeds at the closest subsequent checkpoint first, and

then via the ROB to the relevant instruction [44]. Alternatively, the processor can re-

cover to the closest preceding checkpoint at the expense of re-executing any intervening

instructions [8]. In this case an ROB is unnecessary. It has been shown that a few

checkpoints offer performance that is close to that possible with an infinite number of

checkpoints [8, 44]. Accordingly, we limit our attention to four or eight checkpoints.

5.2 Register Renaming

An OoO processor reorders instructions to extract instruction- and data-level parallelism.

However, instruction reordering must preserve data dependencies, which can be catego-

rized into the following:

1. read-after-write (RAW)

2. write-after-read (WAR)


3. write-after-write (WAW)

The last two are also known as false dependencies since they are an artifact of re-

using a limited number of registers. Register renaming eliminates false dependencies by

mapping, at run time, the architectural registers referred to by instructions to a larger

set of physical registers implemented in hardware [59]. False dependencies are eliminated

by using a different physical register for each write to the same architectural register.

Typical implementations of register renaming use a Register Alias Table (RAT) which

maps architectural to physical registers [54, 65, 57, 45]. RAT is indexed with the archi-

tectural register name and each entry provides the physical register name. Renaming an

instruction for a three-operand instruction set such as that of Nios II proceeds as follows:

• The two source registers are renamed by reading their current mapping from the

RAT.

• A new mapping is created for the destination register, if any. A free list provides

the new physical register name. The processor recycles a physical register when

it is certain that no instruction will ever access its value (e.g., when a subsequent

instruction that overwrites the same architectural register commits).

5.2.1 Checkpointed RAT

In order to support speculative execution, RAT’s contents need to be checkpointed.

RATRAM, a common RAT implementation, is a table indexed by architectural register

names and whose entries contain physical register names [65]. For the Nios II instruction

set, this table needs three read ports, two for the source operands plus one for reading

the previous mapping for the destination operand to store it in the ROB. The table also

needs one write port to write the new mapping for the destination register. A checkpoint

is a snapshot of the table’s content and is stored in a separate table. Multiple checkpoints

require multiple tables. Recovery amounts to copying back a checkpoint into the main


Epoch

00

Epoch

01

Epoch

10

Time

New SpeculationInstruction

Figure 5.1: Epochs illustrated in a sequence of instructions.

table. A checkpoint is taken when the processor renames a register which initiates a new

speculation (e.g., a branch). Such an instruction terminates an epoch comprising the

instructions seen since the last preceding checkpoint as shown in Figure 5.1. Recovering

at a checkpoint effectively discards all the instructions of all subsequent epochs.

5.3 CFC

CFC modifies RAMRAT so that it can better match an FPGA substrate. The key chal-

lenge when implementing RAMRAT on an FPGA is the implementation of the checkpoints.

Creating checkpoints requires copying all the bits of the main table into one of the check-

point tables. In ASIC implementations the checkpoints are implemented as small queues

that are embedded next to each RAT bit. However, such implementation is expensive

and inefficient on an FPGA because it cannot exploit BRAMs and uses LUTs exclusively

(see Section 5.5.2).

In RAMRAT, the main table holds all the changes applied to the RAT by all the

instructions, both speculative and non-speculative. The advantage of this implementation

is that the most recent mapping for a register always appears at the corresponding

entry of the main table. Hence lookups are streamlined. Checkpoints, however, need

to take a complete snapshot of the main table and this results in an inefficient FPGA

implementation. Instead of storing updates always in the same main table, CFC uses

a set of BRAM-implemented tables and manages them as a circular queue. CFC stores


RAT updates done in each epoch in a different table. Therefore, recovering from a mis-

speculated epoch is as simple as discarding the corresponding table. This significantly

simplifies RAT updates and checkpoint operations. RAT lookups, however, turn into

ordered searches through all the tables to find the most recent mapping.

By eliminating the need for copying, CFC is able to exploit on-chip BRAMs to store

mappings. It uses a few LUTs to maintain the relative order among tables and to

implement the search logic involved in read operations. Implementing reads is inexpensive

on FPGAs as LUTs can efficiently implement complex logic.

5.3.1 The New RAT Structure

Figure 5.2 shows the organization of CFC. There are two main structures, the RAT tables

and the dirty flag array (DFA). Each RAT table contains one entry per architectural

register, which provides a physical register name. A total of c+1 tables exist, which

correspond to c checkpoints and the committed state of the RAT. Each checkpoint table

contains mappings introduced by instructions of an epoch. For simplicity, epochs and

tables use the same indexes. CFC uses two pointer registers, head and tail, to specify the

relative order of the tables, similar to a circular queue. The DFA tracks valid mappings

in each checkpoint table. Accordingly, DFA contains one bit per checkpoint table and

per architectural register.

The (c+1)th table, the committed table, represents RAT’s architectural state, being

the latest changes applied by non-speculative instructions, i.e., committed instructions.

The processor uses this table to recover from unexpected events, e.g., page faults or

hardware interrupts.

5.3.2 RAT Operations

This section explains in detail how CFC performs various renaming operations.


C + 1Tables

CommittedCommittedCommittedCommitted

CommittedCommittedCommittedCommittedCommitted CopyCommitted Copy

# A

rch

ite

ctu

ral

Re

gis

ters

Figure 5.2: CFC main structure consists of c+1 tables and a dirty flag array.

Tail

R1

R2

00 01 10 11

Head

Active Copies

Most Recent

Older

Figure 5.3: Finding the most recent mapping: The most recent mapping for register R1is in the second column (01), while for R2, it resides in the fourth (11).

Finding the Most Recent Mapping

When renaming a source operand, the processor needs to identify the table holding

the most recent mapping. This is achieved by examining the DFA row indexed by the


architectural register name. Conceptually, this is done sequentially, looking for the first

set dirty flag, starting from the head and moving backwards towards tail. If no dirty

flag is found set, then the committed table is used. Figure 5.3 shows two examples. In

practice, this search is implemented in a lookup table having as inputs the DFA row and

the two pointers.

Creating a New Mapping

When renaming a destination register, CFC stores a new mapping into the most recent

table identified by the head pointer. CFC obtains the new mapping from a free list of

physical registers. It also sets the corresponding DFA bit indicating a valid mapping in

the corresponding entry of the table.

Creating a Checkpoint

CFC creates a checkpoint by simply advancing the head pointer. This ensures that all

subsequent updates to the RAT are made to a new table. As this table holds no valid

mappings yet, CFC also clears all DFA bits of the corresponding column, identified by

the new head pointer. Clearing all the DFA bits indicates that the table does not yet

contain any valid mappings. As CFC directs all subsequent RAT updates to this new

table, the previous tables remain intact, which CFC uses for recovery. Note that no data

copying is necessary when creating a new checkpoint.

Committing a Checkpoint

Upon instruction commit, CFC places the destination register mapping into the com-

mitted table. Instead of copying all mappings of an epoch en masse from a checkpoint

table to the committed table, CFC commits changes progressively. CFC stores mappings

one-by-one into the committed table as individual instructions commit. Finally, CFC

advances the tail pointer upon the commit of an instruction that started an epoch and


allocated a checkpoint, e.g., a branch, effectively recycling the checkpoint.

Restoring from a Checkpoint

On a mispeculation, e.g., branch misprediction, RAT must be restored to the state it

was before renaming the mis-speculated instruction. All that is needed is simply to

update the head pointer to the epoch number of the mis-speculated instruction. The

intervening tables are effectively discarded since only those columns in between head and

tail are considered during subsequent lookups. Notice that restoring a checkpoint does

not involve any copying either.

5.4 FPGA Mapping

This section details how CFC is implemented on an FPGA. The implementation is slightly

different than the organization described before, taking advantage of FPGA-specific prop-

erties. Most of the RAT state is stored in BRAMs, which are high-speed, area-efficient

memory arrays.

5.4.1 Flattening

Selecting the most recent checkpoint table as determined by the DFA logic, requires a C-

to-1 multiplexer, C being the number of checkpoints. Such multiplexer is area and latency

inefficient. Flattening the RAT array, that is storing all checkpoints sequentially into one

table eliminates this C-to-1 multiplexer. Accessing the new flattened RAT, however,

requires a new index which is composed of two parts: a base index and an offset. The

base index determines the checkpoint while the offset is the entry in that checkpoint.

The base index is determined by the dirty flags whereas the offset is determined by the

architectural register. As long as C is a power of two, calculating the index amounts to

concatenating the architectural register name to the column index reported by the DFA


logic.

5.4.2 Multiporting the RAT

Two processor stages access the RAT structure in CFC: rename and commit. Renaming

an instruction requires reading at most three mappings, and changing one mapping.

Commit also writes a mapping into the committed copy. In total, the RAT structure

must have three read ports and two write ports. Unfortunately, BRAMs only have one

read and one write port. Multiported BRAM-based storage has been proposed which

come at the expense of area and frequency overhead which we’d like to avoid [41].

We make the following observation to enable the use of BRAMs for storage. The

commit stage writes only to the committed table, while the rename stage writes only

to the checkpoint tables. Accordingly, we implement the committed table separately

to avoid needing two write ports to the same BRAM. On lookups, a 2-to-1 multiplexer

selects between the committed and the most recent checkpoint table. This multiplexer

does not add significant area or latency overhead, due to its fixed, minimal size.

To provide three read ports, we replicate each BRAM three times. The write ports of

the checkpoint BRAMs are connected to a single external write port so that all copies are

updated simultaneously. Similarly, the write ports of the three BRAMs implementing

the committed copy are also connected to a single write port.

5.4.3 Dirty Flag Array

The DFA is accessed one row at a time for lookups. However, for checkpoint creation,

an entire column is reset at the same time. Therefore, DFA and the associated logic are

described as a lookup table to the synthesis tools and are implemented using LUTs.


5.4.4 Pipelining the CFC

Compared to RAMRAT, CFC adds a level of indirection prior to accessing the table array.

Before accessing the tables, CFC needs to determine the index of the table to be used.

This involves accessing the DFA. Consequently, its latency can be longer than RAMRAT.

However, CFC’s clock frequency can be improved as it can be pipelined. Specifically, we

implement CFC as a two-stage pipeline as follows:

• In the first stage CFC decodes the dirty flags corresponding to the architectural

register being renamed. It generates a BRAM index based on the DFA row read

and the architectural register name. It also updates the DFA row if a new mapping

is being placed in the RAT.

• In stage two CFC accesses the checkpoint and committed BRAMs in parallel and

at the end selects the appropriate copy. At the end of stage 2, all BRAM updates

occur as well.

5.5 Evaluation

This section demonstrates the performance and cost of the CFC compared against FPGA-

implementations of two conventional renaming methods. Section 5.5.1 details the exper-

imental methodology. Section 5.5.2 reports the LUT usage of each mechanism, while

Section 5.5.3 reports their operating frequencies. Section 5.5.4 measures the impact of

pipelining on IPC performance. Finally, Section 5.5.5 reports overall performance and

summarizes our findings taking into account LUT and BRAM usage.

5.5.1 Methodology

We compare CFC to two conventional methods which we call RAM and CAM. RAM

uses LUTs exclusively to checkpoint the RAT. CAM uses content-addressable-memories


Table 5.1: Architectural properties of the simulated processors.Common Properties

BPredictor Type Bimodal

Bimodal Entries 512

BTB Entries 512

Cache Size (Bytes) 4K, 8K, 16K, 32K

Cache Associativity Direct Mapped, 2-way

Memory Latency 20 Cycles

Superscalar Specific

Pipeline Stages 5

Out-of-Order Specific

Pipeline Stages 7

Scheduler Size 32

ROB Size 32

Physical Registers 64

Checkpoints 4

to provide checkpointing functionality at the expense of reduced clock frequency [54, 47].

We consider designs with four and eight checkpoints as past work has shown that this

number of checkpoints is sufficient [8, 44]. We implemented all three renaming schemes in

Verilog. We follow the same experimental methodology described in Chapter 3 to obtain

IPC, area and frequency characteristics of the designs. Table 5.1 details the architecture

of the processor simulated in this study.

5.5.2 LUT Usage

Table 5.2 reports the number of LUTs used by the three renaming mechanisms with

four and eight checkpoints on two platforms. Because only the DFA and associated logic

uses LUTs in CFC, its cost is considerably lower than CAM and RAM. For example,

with eight checkpoints on Stratix III, CFC uses approximately 2x and 7x less resources

than CAM and RAM respectively. On Cyclone II and with eight checkpoints, CFC uses

2.73% of the available LUTs, while CAM and RAM use 10.37% and 18.19% respectively.

CFC uses six BRAMs, which is only a small fraction of the BRAMs available on either

platform. We conclude that CFC is superior in terms of resource usage.


Table 5.2: LUT and BRAM usage and maximum frequency for 4 and 8 checkpoints ondifferent platforms.

Config RAM CAM CFC

LUT

Cyclone II/4 3220 2378 501

Cyclone II/8 6368 3631 964

Stratix III/4 3002 1802 399

Stratix III/8 7082 2327 996

BRAM

Cyclone II/4 0 0 6

Cyclone II/8 0 0 6

Stratix III/4 0 0 6

Stratix III/8 0 0 6

Silicon Tile Area (mm2)Stratix III/4 3.3022 1.9822 0.8199

Stratix III/8 7.7902 2.5597 1.4766

Frequency (MHz)

Cyclone II/4 122 85 137

Cyclone II/8 82 71 104

Stratix III/4 195 133 292

Stratix III/8 196 105 220

We also compare designs based on their equivalent silicon real estate used. We calcu-

late equivalent area by summing the area of all the LUTs plus the silicon area of all the

BRAMs. As shown in Table 5.2, even after considering the BRAM area, CFC is still sig-

nificantly smaller compared to both RAM and CAM. Specifically, with four checkpoints,

CFC is 75% and 58% smaller than RAM and CAM, respectively. With eight checkpoints,

CFC is 81% and 42% smaller than RAM and CAM, respectively.

5.5.3 Frequency

Table 5.2 reports the maximum clock frequency for the three checkpointing mechanisms.

CFC outperforms both conventional schemes on both FPGA platforms. CFC can operate

at up to 118% and 50% faster compared to CAM and RAM based schemes, respectively.

This is due to the fact that CFC exploits BRAMs for storage. Using BRAMs leads

to a less complex interconnect and higher routing efficiency, manifesting higher clock

frequencies.


00.10.20.30.40.50.60.7

IPC

7-Stages 6-Stages

Figure 5.4: Performance impact of an extra renaming stage.

5.5.4 Impact of Pipelining on IPC

CFC outperforms RAM and CAM, in terms of clock frequency, however, it is pipelined in

two stages. Pipelining imposes a runtime performance penalty. Figure 5.4 compares the

IPC performance of single-cycle and two-cycle pipelined renaming (the base architecture

has six stages out of which one is for renaming). The performance penalty incurred by

the additional renaming stage is small, i.e., less than 2% IPC drop. Coupled with the

frequency advantage of CFC, we thus expect that CFC will outperform both RAM and

CAM schemes.

5.5.5 Performance

Figure 5.5 reports the overall runtime of processors using different checkpointing schemes.

The difference between the processors is the operating clock frequency, set as the max-

imum achieved by the checkpointing scheme, and the average IPC. Compared to RAM

and CAM, CFC is slightly slower in terms of IPC, on average 0.54 vs 0.55, due to the

added pipeline stage. However, overall performance, measured in IPS, is significantly

higher due to higher clock frequency.


0

50

100

150

200

4C-Cyclone 8C-Cyclone 4C-Stratix 8C-Stratix

Mil

lio

n I

PS

RAM CAM CFC

Figure 5.5: Overall processor performance in terms of IPS using various checkpointingschemes.

5.6 Related Work

Mesa-Martinez et al. propose implementing an OoO soft core, SCOORE, on FPGAs [27].

They investigate the OoO architectures for FPGA implementation. They show that OoO

arhictectures, in their conventional form, result in expensive, and inefficient implementa-

tions and propose several, general remedies. SCOORE project is different than the work

done in this thesis as its primary goal is that of simulation acceleration.

Fytraki and Pnevmatikatos implement parts of an OoO processor on an FPGA for

the purpose of accelerating processor simulation as well [30]. Their work is motivated, in

part, by the same inefficiencies that prior works identified. However, our goal is different

in the way that we aim to develop cost- and performance-effective, FPGA-friendly OoO

components that will be used in embedded system applications.


5.7 Conclusion

In order to have a practical and efficient OoO soft processor on FPGAs, it is necessary

to develop FPGA-friendly implementations of the various units that OoO execution re-

quires. This chapter presented CFC, a novel checkpointing technique that avoids data

copying to be able to use BRAMs on FPGAs. Using CFC, fast and area-efficient register

renaming, a key component of OoO execution, is possible on FPGAs. The proposed copy-

free checkpointed register renaming was shown to be much more resource-efficient than

conventional alternatives. It was also shown that it can be pipelined, offering superior

performance.

Although this chapter focused on register renaming, checkpointing has many other

applications. For example, checkpointing can be used in support of alternative execution

models such as transactional memory. We have already seen the proposed copy-free-

checkpointing scheme used for such applications [40].

Chapter 6

Instruction Scheduler

As an additional step toward an FPGA-friendly, single-issue OoO design, this chapter

studies instruction scheduler implementations. The instruction scheduler is the core of

OoO implementations which rely on reordering instructions to maximize instruction-

and data-level parallelism. The scheduler is where instructions wait for all their source

operands and execution resources to become available. This work starts with a conven-

tional, content-addressable-memory-based scheduler design [49] and studies its implemen-

tation on a modern FPGA. Specifically, performance and area are studied as a function

of the number of scheduler entries, the inclusion of back-to-back scheduling support, and

the use of age-based priority scheduling. The results of the study done in this chapter

have been published as [2].

This chapter shows that considering the scheduler in isolation, the best performance

is achieved with a two-entry scheduler without back-to-back scheduling and with the

simpler location-based selection policy. However, when the scheduler is considered as

part of the rest of the pipeline, it is shown that best performance is achieved with a four-

entry scheduler with back-to-back scheduling and age-based selection. This four-entry

configuration is inexpensive and fast. It uses 164 ALUTs and operates at 303MHz.

65

Chapter 6. Instruction Scheduler 66

Scheduler

Fetch Decode Rename

Execute

Mem

Write Commit

(A) ldw r1, 0(r2)

(B) addi r3, r1, 1

(C) muli r4, r5, 3

Tim

e

(B)

(C)

(A)

Figure 6.1: An example sequence of instructions being scheduled. Current state of theprocessor is presumed as instruction A being in the memory stage, and instructions Band C are in the scheduler, waiting to be selected for execution.

6.1 Instruction Scheduling

An OoO processor can execute instructions in any order that does not violate data de-

pendencies. Instructions enter the instruction scheduler, a pool where they wait until

they become ready, that is until all their source operands are available. The instruction

scheduler identifies ready-to-execute instructions, then among those, selects W instruc-

tions to issue to functional units, W being the processor datapath width. This chapter

focuses on single-issue schedulers (W = 1) as it was shown in Chapter 2 that it is the

number of datapaths that dominates the area and frequency of FPGA implementations.

Figure 6.1 shows an example sequence of instructions that enter an OoO processor.

Instruction A is in the memory stage waiting to load data from the data cache, while

instructions B and C reside in the scheduler pool. Instruction B depends on instruction

A through register r1, thus it cannot execute before instruction A finishes. As soon

as instruction A finishes loading data from memory, instruction B can be chosen for


execution as all its source operands (r1) are now available. Encountering a cache miss

can delay the execution of instruction A for multiple cycles. While instruction B stalls

waiting for A, instruction C is free to execute.

An instruction scheduler comprises a wakeup unit and a selection unit. Wakeup

is responsible for identifying ready-to-execute instructions among those residing in the

scheduler pool. It observes instructions as they produce their results, notifying wait-

ing instructions accordingly. A waiting instruction becomes ready when all its source

operands have been produced. With a scheduler with back-to-back scheduling an instruc-

tion can be scheduled for execution in the cycle immediately following the completion of

execution of an instruction on which it depends. In the case of multiple dependencies, the

last one to be resolved dictates the cycle in which the dependent instruction is scheduled

for execution.

At any given time, there can be more ready instructions than the number of available

functional units. All ready instructions request execution from the selection unit. For ex-

ample in Figure 6.1, if instruction A finishes execution at the time C enters the scheduler,

both B and C become ready and request for execution. In a single-issue OoO processor,

only one instruction can proceed to execution immediately. The selection unit is respon-

sible for selecting among the ready instructions the one that will execute. Typically the

selection unit uses a pre-specified selection policy for doing so. The selection policy can

be based on many factors, such as instruction age, instruction type, or availability of

functional units.

6.2 CAM-Based Scheduler

CAM, a common scheduler design, is based on content-addressable memories [49]. Fig-

ure 6.2 depicts CAM’s structure. The wakeup part is an array with one row per in-

struction. Each row contains one column per source operand. Each column contains the


source operand tag along with a ready bit indicating the operand’s availability. For the

Nios II ISA, every instruction can have up to two source operands. Each row is accompa-

nied by two comparators. When an earlier instruction finishes execution, its destination

register tag is broadcast over all entries and compared to source operands. All matching

entries mark their corresponding source operands as available. All instructions that have

both their source operands marked ready request execution. The selection logic selects

one among those ready instructions for execution. Figure 6.2 shows the ready signals as

inputs to the selection logic.

6.2.1 CAM on FPGAs

Despite CAM’s simple structure, it is expensive to build on FPGAs. As Section 6.3

will show, area and frequency degrade as the number of entries increases. By increasing

the number of entries, the network connecting the comparators and source operand tags

becomes more complex, leading to longer critical paths, and hence lower clock frequencies.

Because all entries are used for comparison in every clock cycle, BRAMs cannot be used

for storing the tags due to read/write port limitations. Additionally, there is a comparator

for each source operand of every instruction, resulting in a high resource usage.

6.2.2 CAM Performance

It is well documented that the ILP that can be extracted from a program increases with

the number of scheduler entries [49]. The resulting IPC benefits tend to level off after a

certain number of entries. The actual saturation point varies depending on the processor

architecture and also on system properties, such as memory latency. Furthermore, actual

performance depends not only on IPC but on the processor clock frequency as well. It

has been shown that scheduler frequency deteriorates with the number of entries [49]. As

a result there is a trade-off between scheduler size and performance in conventional CAM

implementations. Section 6.3 will explore this trade-off for FPGA-based implementations.


Destination tag

Se

lectio

n L

og

ic

… … …

= =

tag-L tag-Rrdy-L rdy-R

andor

or

= =


andor

or

= =


andor

or

Figure 6.2: CAM Scheduler with back-to-back scheduling and compaction. OR gatesprovide back-to-back scheduling. The dashed gray lines show the shifting interconnectwhich preserves the relative instruction order inside the scheduler for age-based policy.The selection logic prioritizes instruction selection based on location, i.e., it is a priorityencoder.

6.2.3 Back-to-Back Scheduling

To exploit more ILP it is desirable to execute dependent instructions in consecutive

cycles, or back-to-back, avoiding bubbles in the pipeline. In this regard, the CAM must

generate ready signals in the same clock cycle as the destination tag is broadcast by

earlier instructions. Figure 6.2 depicts a CAM with back-to-back scheduling. The OR

gates ensure that ready signals are produced by either the ready register bit or by the


result of the comparisons performed in the current clock cycle. In this design, the wakeup

and select units must operate in the same clock cycle. Although supporting back-to-back

scheduling increases processor IPC, it adversely affects operating frequency. Section 6.3

will show that back-to-back scheduling in fact increases latency on FPGAs. The reduced

clock frequency can overshadow any IPC advantage back-to-back scheduling has to offer.

Area-wise, adding the OR gates has a small overhead as will be shown in Section 6.3.

6.2.4 Scheduling Policy

In the event that more instructions are ready than the available execution units, the

Selection unit, based on a selection policy, determines which instructions to execute.

This policy can be based on various parameters such as instruction age or location inside

the scheduler. Previous work has shown that a selection policy based on instruction age

tends to perform better than other simple-to-implement heuristics.

One way to consider instruction age in the selection policy is to organize the scheduler

as a FIFO queue. FIFOs preserve instruction ordering and provide relative age infor-

mation based on the location in the queue. Using FIFOs, insertion of instructions is a

trivial queue push operation. However, removing instructions requires more than pop

operations. As instructions can be selected for execution from arbitrary positions of the

FIFO, additional functionality must be provided to maintain the relative positioning of

instructions inside the FIFO.

In order to remove instructions from arbitrary positions inside the scheduler once

they execute, compaction has been implemented in commercial designs to maintain FIFO

ordering [37]. In Figure 6.2, the interconnect between rows provides compaction capabil-

ity. Upon scheduling an instruction, all entries starting from its position to the bottom

(younger) are shifted towards the top (older). This ensures that at any point in time

older instructions are placed at the top. This design guarantees that an instruction’s

relative position also reflects its relative age. The selection logic then uses a priority


encoder which prioritizes based on instruction location, giving entries closer to the top

higher priority.

Compaction, as described above, requires extra connections between scheduler rows,

hence it impacts area and latency of the scheduler design. It should also be noted that

as we study only one-way datapaths, in any cycle at most only one scheduler entry is set

free. Therefore in every cycle only one shift operation is required to provide compaction

functionality.

The simplest-to-implement alternative to the age-based policy is a random, location-

based scheduling where priority is given to instructions according to where they are stored

inside the scheduler. Upon scheduling an instruction, its position is marked as free and

can be filled with a future instruction. Over time, instruction location provides almost

no information about its relative age.

6.3 Evaluation

This section compares the aforementioned scheduler designs based on their area, op-

erating frequency, IPC, and overall performance, measured in instructions per second

(IPS).

6.3.1 Methodology

We implement all scheduler designs in Verilog following the methodology explained in

Chapter 3. We also use software simulations to estimate the performance of each sched-

uler design. The simulated OoO processor consists of seven pipeline stages, uses 32KB

direct-mapped caches, and a 512-entry bimodal branch predictor. We simulate 2- to

32-entry instruction schedulers.

We use the following notation: CAM-B and CAM refer to schedulers with and without

back-to-back scheduling, regardless of their entry count. CAM-[B]A and CAM-[B]L refer


200 400 600 800

1000

2 4 8 16 32

ALUTs

Entries

Area

CAM-BACAM-ACAM-BLCAM-L

Figure 6.3: Number of ALUTs used by scheduler designs.

to schedulers with age- and location-based (random) selection policies respectively.

6.3.2 Area

Figure 6.3 shows how the area of the various designs scales as a function of entry count

(x-axis is exponential). Area requirements grow at least linearly with the number of

entries. Back-to-back scheduling and selection policy have negligible impact on area. For

example, a 4-entry CAM-BA uses 164 ALUTs while CAM-A uses 161 ALUTs. The area

scaling is primarily determined by the number of scheduler entries, which is different than

the area scaling in conventional, custom logic implementations that are wire-dominated.

A Nios II/f processor can be implemented using approximately 1500 ALUTs [13].

Given the results in Figure 6.3, using more than eight entries in a scheduler would

introduce an overhead of 20% or more. Hence, an appropriate conclusion from the

results presented here is that using more than eight entries is not advisable from the

area standpoint.


100 200 300 400 500 600

2 4 8 16 32

MHz

Entries

Frequency


Figure 6.4: Maximum clock frequency of the scheduler designs.

6.3.3 Frequency

Figure 6.4 reports the maximum frequency achieved by each design. Schedulers without

back-to-back scheduling consistently achieve higher frequencies. This is due to the fact

that the circuitry for instruction wakeup and instruction selection form two separate

combinatorial logic. For example, the 8-entry CAM-A and CAM-L operate at 344MHz

and 404MHz respectively, while the same size CAM-BA and CAM-BL operate at 244MHz

and 264MHz, a difference of 29% and 35% respectively. We also observe frequency losses

by moving from location- to the age-based policy as the shifting interconnect is added

to support compaction. This drop is highest at 20% between the 16-entry CAM-A and

CAM-L which operate at 330MHz and 265MHz respectively.

6.3.4 IPC

A lower frequency design is not necessarily a worse performing design. Performance de-

pends also on the number of instructions retired per cycle (IPC). Figure 6.5 reports IPC


0.27 0.28 0.29

0.3 0.31 0.32

2 4 8 16 32

Entries

IPC


Figure 6.5: Instructions per cycle achieved using four scheduler designs.

for the various schedulers. CAM-BA consistently outperforms the rest of the schedulers.

The highest difference observed is 7.5% between the two-entry CAM-BA and CAM-BL

schedulers. Back-to-back scheduling improves IPC as expected. CAM-BA and CAM-

BL are superior to CAM-A and CAM-L respectively. Similarly, the age-based selection

outperforms the location-based selection. Most of the IPC benefits come from back-to-

back scheduling rather than from age-based selection. CAM-BL consistently outperforms

CAM-A beyond the scheduler entry count of two. We conclude that to improve IPC we

need to have a scheduler with age-based selection and back-to-back scheduling. However,

should any of these features need to be sacrificed (e.g., due to frequency constraints), we

find it best to substitute age-based policy with location-based policy rather than remov-

ing back-to-back scheduling support. The IPC advantage that back-to-back scheduling

provides is greater than that of the age-based selection policy.


50 70 90

110 130 150 170

2 4 8 16 32

Entries

IPS


Figure 6.6: Overall performance as million instructions per second of four schedulerdesigns.

6.3.5 Performance

This section compares the overall performance of various scheduler designs in terms of

instructions per second (IPS). IPS considers both clock frequency and IPC. Figure 6.6

compares the IPS of 2- to 32-entry schedulers. We observe that schedulers without back-

to-back scheduling are consistently better performing for all entry counts. Although

these designs were shown to reach lower IPCs, their superior clock frequency provides

higher overall performance. Similarly, dropping age-based selection in favor of the simpler

location-based selection results in higher performance. Increasing the number of scheduler

entries reduces performance. Assuming that the entire processor can operate at the

scheduler speed, one would conclude that a very small scheduler with two entries would

be best.

Chapter 5 shows that an FPGA-friendly renaming unit, a crucial OoO component,

operates at 303MHz when implemented on the same platform. Thus, in Figure 6.7

we study the effect of limiting the processor clock frequency to 303MHz. In this case


50 60 70 80 90

100

2 4 8 16 32

Entries

IPS - 303MHz Limit


Figure 6.7: Overall performance of scheduler designs when the operating frequency islimited to 303Mhz.

using a slightly larger scheduler proves to be better. The four-entry CAM-BA, eight-

entry CAM-A and 16-entry CAM-L are the top three designs. Comparing these three

designs we observe that the performance loss due to decreasing scheduler entry count is

effectively compensated by the age-based selection policy and back-to-back scheduling

support. Additionally, as lower entry counts are desirable considering area usage, we

conclude that the 4-entry CAM-BA is the best configuration to choose, both in terms of

area and performance.

6.4 Related Work

To avoid low frequency and high area usage of content addressable memories, Mesa-

Martinez et al. [43] propose SEED, Scalable Efficient Enforcement of Dependences. SEED

uses indexed tables to track instruction dependencies. It uses multi-banked structures

and is shown to scale well on ASICs. However, SEED’s scalability is shown to be poor

on FPGAs as routing overhead among multiple components becomes critical. Fytraki


and Pnevmatikatos [30] and Derek et al. [18] implemented parts of an OoO processor on

an FPGA for the purpose of accelerating processor simulations. This is the first work

that studies how the area, frequency and most importantly performance of CAM-based

instruction schedulers scale with the number of scheduler entries on an FPGA.

6.5 Conclusion

This chapter explored part of the design space of instruction schedulers for out-of-order

soft processors. It examined the effect of scheduler size, instruction selection policy,

and back-to-back scheduling on performance, area and frequency. It showed that in

isolation (no restrictions on the clock frequency), a two-entry scheduler with a location-

based selection policy and no back-to-back scheduling achieves maximum performance.

However, by limiting the processor frequency to 303MHz (the frequency that an FPGA-

friendly register renamer operates at) we showed that a four-entry scheduler with age-

based selection policy and back-to-back scheduling reaches the maximum performance.

The results of this chapter can be used to estimate the best scheduler design under various

operating frequency assumptions.

Chapter 7

NCOR: Non-blocking Cache For

Runahead Execution

7.1 Introduction

This chapter presents NCOR (Nonblocking Cache Optimized for Runahead execution),

an FPGA-friendly alternative to conventional non-blocking caches. NCOR is specifically

designed for Runahead execution on FPGAs. NCOR avoids using content-addressable

memories, structures mapping poorly to FPGAs. Instead it judiciously sacrifices some of

the flexibility of a conventional non-blocking cache to achieve higher operating frequency

and thus superior performance when implemented on an FPGA. Specifically, NCOR

sacrifices the ability to issue secondary misses, that is requests for memory blocks that

map onto a cache line with an outstanding request to memory. Ignoring secondary misses

enables NCOR to track outstanding misses within the cache frame avoiding the need for

associative lookups. This chapter demonstrates that this simplification does not affect

performance nor correctness under Runahead execution.

This chapter quantitatively demonstrates that the usage of CAMs in conventional

non-blocking caches leads to a low operating frequency and high area usage. It also pro-

78

Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 79

vides a detailed description of NCOR and of the underlying design trade-offs. It explains

how NCOR avoids the inefficiencies of conventional designs. It compares the frequency

and area of conventional, non-blocking CAM-based caches and NCOR. Finally, it mea-

sures how often secondary misses, those that NCOR does not service, occur in Runahead

execution showing that they are relatively infrequent. The NCOR cache architecture

proposed in this chapter has been published as [4, 3].

The rest of this paper is organized as follows: Section 7.2 reviews conventional, CAM-

based non-blocking caches. Section 7.3 provides the rationale behind the optimizations

incorporated in NCOR. Section 7.4 presents the NCOR architecture. Section 7.5 dis-

cusses the FPGA-implementation of NCOR. Section 7.6 evaluates NCOR comparing it

to conventional CAM-based non-blocking cache implementations. Section 7.7 reviews

related work, and Section 7.8 summarizes our findings.

7.2 Conventional Non-Blocking Cache

Non-blocking caches are used to extract Memory Level Parallelism (MLP) and reduce

latency compared to conventional blocking caches that service cache miss requests one at a

time. In blocking caches, if a memory request misses in the cache, all subsequent memory

requests are blocked and are forced to wait for the outstanding miss to receive data from

the main memory. Blocked requests may include requests for data that is already in

the cache or that could be serviced concurrently by modern main memory devices. A

non-blocking cache does not block subsequent memory requests when a request misses.

Instead, these requests are allowed to proceed concurrently. Some may hit in the cache,

while others may be sent to the main memory system as well. Overall, because multiple

requests are serviced concurrently, the total amount of time the program has to wait for

the memory to service its requests is reduced.

To keep track of outstanding requests and to make the cache available while a miss


is pending, Miss Status Holding Registers (MSHRs) are used which store information

regarding all outstanding requests [38]. MSHRs maintain the information that is neces-

sary to direct the data received from the main memory to its rightful destination, e.g.,

cache frame or a functional unit. MSHRs can also detect whether a memory request

is for a block for which a previous request is still pending. Such requests can be ser-

viced without issuing an additional main memory request. To detect these accesses and

to avoid duplicate requests, for every request missing in the cache, the entire array of

MSHRs is searched. A matching MSHR means the data has already been requested

from the memory. Such requests are queued and serviced when the data arrives. Search-

ing the MSHRs requires an associative lookup, which is implemented using a Content-

Addressable-Memory (CAM). CAMs map poorly to reconfigurable logic as Section 7.6

shows. As the number of MSHRs bounds the maximum number of outstanding requests,

more MSHRs are desirable to extract more MLP. Unfortunately, the area and latency of

the underlying CAM grow disproportionately large with the number of MSHRs making

large number of MSHRs undesirable.

7.3 Making a Non-Blocking Cache FPGA-Friendly

Runahead execution is conceptually an extension to a simple in-order processor. The

simplicity of its architecture is one of the primary reasons that makes Runahead suitable

for reconfigurable fabrics. However, for Runahead to be feasible on these fabrics, the ex-

tensions must come with low overhead. As Section 7.6 shows, conventional non-blocking

cache designs based on MSHRs do not map well onto FPGAs. Accordingly there is a need

to design a low cost non-blocking cache suitable for FPGAs. This work observes that

Runahead execution does not need the full functionality of a conventional non-blocking

cache and exploits this observation to arrive to an FPGA-friendly non-blocking cache

design for Runahead execution.


Conventional non-blocking caches that use MSHRs do not map well on reconfigurable

fabrics. The primary reason is that MSHRs use a CAM to perform associative searches.

As Section 7.6 shows MSHRs lead to low clock frequencies and high area usage. In

addition to MSHRs, the controller of a non-blocking cache is considerably more complex

compared to the one in a blocking cache. The controller is responsible for a wide range of

concurrent operations resulting in large, complex state machines. This work presents the

Non-blocking Cache Optimized for Runahead execution, or NCOR. NCOR has an FPGA-

friendly design that revisits the conventional non-blocking cache design considering the

specific needs of Runahead execution. NCOR does away with MSHRs and incorporates

optimizations for the cache controller and data storage.

7.3.1 Eliminating MSHRs

Using the following observations, NCOR eliminates the MSHRs:

1) As originally proposed, Runahead executes all trigger-miss-independent instruc-

tions during Runahead mode. However, since the results produced in Runahead

mode are later discarded, the processor can choose not to execute some of these in-

structions as it finds necessary. This option of selective execution can be exploited

to reduce complexity by avoiding the execution of instructions that require addi-

tional hardware support. One such instruction class is those that cause secondary

misses, that is misses on already pending cache frames. Supporting secondary

misses is conventionally done via MSHRs, which do not map well to FPGAs.

2) In most cases servicing secondary misses offers no performance benefit. There are

two types of secondary misses: redundant and distinct. A redundant secondary

miss requests the same memory block as the trigger miss while a distinct secondary

miss requests a different memory block that happens to map to the same cache

frame as the trigger miss. Section 7.6 shows that distinct secondary misses are a


very small fraction of the memory accesses made in Runahead mode. It should be

noted that this fraction is larger in ASIC implementations in which many factors,

e.g., memory latency and pipeline depth, are different.

Servicing a redundant secondary miss cannot directly improve performance further

as the trigger miss will bring the data in the cache. A redundant secondary miss may

be feeding another load that will miss and that could be subsequently prefetched.

However this is impossible as the trigger-miss is serviced first which causes the

processor to switch to normal execution. On the other hand, distinct secondary

misses could prefetch useful data, but as Section 7.6 shows, this has a negligible

impact on performance.

Based on these observations the processor can simply discard instructions that cause

secondary misses during runahead mode while getting most, and often all of the perfor-

mance benefits of runahead execution. However, NCOR still needs to identify secondary

misses to be able to discard them. NCOR identifies secondary misses by tracking out-

standing misses within the cache frames using a single pending bit per frame. Whenever

an address misses in the cache, the corresponding cache frame is marked as pending .

Subsequent accesses to this frame would observe the pending bit and will be identified as

secondary misses, and discarded by the processor. Effectively, NCOR embeds the MH-

SRs in the cache, while judiciously simplifying their functionality to reduce complexity

and maintain much of the performance benefits.

7.3.2 Making the Common Case Fast

Ideally, the cache performs all operations in as few cycles as possible. In particular, it is

desirable to service cache hits in a single cycle, as hits are expected to be the common

case. In general, it is desirable to design the controller to favor the frequent operations

over the infrequent ones. Accordingly, NCOR uses a three-part cache controller which


Request

QueueLookup Request Bus

Tag

Data

Meta Data

System Bus

Figure 7.1: Non-blocking cache structure.

favors the most frequent requests, i.e., cache hits, by dedicating a simple sub-controller

just for hits. Cache misses and all non-cacheable requests (e.g., I/O requests) are handled

by other sub-controllers which are triggered exclusively for such events and are off the

critical path for hits. These requests complete in multiple cycles. The next section

explains the NCOR cache controller architecture in detail.

7.4 NCOR Architecture

Figure 7.1 depicts the basic structure of NCOR. The cache controller comprises Lookup,

Request, and Bus components. NCOR also contains Data, Tag, Request and Metadata

storage units.

7.4.1 Cache Operation

NCOR functions as follows:

• Cache Hit: The address is provided to Lookup which determines, as explained in

Section 7.4.2, that this request is a hit. The data is returned in the same cycle for

Load operations, and is stored in the cache during the next cycle for Store operations.


Other soft processor caches, such as those of Altera Nios II, use two cycles for stores

as well [13].

• Trigger Cache Miss: If Lookup identifies a cache miss, it sends a signal to Request to

generate the necessary requests to handle the miss. Lookup blocks the cache interface

until Request signals back that it has generated all the necessary requests.

Request generates all the necessary requests directed at Bus to fulfil the pending mem-

ory operation. If a dirty line must be evicted, a write-back request is generated first.

Then a cache line read request is generated and placed in the Queue between Request

and Bus.

Bus receives requests through the Queue and sends the appropriate signals to the

system bus. The pending bit of the cache frame that will receive the data is set.

• Secondary Cache Miss in Runahead Mode: If Lookup identifies a secondary cache miss,

i.e., a miss on a cache frame with pending bit set, it discards the operation.

• Secondary Cache Miss in Normal Mode: If Lookup identifies a secondary cache miss in

normal execution mode, it blocks the pipeline until the frame’s pending bit is cleared.

It is possible to have a secondary miss in normal execution mode as a memory access

initiated in Runahead mode may be still pending. In Normal execution processor can

not discard operations and must wait for the memory request to be fulfilled.

The following subsections describe the function of each NCOR component.

7.4.2 Lookup

Lookup is the cache interface that communicates with the processor and receives memory

requests. Lookup performs the following operations:

• For cache accesses, Lookup compares the request address with the tag stored in the

Tag storage to determine whether this is a hit or a miss.


• For cache hits, on a load, Lookup reads the data from the Data storage and provides

it to the processor in the same cycle as the Tag access. Reading the Data storage

proceeds in parallel with the Tag access and comparison. Stores, on the other hand,

take two cycles to complete as writes to the Data storage happen in the cycle after the

hit is determined. In addition, the cache line is marked as dirty.

• For cache misses, Lookup marks the cache line as pending.

• For cache misses and non-cacheable requests, Lookup triggers Request to generate

the appropriate requests. In addition, for loads, it stores the instruction metadata,

including the destination register name, in the MetaData storage. Lookup blocks the

processor interface until Request signals it has generated all the necessary requests.

• For cache accesses, whether the request hits or misses in the cache, if the corresponding

cache line is pending, Lookup discards the request if the processor is in Runahead mode.

However, if the processor is in normal execution mode, Lookup stalls the processor.

Note that it is possible to incur a pending line in normal execution mode under the

following scenario. The processor incurs a cache miss and switches to Runahead mode.

In Runahead mode it incurs a second cache miss and initiates another memory request,

setting the corresponding cache line to pending. The initial miss request is returned

and the processor switches back to normal execution mode. While the second miss

initiated in Runahead mode is still pending, the processor incurs another cache miss

that maps to the same pending cache line, hence the processor must stall.

7.4.3 Request

Request is normally idle waiting for a trigger from Lookup. When triggered, it issues

the appropriate requests to Bus through request Queue. Request performs the following

operations:

• Waits in the idle state until triggered by Lookup.


• For cache misses, Request generates a cache line read request. In addition if the evicted

line is dirty, Request generates a cache line write-back request.

• For non-cacheable requests, depending on the operation, Request generates a single

read or write request.

• When all necessary requests are generated and queued, Request notifies Lookup of its

completion and returns to its idle state.

7.4.4 Bus

Bus is responsible for servicing bus requests generated by Request . Bus receives requests

through the request Queue and communicates through the system bus with the main

memory and peripherals. Bus consists of two internal modules:

Sender

Sender sends requests to the system bus. It removes requests from the request Queue

and, depending on the request type, sends the appropriate signals to the system bus. A

request can be of one of the following types:

• Cache Line Read: Read requests are sent to the system bus for each data word of the

cache line. The critical word (word originally requested by the processor) is requested

first. This ensures minimum wait time for data delivery to the processor.

• Cache Line Write-Back: Write requests are sent to the system bus for each data word

of the dirty cache line. Data words are retrieved from Data storage and sent to the

system bus.

• Single Read/Write: A single read/write request is sent to the memory/peripheral

through the system bus.


Receiver

Receiver handles the system bus responses. Depending on the processor’s original request

type, one of the following actions is taken:

• Load from Cache: Upon receipt of the first data word, Receiver signals request com-

pletion to the processor and provides the data. This is done by providing the corre-

sponding metadata from the MetaData storage to the processor. Receiver also stores

all the data words received in the Data storage. Upon receipt of the last word, it stores

the cache line tag in the corresponding entry in the Tag storage, sets the valid bit and

clears both dirty and pending bits.

• Store to Cache: The first data word received is the data required to perform the store.

Receiver combines the data provided by the processor with the data received from the

system bus and stores it in the Data storage. It also stores subsequent data words,

as they are received, in the Data storage. Upon the receipt of the last word, Receiver

stores the cache line tag in the corresponding entry in the Tag storage, sets both valid

and dirty bits and clears the pending bit.

• Non-Cacheable Load: Upon receipt of the data word, Receiver signals request com-

pletion to the processor and provides the data. It also provides the corresponding

metadata from the MetaData storage. Non-Cacheable loads are those operations that

are specified by the instruction, e.g., IO operations, to not be handled by the data

cache and be sent directly to the system bus.

7.4.5 Data and Tag Storage

The Data and Tag storage units are tables holding cache line data words, tags, and status

bits. Lookup and Bus both access Data and Tag .


7.4.6 Request Queue

Request Queue is a FIFO memory holding requests generated by Request directed at

Bus . Request Queue conveys requests in the order they are generated.

7.4.7 Meta Data

For outstanding load requests, i.e., load requests missing in the cache or non-cacheable

operations, the cache stores the metadata accompanying the request. This data includes

Program Counter and destination register for Load instructions. Eventually when the

request is fulfilled this information is provided to the processor along with the data loaded

from the memory or I/O. This information allows the loaded data to be written to the

register file. MetaData is designed as a queue so that requests are processed in the order

they were received. No information is placed in the MetaData for Stores as the processor

does not require acknowledgements for their completion.

7.5 FPGA Implementation

This section presents the implementation of the non-blocking cache on FPGAs. It dis-

cusses the design challenges and the optimizations applied to improve clock frequency

and minimize the area. It first discusses the storage organization and usage and the

corresponding optimizations. It then discusses the complexity of the cache controller’s

state machine and how its critical path was shortened for the most common operations.

7.5.1 Storage Organization

Modern FPGAs contain dedicated Block RAM (BRAM) storage units that are fast and

take significantly less area compared to LUT-based storage. This subsection explains

the design choices that made it possible to use BRAMs for most of the cache storage

components.


Tag

One Data Word

{unused, dirty, pending}

{unused, valid, Tag}

24 bits8 bits

One Cache Line

Data

Figure 7.2: The organization of the Data and Tag storage units.

Data

Figure 7.2 depicts the Data storage organization. As BRAMs have a limited port width,

the entire cache line does not fit in one entry. Consequently, cache line words are spread,

one word per entry, over multiple BRAM entries. This work targets the Nios-II ISA [13]

which supports byte, half-word, and word stores (one, two, and four bytes respectively).

These are implemented using the BRAM byte enable signal [21]. Using this signal avoids

two-stage writes (read-modify-write) which would increase area due to the added multi-

plexers.

Tag

Figure 7.2 depicts the Tag storage organization. Unlike cache line data, a tag fits in one

BRAM entry. In order to reduce BRAM usage, we store cache line status bits, i.e., valid,


dirty and pending bits, along with the tags.

Despite the savings in BRAM usage by storing cache line status bits along with the

tags, the following problem arises. Lookup makes changes only to the dirty and pending

bits and should not alter valid or Tag bits. In order to preserve valid and Tag bits while

performing a write, a two stage write could be used, in which bits are first read and then

written back. This read-modify-write sequence increases area and complexity and hurts

performance. We overcome this problem by using the byte enable signals. As Figure 7.2

shows, we store valid and Tag bits in the lower 24 bits, and dirty and pending bits in the

higher eight bits. Depending on the tag size, a number of bits are unused in the lower

24-bits portion. Using the byte enable signal, Lookup is able to change only the upper

byte, i.e., dirty and pending bits.

7.5.2 BRAM Port Limitations

Although BRAMs provide fast and area-efficient storage, they have a limited number

of ports. A typical BRAM in today’s FPGAs has two ports available for reading and

writing [21]. Figure 7.3 shows that both Lookup and Bus write and read to/from the

Data and Tag storages. This requires four ports. Our design uses only two ports based

on the following observations: BRAMs can be configured to provide two ports, each

providing both write and read operations over one address line. Although Lookup and

Bus both write and read to/from the Data and Tag at the same time, each only requires

one address line.

Tag

For every access from Lookup to the Tag storage, Lookup reads the Tag , valid , dirty and

pending bits for a given cache line. Lookup also writes to the Tag storage in order to

mark a line dirty or pending . However, reads and writes never happen at the same time

as marking a line dirty (for stores) or pending (for misses) happens one cycle after the


Lookup

Bus

Data

Tag

Sender

Receiver

Figure 7.3: Connections between Data and Tag storages and Lookup and Bus compo-nents.

tag and other status bits are read. Bus only writes to the Tag storage when a cache line

is retrieved from the main memory. Therefore, dedicating one address line to Lookup and

one to Bus is sufficient to access the Tag storage.

Data

For every Lookup access to the Data storage, Lookup either reads or writes a single, or

part of a word. However, Bus may need to write to, or read from the Data storage at the

same time. This occurs if Bus is sending words of a write-back request while a previously

requested data is being delivered by the system bus. To avoid this conflict, we restrict

Bus to send a write-back data word only when the system bus is not delivering any

data. Forward progress is guaranteed as outstanding write-back requests do not block

responses from the system bus. This restriction minimally impacts cache performance

as words are sent as soon as the system bus is idle. In Section 7.6.9 we show that even

in the worst case scenario, impact on performance is marginal. With this modification,

dedicating one address line to Lookup and one to Bus is sufficient for accessing the Data

storage.


CPU-Side

Bus

Lookup Request Bus

(a)

(b)

Request

Queue

Request

Queue

Figure 7.4: (a) Two-component cache controller. (b) Three-component cache controller.

7.5.3 State Machine Complexity

The cache controller is responsible for looking up in the cache, performing loads and

stores, generating bus requests and handling bus transactions. Given the number of

operations that the controller handles, in many cases concurrently, it requires a large

and complex state machine. A centralized cache controller can be slow, and has the

disadvantage of treating all requests the same. However, we would like the controller to

respond as quickly as possible to those requests that are most frequent, i.e., requests that

hit in the cache. Accordingly, we partition the controller into sub-components. One could

partition the controller into two components of CPU-side and bus-side, as Figure 7.4(a)

shows. The CPU-side component would be responsible for looking up addresses in the

cache, performing loads and stores, handling misses and non-cacheable operations, and

sending necessary requests to the bus-side component. The bus-side component would

communicate with the main memory and system peripherals through the system bus.

Due to the variety of operations that the CPU-side component is responsible for, we

find that it still requires a non-trivial state machine. The state machine has numerous


RequestLookup

Wait

Write

Write

Back

Read

Line

I/OWait

I/O

Look

upIDLE

Figure 7.5: Lookup and Request state machines. Double-lined states are initial states.Lookup waits for Request completion in the “wait” state. All black states generaterequests targeted at the Bus controller.

input signals and this reduces performance. Among its inputs is the cache hit/miss signal,

a time-critical signal due to the large comparator used for tag comparison. As a result,

implementing the CPU-side component as one state machine leads to a long critical path.

Higher operating frequency is possible by further partitioning the CPU-side compo-

nent into two subcomponents, Lookup and Request , which cooperatively perform the

same set of operations. Figure 7.4(b) depicts the three-component cache controller. The

main advantage of this controller is that cache lookups that hit in the cache, the most

frequent operations, are handled only by Lookup and are serviced as fast as possible.

However, this organization has its own disadvantages. In order for the Lookup and Re-

quest to communicate, e.g., in the case of cache misses, extra clock cycles are required.

Fortunately, these actions are relatively rare. In addition, in such cases servicing the

request takes in the order of tens of cycles. Therefore, adding one extra cycle delay to

the operation has little impact on performance. Figure 7.5 shows an overview of the two

state machines corresponding to Lookup and Request.


7.5.4 Latching the Address

We use BRAMs to store data and tags in the cache. As BRAMs are synchronous RAMs,

the input address needs to be available just before the appropriate clock edge (rising in

our design) of the cycle when cache lookup occurs. Therefore, in a pipelined processor,

the address has to be forwarded to the cache from the previous pipeline stage, e.g., the

execute stage in a typical 5-stage pipeline. After the first clock edge, the input address

to the cache changes as it’s forwarded from the previous pipeline stage. However, the

input address is further required for various operations, e.g., tag comparison. Therefore,

the address must be latched.

Since some cache operations take multiple cycles to complete, the address must be

latched only when a new request is received. This occurs when Lookup’s state machine

is entering the lookup state. Therefore, the address register is clocked based on the next

state signal. This is a time-critical signal and using it to clock a wide register, as is the

case with the address register, negatively impacts performance.

To avoid using this time-critical signal we make the following observations: The

cache uses a latched address in two phases: In the first cycle for tag comparison, and

in subsequent cycles for writes to Data storage and request generations. Accordingly,

we can use two separate registers, addr always and addr lookup one per phase. At every

clock cycle, we latch the input address into addr always. We use this register for tag

comparison in the first cycle. At the end of the first cycle, if Lookup is in the lookup

state, we copy the content of addr always into addr lookup. We use this register for

writes to the cache and request generation. As a result, the addr always register is

unconditionally clocked every cycle. Also, we use Lookup’s current state register, rather

than its next state combinational signal, to clock the addr lookup register. This improves

the design’s operating frequency.


Table 7.1: Architectural properties of simulated processors.No. Ways 1-4

I-Cache Size (Bytes) 32K

D-Cache Size (Bytes) 4-32K

Cache Line Size 32 Bytes

Cache Associativity Direct Mapped

Memory Latency 26 Cycles

BPredictor Type GShare

BPredictor Entries 4096

BTB Entries 256

Pipeline Stages 5

No. Outstanding Misses 32

7.6 Evaluation

This section evaluates NCOR. It first compares the area and frequency of NCOR with

those of a conventional MSHR-based non-blocking cache. It then shows the potential

performance advantage that Runahead execution has over an in-order processor using a

non-blocking cache.

7.6.1 Methodology

We use software simulations to estimate the performance of various NCOR configurations.

We follow the methodology explained in Chapter 3. The processor models include a 5-

stage in-order pipelined processor with Runahead execution support. Table 7.1 details

the simulated processor micro-architecture. We also compare the area and frequency

characteristics of NCOR against a conventional, MSHR-based non-blocking cache.

Although NCOR’s architecture is applicable to set-associative caches as well, in this

study we only consider direct-mapped caches. We find that set-associativity substantially

increases cache’s architectural and implementation complexity [67]. Specifically, set-

associative caches require multiple comparison operations for every lookup, which leads

to low clock frequencies. We decide not to include set-associative caches in our study as

we expect substantial frequency loss by making the cache set associative.


7.6.2 Simplified MSHR-Based Non-Blocking Cache

NCOR was motivated as a higher-speed, lower-cost and complexity alternative to con-

ventional, MSHR-based non-blocking caches. A comparison of the two designs is needed

to demonstrate the magnitude of these advantages. Our experience has been that the

complexity of a conventional non-blocking cache design quickly results in an impractically

slow and large FPGA implementation. This makes it necessary to seek FPGA-friendly

alternatives such as NCOR. For the purposes of demonstrating that NCOR is faster

and smaller than a conventional non-blocking cache, it is sufficient to compare against

a simplified non-blocking cache. This is sufficient, as long as the results demonstrate

the superiority of NCOR and provided that the simplified conventional cache is clearly

faster and smaller than a full-blown conventional non-blocking cache implementation.

The simplifications made to the conventional MSHR non-blocking cache are as follows:

• Requests mapping to a cache frame for which a request is already pending are not

supported. Allowing multiple pending requests targeting the same cache frame

substantially increases complexity.

• Each MSHR entry tracks a single processor memory request as opposed to all

processor requests for the same cache block request [38]. This eliminates the need

for a request queue per MSHR entry which tracks individual processor requests,

some of which may map onto the same cache block. In this organization the MSHRs

serve as queues for both pending cache blocks and processor requests. Secondary

misses are disallowed.

• Partial (byte or half-word) loads/stores are not supported.

We use this simplified MSHR-based cache for FPGA resource and clock frequency

comparison with NCOR. In the performance simulations, we use a regular MSHR-based

cache.


7.6.3 Resources

FPGA resources include ALUTs, blockrams (BRAM), and the interconnect. In these

designs interconnect usage is mostly tied to ALUT and BRAM usage. Accordingly, this

section compares the ALUT and BRAM usage of the two cache designs. Figure 7.6

reports the number of ALUTs used by NCOR and the MSHR-based cache for various

capacities. The conventional non-blocking cache uses almost three times as many ALUTs

compared to NCOR. There are two main reasons why this difference occurs:

1. The MSHRs in the MSHR-based cache must use ALUTs exclusively instead of a

mix of ALUTs and BRAMs due to the nature of CAMs included in their design.

2. The large number of comparators required in the CAM structure of the MSHRs

require a large number of ALUTs.

While the actual savings in the number of ALUTs is small compared to the number

of ALUTs found in high capacity FPGAs available today (>100K ALUTs) such small

savings add up, for example in a multi-processor environment. Additionally, there are

designs where low-capacity FPGAs are required, for example in low budget or low power

applications.

In NCOR, the bulk of the cache is implemented using BRAMs, hence the high area

density and efficiency of the cache design. The vast majority of the BRAMs contain the

cache’s data, tag and status bits. As expected, both caches experience a negligible change

in ALUT usage over different capacities, as most of the cache storage is implemented using

BRAMs.

Figure 7.7 shows the number of BRAMs used in each cache for various capacities.

Compared to the conventional cache, NCOR uses one more BRAM as it stores pending

memory requests in BRAMs rather than in MSHRs.


0

100

200

300

400

500

600

700

4KB 8KB 16KB 32KB

AL

UTs

Area

NCOR

MSHR

Figure 7.6: Area comparison of NCOR and MSHR-based caches over various capacities.

0

10

20

30

40

50

4KB 8KB 16KB 32KB

BR

AM

s

Block Ram

NCOR

MSHR

Figure 7.7: BRAM usage of NCOR and MSHR-based caches over various capacities.

7.6.4 Frequency

Figure 7.8 reports the maximum clock frequency the NCOR and the MSHR-based cache

can operate at and for various capacities. NCOR is consistently faster. The difference is

at its highest (58%) for the 4KB caches with NCOR operating at 329MHz compared to

the 207MHz for the MHSR-based cache. For both caches, and in most cases, frequency

decreases as the cache capacity increases. At 32KB NCOR’s operating frequency is within

18% of the 4KB NCOR. Although increased capacity results in reduced frequency in most

cases, the 8KB MSHR-based cache is faster than its 4KB counterpart. As the cache


150

200

250

300

350

4KB 8KB 16KB 32KB

MH

z

Maximum Clock Frequency

NCOR

MSHR

Figure 7.8: Clock frequency comparison of NCOR and of a four-entry MSHR-based cacheover various cache capacities.

capacity increases, more sets are used, and hence the tag size decreases. Accordingly,

this makes tag comparisons faster. At the same time, the rest of the cache becomes

slower. These two latencies combine to determine the operating frequency which is at

a local maximum at a capacity of 8KB for the MSHR-based cache. However, as cache

capacity continues to grow, any reduction in tag comparison latency is overshadowed by

the increase in latency in other components.

7.6.5 MSHR-Based Cache Scalability

The NCOR studied in this work is capable of handling up to 32 outstanding requests.

Supporting more outstanding requests in NCOR basically comes for free as these are

tracked in BRAMs. An MSHR-based cache however uses CAMs, hence LUTs for storage.

Figure 7.9 reports how the frequency and area of the MSHR-based cache scale with MSHR

entry count. As expected, as the number of MSHRs increases clock frequency drops and

area increases. With 32 MSHRs, the MSHR-based cache operates at only 126MHz and

requires 3269 ALUTs.


0

50

100

150

200

250

0

1000

2000

3000

4000

2 4 8 16 32

MH

z

AL

UTs

# MSHRs

Area / Frequency

Area

Freq

Figure 7.9: Area and clock frequency of a 32KB MSHR-based cache with various numberof MSHRs. The left axis is ALUTs and the right axis is clock frequency.

7.6.6 Runahead Execution

Figure 7.10 reports the speedup achieved by Runahead execution on 1- to 4-way super-

scalar processors modeled in simulation. For this comparison, performance is measured

as the instructions per cycle (IPC) rate. IPC is a frequency independent metric and thus

is useful in determining the range of frequencies for which an implementation can operate

on and still outperform an alternative. Runahead is able to outperform the correspond-

ing in-order processor by extracting memory-level parallelism effectively hiding the high

main memory latency. For a typical single-issue pipeline (1-way), on average, Runahead

improves IPC by 26%.

As the number of outstanding memory requests increases, higher memory level par-

allelism is extracted, hence higher performance. Figure 7.11 shows how the IPC scales

with increasing the number of outstanding requests. Moving from two outstanding re-

quests to 32, we gain, on average, 7% in IPC. The impact of the number of outstanding

requests is even greater as the memory latency increases, as is expected with the increas-

ing gap between FPGA and DDR clock speeds. We study memory latency impact in

Figure 7.12. When memory latency is lower, increasing outstanding requests marginally

increases speedup, i.e. by 7%. However with a high memory latency, by moving from


0.00

0.20

0.40

0.60

0.80

1.00

IPC

Performance

Normal Runahead

Figure 7.10: Speedup gained by Runahead execution on 1- to 4-way superscalar proces-sors. The lower parts of the bars show the IPC of the normal processors. The full barsshow the IPC of the Runahead processor.

4%

6%

8%

10%

0.29

0.30

0.31

2 4 8 16 32

Sp

ee

du

p

IPC

Outstanding Requests

IPC

IPC Speedup

Figure 7.11: The impact of number of outstanding requests on IPC. Speedup is measuredover the first configuration with two outstanding requests.

two outstanding requests to 32, the speedup doubles, i.e., from 26% to 54%.

Next we compare the speedup gained with NCOR to that of a full-blown MSHR-

based cache. Figure 7.13 compares the IPC of Runahead execution with NCOR and

MSHR-based caches. NCOR achieves slightly lower IPC, less than 4% on average, as it

sacrifices memory level parallelism for lower complexity. However, in the case of sjeng,

MSHR performs worse. MSHR is more aggressive in prefetching cache lines, and in this


0%

10%

20%

30%

40%

50%

60%

26 100

Sp

eed

up

Memory Latency (Cycles)

Memory Latency Impact

2

32

Figure 7.12: Speedup gained by Runahead execution with two and 32 outstanding re-quests, with memory latency of 26 and 100 cycles.

00.10.20.30.40.50.6

IPC Comparison

NCOR MSHR

Figure 7.13: Performance comparison of Runahead with NCOR and MSHR-based cache.

case pollutes the cache rather than prefetching useful data.

Finally we compare NCOR and MSHR-based caches based on both IPC and their

operating frequency. We simulate two processors with different caches and compare the

two in terms of runtime in seconds to complete the execution of our benchmarks set.

Figure 7.14 compares the two systems over a range of cache sizes. NCOR performs

the same task up to 34% faster than MSHR. It should be noted that NCOR with 4KB

capacity performs faster than a 32KB MSHR-based cache.


11

13

15

17

19

21

23

4KB 8KB 16KB 32KB

Se

co

nd

s

Runtime

NCOR

MSHR

Figure 7.14: Average runtime in seconds for NCOR and MSHR-based cache.

70%

80%

90%

100%

Hit Ratio

Normal Runahead

Figure 7.15: Cache hit ratio for both normal and Runahead execution.

7.6.7 Cache Performance

This section compares cache performance with and without Runahead execution. Fig-

ure 7.15 reports hit ratio for a cache with 32KB capacity with and without Runahead

execution. Runahead improves cache hit ratio, by as much as 23% for hmmer and by

about 7% on average. We also report the number of cache Misses Per Kilo Instructions

(MPKI) in Figure 7.16. Runahead reduces MPKI, on average by 39% as it effectively

prefetches useful data into the cache.


020406080

100120

Misses / Kilo Instructions

Normal Runahead

Figure 7.16: Number of misses per 1000 instructions executed in both normal and Runa-head execution.

7.6.8 Secondary Misses

Runahead execution tied with NCOR achieves high performance even though the cache

is unable to service secondary misses. This section provides additional insight on why

discarding secondary misses has little effect on performance. Figure 7.17 reports, on

average, how many times the cache observes a secondary miss (only misses to a different

memory block) while in Runahead mode. The graph shows that every time the processor

switches to Runahead mode only 0.1 secondary misses are encountered, on average over

all benchmarks. Even if the cache was able to service secondary misses, it would have

generated only 10 memory requests every 100 times that it switches to Runahead mode.

Therefore, discarding secondary misses does not take away a significant opportunity to

overlap memory requests. Even for hmmer which experiences a high number of secondary

misses, Runahead achieves a 28% speedup as Figure 7.10 reports. This shows that non-

secondary misses are in fact fetching useful data.


0

0.5

1

1.5

2

Secondary Misses / Runahead Mode

Figure 7.17: Average number of secondary misses (misses only to different cache blocks)observed per invocation of Runahead executions in a 1-way processor.

7.6.9 Writeback Stall Effect

In Section 7.5.2 we showed that BRAM port limitation requires NCOR to delay write-

backs in case the system bus is responding to an earlier cache line read request. Unfor-

tunately in order to study the impact on IPC we need a very accurate DDR-2 model in

software simulations which our infrastructure does not include. Alternatively, we study

the most pessimistic scenario in which all write-backs coincide with data return of pend-

ing cache line reads, resulting in write-back stalls. Although it is possible, this scenario is

unlikely to occur and represents the absolute worst case in this study. Figure 7.18 shows

that even in this worst case scenario, still Runahead execution with NCOR is effective,

and on average less than 2% performance is lost.

7.7 Related Work

Related work in soft processor cache design includes work on automatic generation of

caches and synthesizable high performance caches, including non-blocking and traversal

caches. To the best of our knowledge, NCOR is the first FPGA-friendly non-blocking

data cache optimized for Runahead execution.


0.00.10.20.30.40.50.6

IPC

Writeback Stall Effect

Normal Runahead Runahead w/ Writeback Stall

Figure 7.18: IPC comparison of normal, Runahead and Runahead with worst case sce-nario for write-back stalls.

The technique used in NCOR for tracking pending cache lines is similar to that pro-

posed by Franklin and Sohi which stores the MSHR information in the cache line rather

than in a separate structure[28]. They add a transit bit to each cache line, indicating

that the line is being fetched from the main-memory. In their scheme, the data stored

in a cache line marked as in-transit provide MSHR information. However, NCOR uses

separate registers to store this information as it only requires MSHR information for one

cache line, hence the area overhead is low.

Yiannacouras and Rose created an automatic cache generation tool for FPGAs [67].

Their tool is capable of generating a wide range of caches based on a set of configuration

parameters, for example cache size, associativity, latency, and data width. The tool is

also useful in identifying the best cache configuration for a specific application.

The PowerPC 470S is a synthesizable soft-core implementation that is equipped

with non-blocking caches. This core is available under a non-disclosure agreement from

IBM [42]. A custom logic implementation of this core, PowerPC 476FP has been im-

plemented by LSI and IBM [42]. However, this implementation is not tuned for FPGA

implementation and its efficiency on such platform must be studied.

Coole et. al., present a traversal data cache framework for soft processors [20]. Traver-


sal caches are suitable for applications with pointer-based data structures. It is shown

that, using traversal caches, for such applications performance may improve by as mush

as 27x. Traversal caches are orthogonal to NCOR.

Choi et al., study the design and implementation of multi-ported data caches on

FPGAs [19]. They investigate various cache architectures for systems with multiple

components accessing the cache at the same time. They propose a multi-pumped cache

which achieves high performance, without partitioning the memory, hence the entire

cached memory space is available through all cache ports in single cycle.

NCOR avoids using CAMs to increase area and frequency efficiency. Dhawan and

DeHon propose dMHC, a near-associative memory architecture that exploits BRAMs to

store data and uses Bloom filters to track and match keys inside the memory [24]. dMHC

is shown to achieve higher performance compared to a naive, LUT-based implementation

of content addressable memories on FPGAs [24].

7.8 Conclusion

This chapter presented NCOR, an FPGA-Friendly non-blocking data cache implemen-

tation for soft processors with Runahead execution. It showed that a conventional non-

blocking cache is expensive to build on FPGAs due to the CAM-based structures used

in its design. NCOR exploits the key properties of Runahead execution to avoid CAMs

and instead stores information about pending requests inside the cache itself. In addi-

tion, the cache controller is optimized by breaking its large and complex state machine

into multiple, smaller, and simpler sub-controllers. Such optimizations improve design

operating frequency. A 4KB NCOR operates at 329 MHz on Stratix III FPGAs while

it uses only 270 logic elements. A 32KB NCOR operates at 278 MHz using 269 logic

elements.

Chapter 8

SPREX: Soft Processor with

Runahead EXecution

This chapter presents SPREX (Soft Processor with Runahead EXecution), an FPGA-

friendly, synthesizable, soft processor with Runahead execution. Conventional Runa-

head implementations were proposed for ASIC designs with different constraints than

the FPGA fabric. Such implementations rely on structures that do not map well onto

FPGAs, such as CAMs. SPREX avoids the inefficiencies of conventional Runahead de-

signs by exploiting CFC and NCOR. CFC avoids copying in order to use BRAMs for

storage while providing checkpointing functionality. NCOR does not use CAMs while it

provides non-blocking data cache functionality required by Runahead execution. In this

chapter we discuss the details of SPREX implementation and the challenges of tuning

the implementation to map well onto FPGAs. We implement SPREX in Verilog and

show that for our benchmark set, it improves performance by 9% on average and by as

much as 36%. The architecture of SPREX and its performance study has been published

as [5].

The rest of this chapter is organized as follows: Section 8.1 discusses the challenges

in implementing a Runahead processor on FPGAs. Section 8.2 presents the architecture

108

Chapter 8. SPREX: Soft Processor with Runahead EXecution 109

of the SPREX. Section 8.3 presents our experimental evaluation of SPREX using both

software simulation and actual hardware implementation. Section 8.4 presents related

work and, finally, Section 8.5 concludes this chapter.

8.1 Challenges of Runahead Execution in Soft Pro-

cessors

A processor with Runahead execution requires additional functionality beyond simple

pipelining. In Chapter 2 we discussed Runahead execution in more detail. This func-

tionality comes with area and frequency overheads. Conventional Runahead designs were

proposed for custom ASIC implementation in which the implementation trade-offs are

different compared to FPGAs. For example, on FPGAs BRAMs have a limited number

of ports and discrete sizes, whereas arbitrary SRAMs can be implemented in ASICs [21].

One of the key mechanisms Runahead requires is register file checkpointing. Instruc-

tions executed in Runahead mode must not alter the processor’s architectural state,

including the register file content. As the processor switches to runahead mode, it check-

points the register file. That is, the processor saves a copy of the register file contents,

only to be restored when the processor exits Runahead mode.

Saving the register file requires copying its content to a backup storage. Conventional

ASIC implementations checkpoint register files by interleaving checkpoint bits next to

each register file bit [53]. This implementation allows mass copying of data in single

cycles. On FPGAs, however, for area efficiency, register files are implemented using

BRAMs, which are equipped with a limited number of ports, normally at most two [21].

Therefore, copying the entire register file takes multiple cycles. Multi-cycle checkpoint-

ing delays the processor in entering Runahead mode, thus diminishing any performance

benefits. An alternative would be to implement the register file using dedicated registers

in LUTs, which leads to a large and area inefficient design.


A Runahead processor pipelines multiple memory requests to reduce total data re-

trieval time. Consequently, the processor requires a non-blocking data cache. Conven-

tional non-blocking cache designs are based on CAMs which map poorly onto FPGAs.

CAMs include an array of large comparators to perform associative lookups. The re-

sulting FPGA implementation stores the CAM data in dedicated registers in LUTs, and

further uses LUTs to implement a collection of multiplexers to select the matching cell.

The FPGA implementation is slow and large [3, 4].

8.2 SPREX: An FPGA-Friendly Runahead Archi-

tecture

This section describes SPREX, an FPGA-friendly Runahead architecture that has been

tailored to map well on reprogrammable fabrics. SPREX is based on the Nios-II ISA

and resembles a Nios-II-s implementation [13]. SPREX revisits conventional runahead

architecture, taking into consideration what functions are needed, how well they map

onto FPGAs and their corresponding performance benefit. As a result, SPREX keeps

just those functions that are absolutely necessary for Runahead while avoiding other

functions which in most cases lead to negligible performance gains. Figure 8.1 shows a

conventional in-order processor architecture augmented with additional components for

Runahead support.

8.2.1 Checkpointing

A Runahead processor uses checkpointing to preserve its architectural state while execut-

ing instructions in Runahead mode. For checkpointing the register file, we use Copy-Free

Checkpointing (CFC) as proposed in Chapter 5. CFC checkpoints the register file with-

out performing any copy operations, therefore is ideal for implementation using BRAMs.

CFC can support multiple checkpoints and could be used as a component for an out-of-


Fetch Decode Execute WriteMemory

Register

TrackingNCOR

CFC

Register file

Runahead

Figure 8.1: Gray components form a typical 5-stage in-order pipeline. Black componentsare added to support Runahead execution.

order soft-core implementation. For Runahead only one checkpoint of the register file is

required. SPREX is based on the Nios-II ISA, which includes 32 registers that are 32

bits in width, totaling to 1024 bits of storage. With the checkpoint, the total storage

needed is 2048 bits, which still fits in one block RAM (M9K blocks on Stratix III devices).

Therefore, using CFC to checkpoint the register file comes with no storage overhead in

terms of block RAMs used.

CFC requires a small vector for checkpoint tracking. This vector is stored in dedicated

registers rather than in a block RAM as parallel access is required. Only one bit per

architectural register is needed, or 32 bits in total.

8.2.2 Non-Blocking Cache

In Chapter 7 we showed that not all of the capabilities a full-blown non-blocking cache

provides offer significant performance benefits. Conventional non-blocking caches are de-

signed to overlap any arbitrary combination of memory references for best performance.

To support all combinations of cache accesses, MSHRs are used. However, during Runa-

head mode the processor does not have to execute all instructions, it can selectively

choose to discard and not support the execution of instructions that require complex


support. Accordingly, we choose NCOR as the data cache for SPREX. NCOR is an

FPGA-friendly non-blocking cache specialized for Runahead execution. It does not ser-

vice secondary misses, that is misses that will bring blocks on cache lines that already

have another request pending. In Chapter 7 we showed that supporting secondary misses

has negligible performance benefits. On the other hand, by removing this support, NCOR

is able to replace MSHRs with single bits stored along with each cache line.

8.2.3 Extra Decoding

In runahead mode, not all instructions should be executed. For example instructions

changing the processor control registers, or those causing exceptions must be discarded.

Therefore, a small decoder is added to the Decode stage to identify these instructions

that need to be flushed in runahead mode.

8.2.4 Store Instructions

The processor runs in speculative mode when in runahead mode. Therefore, no in-

struction must make persistent changes to the processor state, including the data cache.

However, store instructions, if executed, will change data words in the data cache. For

store instructions, we considered the following options:

1. Discard stores altogether: Discarding any instruction in runahead mode is perfectly

safe and does not affect overall program execution correctness [3].

2. Discard stores hitting in the cache. Fetch cache lines addressed by missing stores,

without actually modifying the cache line; In conventional caches, stores must first

fetch the whole cache line and then modify the part they touch.

3. Use a speculative store-buffer to keep the store values produced in Runahead mode

and prevent them from modifying the memory hierarchy. In this option, subsequent

loads in the current Runahead mode, which try to access the same address, are


provided by the store data. This can potentially lead to a more precise execution

in Runahead mode, hence more precise memory prefetches.

The first two options are simple to implement. The only difference is in the way

store instructions are serviced in runahead mode. Performance wise, the second option

may achieve a higher performance by prefetching more cache lines compared to the first

option. However, it is also possible that such lines pollute the cache, hurting performance.

The third option requires an extra storage unit for keeping store values produced

in runahead mode. Many store-buffer designs have been proposed in the past [46]. The

typical design contains an associative array, which does not map well on FPGAs, imposing

area and complexity overheads. Alternative designs sacrifice performance for shrinking

the associative array [50].

Section 8.3 shows that prefetching cache lines for stores results in more cache pollution

than useful prefetches. We conclude that the added complexity of using store-buffers for

runahead mode is not justified. Moreover, we conclude that it is not beneficial to prefetch

cache lines for stores executed in runahead mode. Therefore, SPREX discards all store

instructions during runahead mode.

8.2.5 Register Validity Tracking

In order to maximize the prefetching of useful cache lines, the program execution must

be followed as accurately as possible in runahead mode. However, not all the data is

available to the processor when in runahead mode. For example, in runahead execution

mode, not all registers are holding valid data [25]. There are reasons why registers end up

with bogus values during runahead mode. First, if the trigger miss is a load instruction its

destination register does not hold valid data yet, as the data load is still pending. Hence,

any instruction using that register, and all instructions down the same dependency graph

produce bogus data. Additionally, the destination registers of discarded instructions end

up with bogus data as well.


Executing using bogus data may lead to prefetching bogus addresses, thus polluting

the cache. Since instructions execute speculatively in runahead mode, correctness is

preserved, but bogus prefetches may hurt performance. Moreover, Section 8.3 shows that

avoiding bogus prefetches leads to higher performance. Therefore, instructions accessing

bogus data are best identified and discarded. SPREX tracks register validity as we

execute instructions in runahead mode.

Tracking data validity results in a small overhead. For each register an additional

bit is required. An instruction is discarded if any of its source registers are marked as

invalid. In addition, if an instruction that produces a result is discarded, its destination

register is marked as invalid as well. Registers become valid if an instruction writes valid

data into them.

8.3 Evaluation

8.3.1 Methodology

Given the number of parameters involved in the design space of Runahead, we used

software simulations to determine the best configuration and then implemented it in

hardware. We follow the methodology explained in Chapter 3. Table 8.1 reports the

architectural properties of the simulated and implemented processor. Our simulation

infrastructure uses a simplified DDR2 memory model and as a result, the performance

predicted by simulation does not completely match that measured on actual hardware.

In all experiments, we report speedup achieved over a simple 5-stage inorder pipeline.

We use microbenchmarking to tune the base pipeline to match Nios II in terms of IPC.

After choosing the best Runahead architecture, we implement it in Verilog and synthesize

it to the FPGA. SPREX operates at a maximum clock speed of 146MHz. We choose the

clock speed of 133MHz, conveniently chosen as half the clock speed of the DDR memory.

The timers and the UART run at 50MHz.


Table 8.1: Architectural properties of the simulated and implemented processors.

Pipeline Stages 5

Branch Predictor Bimodal

Bimodal Entries 512

I-Cache Blocking

I-Cache Size 32KB

I-Cache Block Size 32Bytes

D-Cache NCOR

D-Cache Size 32KB

D-Cache Block Size 32Bytes

NCOR no. outstanding requests 32

Cache Associativity Direct Mapped

Memory Latency (simulation) 24 Cycles

Checkpointing CFC

CFC no. checkpoints 1

Chapter 5 and Chapter 7 show, using software simulation, that NCOR and CFC

can support runahead execution effectively [3, 1]. Here we initially investigate, through

software simulation, the following key additional design choices for runahead:

1. How to handle stores

2. Whether to track register validity during runahead mode

3. How many outstanding requests should we track for the data cache

We finally measure performance on actual hardware and report the area and frequency

characteristics.

8.3.2 Stores During Runahead

Figure 8.2 compares three simulated Runahead architectures in terms of speedup over a

simple 5-stage pipeline. All three architectures use register validity tracking. The first


-5%

0%

5%

10%

15%

20%

25%

Sp

ee

du

p

discard stores prefetch stores prefetch + store-buffer

Figure 8.2: Store handling during runahead mode. Speedup comparison (see text for adescription of the three choices).

architecture discards all store instructions in runahead mode. The second architecture

prefetches cache lines for missing stores incurred in runahead mode, however does not

store any data in the cache. The third architecture includes a store buffer in addition to

prefetching cache lines for stores.

In two cases, bzip2 and h264 we observe a significant loss of performance when store

instructions are included. Besides a mild performance gain for astar, other benchmarks

exhibit little to no sensitivity to the inclusion of stores in runahead execution. In the case

of quantum benchmark, we observe a negligible performance loss, i.e., less than 2%, which

is the result of cache pollution. We conclude that, given the complexity of store-buffers,

it is not beneficial to use store-buffers, nor is it to execute stores in runahead mode at all.

Hence our final SPREX implementation discards store instructions in runahead execution

mode. For the rest of the evaluation we restrict our attention to the first option where

stores are discarded.


-5%0%5%

10%15%20%25%

Sp

ee

du

p

Register Validity Tracking

without Reg. Tracking with Reg. Tracking

Figure 8.3: Speedup with and without register validity tracking.

8.3.3 Register Validity Tracking

Considering that correctness is not necessary in runahead execution mode, we have the

option of executing instructions using bogus data. Therefore it is not critical to track

valid data in registers. However, avoiding bogus instructions can yield higher performance

as more useful cache lines will be prefetched. Figure 8.3 compares the speedup achieved

with and without register validity tracking. Performance is better by 4% on average

when register tracking is enabled. As register tracking comes with little overhead, we opt

to include register tracking.

8.3.4 Number of Outstanding Requests

As the memory latency increases, more time is spent in runahead execution mode. Thus

the processor has a higher chance of finding and overlapping memory accesses. How-

ever, the number of memory requests that the processor can generate in runahead mode

depends on the number of outstanding requests that the cache supports.

NCOR uses block RAMs to store information regarding outstanding memory requests.

Figure 8.4 shows NCOR’s block RAM and ALUT usage based on the number of out-


370

380

390

400

410

0

1

2

3

4

5

2 4 8 16 32 64

AL

UT

Blo

ck R

am

NCOR Resource Usage

Blockram ALUTs

Figure 8.4: NCOR resource usage based on the number of outstanding requests.

-10%

0%

10%

20%

30%

Sp

eed

up

Outstanding Requests

2 4 8 16 32 64

Figure 8.5: Speedup comparison of architectures with various numbers of outstandingrequests.

standing requests. NCOR’s block RAM usage is oblivious to the number of outstanding

requests in the range of 2-64. However, ALUT usage is directly affected. Figure 8.5

shows performance over the same range of number of outstanding requests. The speedup

obtained with more than four outstanding requests is insignificant. Based on these results

we use an NCOR with four outstanding requests.


0%

20%

40%

60%

80%

100%

120%In

cre

as

e

Memory Bandwidth

Figure 8.6: Memory bandwidth usage increase due to Runahead execution.

8.3.5 Memory Bandwidth

SPREX prefetches cache lines with the hope of finding a line used in the near future. This

increases pressure on the memory subsystem, potentially increasing power dissipation. In

Figure 8.6 we report the increase in memory bandwidth usage by Runahead execution.

On average, memory bandwidth usage is increased by 12%, peaking at 95% for the h264

benchmark. We predict that Runahead does not significantly increase memory bandwidth

usage in the system. However, the actual impact on power dissipation must be measured

to reach a conclusive result which we leave for future work.

8.3.6 Branch Prediction Accuracy

SPREX encounters and predicts branch instructions in Runahead mode as well as in nor-

mal execution mode. Execution of branches in Runahead mode serves as an opportunity

for the branch predictor to be trained before the actual branch is encountered in nor-

mal execution. Therefore, higher branch prediction accuracy is expected with Runahead

execution. Figure 8.7 compares the prediction accuracy with and without Runahead

execution. For all benchmarks, except bzip2 and sjeng, prediction accuracy is increased,


0%

10%

20%

30%

40%

Branch Misprediction Rate

Normal Runahead

Figure 8.7: Comparison of branch prediction accuracy for normal and Runahead execu-tions.

on average by 13%. Prediction accuracy is decreased by 11% and 1% for bzip2 and sjeng

respectively.

8.3.7 Final Processor Performance

We compare our final SPREX implementation with a simple 5-stage pipeline. We use

execution time for our comparison, that is the number of processor cycles it takes to

execute 1 Billion instructions. Figure 8.8 compares the two architectures. SPREX con-

sistently outperforms the baseline processor studied. SPREX’s performance advantage is

much higher for bzip2, astar and xalanc. The speedup is maximum at 36% for the astar

benchmark. Lower performance gains are observed for other benchmarks, ranging from

3-5%.

8.3.8 Runahead Overhead

Runahead comes with overhead both in terms of area and frequency. Table 8.2 reports

area usage of the entire SPREX processor, including runahead functionality. The table


0%

10%

20%

30%

40%S

pe

ed

up

Runahead Speedup

Figure 8.8: Speedup gained with Runahead execution over normal execution on actualFPGA.

also reports area usage for individual runahead components. In case of NCOR, the num-

bers of brackets indicate the overhead in addition to a simple blocking cache. Runahead

requires a total of 324 additional logic elements, 279 registers and 4 block RAMs, which

amount to only 19%, 18% and 57% of the total processor logic elements, register and

block RAM usage respectively. Considering the storage required for caches, the block

RAM overhead is much lower. For a SPREX with 32KB caches, the BRAM overhead is

only 5%, that is 4 BRAMs in addition to 77 BRAMs used in the entire processor.

Previous research has shown that components used in this work to support Runahead

are fast and area efficient on FPGAs [1, 4, 3]. Future work can investigate critical

paths in SPREX and tune the architecture further to improve clock frequency and thus

performance.

8.4 Related Work

A few past works have focused on architectures targeting programs with unstructured

ILP, for example superscalar, out-of-order or Runahead. Santa Cruz Out-of-Order Risc


Table 8.2: Runahead processor hardware cost breakdown. Numbers in parentheses denoteoverhead for Runahead support.

ALUTs Registers Block RAMs

Entire SPREX 1774 1518 7+caches

Extra Decoder 4 - -

Register Tracking 83 32 -

CFC 88 32 -

NCOR 412 (149) 323 (215) 4+caches (4)

Total Runahead Overhead 324 (19%) 279 (18%) 4 (57%)

Including Cache Storage - - 4 (5%)

Engine, SCOORE [27], is a project targeting a full-blown out-of-order soft processor with

large resource usage, i.e., >100K LUTs. SCOORE shows why out-of-order implementa-

tions do not map well on FPGAs resulting in expensive and inefficient implementations.

The primary goal of the SCOORE project is simulation acceleration. Mathieu Rosire

et. al., propose a multi-banked ROB implementation [52], a key component used in out-

of-order architectures. Fytraki and Pnevmatikatos implement parts of an out-of-order

processor on an FPGA for the purpose of accelerating processor simulation [30]. To the

best of our knowledge, SPREX is the first soft processor architecture with Runahead

execution.

8.5 Conclusion

This is the first step towards implementing a high performing soft processor, targeting

programs with unstructured ILP. We presented SPREX, an FPGA-friendly, synthesiz-

able Runahead soft processor architecture and showed that Runahead in fact provides

significant performance benefits in the reconfigurable environments, by up to 36%. We

showed that by sacrificing less important functionality, we can achieve an efficient archi-

tecture for FPGAs, while we maintain Runahead performance benefits. Our next steps


would include understanding and eliminating frequency bottlenecks in our implementa-

tion of the architecture. Further optimization may improve clock frequency and allow us

to run the processor at a higher clock frequency, possibly matching that of the memory

controller, i.e. at 266 MHz.

Chapter 9

Concluding Remarks

For embedded systems incorporating soft processors, many architectures have been pro-

posed for accelerating the applications, including VLIW, Vector processing, and SIMD.

However, these architectures target programs with regular parallelism that can be ex-

tracted offline. As embedded systems grow in size and complexity, their software evolve

as well, leading to programs with unstructured parallelism which is inherently difficult,

sometimes impossible, to extract offline.

This thesis considered microarchitectures designed for programs with irregular par-

allelism with a set of constraints in mind that are unique to FPGAs. Superscalar, Out-

of-Order, and Runahead processing are the three main architectures proposed for such

applications, which have been extensively studied for the ASIC paradigm. This thesis

investigated the potential and feasibility of each architecture for FPGA implementa-

tion. Superscalar processing was shown to be undesirable due to low clock frequency

and high area cost. A narrow, Out-of-Order pipeline, on the other hand, was shown to

be promising. We redesigned and investigated many components of the OoO architec-

ture, including checkpointing, register renamer, instruction scheduler, and non-blocking

cache. Although the potential for OoO processing on FPGAs was demonstrated, a fully

functioning core was left for future work. Finally, a complete soft core with Runahead

124

Chapter 9. Concluding Remarks 125

execution was introduced which achieved high performance with comparable area costs

compared to off-the-shelf inorder soft processors.

This chapter presents the summary of the thesis and research contributions, followed

by directions for future research.

9.1 Thesis Summary

Implementing soft processors comes with various challenges including maintaining high

clock frequency, low area cost and low instruction cycle count. Despite their differences,

many processor microarchitectures are based on conventional 5-stage pipeline. Accord-

ingly, this thesis studied the challenges in implementing a typical soft processor on FPGAs

and proposed unique solutions for each challenge faced.

Next, we considered an OoO architecture which is suitable for accelerating programs

with unstructured parallelism. However, implementing an OoO soft processor comes with

additional challenges compared to a simple 5-stage inorder pipeline. Such an architec-

ture employs additional components and mechanisms which are mostly only studied for

ASIC implementation. Accordingly, we studied the feasibility of many OoO components

and mechanisms on FPGAs and proposed FPGA-friendly alternatives for cases where

conventional designs mapped poorly to FPGAs. We exploit the unique characteristics of

FPGA resources, such as BRAMs, and overcame the challenge of maintaining high clock

frequency and low area cost while providing the same functionality when redesigning

OoO components.

We also studied Runahead execution, as a simpler alternative to OoO, which we show

provides most of the benefits of OoO processing in an embedded environment. However,

Runahead still requires additional functionality on top of an inorder pipeline. This thesis

studied the requirements of Runahead execution and proposes novel techniques to provide

them while utilizing FPGA resources to achieve high clock frequency and low area cost.


Finally, SPREX, a complete soft processor implementation with Runahead execution,

was introduced which provides higher performance at a comparable area cost to simple

inorder processors available today.

More specifically, the contributions of this thesis are as follows:

• This thesis investigated the challenges in implementing soft processors on FPGAs.

As many processor architectures are based on the typical 5-stage pipeline, the

challenges one faces in implementing them are similar. This thesis identified and

categorized various challenges designers face in implementing soft processors, for

example low clock frequency due to data forwarding in the pipeline and hazard

detection, and proposed solutions to overcome such challenges.

• This thesis introduced CFC, a novel copy-free checkpointing mechanism that takes

advantage of the LUT structure and BRAMs on FPGAs to achieve high perfor-

mance and low area cost. Conventional checkpointing mechanisms employ bit-

interleaving techniques to copy checkpoint data between multiple storage banks.

However, CFC avoids data copying and uses sophisticated data indexing to locate

the desired checkpoint data among the many versions stored.

• This thesis investigated the implementation of instruction schedulers for OoO pro-

cessing on FPGAs. It showed that considering the scheduler as part of the whole

processor pipeline, it is beneficial, both in terms of clock frequency and area cost,

to employ a small four-entry scheduler which utilizes a sophisticated, age-based

selection policy and fast, back-to-back scheduling.

• This thesis introduced NCOR, a non-blocking data cache which is tailored for

Runahead execution on FPGAs. NCOR does away with CAMs which are used

in conventional non-blocking cache designs. Instead, it stores meta data used for

tracking the pending cache lines in the cache itself. Compared to a full-blown


non-blocking cache, NCOR provides only the required functionality that Runahead

execution needs, leading to a smaller and faster design.

• This thesis introduced SPREX, a complete soft processor with Runahead execution.

SPREX utilizes CFC and NCOR for checkpointing and non-blocking functionality

respectively, which are required for Runahead execution. Furthermore, SPREX

is shown to provide higher performance compared to off-the-shelf soft processors,

while using comparable FPGA resources.

9.2 Future Work

This thesis studied the implementation of a fast and small OoO soft processor on FPGAs,

a microarchitecture that had never been implemented on FPGAs with area and frequency

characteristics comparable to those of inorder processors. In this section we discuss

possible future research directions that are enabled by the research done in this thesis.

9.2.1 Out-of-Order Execution

This thesis demonstrated the potential for OoO execution on FPGAs. We showed that

a 1-way OoO soft processor is able to reach higher performance than a simple 5-stage

inorder pipeline. We proposed alternative, FPGA-friendly solutions for checkpointing,

renaming and non-blocking caches for OoO execution. Our solutions show that it is

possible to redesign conventional ASIC-oriented designs of such structures, making them

suitable for FPGA implementation. However, to achieve a complete OoO core, more

components are required and need to be investigated for FPGA implementation.

In OoO execution, Load-Store-Queues (LSQ) are employed to forward data between

memory instructions and detect Read-After-Write dependency violations [46]. When a

load instruction is executed, using the LSQ, data is forwarded from old uncommitted

stores. To find a matching store, the LSQ must be searched for a matching address.


Therefore, for fast access, LSQs are conventionally implemented using CAMs which are

slow and large on FPGAs. One could investigate the feasibility of implementing LSQs

on FPGAs and possibly redesign them to remove the CAMs from their structure.

ReOrder Buffers (ROB) are another components used in OoO execution, responsible

for tracking the original ordering among the instructions that are being executed. Using

ROB, the processor is able to commit instructions in the order they were fetched and

preserve correctness. Rosiere et. al., have proposed a multi-banked ROB implementation

tuned for FPGA implementation which is able to use BRAMs. Future work can utilize

this ROB design to form a complete OoO processor.

The next step in this path is to implement a complete OoO soft processor. Even if

all the components of the processor are already designed, integrating them all into one

coherent design is not a trivial task. Future work can target forming a complete OoO

processor utilizing the components proposed in this thesis and achieve a fast and small

implementation. Based on the data gathered in this thesis through complete system sim-

ulations, we expect a complete OoO soft processor to achieve performance improvements

of up to 20% compared to an inorder pipeline.

9.2.2 Multi-Processor Designs

In the last decade computer architecture research observed a shift towards utilizing

multi-processing due to the frequency wall [7]. Multi-processing exploits parallelism

in programs to achieve higher performance, while each processing element operates at

sustainable clock speeds and energy requirements. Proposals for such architectures in-

clude Simultaneous Multithreading and Multicore architectures, and Cell processors [60,

26, 48, 32].

The processing elements in a multi-processor architecture can have various microar-

chitectures, including OoO and Runahead. This work introduced SPREX, a complete

single core pipeline with Runahead execution. Future work in the multi-processing area


can utilize SPREX to form a multi-processor system with Runahead execution. However,

including a core with Runahead execution in a multi processor system introduces new,

and interesting challenges. For example, designing a coherent cache which is able to

track cache accesses in Runahead mode is challenging, as not all memory requests by a

Runahead core are initiated by the program itself. We leave investigating such challenges

to future work.

9.2.3 Power and Energy

This thesis focused on performance and area cost trade-off when designing processor

components or an entire processor. However, power dissipation and energy consumption

are increasingly more prohibitive in the embedded systems, following the trends in the

ASIC processor design [36]. Therefore, an important future direction is to study the

performance/area/power trade-off when implementing soft processors.

OoO processing requires additional components on top of inorder pipelines. Every ad-

ditional component, even with low runtime activity, dissipates power, hence increases the

processor’s energy footprint. This thesis introduced FPGA-friendly alternatives for vari-

ous OoO components considering only performance and area. Future work can reevaluate

these components taking into account energy consumption and propose solutions adher-

ing to specific energy requirements. For example, in Chapter 6 we showed that by limiting

the instruction scheduler clock frequency, the optimal design’s size and scheduling policy

changes. Hence, it is reasonable to assume that by introducing energy constraints the

optimal design may be different.

In Runahead, the processor continues instruction execution while waiting for memory

operations. These operations are performed for the sole purpose of finding subsequent

memory operations, and no instruction result is stored. Therefore, the processor con-

sumes extra energy in Runahead mode compared to an inorder pipeline. Additionally,

not all memory requests sent in Runahead mode are useful, yet consume energy in the


processor and the memory controller. On the other hand, finding, and overlapping, sub-

sequent memory operations reduces overall execution time, hence saving energy. Future

work can study this complex energy/performance trade-off in Runahead execution.

Bibliography

[1] Kaveh Aasaraai and Andreas Moshovos. Towards a viable out-of-order soft

core: Copy-free, checkpointed register renaming. In 19th Intl. Conf. on Field

Programmable Logic and Applications (FPL), Prague, Czech Republic, September

2009.

[2] Kaveh Aasaraai and Andreas Moshovos. Design space exploration of instruction

schedulers for out-of-order soft processors. In In the International Conference on

Field-Programmable Technology (Poster Presentation), 2010.

[3] Kaveh Aasaraai and Andreas Moshovos. NCOR: An FPGA-Friendly nonblocking

data cache for soft processors with runahead execution. International Journal of

Reconfigurable Computing, 2011.

[4] Kaveh Aasaraai and Andreas Moshovos. An efficient non-blocking data cache for soft

processors. In Proc. of the International Conference on ReConFigurable Computing

and FPGAs, December 2010.

[5] Kaveh Aasaraai and Andreas Moshovos. Sprex: A soft processor with runahead

execution. In Proc. of the International Conference on ReConFigurable Computing

and FPGAs, December 2012.

[6] Advanced Micro Devices Inc. AMD-K5 Processor Data Sheet. In Proceedings of the

Hot Chips VIII, 1997.

131

Bibliography 132

[7] Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, and Doug Burger. Clock rate

versus ipc: the end of the road for conventional microarchitectures. In Proceedings

of the 27th annual international symposium on Computer architecture, ISCA ’00,

pages 248–259, New York, NY, USA, 2000. ACM.

[8] Haitham Akkary, Ravi Rajwar, and Srikanth T. Srinivasan. Checkpoint pro-

cessing and recovery: Towards scalable large instruction window processors. In

Proceedings of the 36th International Symposium on Microarchitecture, pages 423–

434, 2003.

[9] Altera Corporation. Avalon Bus Specifications. http://www.altera.com/

literature/manual/mnl_avalon_spec.pdf.

[10] Altera Corporation. Embedded Peripherals IP. http://www.altera.com/

literature/ug/ug_embedded_ip.pdf.

[11] Altera Corporation. Functional Description - UniPHY. http://www.altera.com/

literature/hb/external-memory/emi_fd_uniphy.pdf.

[12] Altera Corporation. Logic Array Blocks and Adaptive Logic Modules in Stratix III

Devices.

[13] Altera Corporation. Nios II Processor Reference Handbook, May 2011.

[14] Altera Corporation. Nios II Performance Benchmarks, Dec. 2012.

[15] Arcturus Networks Inc. uClinux. http://www.uclinux.org/.

[16] Jean-Loup Baer and Tien-Fu Chen. Effective hardware-based data prefetching for

high-performance processors. IEEE Trans. Comput., 44(5):609–623, May 1995.

[17] Samson Belayneh and David R. Kaeli. A discussion on non-blocking/lockup-free

caches. SIGARCH Comput. Archit. News, 24(3):18–25, 1996.

Bibliography 133

[18] Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil A. Patil, William Reinhart, Dar-

rel Eric Johnson, Jebediah Keefe, and Hari Angepat. FPGA-accelerated simula-

tion technologies (fast): Fast, full-system, cycle-accurate simulators. In MICRO

40: Proceedings of the 40th Annual IEEE/ACM International Symposium on

Microarchitecture, pages 249–261, Washington, DC, USA, 2007. IEEE Computer

Society.

[19] Jongsok Choi, Kevin Nam, Andrew Canis, Jason Anderson, Stephen Brown, and

Tomasz Czajkowski. Impact of cache architecture and interface on performance and

area of fpga-based processor/parallel-accelerator systems. In Proceedings of the 2012

IEEE 20th International Symposium on Field-Programmable Custom Computing

Machines, FCCM ’12, pages 17–24, Washington, DC, USA, 2012. IEEE Computer

Society.

[20] James Coole and Greg Stitt. Traversal caches: A framework for FPGA accelera-

tion of pointer data structures. International Journal of Reconfigurable Computing,

2010:16 pages, 2010.

[21] Altera Corp. Stratix III Device Handbook: Chapter 4. TriMatrix Embedded Memory

Blocks in Stratix III Devices., 2010.

[22] Control Data Corporation. CDC 6600 mainframe computer, 1964.

[23] Fredrik Dahlgren, Michel Dubois, and Per Stenstrom. Sequential hardware prefetch-

ing in shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 6(7):733–

746, July 1995.

[24] Udit Dhawan and Andre DeHon. Area-efficient near-associative memories on

fpgas. In Proceedings of the ACM/SIGDA international symposium on Field

programmable gate arrays, FPGA ’13, pages 191–200, New York, NY, USA, 2013.

ACM.

Bibliography 134

[25] James Dundas and Trevor Mudge. Improving data cache performance by pre-

executing instructions under a cache miss. In ICS ’97: Proc. of the 11th intl. conf.

on Supercomputing, pages 68–75, New York, NY, USA, 1997. ACM.

[26] Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, Rebecca L. Stamm,

and Dean M. Tullsen. Simultaneous multithreading: A platform for next-generation

processors. IEEE Micro, 17:12–19, 1997.

[27] F. J. Mesa-Martinez et al. SCOORE Santa Cruz Out-of-Order RISC Engine, FPGA

Design Issues. In Workshop on Architectural Research Prototyping (WARP), held

in conjunction with ISCA-33, pages 61–70, 2006.

[28] K. I. Farkas and N. P. Jouppi. Complexity/performance tradeoffs with non-blocking

loads. In Proceedings of the 21st Annual International Symposium on Computer

Architecture, ISCA ’94, pages 211–222, Los Alamitos, CA, USA, 1994. IEEE Com-

puter Society Press.

[29] Freescale. e600 PowerPC Core Reference Manual.

[30] S. Fytraki and D. Pnevmatikatos. RESIM: A trace-driven, reconfigurable ILP pro-

cessor simulator. In Design and Automation Europe, 2008.

[31] J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantitative

Approach. Computer Architecture, the Morgan Kaufmann Ser. in Computer Ar-

chitecture and Design Series. Elsevier Science, 2006.

[32] H. Peter Hofstee. Power efficient processor architecture and the cell processor. In

Proceedings of the 11th International Symposium on High-Performance Computer

Architecture, HPCA ’05, pages 258–262, Washington, DC, USA, 2005. IEEE Com-

puter Society.

Bibliography 135

[33] Engin Ipek, Onur Mutlu, Jose F. Martınez, and Rich Caruana. Self-optimizing

memory controllers: A reinforcement learning approach. In Proceedings of the 35th

Annual International Symposium on Computer Architecture, ISCA ’08, pages 39–50,

Washington, DC, USA, 2008. IEEE Computer Society.

[34] J. E. Smith. A study of branch prediction strategies. In 8th Annual Symposium on

Computer Architecture, pages 135-147, June 1981.

[35] Norman P. Jouppi. Cache write policies and performance. In Proceedings of the

20th annual international symposium on computer architecture, ISCA ’93, pages

191–201, New York, NY, USA, 1993. ACM.

[36] Stefanos Kaxiras and Margaret Martonosi. Computer Architecture Techniques for

Power-Efficiency. Morgan and Claypool Publishers, 1st edition, 2008.

[37] J. Keller. The alpha 21264 microprocessor architecture. In In Proceedings of 9th

Annual Microprocessor Forum, 1996.

[38] D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In Proceedings

of the 8th Annual International Symposium on Computer Architecture, 1981.

[39] Ashok Kumar. The HP PA-8000 RISC CPU: a high performance out-of-order pro-

cessor. In Proceedings of the Hot Chips VIII, 1996.

[40] Martin Labrecque, Mark C. Jeffrey, and J. Gregory Steffan. Application-specific

signatures for transactional memory in soft processors. ACM Trans. Reconfigurable

Technol. Syst., 4(3):21:1–21:14, August 2011.

[41] Charles Eric LaForest and J. Gregory Steffan. Efficient multi-ported memories for

fpgas. In Proceedings of the 18th annual ACM/SIGDA international symposium

on Field programmable gate arrays, FPGA ’10, pages 41–50, New York, NY, USA,

2010. ACM.

Bibliography 136

[42] International Business Machines. IBM and LSI, PowerPC 476FP Embedded Proces-

sor Core and PowerPC 470S Synthesizable Core User’s Manual. http://www-03.

ibm.com/press/us/en/pressrelease/28399.wss.

[43] Francisco J. Mesa-Martınez, Michael C. Huang, and Jose Renau. Seed: scal-

able, efficient enforcement of dependences. In PACT ’06: Proceedings of the 15th

international conference on Parallel architectures and compilation techniques, pages

254–264, New York, NY, USA, 2006. ACM.

[44] A. Moshovos. Checkpointing alternatives for high performance, power-aware proces-

sors. In Proceedings of the 2003 international symposium on Low power electronics

and design, pages 318–321, 2003.

[45] A. Moshovos and G. S. Sohi. Micro-Architectural Innovations: Boosting Processor

Performance Beyond Technology Scaling. Proceedings of the IEEE, 89(11), Novem-

ber 2001.

[46] Andreas Moshovos, Scott E. Breach, T.N. Vijaykumar, and Gurindar S. Sohi. Dy-

namic speculation and synchronization of data dependencies. In In Proceedings of

the 24th International Symposium on Computer Architecture, 1997.

[47] Mayan Moudgill, Keshav Pingali, and Stamatis Vassiliadis. Register renaming and

dynamic speculation: an alternative approach. In Proceedings of the 26th annual

international symposium on Microarchitecture, MICRO 26, pages 202–213, Los

Alamitos, CA, USA, 1993. IEEE Computer Society Press.

[48] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung

Chang. The case for a single-chip multiprocessor. In Proceedings of the seventh

international conference on Architectural support for programming languages and

operating systems, ASPLOS VII, pages 2–11, New York, NY, USA, 1996. ACM.

Bibliography 137

[49] Subbarao Palacharla and J. E. Smith. Complexity-effective superscalar proces-

sors. In In Proceedings of the 24th Annual International Symposium on Computer

Architecture, pages 206–218, 1997.

[50] Il Park, Chong Liang Ooi, and T. N. Vijaykumar. Reducing design complexity of the

load/store queue. In In Proceedings of the 36th annual IEEE/ACM International

Symposium on Microarchitecture, 2003.

[51] European Space Research and Technology Centre. Leon3 multiprocessing cpu core.

http://www.gaisler.com/doc/leon3\_product\_sheet.pdf/.

[52] M. Rosiere, J.-I. Desbarbieux, N. Drach, and F. Wajsburt. An out-of-order super-

scalar processor on FPGA: The reorder buffer design. In Design, Automation Test

in Europe Conference Exhibition (DATE), 2012, pages 1549 –1554, march 2012.

[53] E. Safi, A. Moshovos, and A. Veneris. On the latency and energy of checkpointed

superscalar register alias tables. Very Large Scale Integration (VLSI) Systems, IEEE

Transactions on, 18(3):365 –377, march 2010.

[54] J. E. Smith and G. Sohi. The Microarchitecture of Superscalar Processors.

Proceedings of the IEEE, 1995.

[55] CORPORATE SPARC International, Inc. The SPARC architecture manual (version

9). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1994.

[56] Standard Performance Evaluation Corporation. SPEC CPU 2006. http://www.

spec.org/cpu2006/.

[57] T. N. Buti et al. Organization and Implementation of the Register-

Renaming Mapper for Out-of-Order IBM POWER4 Processors.

IBM Journal of Research and Development, Vol. 49, No. 1, 2005.

Bibliography 138

[58] Terasic Inc. Altera DE3 development system with Stratix III FPGA. http://

university.altera.com/materials/boards/de3/.

[59] R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units.

IBM J. Res. Dev., 11(1):25–33, January 1967.

[60] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. Simultaneous multithread-

ing: maximizing on-chip parallelism. In 25 years of the international symposia on

Computer architecture (selected papers), ISCA ’98, pages 533–544, New York, NY,

USA, 1998. ACM.

[61] David W. Wall. Limits of instruction-level parallelism. In Proceedings of the fourth

international conference on Architectural support for programming languages and

operating systems, ASPLOS IV, pages 176–188, New York, NY, USA, 1991. ACM.

[62] Henry Wong, Vaughn Betz, and Jonathan Rose. Comparing FPGA vs. custom

cmos and the impact on processor microarchitecture. In Proceedings of the 19th

ACM/SIGDA international symposium on Field programmable gate arrays, FPGA

’11, pages 5–14, New York, NY, USA, 2011. ACM.

[63] Di Wu, Kaveh Aasaraai, and Andreas Moshovos. Low-cost, high-performance

branch predictors for soft processors. In 23rd International Conference on Field

Programmable Logic and Applications (FPL), September 2013.

[64] Xilinx Inc. MicroBlaze Processor Reference Guide, Mar. 2012.

[65] K.C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 16(2):28–

40, 1996.

[66] P. Yiannacouras, J. G. Steffan, and J. Rose. VESPA: portable, scalable, and flexible

FPGA-based vector processors. In Proceedings of the 2008 International Conference

Bibliography 139

on Compilers, Architectures and Synthesis for Embedded Systems, pages 61–70,

2008.

[67] Peter Yiannacouras and Jonathan Rose. A parameterized automatic cache generator

for FPGAs. In Proc. Field-Programmable Technology (FPT), pages 324–327, 2003.

[68] Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. Exploration and cus-

tomization of fpga-based soft processors. IEEE Trans. on CAD of Integrated Circuits

and Systems, 26(2):266–277, 2007.

High Performance Soft Processor Architectures for ... · processors and VLIWs. These architectures...

Documents

Transcript of High Performance Soft Processor Architectures for ... · processors and VLIWs. These architectures...