High Performance Soft Processor Architectures for ... · processors and VLIWs. These architectures...
Transcript of High Performance Soft Processor Architectures for ... · processors and VLIWs. These architectures...
High Performance Soft Processor Architectures forApplications with Irregular Data- and Instruction-Level
Parallelism
by
Kaveh Aasaraai
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
c© Copyright 2014 by Kaveh Aasaraai
Abstract
High Performance Soft Processor Architectures for Applications with Irregular Data-
and Instruction-Level Parallelism
Kaveh Aasaraai
Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto
2014
Embedded systems based on FPGAs frequently incorporate soft processors. The
prevalence of soft processors in embedded systems is due to their flexibility and adaptabil-
ity to the application. However, soft processors provide moderate performance compared
to hard cores and custom logic, hence faster performing soft processors are desirable.
Many soft processor architectures have been studied in the past including Vector
processors and VLIWs. These architectures focus on regular applications in which it is
possible to extract data and/or instruction level parallelism offline. However, applications
with irregular parallelism only benefit marginally from such architectures. Targeting
such applications, we investigate superscalar, out-of-order, and Runahead execution on
FPGAs. Although these architectures have been investigated in the ASIC world, they
have not been studied thoroughly for FPGA implementations.
We start by investigating the challenges of implementing a typical inorder pipeline on
FPGAs and propose effective solutions to shorten the processor critical path. We then
show that superscalar processing is undesirable on FPGAs as it leads to low clock fre-
quency and high area cost due to wide datapaths. Accordingly, we focus on investigating
and proposing FPGA-friendly OoO and Runahead soft processors.
We propose FPGA-friendly alternatives for various mechanisms and components used
in OoO execution. We introduce CFC, a novel copy-free checkpointing which exploits
ii
FPGA block RAMs for fast and dense storage. Using CFC, we propose an FPGA-friendly
register renamer and investigate the design and implementation of instruction schedulers
on FPGAs.
We then investigate Runahead execution and introduce NCOR, an FPGA-friendly
non-blocking cache tailored for FPGAs. NCOR removes CAM-based structures used in
conventional designs and achieves the high clock frequency of 278 MHz. Finally, we intro-
duce SPREX, a complete Runahead soft core incorporating CFC and NCOR. Compared
to Nios II, SPREX provides as much as 38% higher performance for applications with
irregular data-level parallelism with minimal area overhead.
iii
Acknowledgements
An important part of my studies was that it was more than studying at school.
I interacted with many people and learned many life lessons, directly and indirectly
from them. I’d like to acknowledge all for their support, friendship, supervision and
company. I hope those who have been omitted from these pages are forgiving for this is
not intentional.
I never took my research supervisor and advisor, Prof. Andreas Moshovos, for granted.
He guided me through my research and supported me academically, financially, and
mentally. He managed to create the perfect balance between supervision and freedom of
work that I truly appreciate. I’d have to advise anyone looking for a Ph.D. supervisor to
make him their first choice.
Half way through my studies I was accompanied with my now ex-wife, Monia Ghobadi.
Although our relationship ended before my studies, but I must admit she was always
supportive and helpful. I wish her well in her life and thank her for all her support.
My parents played a big role in forming my personality and helping me to get to this
point in my life. They are both academic people and throughout the years encouraged
me in my studies. My mother has always been my go to person in times of despair and
hardship.
I’d like to thank my committee members, Professors Paul Chow, Greg Steffan, and
Jason Anderson, for their support in my studies. I had the pleasure of taking several
courses with them and finished some interesting projects under their advice. Throughout
my studies I encountered many technical difficulties, and with no hesitation I knew that
I could seek help from them. Prof. Anderson was kind enough to accept to be part of
my committee last minute, and I truly appreciate his support.
My friends have always been a big part of my life. My life during the past several
years has had many good and bad moments and I’m honored to have had such caring
friends to always be beside me. Soheil and Shabnam, the lovely couple who helped
iv
me through my studies and relationship difficulties will always be my dear friends. My
best friend Paige has always been supportive, in any aspect, and generous in paying me
attention when I needed. She encouraged me, pushed me, and picked me up whenever I
was going through difficult times. Diego and I essentially shared the lab space at school.
His company throughout the years has been very helpful and I’m glad to have made such
a good friend at school.
I’d like to thank all my colleagues at school who were always helpful with their
support and most importantly their constructive criticism. I’d like to thank Myrto for
her friendship and support. I’d also like to thank Maryam, Ian, Alhassan, Elias, Elham,
Jason, Patrick, Mitchel, Eric, Davor, Henry and many more who were always in the
lab! I particularly enjoyed having long and intellectually rich conversations about soft
processors with Henry.
My research was highly dependant on equipment which were primarily donated by
Altera Corp. to our lab. I’d like to thank them for their support and generosity, for they
greatly facilitated my research in this field.
Many faculty members in our group helped me throughout my studies. I had many
interesting and thought provoking conversations with Prof. Jonathan Rose. Prof Vaughn
Betz also helped me through my studies and helped me in making connections to the
industry. Prof. Jason Anderson has always been my good friend and very supportive of
my studies, besides being a member of my committee.
My studies would not have been possible without financial support. I was fortunate to
be granted many awards which helped me focus on my studies. I received supports from
programs ranging from OGSST, NSERC-CGS, DCA to Graduate Student Endowment
Fund awarded by the dean of the graduate studies. Additionally, Prof. Moshovos has
always been generous in supporting me financially to attend conferences and events. In
the latter part of my studies that I had no financial support from the university, he
supported me completely.
v
I’d like to thank all the administrative staff at school. Kelly Chan has always been
cheerful and helpful. Jayne Leake coordinated my TAships and never complained for
all the hardship I gave her with late paperwork! Judith Levene and Darlene Gorzo
helped with all the school administrative work and were always available to answer my
never-ending questions!
vi
Contents
1 Introduction 11.1 Superscalar Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Out-of-Order Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Runahead Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Superscalar vs. OoO and Runahead Execution . . . . . . . . . . . . . . . 41.5 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.6 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 Soft-Processor Implementation Challenges . . . . . . . . . . . . . 61.6.2 Copy-Free Checkpointing . . . . . . . . . . . . . . . . . . . . . . . 71.6.3 Instruction Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 91.6.4 Non-Blocking Data Cache . . . . . . . . . . . . . . . . . . . . . . 91.6.5 Soft Processor with Runahead Execution . . . . . . . . . . . . . . 10
1.7 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Background and Motivation 132.1 Superscalar Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Out-of-Order Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Runahead Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Narrow vs. Wide Datapath . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Experimental Methodology 203.1 Comparison Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.2 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.3 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.4 IPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Software Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Software Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.2 Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.1 Verilog Implementation . . . . . . . . . . . . . . . . . . . . . . . . 253.3.2 Component Isolation . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.3 Inorder Processor Resembling Nios II . . . . . . . . . . . . . . . . 263.3.4 The System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
vii
3.3.5 System Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.6 Memory Controller . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.7 Peripherals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Soft Processor Implementation Challenges 284.1 Identifying Implementation Inefficiencies . . . . . . . . . . . . . . . . . . 304.2 Processor Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Fetch Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.2 Decode Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2.3 Execute Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2.4 Memory Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.5 Writeback Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4 Critical Path Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.5 Eliminating Critical Paths . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.1 Multiplier and Shifter . . . . . . . . . . . . . . . . . . . . . . . . . 404.5.2 Branch Misprediction Detection . . . . . . . . . . . . . . . . . . . 414.5.3 Data Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5.4 Fetch Address Selection . . . . . . . . . . . . . . . . . . . . . . . 454.5.5 Data Operand Specialization . . . . . . . . . . . . . . . . . . . . . 46
4.6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 CFC: Copy-Free Checkpointing 505.1 The Need for Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . 505.2 Register Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Checkpointed RAT . . . . . . . . . . . . . . . . . . . . . . . . . . 525.3 CFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1 The New RAT Structure . . . . . . . . . . . . . . . . . . . . . . . 545.3.2 RAT Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 FPGA Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.4.1 Flattening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.4.2 Multiporting the RAT . . . . . . . . . . . . . . . . . . . . . . . . 585.4.3 Dirty Flag Array . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.4.4 Pipelining the CFC . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.5.2 LUT Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.5.3 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.5.4 Impact of Pipelining on IPC . . . . . . . . . . . . . . . . . . . . . 625.5.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
viii
6 Instruction Scheduler 656.1 Instruction Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.2 CAM-Based Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2.1 CAM on FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.2.2 CAM Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 686.2.3 Back-to-Back Scheduling . . . . . . . . . . . . . . . . . . . . . . . 696.2.4 Scheduling Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.3.2 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.3.3 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.3.4 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.3.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7 NCOR: Non-blocking Cache For Runahead Execution 787.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.2 Conventional Non-Blocking Cache . . . . . . . . . . . . . . . . . . . . . . 797.3 Making a Non-Blocking Cache FPGA-Friendly . . . . . . . . . . . . . . . 80
7.3.1 Eliminating MSHRs . . . . . . . . . . . . . . . . . . . . . . . . . 817.3.2 Making the Common Case Fast . . . . . . . . . . . . . . . . . . . 82
7.4 NCOR Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837.4.1 Cache Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 837.4.2 Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847.4.3 Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857.4.4 Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867.4.5 Data and Tag Storage . . . . . . . . . . . . . . . . . . . . . . . . 877.4.6 Request Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.4.7 Meta Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.5 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.5.1 Storage Organization . . . . . . . . . . . . . . . . . . . . . . . . . 887.5.2 BRAM Port Limitations . . . . . . . . . . . . . . . . . . . . . . . 907.5.3 State Machine Complexity . . . . . . . . . . . . . . . . . . . . . . 927.5.4 Latching the Address . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.6.2 Simplified MSHR-Based Non-Blocking Cache . . . . . . . . . . . . 967.6.3 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.6.4 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987.6.5 MSHR-Based Cache Scalability . . . . . . . . . . . . . . . . . . . 997.6.6 Runahead Execution . . . . . . . . . . . . . . . . . . . . . . . . . 1007.6.7 Cache Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 1037.6.8 Secondary Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.6.9 Writeback Stall Effect . . . . . . . . . . . . . . . . . . . . . . . . 105
ix
7.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8 SPREX: Soft Processor with Runahead EXecution 1088.1 Challenges of Runahead Execution in Soft Processors . . . . . . . . . . . 1098.2 SPREX: An FPGA-Friendly Runahead Architecture . . . . . . . . . . . . 110
8.2.1 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1108.2.2 Non-Blocking Cache . . . . . . . . . . . . . . . . . . . . . . . . . 1118.2.3 Extra Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1128.2.4 Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1128.2.5 Register Validity Tracking . . . . . . . . . . . . . . . . . . . . . . 113
8.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.3.2 Stores During Runahead . . . . . . . . . . . . . . . . . . . . . . . 1158.3.3 Register Validity Tracking . . . . . . . . . . . . . . . . . . . . . . 1178.3.4 Number of Outstanding Requests . . . . . . . . . . . . . . . . . . 1178.3.5 Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . 1198.3.6 Branch Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . 1198.3.7 Final Processor Performance . . . . . . . . . . . . . . . . . . . . . 1208.3.8 Runahead Overhead . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1218.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9 Concluding Remarks 1249.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1259.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
9.2.1 Out-of-Order Execution . . . . . . . . . . . . . . . . . . . . . . . 1279.2.2 Multi-Processor Designs . . . . . . . . . . . . . . . . . . . . . . . 1289.2.3 Power and Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Bibliography 130
x
List of Tables
3.1 SoinSim Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Processor critical paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1 Architectural properties of the simulated processors. . . . . . . . . . . . . 605.2 LUT and BRAM usage and maximum frequency for 4 and 8 checkpoints
on different platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.1 Architectural properties of simulated processors. . . . . . . . . . . . . . . 95
8.1 Architectural properties of the simulated and implemented processors. . . 1158.2 Runahead processor hardware cost breakdown. Numbers in parentheses
denote overhead for Runahead support. . . . . . . . . . . . . . . . . . . . 122
xi
List of Figures
2.1 A typical out-of-order pipeline using register renaming and reorder buffer. 15
2.2 (a) In-order execution of instructions resulting in stalls on cache misses.(b) Overlapping memory requests in Runahead execution. . . . . . . . . 16
2.3 Area and maximum frequency of a minimalistic pipeline for 1-, 2-, and4-way superscalar processors. . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 IPC performance of superscalar, out-of-order, and Runahead processors asa function of cache size. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 The typical 5-stage pipeline implemented in this work. Dotted lines rep-resent control signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Multiplication and shift/rotate operations before (a) and after (b) opti-mization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Branch misprediction detection before (a) and after (b) optimization. Dashedboxes represent registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Forwarding data path before and after optimization in the pipeline. Dashedline is the added forwarding path. . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Next address selection data path in the Fetch state before (a) and after(b) optimization. Dashed boxes represent registers. . . . . . . . . . . . . 46
4.6 IPC and relative IPS improvement for the processor after removing criticalpaths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Epochs illustrated in a sequence of instructions. . . . . . . . . . . . . . . 53
5.2 CFC main structure consists of c+1 tables and a dirty flag array. . . . . 55
5.3 Finding the most recent mapping: The most recent mapping for registerR1 is in the second column (01), while for R2, it resides in the fourth (11). 55
5.4 Performance impact of an extra renaming stage. . . . . . . . . . . . . . . 62
5.5 Overall processor performance in terms of IPS using various checkpointingschemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1 An example sequence of instructions being scheduled. Current state ofthe processor is presumed as instruction A being in the memory stage,and instructions B and C are in the scheduler, waiting to be selected forexecution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
xii
6.2 CAM Scheduler with back-to-back scheduling and compaction. OR gatesprovide back-to-back scheduling. The dashed gray lines show the shift-ing interconnect which preserves the relative instruction order inside thescheduler for age-based policy. The selection logic prioritizes instructionselection based on location, i.e., it is a priority encoder. . . . . . . . . . . 69
6.3 Number of ALUTs used by scheduler designs. . . . . . . . . . . . . . . . 72
6.4 Maximum clock frequency of the scheduler designs. . . . . . . . . . . . . 73
6.5 Instructions per cycle achieved using four scheduler designs. . . . . . . . 74
6.6 Overall performance as million instructions per second of four schedulerdesigns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.7 Overall performance of scheduler designs when the operating frequency islimited to 303Mhz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.1 Non-blocking cache structure. . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 The organization of the Data and Tag storage units. . . . . . . . . . . . 89
7.3 Connections between Data and Tag storages and Lookup and Bus compo-nents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.4 (a) Two-component cache controller. (b) Three-component cache controller. 92
7.5 Lookup and Request state machines. Double-lined states are initial states.Lookup waits for Request completion in the “wait” state. All black statesgenerate requests targeted at the Bus controller. . . . . . . . . . . . . . . 93
7.6 Area comparison of NCOR and MSHR-based caches over various capaci-ties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.7 BRAM usage of NCOR and MSHR-based caches over various capacities. 98
7.8 Clock frequency comparison of NCOR and of a four-entry MSHR-basedcache over various cache capacities. . . . . . . . . . . . . . . . . . . . . . 99
7.9 Area and clock frequency of a 32KB MSHR-based cache with various num-ber of MSHRs. The left axis is ALUTs and the right axis is clock frequency.100
7.10 Speedup gained by Runahead execution on 1- to 4-way superscalar proces-sors. The lower parts of the bars show the IPC of the normal processors.The full bars show the IPC of the Runahead processor. . . . . . . . . . . 101
7.11 The impact of number of outstanding requests on IPC. Speedup is mea-sured over the first configuration with two outstanding requests. . . . . . 101
7.12 Speedup gained by Runahead execution with two and 32 outstanding re-quests, with memory latency of 26 and 100 cycles. . . . . . . . . . . . . . 102
7.13 Performance comparison of Runahead with NCOR and MSHR-based cache.102
7.14 Average runtime in seconds for NCOR and MSHR-based cache. . . . . . 103
7.15 Cache hit ratio for both normal and Runahead execution. . . . . . . . . . 103
7.16 Number of misses per 1000 instructions executed in both normal and Runa-head execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.17 Average number of secondary misses (misses only to different cache blocks)observed per invocation of Runahead executions in a 1-way processor. . . 105
7.18 IPC comparison of normal, Runahead and Runahead with worst case sce-nario for write-back stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xiii
8.1 Gray components form a typical 5-stage in-order pipeline. Black compo-nents are added to support Runahead execution. . . . . . . . . . . . . . . 111
8.2 Store handling during runahead mode. Speedup comparison (see text fora description of the three choices). . . . . . . . . . . . . . . . . . . . . . 116
8.3 Speedup with and without register validity tracking. . . . . . . . . . . . . 1178.4 NCOR resource usage based on the number of outstanding requests. . . . 1188.5 Speedup comparison of architectures with various numbers of outstanding
requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1188.6 Memory bandwidth usage increase due to Runahead execution. . . . . . . 1198.7 Comparison of branch prediction accuracy for normal and Runahead exe-
cutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208.8 Speedup gained with Runahead execution over normal execution on actual
FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xiv
Chapter 1
Introduction
Embedded systems increasingly use FPGAs due to their superior cost and flexibility
compared to custom integrated circuits. There are several reasons for which FPGA-
based systems often include processors; for example, certain tasks are best, cost- or
performance-wise, implemented in processors, and processor-based implementations can
be faster and easier to develop, and debug than custom logic. If history is any indica-
tion of the future of embedded systems, it is safe to expect that their functionality will
increase and their applications will evolve increasing in complexity, footprint and func-
tionality (cell phone designs, for example, have followed similar trends). Accordingly, it
is important to develop higher performing embedded processors.
FPGA systems often incorporate two types of processors, soft and hard. Soft pro-
cessors are implemented using the FPGA fabric itself. Hard cores, on the other hand
are fabricated separately, and are either embedded in or external to the FPGA. Hard
cores could offer higher performance compared to soft processors. However, both options
have their shortcomings: Embedded hard cores are wasted when not needed and are
inflexible. External hard cores increase system cost and suffer from increased inter-chip
communication latency. Accordingly, there is a need to develop soft cores that provide
high performance.
1
Chapter 1. Introduction 2
Procesor performance improvement techniques generally rely on increasing the con-
currency of instruction processing. Such techniques include pipelining, superscalar [54],
Very Long Instruction Word (VLIW) [54], Single Instruction Multiple Data (SIMD), and
Vector execution [54, 66]. VLIW, SIMD, and Vector execution exploit instruction-level
parallelism that can be extracted by the programmer or the compiler. When this is
possible, each of these alternatives has specific advantages.
However, there are applications where parallelism is less structured and much more
difficult to extract. It is not possible to extract such irregular parallelism offline, rather
a dynamic architecture is required. Such architectures dynamically identify and extract
instruction-level and data-level parallelism in the code at runtime. Examples of such
architectures are superscalar, out-of-order (OoO), and Runahead execution [25, 54]. The
next three Sections review Superscalar, OoO, and Runahead architectures and comment
on their suitability for a soft-core implementation.
1.1 Superscalar Execution
Superscalar processors use multiple datapaths operating in parallel to increase instruction
throughput. They attempt to overlap the execution of two or more adjacent instructions.
An n-way superscalar processor can execute up to n consecutive instructions at the same
time. To do so, it effectively replicates the pipeline, including all control and data paths.
Superscalar processors are limited in the amount of parallelism they can extract from
the code because the instructions running in parallel must be spatially close to each other.
Furthermore, a wide datapath results in complex interconnect which leads to inefficient
implementations on FPGAs, which we will discuss in Section 1.4. In addition to the
datapath, the control plane also grows in complexity with the widened datapath and
leads to lower clock frequency.
Chapter 1. Introduction 3
1.2 Out-of-Order Execution
Out-of-Order (OoO) processors exploit instruction-level and data-level parallelism to
achieve high performance. OoO processors allow instructions to execute in any order
that does not violate program semantics [54, 45]. OoO can extract more parallelism than
superscalar execution because in OoO the instructions executing in parallel do not have
to be adjacent in the program order. Furthermore, OoO execution can extract more
parallelism using register renaming and speculative execution [47].
OoO execution is orthogonal to superscalar processing. As such, when combined
with multiple datapaths, OoO execution can offer higher performance than superscalar
design alone. However, this thesis shows that even a 1-way OoO provides performance
comparable to that of a wide superscalar processor.
In mid- to high-end hard cores, OoO execution has been the architecture of choice
since the 1990s, but not so for soft cores [29, 55, 65, 39, 37, 6]. Implementing support
for OoO execution in FPGAs requires a prohibitively large amount of on-chip resources
relative to the potential gain in performance. Other techniques such as VLIW may
provide comparable performance at less expense in terms of on-chip resources. OoO
structures have been developed for Application Specific Integrated Circuits (ASIC) and
for this reason are not necessarily well-suited for the FPGA substrate. However, it may
be possible to port most of the benefits of OoO execution while using structures that are
a better fit to the FPGA substrate. Accordingly, this thesis investigates and develops
FPGA-friendly OoO components as a step toward a practical and efficient OoO soft core.
1.3 Runahead Execution
Runahead execution is a technique that allows the processor to exploit memory-level
parallelism to achieve higher performance. Runahead extends a conventional inorder
pipeline with the ability to continue execution when a memory operation misses in the
Chapter 1. Introduction 4
cache. With Runahead, the processor continues execution with the hope of finding more
useful misses that can be issued concurrently and thus finish earlier.
Runahead can be considered a lower-complexity alternative to OoO architectures.
In fact, Runahead has been shown to offer most of the benefits of OoO execution [25].
Runahead relies on the observation that often most of the performance benefits of OoO
execution result from allowing multiple outstanding main-memory requests.
Originally, Runahead’s effectiveness was demonstrated for high-end general-purpose
systems with main-memory latencies of a few hundred processor cycles [33]. This the-
sis demonstrates that Runahead remains effective even under the lower main memory
latencies of a few tens of cycles that are observed in FPGA-based systems today.
Runahead, as originally proposed, requires the introduction of additional compo-
nents to a basic in-order pipeline. Runahead primarily exploits non-blocking data caches
which do not map well onto FPGAs because they rely on highly-associative Content-
Addressable Memories (CAMs) [25]. Implementation of CAMs on FPGAs leads to in-
creased area and decreased clock frequency. This thesis proposes FPGA-friendly alter-
natives for Runahead components which deliver comparable performance to CAM-based
techniques without a significant increase in area and without a significant degradation in
clock frequency.
1.4 Superscalar vs. OoO and Runahead Execution
General-purpose processors combine OoO and Runahead with superscalar execution be-
cause resources are plentiful (more than a billion of transistors per chip is common today).
When maintaining low resource usage is important, as it is on an FPGA substrate, OoO
or Runahead can be used on a narrow datapath. In Chapter 2 we demonstrate that
single-datapath, or single-issue, OoO and Runahead executions have the potential of im-
proving performance compared to wide superscalar processors that require more area and
Chapter 1. Introduction 5
cause reductions in clock frequency when implemented in FPGAs. OoO, for example,
improves performance over simple pipelining by not stalling when an instruction requires
additional cycles to execute. Waiting for the main memory is a major source of delay
even for soft cores.
We also demonstrate that increasing the number of datapaths in the processor leads
to considerably larger area and lower clock frequencies. In fact, we show that by moving
from a 1-way superscalar to a 4-way superscalar processor, the area requirement increases
by a factor of 10, and clock frequency drops by 33%, while the gain in IPC is only 10%.
We conclude that OoO and Runahead have the potential to improve performance
beyond the level of performance provided by simple pipelining, while avoiding the super-
linear costs of datapath replication. Several challenges remain for this potential to be
exploited effectively. First, performance depends not only on IPC but also on the oper-
ating frequency. Hence, the inclusion of OoO and Runahead must be done in a manner
that limits any reduction in the clock frequency. Second, OoO and Runahead introduce
additional structures into the implementation of a basic 1-way in-order pipeline. Such
additional resources in a single-datapath implementation must not increase the total area
beyond that of a multiple-datapath, otherwise the use of OoO and Runahead is no longer
advantageous.
1.5 Objectives
Ideally, existing OoO and Runahead implementations would map easily onto FPGAs and
would achieve reasonable performance while exhibiting reasonable resource cost. How-
ever, the FPGA substrate is different than that of ASICs and exhibits different trade-offs.
Accordingly, it is necessary to revisit conventional OoO and Runahead implementations
while taking the unique characteristics of FPGAs into consideration.
This thesis takes steps towards understanding whether with FPGA-friendly designs
Chapter 1. Introduction 6
it will be possible to build OoO cores that are performance- and resource-effective. The
goal is to revisit individual components involved in the OoO architecture and propose
FPGA-friendly alternatives.
One objective of this thesis is to propose a complete pipeline that supports Runahead
execution. The proposed design should not impose significant area overhead compared
to an in-order pipeline and it should offer reasonable speedup. Section 1.7 summarizes
the contributions of this thesis in more detail.
1.6 Thesis Overview
Chapter 2 provides background on superscalar, OoO, and Runahead execution, also mo-
tivates exploring narrow architectures as opposed to wide datapaths. Chapter 3 discusses
the experimental methodology followed in this thesis. Chapter 4 investigates soft pro-
cessor implementation challenges and provides solutions to remove most of the identified
difficulties. Chapter 5 proposes a novel checkpointing mechanism, a key component used
in both OoO and Runahead architectures. Chapter 6 studies instruction scheduler de-
signs for OoO architectures and proposes a configuration to be implemented on FPGAs,
offering the best performance for the least area cost. Chapter 7 proposes NCOR, a novel
non-blocking data cache optimized for Runahead execution on FPGAs. Chapter 8 in-
troduces SPREX, a complete soft processor with Runahead execution support. Finally,
Chapter 9 offers concluding remarks and outlines future research directions. The re-
mainder of this section offers an overview of each chapter and its corresponding technical
contribution.
1.6.1 Soft-Processor Implementation Challenges
Similar to any other embedded design, soft processors face their own unique challenges.
The first challenge in implementing soft processors is that the timing-critical components
Chapter 1. Introduction 7
of a typical pipeline must be identified. Chapter 4 investigates the challenges in imple-
menting a conventional 5-stage inorder pipeline on the FPGA substrate. It starts with a
straightforward soft-processor implementation. It then systematically identifies the crit-
ical paths of the implementation and classifies them into those of the control planes and
data planes.
There are two major challenges in identifying the critical path of a processor. First,
for various reasons such as inherent randomness in the synthesis and place and routing
algorithms, such paths are inter-dependent and a single path may not always constitute
the critical path of a design. Second, it is an open question as to how to properly identify
the next critical path without removing the first critical path.
In this thesis, the choice is made to use the longest path reported by the timing-
analysis tool as the critical path for a particular implementation synthesized by a computer-
aided design tool. This approach enables the selection of only a single path in the presence
of many tightly-coupled paths. Next, in order to identify the next longest path, we artifi-
cially remove the current critical path by introducing registers in the middle of the path.
This technique allows us to remove the path without having to introduce extra logic into
the design. The insertion of registers causes the behavior of the implementation to differ
from the design specifications, and is therefore not strictly correct. Nonetheless, this
approach reflects the focus of this chapter on path identification.
Chapter 4 moves on to proposing sophisticated solutions for eliminating critical paths
of the processor while preserving its correctness. It proposes various solutions and shows
that the processor performance can be greatly improved by applying such optimizations.
1.6.2 Copy-Free Checkpointing
One of the key mechanisms used in almost all modern processor architectures is specu-
lative execution. Speculative execution allows the processor to continue execution when
the outcome of a particular operation, such as a branch instruction, takes multiple cy-
Chapter 1. Introduction 8
cles to be determined. The processor predicts the outcome of such an operation and
continues execution using the predicted outcome. Once the actual result is available, it
is compared with the prediction. A correct prediction allows the processor to continue
execution with no penalty. An incorrect prediction, on the other hand, introduces a
penalty that stems from having to discard the results of any speculative computations
and perform computations again based on the actual result.
In order to support speculative execution, many approaches have been proposed.
One popular approach is checkpointing, which dictates that a copy of the processor state
must be saved, i.e., checkpointed, for every prediction made. Later, if a prediction is
found to be incorrect, the processor state is restored from the checkpoint corresponding
to that prediction. The storage required to implement this technique is proportional to
the number of checkpoints, i.e., the scope of permissible speculation. To provide good
performance, checkpointing requires the saving and restoring of state to be performed
quickly.
For soft-processors in FPGAs, checkpointing presents a specific implementation chal-
lenge. Rapid copying of the processor state (ideally in a single cycle) is complicated
by the typical use of FPGA block RAM (BRAM) to store the processor state instead of
flip-flops in logic blocks. BRAMs are high-speed, area-efficient memory arrays that signif-
icantly increase design efficiency. However, BRAM components have a limited number of
access ports. Consequently, copy operations to save or restore processor state would effec-
tively be serialized, resulting in poor performance for checkpointing. Chapter 5 proposes
CFC, a novel Copy-Free Checkpointing mechanism which provides the full functionality
of conventional checkpointing, while avoiding data copying. CFC is well-suited for FPGA
implementation as it addresses the port limitations of BRAMs.
Chapter 1. Introduction 9
1.6.3 Instruction Scheduling
An OoO processor executes instructions in any order that does not violate data depen-
dencies. Instructions are placed in a pool, and those with available operands are chosen to
be executed. The instruction scheduler in an OoO pipeline is responsible for identifying
and issuing ready-to-execute instructions.
Chapter 6 investigates various instruction scheduler designs and proposes the best
configuration for FPGA implementation, considering both performance and area cost. It
considers a range of scheduling policies, number of entries, and the cost-effectiveness of
back-to-back scheduling on the FPGA substrate. It shows that in a practical implemen-
tation, the best performance is achieved with a four-entry scheduler that incorporates
back-to-back scheduling and age-based selection policy.
1.6.4 Non-Blocking Data Cache
Runahead execution exploits data-level parallelism to achieve high performance. It ex-
ploits data prefetching to pre-populate the data cache while a data cache miss is being
serviced. However, unlike data prefetchers, Runahead uses the program’s own instruction
stream to induce subsequent data cache misses that then generate additional memory
requests that act as prefetch operations [23, 16].
Runahead extends a simple inorder pipeline with the ability to continue instruction
execution after a miss in the data cache. Runahead continues instruction execution
speculatively and allows continued access to the data cache. This requires a non-blocking
data cache which is costly to implement on FPGAs because conventional non-blocking
designs use highly-associative CAMs for cache-line tracking [17].
Chapter 7 proposes NCOR, a novel non-blocking data cache optimized for Runahead
execution on FPGA implementation. NCOR only provides a subset of a conventional non-
blocking cache’s features that are required for Runahead execution. Most importantly,
NCOR avoids using CAMs for tracking pending cache lines. Instead, it uses an in-cache
Chapter 1. Introduction 10
tracking system, in which metadata are stored along with the cache lines. NCOR’s simple
tracking system provides most of the benefits of a conventional tracking scheme, while
using negligible storage.
1.6.5 Soft Processor with Runahead Execution
Chapter 8 introduces SPREX, a complete soft processor implementation with Runahead
execution. SPREX extends simple pipelining and provides Runahead functionality with
minimal area and frequency penalty. SPREX exploits CFC and NCOR for checkpointing
processor state and providing continuous access to the data cache, respectively. SPREX
also tracks the dataflow graph to increase speedup. On average, SPREX offers 10%
speedup over a conventional inorder pipeline.
1.7 Thesis Contributions
The following are the contributions of this thesis:
• This thesis investigates processor implementation challenges on the FPGA sub-
strate and provides solutions to improve performance. Although many designs
require custom processor implementations, many designs also share common fea-
tures, and the challenges of interest are therefore common as well. Hence, it is
of high importance to be aware of such challenges and their workarounds when
implementing a soft processor.
• This thesis proposes CFC, a novel checkpointing mechanism which avoids data
copying, to address the problem of serialized data copying due to BRAM port
limitations. Checkpointing is a widely used mechanism in modern architectures,
e.g., superscalar processing and thread-level speculation. However, conventional
checkpointing schemes use single-cycle parallel data copying to store/retrieve check-
points. Such data copying is feasible in ASIC implementations using techniques
Chapter 1. Introduction 11
such as bit interleaving. However, on FPGAs large storage is provided using
BRAMs which provide a limited number of ports to access the data. By avoid-
ing data copying, CFC is still able to use BRAMs for storage, yielding a highly
efficient design both in terms of area and frequency.
• This thesis investigates the best configuration, in terms of area, frequency and
instructions per cycle, for instruction scheduling in OoO processors on FPGAs.
Instruction scheduling is at the heart of OoO processors and reorders instructions
for execution to extract the most instruction- and data-level parallelism. How-
ever, a poor choice of scheduling policy can lead to low parallelism, hence lower
performance. Additionally, a scheduler with high clock frequency is desirable to
achieve high performance. Finally, the FPGA substrate offers different trade-offs
compared to ASICs. This thesis proposes the best scheduler configuration when all
such parameters are taken into account.
• This thesis proposes NCOR, a novel non-blocking data cache specialized for Runa-
head execution on FPGAs. Runahead execution relies heavily on continuous access
to the data cache even after an access misses in the cache. Conventional non-
blocking caches track pending cache misses using CAMs, which map poorly to
FPGAs. Accordingly, NCOR does away with CAMs and takes a simpler approach
to track pending misses. Instead of tracking all possible combinations, NCOR only
tracks a subset of miss combinations. Hence, NCOR is able to use in-cache tracking
which greatly reduces area cost and increases its clock frequency.
• This thesis introduces SPREX, a complete soft processor with Runahead execu-
tion. SPREX extends an in-order, 5-stage pipeline with Runahead execution which
offers, on average, 10% speedup. SPREX incorporates CFC and NCOR to pro-
vide Runahead functionality, while achieving high frequency and area efficiency.
Compared to off-the-shelf soft processors, SPREX offers higher performance with
Chapter 1. Introduction 12
comparable area usage.
Chapter 2
Background and Motivation
This chapter provides background on superscalar, Out-of-Order, and Runahead execu-
tion. It discusses the functionality of each architecture and their advantages over simple
pipelining. We also compare the cost and benefits of narrow, 1-way OoO and Runahead
execution to those of 2- and 4-way superscalar processing. We show that a narrow OoO
and Runahead architecture are more suitable options for FPGA implementation.
2.1 Superscalar Processing
An N-way superscalar processor can execute, in parallel, up to N instructions adjacent
in the original program order. To do so, most of the datapath components are replicated
N times. This includes Arithmetic Logic Units (ALUs), branch predictor, and writeback
logic. Furthermore, many components used in the pipeline must provide multiple accesses
in the same cycle. These include the instruction cache, register file, and data cache.
Most importantly, to avoid unnecessary pipeline stalls, bypass paths are needed among
all N datapaths. Accordingly, superscalar resource costs increase super-linearly with the
number of ways.
13
Chapter 2. Background and Motivation 14
2.2 Out-of-Order Execution
Out-of-order (OoO) processors exploit instruction- and data-level parallelism to achieve
high performance. OoO executes instructions in parallel by issuing multiple instructions
to multiple functional and memory units. Furthermore, instructions taking multiple
cycles to execute, e.g., missing loads, do not block the pipeline as subsequent independent
instructions are free to execute using other functional units [61].
OoO avoids stalling the pipeline by executing independent instructions past those
waiting to complete. Executing instructions out of order provides the opportunity to
overlap instruction execution and achieve higher performance. For example, if a load
instruction is waiting for its data from the main memory, subsequent instructions which
do not depend on the load data are free to execute.
Compared to in-order processors, OoO relies on additional mechanisms to reorder in-
structions while maintaining correctness. Using mechanisms such as Scoreboarding and
Tomasulo’s algorithm, OoO is able to reorder instructions and assign them to functional
units for execution [22, 59]. OoO also uses register renaming to remove false data de-
pendencies in the program. False dependencies are a side effect of a limited number of
architectural registers.
Figure 2.1 shows a typical OoO pipeline using register renaming. Fetch and Decode
stages are similar to those of an in-order pipeline. Next, instructions have their register
operands renamed in the Rename stage. Register renaming is the process of mapping
architectural registers to physical registers. Subsequently, instructions next enter the
instruction scheduler where they wait to be assigned to functional and memory units.
After execution, e.g., loads reading data from the cache, instructions move to the Write
stage. Here instructions write their results, if they have any, into the register file. Finally,
instructions commit sequentially in the Commit stage.
Instructions are fetched, decoded, and renamed in the program order. After being
placed into the instruction scheduler, instructions are free to execute out of order. Any
Chapter 2. Background and Motivation 15
SchedulerFetch Decode Rename
Execute
Mem
Write Commit
Reorder Buffer
Figure 2.1: A typical out-of-order pipeline using register renaming and reorder buffer.
ready instruction, i.e., one that has all its source operands ready, is free to execute.
Renaming also allows result writebacks to occur out-of-order as false dependencies have
been removed. However, instructions commit, i.e., apply their changes to the processor
architectural state, in the program order to preserve correctness. For example, store
instructions can commit changes to the data cache only when they commit.
OoO uses a Re-Order Buffer (ROB) to preserve the original instruction ordering.
The ROB maintains the list of instructions in the order they were fetched. Later in
the Commit stage, instructions are retrieved from the ROB sequentially and committed.
Hence, instructions are committed in the original program order.
2.3 Runahead Execution
Runahead is an extension of a simple in-order processor that maps well onto the FPGA
fabric. Runahead improves performance by avoiding stalls caused by cache misses, as
Figure 2.2(a) depicts. A conventional in-order processor stalls whenever a memory re-
quest misses in the cache. Even on reconfigurable platforms, a main memory request
may take several tens of soft processor cycles to complete, thereby limiting performance.
Chapter 2. Background and Motivation 16
Execute
1
Execute
1 2
3Stall
3
Improvement
Stall
2
Runahead Mode
Mem A
Time
Mem B
Mem B
Mem A
Co
nven
tio
nal
Ru
nah
ead
(a)
(b)
Figure 2.2: (a) In-order execution of instructions resulting in stalls on cache misses.(b) Overlapping memory requests in Runahead execution.
Main memory controllers, however, support multiple outstanding requests. Runahead
exploits this capability and improves performance by requesting multiple data blocks
from memory instead of stalling whenever a request is made.
A pipeline with Runahead execution is similar to that of an in-order pipeline. Typ-
ically it consists of five stages of Fetch, Decode, Execute, Memory, and Writeback. In
an in-order pipeline, when a memory instruction is blocked in the Memory stage, all
subsequent instructions are blocked.
As Figure 2.2(b) shows, upon encountering a cache miss, or a trigger miss, instead
of stalling the pipeline, the processor continues to execute subsequent independent in-
structions. This is done with the hope of encountering more cache misses to overlap with
the trigger miss. Effectively, Runahead uses the program itself to predict near-future
accesses that the program will perform, and overlaps their retrieval.
Although all results produced during Runahead mode are discarded, all valid memory
requests are serviced by the main memory and the data requested is eventually placed in
the processor cache. If the program subsequently accesses some of this data, performance
may improve because this data was prefetched (i.e., requested earlier from memory).
Chapter 2. Background and Motivation 17
Provided that a sufficient number of instructions independent of the trigger miss are
found during Runahead mode, the processor has a good chance of reaching other memory
requests that miss. As long as a sufficient number of useful memory requests are reached
during Runahead mode, performance improves as the processor effectively prefetches
these into the cache.
When a cache miss is detected, the processor creates a checkpoint of its architectural
state (e.g., registers) and enters the Runahead execution mode. While the trigger miss is
being serviced, the processor continues executing subsequent independent instructions.
Upon the delivery of the trigger miss data, the processor uses the checkpoint and restores
all architectural state, so that the results produced in Runahead mode are not visible to
the program. The processor then resumes normal execution starting immediately after
the instruction that caused the trigger miss.
Performance trade-offs with Runahead are complex. On one hand, those memory
accesses that were initiated during Runahead mode and bring useful data into the cache
effectively prefetch data for subsequent instructions and reduce overall execution time.
On the other hand, memory accesses that bring useless data pollute the cache and con-
sume memory bandwidth and resources, e.g., they may evict useful data from the cache
or they may delay subsequent requests.
2.4 Narrow vs. Wide Datapath
In this section we compare narrow, 1-way OoO and Runahead execution with wide super-
scalar pipelines. We estimate the processor performance using a full-system simulator we
developed. We also implement, in Verilog, a trimmed down 5-stage pipeline to estimate
the effect of superscalar processing on FPGAs. For this experiment, each datapath uses
a conventional five-stage pipeline with full bypass paths and the pipeline latches. For
simplicity, the ALU unit includes only a 32-bit adder. No other components are modeled.
Chapter 2. Background and Motivation 18
0
50
100
150
200
250
300
350
0
500
1000
1500
2000
2500
1-way 2-way 4-way
Fm
ax
(MH
z)
LU
TS
LUTs Fmax
Figure 2.3: Area and maximum frequency of a minimalistic pipeline for 1-, 2-, and 4-waysuperscalar processors.
See Chapter 3 for a more detailed explanation of the experimental methodology.
Figure 2.3 reports how the area and frequency of a superscalar pipeline scale as
the number of ways increases from one to four. The figure shows that in a superscalar
processor, frequency of a wide pipeline is lower than a narrow pipeline, while its area cost
is significantly higher as well. The maximum frequency of the 4-way superscalar is 33%
less than that of the single-issue processor. The 4-way superscalar must extract sufficient
instruction level parallelism (ILP) to compensate for this frequency disadvantage.
Figure 2.4 compares Superscalar processing with OoO and Runahead execution in
terms of IPC. The figure shows that narrow OoO and Runahead processors come close to
wide, 4-way superscalar in-order pipelines. The comparison is made for the performance
of 1-, 2-, and 4-way superscalar, and single-issue OoO and Runahead processors, for a
wide range of cache configurations. The cache size varies from 4KB up to 32KB (stacked
bars). The OoO pipeline outperforms both 1-way and 2-way superscalars for all cache
Chapter 2. Background and Motivation 19
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
bzip
2
gobm
k
hm
mer
sje
ng
libquantu
m
h264
asta
r
xala
nc
Avera
ge
IPC
4KB 8KB 16KB 32KB
1-way-ooo4-way
2-way
1-way1-way-runahead
Figure 2.4: IPC performance of superscalar, out-of-order, and Runahead processors as afunction of cache size.
sizes, while it performs worse than the 4-way superscalar only with the 32KB cache.
It should be noted that a more advanced compiler could improve performance of the
superscalar processor.
The data presented in this section demonstrate that narrow, 1-way OoO and Runa-
head execution have the potential to improve performance of an in-order pipeline on
FPGAs. In addition, these architectures avoid the superlinear costs of datapath replica-
tion and can potentially achieve low area costs with high clock frequencies.
Following the data presented in this chapter, this thesis targets narrow OoO and
Runahead architectures for FPGA implementation, to avoid the superlinear costs of
superscalar processing on FPGAs.
Chapter 3
Experimental Methodology
In this chapter we explain our experimental methodology. We use a combination of
software simulation and actual hardware implementation to evaluate various designs we
propose. We use multiple performance metrics to measure the efficiency of a given design
and to compare different processor configurations.
3.1 Comparison Metrics
We measure an architecture’s performance using two different metrics: Instructions Per
Cycle (IPC) and Instructions Per Second (IPS). We use IPC for both simulations and
actual hardware implementations. After synthesis and placement-and-routing, we com-
pare designs based on IPS, area and frequency characteristics. We also compare designs
based on their area and frequency characteristics.
3.1.1 Area
We use a design’s area usage as a metric to measure its implementation efficiency on
FPGAs. We measure area usage based on LUT and BRAM usage reported by the
synthesis tool. We primarily use Altera Stratix III FPGAs, in which the basic building
20
Chapter 3. Experimental Methodology 21
blocks are Adaptive Logic Modules (ALMs). Each ALM contains two combinational
adaptive LUTs (ALUTs), two flip-flops and two full adders. ALMs can be configured to
implement logic functions, arithmetic functions, and register functions [12].
In Chapter 5, we also compare checkpointing schemes based on their silicon real
estate. We estimate the silicon area as the total equivalent area, by calculating the sum
of the area of all the ALUTs plus the area of the BRAMs. We follow the same scheme
described in [41, 62].
3.1.2 Frequency
We compare designs based on their operating frequency, a property which can directly
affect their runtime performance. In order to reduce the effect of the inherent randomness
in the tool, we place-and-route every design multiple times using different random seeds.
We report the average of the maximum clock frequencies reported by the tool.
3.1.3 IPC
Before implementing a given design in hardware, we estimate its performance irrespective
of its implementation details. We use IPC, a frequency independent performance metric
to compare designs before implementation. IPC is measured as the rate of instruction
execution per processor clock cycle. Using IPC we can compare the performance of two
architectures solely based on their architectural properties, avoiding implementation-
specific optimizations.
3.1.4 IPS
We use IPS to assess actual performance of a given design. Using IPS we can compare
two designs considering their architectural properties and their hardware implementation
limitations. In Chapter 8 where a complete processor is implemented in hardware, we
Chapter 3. Experimental Methodology 22
measure IPS by clocking the execution time of a specific number of instructions. However,
in the rest of the thesis, when we are designing individual components of the processor, we
estimate the IPS to provide insight into the processor performance. We implement that
particular component in hardware and assume the entire processor can operate at the
same clock frequency. We first use software simulation to measure processor’s IPC with
the proposed micro-architecture, and then use the Formula 3.1 to estimate the processor
IPS.
IPS = IPC ∗ Frequency (3.1)
3.2 Software Setup
3.2.1 Software Simulation
In order to motivate and evaluate this work we use software simulations to pre-evaluate a
given design before going into the time-consuming process of hardware implementation.
We implement, in software, a model of the proposed hardware and use simulations to
estimate its performance in hardware.
We have developed SoinSim, an open-source, cycle-accurate, full-system simulator
for the Altera Nios II instruction set, written in the C language. SoinSim is capable of
modeling various superscalar, Runahead and OoO architectures with numerous detailed
parameters, some of which are listed in Table 3.1.
SoinSim is capable of booting and running the uCLinux operating system [15]. It
models a system consisting of a Nios-II-compatible processor, main memory, timer and
UART. All components are connected through a system bus.
Chapter 3. Experimental Methodology 23
Table 3.1: SoinSim ParametersCommon Properties
Pipeline Stages 5, 6
BPredictor Type Bimodal, GShare
Bimodal Entries Configurable
BTB Entries Configurable
Cache Size Configurable
Cache Associativity Configurable
Memory Latency Configurable
Data Forwarding Configurable
Superscalar Specific
No. ways Configurable
Out-of-Order Specific
Pipeline Stages 7
Scheduler Size Configurable
Scheduler Policy Age / Random
Scheduler Latency Configurable
ROB Size Configurable
No. Physical Registers Configurable
Checkpoints Configurable
Runahead Specific
No. Outstanding Requests Configurable
Include Store Insts. Configurable
Track Registers Configurable
System Bus
SoinSim connects all the components of the system using a bus that follows the Avalon
Bus specifications [9]. In the modeled system, the processor is the only Avalon master
component, capable of initiating bus requests.
Memory Model
In order to estimate main memory access latency in our simulations, we experiment on
a Altera DE-3 board, accessing the DDR2 memory clocked at 266MHz. We experiment
with various memory access patterns and find that a single memory request, on average,
takes 20 cycles to complete. We also find that the memory controller is capable of
pipelining memory requests, and back-to-back memory accesses are serviced faster. For
Chapter 3. Experimental Methodology 24
example, a continuous four-word request is serviced in 30 cycles, rather than 80 cycles if
requested separately.
Accordingly, we have developed a DDR2 memory model which models a fixed-latency,
pipelined memory controller. That is, every initial memory request takes a fixed number
of cycles to service. However subsequent requests that are received before the initial
request is serviced take fewer cycles to return.
Peripherals
SoinSim models three memory-mapped peripherals which are accessible to the processor
through the system bus:
• UART: SoinSim models a UART following Altera’s JtagUART specifications [10].
• Timer: A programmable timer is modeled in software which resembles that of
Altera’s SOPC Timer module [10].
• Performance Counter: This is a custom performance counter to facilitate measuring
various metrics.
3.2.2 Operating System
We boot and run the uCLinux operating system [15] on top of all simulated and hardware
implemented processors. uCLinux is a simplified version of the Linux operating system
which is capable of running arbitrary applications cross-compiled for the Nios II ISA.
The current uCLinux version does not support virtual memory in order to minimize the
overhead of hardware and software memory management.
We use the ramdisk driver to create a memory-mapped disk available to applications.
We store benchmarking data files in this disk space.
Chapter 3. Experimental Methodology 25
3.2.3 Benchmarks
We estimate a given processor’s runtime performance by measuring its performance when
running a specific set of benchmarks. We use benchmarks from the SPEC CPU 2006
benchmark suite that are typically used to evaluate the performance of desktop sys-
tems [56]. We use them as representative of applications that have unstructured data-
and instruction-level parallelism. We make an assumption, motivated by past experi-
ence, that in the future, embedded and FPGA-based systems will be called upon to run
demanding applications such as these.
We use a set of reference inputs for benchmarks as provided by the benchmark suite.
As we do not include floating point units in our processor architectures, as is the case
with Nios II, we use the integer subset of the benchmarks. We compile the benchmarks
using the gcc ported for Nios II by Altera Corp.
Due to the slow speed of simulations, ∼200KIPS on average, we use a sample of one
billion instructions per benchmark. We fast-forward the first billion instructions as to
skip the initialization phase of the benchmark.
We compare designs based on the speedup they achieve in IPC over a baseline imple-
mentation. We run all the benchmarks on every design. To provide a single number as
the speedup for a design, we use the geometric mean of speedups over the execution of
all benchmarks.
3.3 Hardware Setup
3.3.1 Verilog Implementation
We implement proposed hardware designs in Verilog and deploy them on Altera Stratix-
III FPGAs. We use Quartus II for synthesis and place-and-routing the design. Over the
course of this study, we used various Quartus II versions ranging from 8.1 to 12.1.
Chapter 3. Experimental Methodology 26
We use the Altera DE3 development board equipped with an Altera Stratix-III-150
FPGA [58]. The DE3 board has a SODIMM DDR2 slot, providing access to memory
capacities in the order of gigabytes, as needed to boot the operating system and run our
demanding benchmarks.
3.3.2 Component Isolation
In order to measure the area and frequency characteristics of a single component design,
we isolate it for placement and routing. This is done by synthesizing the design in a
top-level module containing only the design itself. In order to reduce the effect of pin
placement on the clock frequency (e.g., due to excessive global routing), all the inputs
and outputs are registered. These include the instruction and data busses, interrupt
lines, and clock and reset signals. All inputs are fed with shift registers to minimize the
number of pins used. All wide outputs (e.g., data bus writedata) are reduced to one bit
signals with XOR reduction operations.
3.3.3 Inorder Processor Resembling Nios II
One of the main objectives of this thesis is to implement a complete soft processor in
hardware and compare it to the current state-of-the-art soft processors. We have chosen
to compare our work with the Nios II/f processor provided by Altera Corp [13]. As
the source code for Nios-II is not disclosed, as is the case with most commercial soft
processors, we found it necessary to implement a baseline in-order pipelined processor
resembling, as accurately as possible, Nios-II-f. Nios-II-f is the fastest version of the
Nios-II processor available.
We have implemented Soin, a complete Nios-II replica. We test Soin’s correctness
using micro-benchmarks. After initial testing, we boot uCLinux on the processor as a
thorough testcase. We found the Linux boot process to be a comprehensive test case for
the processor implementation, covering almost all corner cases.
Chapter 3. Experimental Methodology 27
3.3.4 The System
We use Qsys to create a complete system consisting of the processor, system bus, memory
controller and peripherals.
3.3.5 System Bus
All components in the Qsys system are connected through a memory-mapped Avalon
bus. The processor is the only master component on the bus.
3.3.6 Memory Controller
We use Altera’s UniPhy DDR2 memory interface to access the DDR2 slot on the DE3
board [11]. UniPhy is a commercially used, high performance DDR2 interface, capable
of pipelining memory requests.
3.3.7 Peripherals
The system implemented in hardware consists of the following peripherals which are
connected as Avalon slaves to the system bus.
• Jtag UART: We use Jtag UART to connect to the operating system’s console. Jtag
UART provides UART connectivity through the Jtag port available on the DE3
board.
• Timer: This is a programmable timer available in Altera’s IP library. Timer is used
by the operating system for task scheduling purposes.
• Performance Counter: This is a custom made performance counter to measure
processor’s performance.
Chapter 4
Soft Processor Implementation
Challenges
General-purpose soft processors are a key component in reconfigurable computing since
they provide adequate performance especially for workloads that have little parallelism,
and because they facilitate easy and quick development. Accordingly, many modern de-
signs incorporate multiple instances of general-purpose soft processors. The widespread
use of general-purpose soft processors has led to many designs both by the academic
community and industry. For example, Altera’s Nios II [13] and Xilinx’s Microblaze [64]
are two commonly used designs which provide adequate performance at a low cost. More
advanced soft processors, e.g., LEON3 [51], provide additional functionality and recon-
figurability at the expense of clock frequency and area.
Despite the popularity of soft processors and their widespread use, the implementation
inefficiencies of an entire pipeline as a whole have not been systematically explored.
Instead, several works have addressed specific implementation inefficiencies mostly on a
case-by-case basis. However, a processor pipeline is a complex system, which incorporates
a wide variety of components. Naıvely porting conventional designs that were originally
developed for custom logic implementation can easily lead to high complexity in the
28
Chapter 4. Soft Processor Implementation Challenges 29
processor’s data path and control logic. Accordingly, there is a need to systematically
characterize the sources of inefficiency in soft processor designs. Such a characterization
serves to deepen our understanding of FPGA implementation trade-offs and can serve as
the starting point for developing FPGA-friendly designs that achieve higher performance
and/or lower area.
This chapter systematically characterizes which circuit paths dominate the operat-
ing clock frequency when implementing a typical pipelined general purpose processor
on an FPGA. To do so, we first develop an implementation of a 5-stage pipelined pro-
cessor, a commonly used soft processor architecture. The baseline implementation is
representative of a “textbook” implementation of a 5-stage pipeline that is optimized for
custom logic implementation and that focuses on correctness, modularity, and speed of
development.
The two key questions this chapter then asks are:
1. Which circuit path dominates latency and thus determines the operating clock
frequency?
2. If this critical path was eliminated somehow, which path would be dominating the
clock frequency next?
To answer these questions, this work follows an iterative approach by progressively
synthesizing the design and identifying its critical path. Once the current critical path
has been identified, it is “removed” and the design is synthesized again to identify the
next critical path. Section 4.1 elaborates on the challenges any systematic critical path
identification study faces and the best-effort approach this work follows. Once the various
critical paths are identified, this work proposes a set of optimizations that eliminate them,
improving overall processor frequency.
In summary, this chapter makes two contributions:
Chapter 4. Soft Processor Implementation Challenges 30
1. It identifies the sources of inefficiency in a typical implementation of a 5-stage
pipeline. This analysis focuses on operating frequency, identifying where and why
it suffers. The result of the analysis is an ordered list of critical paths.
2. It proposes several optimizations that eliminate the processor critical paths, im-
proving the operating frequency and performance. The optimizations demonstrate
the utility of the critical path analysis and improve the processor’s clock frequency
from 145MHz to 281MHz. Overall, actual instruction processing throughput in-
creases by 80%.
The goal of this chapter is not to develop the best possible soft processor, nor do we
claim that all the optimizations presented are novel. Rather, this is a step toward system-
atically understanding the sources of inefficiency in soft processor designs. Future work
may rely on the analysis presented here to improve soft processor designs and may follow
a similar methodology to characterize other soft processor designs and architectures.
The remainder of this chapter is organized as follows. Section 4.1 discusses the crit-
ical path identification methodology. Section 4.2 presents the baseline processor design.
Section 4.3 discusses details for the implementation and testing, and it also describes the
specific tools used during the critical path exploration. Section 4.4 presents the criti-
cal path analysis while Section 4.5 proposes several performance optimizations. Finally,
Section 4.6 measures how the processor’s overall performance improves after applying
various optimizations.
4.1 Identifying Implementation Inefficiencies
Given a pipelined processor implementation a designer can follow an iterative refinement
approach in order to improve the processor’s operating frequency and performance. At
each step of the process, the designer would identify the critical path that dominates
the clock frequency. Then they would proceed to develop, if possible, a circuit- or an
Chapter 4. Soft Processor Implementation Challenges 31
architectural-level technique to “remove” this path. Once, and if, the current critical path
is eliminated, another path would now become the critical path and the process can be
repeated. Alternatively, the designer may completely rethink the processor architecture
and design. This work follows the first, iterative approach but the insights it offers are
useful should one decide to completely rethink the processor’s architecture.
A challenge with the iterative refinement approach is that at each step, specific op-
timizations must be developed to eliminate the current critical path. In lieu of actual
optimizations, the study would be of limited value as it only would be able to identify a
single critical path. To overcome this limitation, this work uses a “best-effort” approach
where it artificially removes the critical path at each step. Section 4.4 explains the path
elimination heuristics used on a case-by-case basis. The approach followed in this work
represents a “what if the critical path was magically removed” scenario.
A limitation of the presented analysis is that actual optimizations may alter the rel-
ative importance of the various circuit paths or may give rise to other critical paths.
However, we believe that this analysis represents a meaningful and useful step in iden-
tifying the sources of inefficiency in FPGA-based designs in lieu of actual optimizations.
Moreover, this work goes beyond the critical path analysis and in Section 4.5 presents
specific optimizations that eliminate these paths, while preserving design correctness.
These optimizations demonstrate the utility of the presented path analysis. Section 4.3
discusses how the analysis methodology compensated for lower-level FPGA-specific chal-
lenges during the critical path identification analysis.
4.2 Processor Pipeline
This work implements a classic 5-stage processor pipeline [31] in Verilog. Fig. 4.1 shows
the block diagram of the processor including Fetch, Decode, Execute, Memory and Write-
back stages. The baseline implementation focuses on correctness, modularity and extensi-
Chapter 4. Soft Processor Implementation Challenges 32
Write
DecodeFetch Execute Memory
2
ALU
mul
shift
Branch DCache
31
rfile
0
ICache
BPredictor
Hazard
selectionNext PC
Figure 4.1: The typical 5-stage pipeline implemented in this work. Dotted lines representcontrol signals.
bility rather than clock speed. This section describes the implementation of each pipeline
stage.
4.2.1 Fetch Stage
The Fetch stage is responsible for providing the instruction bits to the Decode stage. It
includes an instruction cache for speeding up instruction fetches as the main memory
latency is high. The instruction cache is capable of fetching one instruction per cycle if
the address hits in the cache.
The fetch stage also predicts the direction and target address of conditional branches
to avoid bubbles in the pipeline [31]. A bimodal branch direction predictor, a dy-
namic branch predictor comprising a table of two-bit saturating counters, predicts the
direction [34]. A Branch Target Buffer (BTB) predicts the target address for taken
branches [34]. The implementation uses a tagless BTB for simplicity and speed. Both
the bimodal predictor and the BTB have 256 entries which are indexed with a portion of
the PC. The BTB and bimodal entries are stored as pairs in one BRAM. It is possible to
Chapter 4. Soft Processor Implementation Challenges 33
use the same BRAM row to store a bimodal and a BTB entry, as they use the same in-
dexing scheme [63]. It has been shown that fusing BTB and bimodal predictor structures
into the same BRAM provides storage and frequency advantages on FPGAs [63].
4.2.2 Decode Stage
The Decode stage is responsible for preparing all data and control signals for the Execute
stage. Depending on the instruction type and pipeline state, data operands may come
from the register file, forwarding lines, or they can be an immediate value from the
instruction bits.
The Decode stage is also responsible for detecting hazards in the pipeline. Hazards
can occur for multiple reasons, for example, if an instruction requires an operand whose
value is yet to be produced in the pipeline. When a hazard is detected, the pipeline must
either be stalled, which introduces penalty cycles as bubbles that perform no useful work,
or a technique must be applied that eliminates the need to stall while ensuring correct
execution semantics.
In order to avoid bubbles in the pipeline, data forwarding is used [31]. One method
of implementing data forwarding is to introduce paths that provide data generated in
later stages of the pipeline to a dependent instruction that is in the decode stage. A
multiplexer must be introduced for each input operand in the decode stage to select
between the normal operand value and the possible forwarding paths from other stages.
Additional logic must also be introduced to perform register-identifier comparisons that
determine the appropriate input to select for each multiplexer.
4.2.3 Execute Stage
The Execute stage includes an arithmetic and logic unit (ALU), which consists of a logical
operation unit, one comparator, one multiplier, one shifter, and two adders. One adder
is used for arithmetic operations and memory address calculations, while the other adder
Chapter 4. Soft Processor Implementation Challenges 34
is used for branch target calculation.
For branch instructions the Execute stage performs a series of operations. First, it
calculates the target address of the branch. In parallel, it determines the branch outcome,
i.e., whether the branch is taken or not. Finally, the calculated branch target is compared
to the predicted target address which was provided by the branch predictor during Fetch.
A misprediction signal is broadcast to all earlier pipeline stages if the two addresses don’t
match, and the pipeline is flushed in the same clock cycle.
4.2.4 Memory Stage
The Memory stage includes a data cache to compensate for the long latency of accessing
the main memory. Load and store instructions lookup their addresses in the data cache.
If the address hits in the cache, loads complete in a single cycle, while stores take two
cycles to complete. For stores, after determining a hit, i.e., in the second cycle, the actual
store operation happens. As a result, a load immediately following a store in the original
program order will have to wait for one additional cycle in the Execute stage.
The data cache is a 2KB blocking, write-back cache [35]. It consists of two storage
units implemented using BRAMs, one for tags and one for data. Loads and stores access
the data cache in the Memory stage. If the address misses in the data cache, the entire
pipeline is stalled, while the cache line is being retrieved from the main memory. For all
other instructions, the memory stage is a pass-through stage.
4.2.5 Writeback Stage
The Writeback stage writes the result of instructions back to the register file.
Chapter 4. Soft Processor Implementation Challenges 35
4.3 Methodology
The entire processor under study is implemented in Verilog, and conforms to the Nios II
ISA. Following the same methodology explained in Chapter 3, we test the processor’s
functionality and measure its performance in terms of both IPC and IPS.
The Verilog design is synthesized using Quartus II 12.1 to a Stratix III chip. The
TimeQuest timing analyzer of the Quartus II software is used to measure the maximum
clock frequency at which the design can operate. The target clock speed is set to 333MHz
(3ns period) in the design constraint file. Our goal is to reach frequencies close to that
of Nios II/f, which is 270MHz on Stratix III devices [14].
There can be many different interfaces and devices the processor may connect to.
To identify the critical paths that are inherent to the processor design and to avoid
artifacts caused by external components, the processor design is isolated for placement
and routing. The isolation process is explained in more detail in Section 3.3.2.
In the critical path analysis, we synthesize the processor and locate the critical path,
that is the circuit path which has the longest cumulative delay. Most critical paths
are tightly coupled with other parallel paths. However, in the analysis we focus on
the top failing path reported by the synthesis tool. Once the critical path is identified,
we artificially eliminate it, that is by introducing registers along the path, effectively
splitting it over two cycles. We then re-synthesize the core to find the next critical path
and continue the process as described. The goal of our analysis is to eliminate paths by
adding/removing as little logic as possible. Following this method to eliminate the paths
may result in a processor design that does not operate correctly. However, we believe
that this is a reasonable approach to determine the next design bottleneck in lieu of an
actual optimization for removing the current critical path. Another method is to declare
top critical paths as false paths by the toolset to exclude them from timing analysis.
We choose the method of introducing registers as it operates in the architecture level as
opposed to the false path setting which is at the circuit level. Section 4.5 demonstrates
Chapter 4. Soft Processor Implementation Challenges 36
the utility of our approach by presenting several optimizations.
4.4 Critical Path Study
Table 4.1 reports the critical paths found in each synthesis iteration. The table reports
the maximum operating frequency with the corresponding path included. The baseline
processor design can operate at 145Mhz. The table reports the top 15 critical paths. If it
was possible to eliminate them all the processor would operate at 281.68Mhz. Removing
most paths results in a monotonic increase in operating frequency except between paths
(D) and (E). Specifically, removing the critical path (D) in the fourth iteration, improves
frequency more than removing the critical path (E) in the fifth iteration. Removing path
(D) results in an isolated efficient routing configuration, mainly caused by the random
nature of the placement and route process. We conclude that the list of the various paths
is more important rather than their relative order. The results also suggest that there
is no single path that, if eliminated, would result in a significant improvement in clock
frequency. Instead, the designer has to contend with multiple, tightly-spaced critical
paths.
The rest of this section discusses each path in more detail also explaining how we
“eliminated” the path for the purpose of identifying the next important critical path. In
some cases the technique used to “eliminate” the critical path breaks correct functionality.
Section 4.5 presents proper ways of removing the critical paths that preserve correctness.
The goal of the analysis is to identify the various critical paths in order of importance
and in lieu of actual optimizations.
A: This path includes the multiplier and forwarding data path. It starts from the
data operand registers provided by the Decode stage, through the multiplier in the
Execute stage, routed back through the forwarding logic to the Decode stage and ends
at the data operand registers. This represents data computation and communication.
Chapter 4. Soft Processor Implementation Challenges 37
Table 4.1: Processor critical paths.
Path Max. Freq. Main Component Type
(MHz)
A 144.99 Multiplier Data
B 184.71 Branch Control
C 199.72 Branch Control
D 211.01 Shifter Data
E 200.84 Hazard Detection Control
F 201.90 Memory Stalls Control
G 206.95 Hazard Detection Control
H 211.46 Forwarding Control
I 214.68 Forwarding Data
J 230.95 ICache Hit Control
K 231.59 Forwarding Data
L 242.72 Multiplier Data
M 249.50 ICache Hit Control
N 249.75 DCache Hit Control
O 281.69 Memory Mux Data
In order to remove this path we registered the output of the multiplier, and allowed
bypass only from the memory stage.
B: This path includes the branch misprediction logic and pipeline redirection. It starts
from the data operands to the Execute stage, and continues in the ALU’s compara-
tor for branch outcome determination. It also includes the address comparator for
misprediction identification which signals the Fetch stage to redirect the program
counter. This path is for branch misprediction identification followed by fetch stage
redirection. We removed this path by registering the branch mispredict signal broad-
cast to the Fetch stage. This effectively delays branch misprediction detection by
one cycle.
C: The third critical path includes the branch misprediction logic and stall signal sent to
the Decode stage. When a branch is identified as mispredicted at the Execute stage,
Chapter 4. Soft Processor Implementation Challenges 38
the instruction currently at the Decode stage must be annulled. To remove this path
we registered the branch outcome signal. This signal determines whether a branch
is taken or not-taken. Similar to path (C), branch misprediction identification is
delayed by one cycle.
D: The fourth critical path includes the shifter in the ALU and the forwarding logic.
It starts from the data operand registers, follows through the shifter and forwarding
logic back to the data operand registers. This is another data computation and com-
munication path. We register the output of the shifter to eliminate this path. This
effectively eliminates one bypass path from the Execute stage back to the Decode
stage.
E: The hazard detection logic in the Decode stage dominates the fifth path. The hazard
signal is broadcast to the Fetch stage where it stalls the fetch process. We register
the fetch redirect signal to remove this path.
F: This path includes the stall signal from the Memory stage to the rest of the pipeline.
We remove this path by registering the memory stall signal.
G: This is another hazard detection dominated path. Hazards are identified in the
Decode stage by checking all forwarding lines. We register the forwarding selection
logic signals to remove this path.
H: The next critical path is in the data path including the forwarding data lines from
the Memory stage to the Decode stage and ending in the data operand registers. We
reduced the data operand multiplexer size by removing one of the inputs (immediate
value for shift operations).
I: The critical path is still through the forwarding logic from the Memory stage to
the Decode stage. We remove two more inputs from the data multiplexers in the
Memory stage (shift and multiplication results).
Chapter 4. Soft Processor Implementation Challenges 39
J: The instruction cache hit signal contributes to this path. This signal directs the
address multiplexer in the Fetch stage to select the next instruction address. We
remove this path by eliminating one input to the multiplexer.
K: The forwarding logic from the Memory stage to the Decode stage surfaces again. This
path includes the sign extension logic required after loading the data from the data
cache. We remove this path by eliminating load instructions from the forwarding
logic.
L: At this point the multiplier alone is the critical path. Both the inputs and the output
of the multiplier are registered. We remove this path by replacing the multiplier with
a simple XOR logic.
M: The instruction cache hit signal to fetch address selection path surfaces again. We
remove this path by registering the ready signal from instruction cache to the Fetch
stage.
N: This path includes the data cache’s lookup address selection logic. The lookup
address is either from load/store instructions or from the write-back logic. The
selection depends on the cache’s next state. We remove this path by using the
current stage (a register) to select the address.
O: This path includes the multiplexer to select between shift, multiplication, loads from
data cache or all other instruction results in the Memory stage. The result is passed
on to the Writeback stage.
We stop critical path exploration at this point as the maximum clock frequency
reached (281 MHz) is higher than our target frequency (270 MHz).
The results of this analysis show there is no single path that dominates the clock
frequency. Instead, removing each problematic path results in a relatively small im-
Chapter 4. Soft Processor Implementation Challenges 40
provement. Only if several paths are eliminated, operating frequency can improve sub-
stantially.
4.5 Eliminating Critical Paths
This section proposes solutions to eliminate some of the critical paths that Section 4.4
identified. All solutions proposed are applicable only to the processor architecture. That
is, they are compiler independent and only change the implementation of the processor.
No compiler options are changed during optimizations. The proposed solutions preserve
the processor’s functionality while increasing its clock frequency. Some of the proposed
optimizations increase clock frequency at the expense of introducing pipeline bubbles
under certain scenarios. These bubbles may delay certain instructions, leading to lower
IPCs. However, as long as these delays are infrequent enough, the gain in frequency can
compensate for the loss in IPC. Section 4.6 measures the resulting performance in IPS,
considering both IPC and clock frequency.
4.5.1 Multiplier and Shifter
The original processor implementation included a multiplier and a shifter in the Execute
stage. Although this reduces the number of cycles required for multiplication and shifting,
it also led to low clock frequency and manifested as critical paths (A), (D) and (L).
Instead, we propose to delay the forwarding of multiplication and shifting operations in
the pipeline by eliminating the bypass path from the execute stage back to the decode
stage. This will introduce bubbles in the pipeline when the next in order instruction in
the pipeline requires the result of the multiplier or shifter. Fig. 4.2 shows the pipeline
before and after this optimization.
Chapter 4. Soft Processor Implementation Challenges 41
Fetch
Decode Memory
Writesign
extend
Load
data
rfile
selection
(a)
Execute
mult
shift
Fetch
Decode Memory
Writesign
extend
Load
data
rfile
selection
(b)
Execute
mult
shift
ALU
ALU
Figure 4.2: Multiplication and shift/rotate operations before (a) and after (b) optimiza-tion.
4.5.2 Branch Misprediction Detection
The Fetch stage predicts the outcome and target address of branch instructions to avoid
stalling fetch on branches. However, when eventually the actual outcome of the branch
is computed in the Execute stage, it must be compared to the one predicted earlier. If a
mismatch is detected, any incorrectly introduced instructions must be flushed from the
pipeline and fetching must be redirected to the computed target address.
Branch misprediction detection includes three steps: 1) The outcome of the branch is
determined, i.e., whether the branch is taken or not-taken. 2) The target address of the
branch is calculated. 3) The actual target of the branch (either fall-through address or
Chapter 4. Soft Processor Implementation Challenges 42
Determined Direction
Predicted
Target
== hit/miss(a)
Determined Direction
Predicted
Target
== hit/miss(b)
reg
iste
r
Computed
Target
Fall-Through Address
Computed
Target
Fall-Through Address
Figure 4.3: Branch misprediction detection before (a) and after (b) optimization. Dashedboxes represent registers.
the target) is compared to that of predicted address in the Fetch stage. Fig. 4.3-a shows
the block diagram of this mechanism.
In order to shorten the long combinatorial paths (B) and (C), we propose to delay
the branch misprediction detection by one clock cycle. As shown in Fig. 4.3-b, branch
outcome and target are calculated in the first clock cycle (Execute stage), and the com-
parison with the predicted target occurs in the next clock cycle (Memory stage). This
shortens the combinatorial path by introducing a register in the path. This optimization
increases branch misprediction recovery time by one clock cycle.
Chapter 4. Soft Processor Implementation Challenges 43
4.5.3 Data Forwarding
When an earlier instruction, already inside the processor pipeline, produces a result used
by a later instruction, its data must be forwarded to avoid a pipeline bubble [31]. Fig. 4.4-
a shows the datapath for forwarding data from various pipeline stages to the Decode stage
where data operands are prepared. Forwarding logic can fall into the critical path as it
uses a large multiplexer and complex selection logic.
A full-blown forwarding logic forwards data from every pipeline stage after the Ex-
ecute stage, and requires a large multiplexer. Furthermore, the selection logic for the
forwarding multiplexer proves to be relatively complex. First, all instructions in the
pipeline producing a result to the same register must be identified through a set of reg-
ister name comparisons. Among all matches, younger sources must be prioritized over
older ones. As the number of data sources increases, the complexity of the selection pro-
cess increases as well. We eliminate the data forwarding delay with the two optimizations
described next.
Two-Cycle Forwarding
The most critical path in data forwarding is due to the selection logic manifested in path
(G). This logic is large and performs various operations sequentially. We shorten this
long combinatorial path using the following observation: Data source identification and
data selection don’t have to occur at the same cycle. Instead, they can be performed in
two separate cycles. In the first cycle, the forwarding logic can determine the source for
a particular data operand. In the next clock cycle, the actual data selection occurs. This
scheme effectively cuts the long path of data forwarding into two smaller paths.
Delayed Data Forwarding
Delaying multiplication and shift operations by one cycle, requires forwarding their result
from the Memory stage to the Decode stage as Fig. 4.4-a shows. Combined with the loads
Chapter 4. Soft Processor Implementation Challenges 44
Write
Fetch
Decode
Execute
Memory
Write
mult
shift
sign
extend
Load
data
rfile
selection
Fetch
Decode
Execute
Memory
rfile
selection
mult
shift
sign
extend
Load
data
(a)
(b)
Figure 4.4: Forwarding data path before and after optimization in the pipeline. Dashedline is the added forwarding path.
and ALU instructions, this requires a 4-to-1 multiplexer in the Memory stage. This
multiplexer manifests in critical path (I) since it resides directly in the forwarding path.
We propose to delay multiplication and shifting results one more cycle and forward them
from the Writeback stage. This reduces the multiplexer size down to 2-to-1.
As Fig. 4.4-a shows, load data from memory passes through the sign extension logic.
This further prolongs the forwarding path manifested in path (K). We propose to remove
load data forwarding to the Decode stage, therefore eliminating the multiplexer in the
Memory stage altogether. This further shortens the forwarding data path as Fig. 4.4-b
shows. Both optimizations may delay certain instruction combinations.
Chapter 4. Soft Processor Implementation Challenges 45
4.5.4 Fetch Address Selection
Although the baseline Fetch stage uses the branch predictor to guess the next instruction
address, it does not have to do so in all cases. More specifically, there are five options
for the next instruction address:
A1: Reset vector
A2: IRQ vector
A3: Redirect address due to branch misprediction
A4: Current PC due to instruction cache miss
A5: The predicted next address by the branch predictor
These options lead to a large 32-bit 5-to-1 multiplexer in the Fetch stage. Further-
more, the select signal depends on the following control signals: reset, interrupt, branch
misprediction, instruction cache miss, data hazard, and memory stall. Having a large
number of combinatorial signals as inputs, the multiplexer in the Fetch stage gives rise
to paths (E) and (J).
We propose reducing the size of the next address multiplexer to 3-to-1 as follows.
We observe that all the address options A1-A3 are redirection addresses. In addition,
we expect that A5 will be the common case, with A4 being less common and A1-A3
occurring infrequently. Accordingly, we propose delaying options A1, A2, and A3 by one
clock cycle. We introduce a redirect address register, holding the redirection address,
selected among options A1, A2, and A3. We use the redirect register to steer the fetch
accordingly in the next cycle.
We also include option A4 in the redirect register by observing that if the Fetch stage
is allowed to advance the PC even if the instruction cache misses, returning back to the
previous fetch address can be treated as a redirection. Therefore, we can include option
Chapter 4. Soft Processor Implementation Challenges 46
next address
(a)
reset
interrupt
i-cache miss
hazard
mem stall
branch miss
Next address
Reset Vector
IRQ Vector
PCBranch
BPrediction
BPrediction
redirect addr.(a)
Reset Vector
IRQ Vector
PCBranch
reset
interrupt
i-cache miss
redirect (b)
branch miss
mem stall
hazard
Figure 4.5: Next address selection data path in the Fetch state before (a) and after (b)optimization. Dashed boxes represent registers.
A4 in the redirect register, effectively removing the instruction cache miss signal from
the multiplexer select input. Fig. 4.5 shows this scheme in detail.
4.5.5 Data Operand Specialization
In the Nios II ISA the second operand for shift/rotate operations can come from only
two sources: the register file or an immediate value from the instruction bits. However,
Chapter 4. Soft Processor Implementation Challenges 47
other instruction types have four options for the second operand. The original, modular
Verilog code of our processor implementation included all possible data sources for all
types of instructions. However, it is not necessary to use the same data multiplexer for
all instruction types. We propose to use a separate 2-to-1 multiplexer for shift/rotate
instructions, shortening path (H).
4.6 Performance
The optimizations proposed in Section 4.5 remove critical paths but may increase the
number of pipeline stalls. Overall processor performance depends on both the Instruc-
tions Per Cycle (IPC) rate and the clock frequency. This section studies the performance
of the processor pipeline taking both into account. Fig. 4.6 reports IPC along with the
instruction per second (IPS) throughput for the various processor designs shown along
the x axis. The baseline configuration is shown at the leftmost side. From left to right,
the graph reports instruction throughput as all paths listed along the x-axis are removed.
For example, configuration I has paths A through E and I removed. The IPS results show
that frequency gains due to optimizations more than compensate for loss in IPC. Pro-
cessor performance starts at 47 million IPS and reaches as high as 85 million IPS after
applying the optimizations, an 80% improvement.
4.7 Related Work
To the best of our knowledge no previous work exists that systematically characterizes
the critical paths in a general purpose soft processor implementation. Several works that
propose optimizations for soft processor implementations exist. The analysis of this work
complements such works and serves as a guide for further optimizations. The closest work
is by Wong et al., who compare the area and delay of processors implemented on custom
CMOS and FPGA substrates [62]. They find that SRAMs and adders are efficient on
Chapter 4. Soft Processor Implementation Challenges 48
0.300
0.305
0.310
0.315
0.320
0.325
0.330
Base A B D E I M
IPC
0
10
20
30
40
50
60
70
80
90
IPS
(M
illi
on
s)
IPC IPS
Figure 4.6: IPC and relative IPS improvement for the processor after removing criticalpaths.
FPGAs mainly due to having dedicated resources. However, CAMs and multiplexers
are found to be extremely inefficient. They also find that data forwarding is inefficient
on FPGAs compared to custom CMOS implementations. Our work complements this
past work as it looks at the architecture of a full processor design identifying specific
architecture components and techniques that are inefficient in an FPGA implementation.
Yiannacouras et. al. explore the impact of soft processor customization on its perfor-
mance [68]. They consider various factors including pipeline depth, pipeline organization,
data forwarding, and multi-cycle operations. They show that fine grain microarchitec-
tural customizations can yield higher overall performance compared to a few hand-picked
optimizations. Furthermore, they show that by subsetting the ISA, they can reach mod-
est frequency improvement, 4% for a 5-stage pipeline. They conclude that after removing
logic from a given path, often another path exists which maintains the previous critical
path length, therefore it is unlikely that one simply reduces all paths.
Chapter 4. Soft Processor Implementation Challenges 49
4.8 Conclusion
This chapter considered a typical pipelined processor design and implemented it on a
modern FPGA. The baseline implementation focused on correctness, development speed,
modularity, and extensibility. It then explored sources of inefficiency in the implemen-
tation and found that the major components limiting speed were branch misprediction
detection, data forwarding, fetch address selection, certain computations, and stall broad-
cast signals. Finally, this work proposed various optimizations to increase processor clock
frequency in order to achieve higher performance.
Chapter 5
CFC: Copy-Free Checkpointing
This chapter proposes CFC, Copy-Free-Checkpointing, a novel, FPGA-friendly check-
pointing mechanism suitable for FPGA implementation. CFC avoids data copying that
would otherwise have to be performed in a serial manner due to the port limitations of
BRAM storage in FPGAs. Here we discuss the need for checkpointing in OoO proces-
sors and show that conventional checkpointing mechanisms map poorly to FPGAs. We
then demonstrate CFC for checkpointing the register renamer table, a key component
in OoO architectures. Finally, CFC is shown to map well to FPGAs while providing all
functionality of a conventional checkpointing scheme. The novel CFC scheme presented
in this chapter has been published as [1].
5.1 The Need for Checkpointing
OoO processors use speculative execution to boost performance. In speculative execution
the processor executes instructions without being certain that it should. A common
form of speculative execution is based on control flow prediction where the processor
executes instructions starting at the predicted target address of a branch. When the
speculation is correct, performance may improve because the processor had a chance of
executing instructions earlier than it would if it had to wait for the branch to decide
50
Chapter 5. CFC: Copy-Free Checkpointing 51
its target. When the speculation fails, all changes done by the erroneously executed
instructions must be undone. For this purpose, OoO processors rely on the Re-Order
Buffer (ROB) [54]. The ROB records, in order, all the changes done by instructions as
they are executed. To recover from a mispeculation, the processor processes the ROB in
reverse order, reverting all erroneous changes to the processor state.
Recovery via the ROB is slow, and requires time that is proportional to the number of
erroneously executed instructions. For this reason, many OoO processors employ check-
pointing, a recovery mechanism that has a fixed latency, often a single cycle [44]. For
storage-based components, a checkpoint is a complete snapshot of its contents. Check-
points are expensive to build, and increase latency. Accordingly, only a few of them are
typically implemented [65, 8, 44].
When both checkpoints and an ROB are available, recovery can proceed as follows:
If the mis-speculated instruction has a checkpoint, recovery proceeds using that check-
point alone. Otherwise, recovery proceeds at the closest subsequent checkpoint first, and
then via the ROB to the relevant instruction [44]. Alternatively, the processor can re-
cover to the closest preceding checkpoint at the expense of re-executing any intervening
instructions [8]. In this case an ROB is unnecessary. It has been shown that a few
checkpoints offer performance that is close to that possible with an infinite number of
checkpoints [8, 44]. Accordingly, we limit our attention to four or eight checkpoints.
5.2 Register Renaming
An OoO processor reorders instructions to extract instruction- and data-level parallelism.
However, instruction reordering must preserve data dependencies, which can be catego-
rized into the following:
1. read-after-write (RAW)
2. write-after-read (WAR)
Chapter 5. CFC: Copy-Free Checkpointing 52
3. write-after-write (WAW)
The last two are also known as false dependencies since they are an artifact of re-
using a limited number of registers. Register renaming eliminates false dependencies by
mapping, at run time, the architectural registers referred to by instructions to a larger
set of physical registers implemented in hardware [59]. False dependencies are eliminated
by using a different physical register for each write to the same architectural register.
Typical implementations of register renaming use a Register Alias Table (RAT) which
maps architectural to physical registers [54, 65, 57, 45]. RAT is indexed with the archi-
tectural register name and each entry provides the physical register name. Renaming an
instruction for a three-operand instruction set such as that of Nios II proceeds as follows:
• The two source registers are renamed by reading their current mapping from the
RAT.
• A new mapping is created for the destination register, if any. A free list provides
the new physical register name. The processor recycles a physical register when
it is certain that no instruction will ever access its value (e.g., when a subsequent
instruction that overwrites the same architectural register commits).
5.2.1 Checkpointed RAT
In order to support speculative execution, RAT’s contents need to be checkpointed.
RATRAM, a common RAT implementation, is a table indexed by architectural register
names and whose entries contain physical register names [65]. For the Nios II instruction
set, this table needs three read ports, two for the source operands plus one for reading
the previous mapping for the destination operand to store it in the ROB. The table also
needs one write port to write the new mapping for the destination register. A checkpoint
is a snapshot of the table’s content and is stored in a separate table. Multiple checkpoints
require multiple tables. Recovery amounts to copying back a checkpoint into the main
Chapter 5. CFC: Copy-Free Checkpointing 53
Epoch
00
Epoch
01
Epoch
10
Time
New SpeculationInstruction
Figure 5.1: Epochs illustrated in a sequence of instructions.
table. A checkpoint is taken when the processor renames a register which initiates a new
speculation (e.g., a branch). Such an instruction terminates an epoch comprising the
instructions seen since the last preceding checkpoint as shown in Figure 5.1. Recovering
at a checkpoint effectively discards all the instructions of all subsequent epochs.
5.3 CFC
CFC modifies RAMRAT so that it can better match an FPGA substrate. The key chal-
lenge when implementing RAMRAT on an FPGA is the implementation of the checkpoints.
Creating checkpoints requires copying all the bits of the main table into one of the check-
point tables. In ASIC implementations the checkpoints are implemented as small queues
that are embedded next to each RAT bit. However, such implementation is expensive
and inefficient on an FPGA because it cannot exploit BRAMs and uses LUTs exclusively
(see Section 5.5.2).
In RAMRAT, the main table holds all the changes applied to the RAT by all the
instructions, both speculative and non-speculative. The advantage of this implementation
is that the most recent mapping for a register always appears at the corresponding
entry of the main table. Hence lookups are streamlined. Checkpoints, however, need
to take a complete snapshot of the main table and this results in an inefficient FPGA
implementation. Instead of storing updates always in the same main table, CFC uses
a set of BRAM-implemented tables and manages them as a circular queue. CFC stores
Chapter 5. CFC: Copy-Free Checkpointing 54
RAT updates done in each epoch in a different table. Therefore, recovering from a mis-
speculated epoch is as simple as discarding the corresponding table. This significantly
simplifies RAT updates and checkpoint operations. RAT lookups, however, turn into
ordered searches through all the tables to find the most recent mapping.
By eliminating the need for copying, CFC is able to exploit on-chip BRAMs to store
mappings. It uses a few LUTs to maintain the relative order among tables and to
implement the search logic involved in read operations. Implementing reads is inexpensive
on FPGAs as LUTs can efficiently implement complex logic.
5.3.1 The New RAT Structure
Figure 5.2 shows the organization of CFC. There are two main structures, the RAT tables
and the dirty flag array (DFA). Each RAT table contains one entry per architectural
register, which provides a physical register name. A total of c+1 tables exist, which
correspond to c checkpoints and the committed state of the RAT. Each checkpoint table
contains mappings introduced by instructions of an epoch. For simplicity, epochs and
tables use the same indexes. CFC uses two pointer registers, head and tail, to specify the
relative order of the tables, similar to a circular queue. The DFA tracks valid mappings
in each checkpoint table. Accordingly, DFA contains one bit per checkpoint table and
per architectural register.
The (c+1)th table, the committed table, represents RAT’s architectural state, being
the latest changes applied by non-speculative instructions, i.e., committed instructions.
The processor uses this table to recover from unexpected events, e.g., page faults or
hardware interrupts.
5.3.2 RAT Operations
This section explains in detail how CFC performs various renaming operations.
Chapter 5. CFC: Copy-Free Checkpointing 55
C + 1Tables
CommittedCommittedCommittedCommitted
CommittedCommittedCommittedCommittedCommitted CopyCommitted Copy
# A
rch
ite
ctu
ral
Re
gis
ters
Figure 5.2: CFC main structure consists of c+1 tables and a dirty flag array.
Tail
R1
R2
00 01 10 11
Head
Active Copies
Most Recent
Older
Figure 5.3: Finding the most recent mapping: The most recent mapping for register R1is in the second column (01), while for R2, it resides in the fourth (11).
Finding the Most Recent Mapping
When renaming a source operand, the processor needs to identify the table holding
the most recent mapping. This is achieved by examining the DFA row indexed by the
Chapter 5. CFC: Copy-Free Checkpointing 56
architectural register name. Conceptually, this is done sequentially, looking for the first
set dirty flag, starting from the head and moving backwards towards tail. If no dirty
flag is found set, then the committed table is used. Figure 5.3 shows two examples. In
practice, this search is implemented in a lookup table having as inputs the DFA row and
the two pointers.
Creating a New Mapping
When renaming a destination register, CFC stores a new mapping into the most recent
table identified by the head pointer. CFC obtains the new mapping from a free list of
physical registers. It also sets the corresponding DFA bit indicating a valid mapping in
the corresponding entry of the table.
Creating a Checkpoint
CFC creates a checkpoint by simply advancing the head pointer. This ensures that all
subsequent updates to the RAT are made to a new table. As this table holds no valid
mappings yet, CFC also clears all DFA bits of the corresponding column, identified by
the new head pointer. Clearing all the DFA bits indicates that the table does not yet
contain any valid mappings. As CFC directs all subsequent RAT updates to this new
table, the previous tables remain intact, which CFC uses for recovery. Note that no data
copying is necessary when creating a new checkpoint.
Committing a Checkpoint
Upon instruction commit, CFC places the destination register mapping into the com-
mitted table. Instead of copying all mappings of an epoch en masse from a checkpoint
table to the committed table, CFC commits changes progressively. CFC stores mappings
one-by-one into the committed table as individual instructions commit. Finally, CFC
advances the tail pointer upon the commit of an instruction that started an epoch and
Chapter 5. CFC: Copy-Free Checkpointing 57
allocated a checkpoint, e.g., a branch, effectively recycling the checkpoint.
Restoring from a Checkpoint
On a mispeculation, e.g., branch misprediction, RAT must be restored to the state it
was before renaming the mis-speculated instruction. All that is needed is simply to
update the head pointer to the epoch number of the mis-speculated instruction. The
intervening tables are effectively discarded since only those columns in between head and
tail are considered during subsequent lookups. Notice that restoring a checkpoint does
not involve any copying either.
5.4 FPGA Mapping
This section details how CFC is implemented on an FPGA. The implementation is slightly
different than the organization described before, taking advantage of FPGA-specific prop-
erties. Most of the RAT state is stored in BRAMs, which are high-speed, area-efficient
memory arrays.
5.4.1 Flattening
Selecting the most recent checkpoint table as determined by the DFA logic, requires a C-
to-1 multiplexer, C being the number of checkpoints. Such multiplexer is area and latency
inefficient. Flattening the RAT array, that is storing all checkpoints sequentially into one
table eliminates this C-to-1 multiplexer. Accessing the new flattened RAT, however,
requires a new index which is composed of two parts: a base index and an offset. The
base index determines the checkpoint while the offset is the entry in that checkpoint.
The base index is determined by the dirty flags whereas the offset is determined by the
architectural register. As long as C is a power of two, calculating the index amounts to
concatenating the architectural register name to the column index reported by the DFA
Chapter 5. CFC: Copy-Free Checkpointing 58
logic.
5.4.2 Multiporting the RAT
Two processor stages access the RAT structure in CFC: rename and commit. Renaming
an instruction requires reading at most three mappings, and changing one mapping.
Commit also writes a mapping into the committed copy. In total, the RAT structure
must have three read ports and two write ports. Unfortunately, BRAMs only have one
read and one write port. Multiported BRAM-based storage has been proposed which
come at the expense of area and frequency overhead which we’d like to avoid [41].
We make the following observation to enable the use of BRAMs for storage. The
commit stage writes only to the committed table, while the rename stage writes only
to the checkpoint tables. Accordingly, we implement the committed table separately
to avoid needing two write ports to the same BRAM. On lookups, a 2-to-1 multiplexer
selects between the committed and the most recent checkpoint table. This multiplexer
does not add significant area or latency overhead, due to its fixed, minimal size.
To provide three read ports, we replicate each BRAM three times. The write ports of
the checkpoint BRAMs are connected to a single external write port so that all copies are
updated simultaneously. Similarly, the write ports of the three BRAMs implementing
the committed copy are also connected to a single write port.
5.4.3 Dirty Flag Array
The DFA is accessed one row at a time for lookups. However, for checkpoint creation,
an entire column is reset at the same time. Therefore, DFA and the associated logic are
described as a lookup table to the synthesis tools and are implemented using LUTs.
Chapter 5. CFC: Copy-Free Checkpointing 59
5.4.4 Pipelining the CFC
Compared to RAMRAT, CFC adds a level of indirection prior to accessing the table array.
Before accessing the tables, CFC needs to determine the index of the table to be used.
This involves accessing the DFA. Consequently, its latency can be longer than RAMRAT.
However, CFC’s clock frequency can be improved as it can be pipelined. Specifically, we
implement CFC as a two-stage pipeline as follows:
• In the first stage CFC decodes the dirty flags corresponding to the architectural
register being renamed. It generates a BRAM index based on the DFA row read
and the architectural register name. It also updates the DFA row if a new mapping
is being placed in the RAT.
• In stage two CFC accesses the checkpoint and committed BRAMs in parallel and
at the end selects the appropriate copy. At the end of stage 2, all BRAM updates
occur as well.
5.5 Evaluation
This section demonstrates the performance and cost of the CFC compared against FPGA-
implementations of two conventional renaming methods. Section 5.5.1 details the exper-
imental methodology. Section 5.5.2 reports the LUT usage of each mechanism, while
Section 5.5.3 reports their operating frequencies. Section 5.5.4 measures the impact of
pipelining on IPC performance. Finally, Section 5.5.5 reports overall performance and
summarizes our findings taking into account LUT and BRAM usage.
5.5.1 Methodology
We compare CFC to two conventional methods which we call RAM and CAM. RAM
uses LUTs exclusively to checkpoint the RAT. CAM uses content-addressable-memories
Chapter 5. CFC: Copy-Free Checkpointing 60
Table 5.1: Architectural properties of the simulated processors.Common Properties
BPredictor Type Bimodal
Bimodal Entries 512
BTB Entries 512
Cache Size (Bytes) 4K, 8K, 16K, 32K
Cache Associativity Direct Mapped, 2-way
Memory Latency 20 Cycles
Superscalar Specific
Pipeline Stages 5
Out-of-Order Specific
Pipeline Stages 7
Scheduler Size 32
ROB Size 32
Physical Registers 64
Checkpoints 4
to provide checkpointing functionality at the expense of reduced clock frequency [54, 47].
We consider designs with four and eight checkpoints as past work has shown that this
number of checkpoints is sufficient [8, 44]. We implemented all three renaming schemes in
Verilog. We follow the same experimental methodology described in Chapter 3 to obtain
IPC, area and frequency characteristics of the designs. Table 5.1 details the architecture
of the processor simulated in this study.
5.5.2 LUT Usage
Table 5.2 reports the number of LUTs used by the three renaming mechanisms with
four and eight checkpoints on two platforms. Because only the DFA and associated logic
uses LUTs in CFC, its cost is considerably lower than CAM and RAM. For example,
with eight checkpoints on Stratix III, CFC uses approximately 2x and 7x less resources
than CAM and RAM respectively. On Cyclone II and with eight checkpoints, CFC uses
2.73% of the available LUTs, while CAM and RAM use 10.37% and 18.19% respectively.
CFC uses six BRAMs, which is only a small fraction of the BRAMs available on either
platform. We conclude that CFC is superior in terms of resource usage.
Chapter 5. CFC: Copy-Free Checkpointing 61
Table 5.2: LUT and BRAM usage and maximum frequency for 4 and 8 checkpoints ondifferent platforms.
Config RAM CAM CFC
LUT
Cyclone II/4 3220 2378 501
Cyclone II/8 6368 3631 964
Stratix III/4 3002 1802 399
Stratix III/8 7082 2327 996
BRAM
Cyclone II/4 0 0 6
Cyclone II/8 0 0 6
Stratix III/4 0 0 6
Stratix III/8 0 0 6
Silicon Tile Area (mm2)Stratix III/4 3.3022 1.9822 0.8199
Stratix III/8 7.7902 2.5597 1.4766
Frequency (MHz)
Cyclone II/4 122 85 137
Cyclone II/8 82 71 104
Stratix III/4 195 133 292
Stratix III/8 196 105 220
We also compare designs based on their equivalent silicon real estate used. We calcu-
late equivalent area by summing the area of all the LUTs plus the silicon area of all the
BRAMs. As shown in Table 5.2, even after considering the BRAM area, CFC is still sig-
nificantly smaller compared to both RAM and CAM. Specifically, with four checkpoints,
CFC is 75% and 58% smaller than RAM and CAM, respectively. With eight checkpoints,
CFC is 81% and 42% smaller than RAM and CAM, respectively.
5.5.3 Frequency
Table 5.2 reports the maximum clock frequency for the three checkpointing mechanisms.
CFC outperforms both conventional schemes on both FPGA platforms. CFC can operate
at up to 118% and 50% faster compared to CAM and RAM based schemes, respectively.
This is due to the fact that CFC exploits BRAMs for storage. Using BRAMs leads
to a less complex interconnect and higher routing efficiency, manifesting higher clock
frequencies.
Chapter 5. CFC: Copy-Free Checkpointing 62
00.10.20.30.40.50.60.7
IPC
7-Stages 6-Stages
Figure 5.4: Performance impact of an extra renaming stage.
5.5.4 Impact of Pipelining on IPC
CFC outperforms RAM and CAM, in terms of clock frequency, however, it is pipelined in
two stages. Pipelining imposes a runtime performance penalty. Figure 5.4 compares the
IPC performance of single-cycle and two-cycle pipelined renaming (the base architecture
has six stages out of which one is for renaming). The performance penalty incurred by
the additional renaming stage is small, i.e., less than 2% IPC drop. Coupled with the
frequency advantage of CFC, we thus expect that CFC will outperform both RAM and
CAM schemes.
5.5.5 Performance
Figure 5.5 reports the overall runtime of processors using different checkpointing schemes.
The difference between the processors is the operating clock frequency, set as the max-
imum achieved by the checkpointing scheme, and the average IPC. Compared to RAM
and CAM, CFC is slightly slower in terms of IPC, on average 0.54 vs 0.55, due to the
added pipeline stage. However, overall performance, measured in IPS, is significantly
higher due to higher clock frequency.
Chapter 5. CFC: Copy-Free Checkpointing 63
0
50
100
150
200
4C-Cyclone 8C-Cyclone 4C-Stratix 8C-Stratix
Mil
lio
n I
PS
RAM CAM CFC
Figure 5.5: Overall processor performance in terms of IPS using various checkpointingschemes.
5.6 Related Work
Mesa-Martinez et al. propose implementing an OoO soft core, SCOORE, on FPGAs [27].
They investigate the OoO architectures for FPGA implementation. They show that OoO
arhictectures, in their conventional form, result in expensive, and inefficient implementa-
tions and propose several, general remedies. SCOORE project is different than the work
done in this thesis as its primary goal is that of simulation acceleration.
Fytraki and Pnevmatikatos implement parts of an OoO processor on an FPGA for
the purpose of accelerating processor simulation as well [30]. Their work is motivated, in
part, by the same inefficiencies that prior works identified. However, our goal is different
in the way that we aim to develop cost- and performance-effective, FPGA-friendly OoO
components that will be used in embedded system applications.
Chapter 5. CFC: Copy-Free Checkpointing 64
5.7 Conclusion
In order to have a practical and efficient OoO soft processor on FPGAs, it is necessary
to develop FPGA-friendly implementations of the various units that OoO execution re-
quires. This chapter presented CFC, a novel checkpointing technique that avoids data
copying to be able to use BRAMs on FPGAs. Using CFC, fast and area-efficient register
renaming, a key component of OoO execution, is possible on FPGAs. The proposed copy-
free checkpointed register renaming was shown to be much more resource-efficient than
conventional alternatives. It was also shown that it can be pipelined, offering superior
performance.
Although this chapter focused on register renaming, checkpointing has many other
applications. For example, checkpointing can be used in support of alternative execution
models such as transactional memory. We have already seen the proposed copy-free-
checkpointing scheme used for such applications [40].
Chapter 6
Instruction Scheduler
As an additional step toward an FPGA-friendly, single-issue OoO design, this chapter
studies instruction scheduler implementations. The instruction scheduler is the core of
OoO implementations which rely on reordering instructions to maximize instruction-
and data-level parallelism. The scheduler is where instructions wait for all their source
operands and execution resources to become available. This work starts with a conven-
tional, content-addressable-memory-based scheduler design [49] and studies its implemen-
tation on a modern FPGA. Specifically, performance and area are studied as a function
of the number of scheduler entries, the inclusion of back-to-back scheduling support, and
the use of age-based priority scheduling. The results of the study done in this chapter
have been published as [2].
This chapter shows that considering the scheduler in isolation, the best performance
is achieved with a two-entry scheduler without back-to-back scheduling and with the
simpler location-based selection policy. However, when the scheduler is considered as
part of the rest of the pipeline, it is shown that best performance is achieved with a four-
entry scheduler with back-to-back scheduling and age-based selection. This four-entry
configuration is inexpensive and fast. It uses 164 ALUTs and operates at 303MHz.
65
Chapter 6. Instruction Scheduler 66
Scheduler
Fetch Decode Rename
Execute
Mem
Write Commit
(A) ldw r1, 0(r2)
(B) addi r3, r1, 1
(C) muli r4, r5, 3
Tim
e
(B)
(C)
(A)
Figure 6.1: An example sequence of instructions being scheduled. Current state of theprocessor is presumed as instruction A being in the memory stage, and instructions Band C are in the scheduler, waiting to be selected for execution.
6.1 Instruction Scheduling
An OoO processor can execute instructions in any order that does not violate data de-
pendencies. Instructions enter the instruction scheduler, a pool where they wait until
they become ready, that is until all their source operands are available. The instruction
scheduler identifies ready-to-execute instructions, then among those, selects W instruc-
tions to issue to functional units, W being the processor datapath width. This chapter
focuses on single-issue schedulers (W = 1) as it was shown in Chapter 2 that it is the
number of datapaths that dominates the area and frequency of FPGA implementations.
Figure 6.1 shows an example sequence of instructions that enter an OoO processor.
Instruction A is in the memory stage waiting to load data from the data cache, while
instructions B and C reside in the scheduler pool. Instruction B depends on instruction
A through register r1, thus it cannot execute before instruction A finishes. As soon
as instruction A finishes loading data from memory, instruction B can be chosen for
Chapter 6. Instruction Scheduler 67
execution as all its source operands (r1) are now available. Encountering a cache miss
can delay the execution of instruction A for multiple cycles. While instruction B stalls
waiting for A, instruction C is free to execute.
An instruction scheduler comprises a wakeup unit and a selection unit. Wakeup
is responsible for identifying ready-to-execute instructions among those residing in the
scheduler pool. It observes instructions as they produce their results, notifying wait-
ing instructions accordingly. A waiting instruction becomes ready when all its source
operands have been produced. With a scheduler with back-to-back scheduling an instruc-
tion can be scheduled for execution in the cycle immediately following the completion of
execution of an instruction on which it depends. In the case of multiple dependencies, the
last one to be resolved dictates the cycle in which the dependent instruction is scheduled
for execution.
At any given time, there can be more ready instructions than the number of available
functional units. All ready instructions request execution from the selection unit. For ex-
ample in Figure 6.1, if instruction A finishes execution at the time C enters the scheduler,
both B and C become ready and request for execution. In a single-issue OoO processor,
only one instruction can proceed to execution immediately. The selection unit is respon-
sible for selecting among the ready instructions the one that will execute. Typically the
selection unit uses a pre-specified selection policy for doing so. The selection policy can
be based on many factors, such as instruction age, instruction type, or availability of
functional units.
6.2 CAM-Based Scheduler
CAM, a common scheduler design, is based on content-addressable memories [49]. Fig-
ure 6.2 depicts CAM’s structure. The wakeup part is an array with one row per in-
struction. Each row contains one column per source operand. Each column contains the
Chapter 6. Instruction Scheduler 68
source operand tag along with a ready bit indicating the operand’s availability. For the
Nios II ISA, every instruction can have up to two source operands. Each row is accompa-
nied by two comparators. When an earlier instruction finishes execution, its destination
register tag is broadcast over all entries and compared to source operands. All matching
entries mark their corresponding source operands as available. All instructions that have
both their source operands marked ready request execution. The selection logic selects
one among those ready instructions for execution. Figure 6.2 shows the ready signals as
inputs to the selection logic.
6.2.1 CAM on FPGAs
Despite CAM’s simple structure, it is expensive to build on FPGAs. As Section 6.3
will show, area and frequency degrade as the number of entries increases. By increasing
the number of entries, the network connecting the comparators and source operand tags
becomes more complex, leading to longer critical paths, and hence lower clock frequencies.
Because all entries are used for comparison in every clock cycle, BRAMs cannot be used
for storing the tags due to read/write port limitations. Additionally, there is a comparator
for each source operand of every instruction, resulting in a high resource usage.
6.2.2 CAM Performance
It is well documented that the ILP that can be extracted from a program increases with
the number of scheduler entries [49]. The resulting IPC benefits tend to level off after a
certain number of entries. The actual saturation point varies depending on the processor
architecture and also on system properties, such as memory latency. Furthermore, actual
performance depends not only on IPC but on the processor clock frequency as well. It
has been shown that scheduler frequency deteriorates with the number of entries [49]. As
a result there is a trade-off between scheduler size and performance in conventional CAM
implementations. Section 6.3 will explore this trade-off for FPGA-based implementations.
Chapter 6. Instruction Scheduler 69
Destination tag
Se
lectio
n L
og
ic
… … …
= =
tag-L tag-Rrdy-L rdy-R
andor
or
= =
tag-L tag-Rrdy-L rdy-R
andor
or
= =
tag-L tag-Rrdy-L rdy-R
andor
or
Figure 6.2: CAM Scheduler with back-to-back scheduling and compaction. OR gatesprovide back-to-back scheduling. The dashed gray lines show the shifting interconnectwhich preserves the relative instruction order inside the scheduler for age-based policy.The selection logic prioritizes instruction selection based on location, i.e., it is a priorityencoder.
6.2.3 Back-to-Back Scheduling
To exploit more ILP it is desirable to execute dependent instructions in consecutive
cycles, or back-to-back, avoiding bubbles in the pipeline. In this regard, the CAM must
generate ready signals in the same clock cycle as the destination tag is broadcast by
earlier instructions. Figure 6.2 depicts a CAM with back-to-back scheduling. The OR
gates ensure that ready signals are produced by either the ready register bit or by the
Chapter 6. Instruction Scheduler 70
result of the comparisons performed in the current clock cycle. In this design, the wakeup
and select units must operate in the same clock cycle. Although supporting back-to-back
scheduling increases processor IPC, it adversely affects operating frequency. Section 6.3
will show that back-to-back scheduling in fact increases latency on FPGAs. The reduced
clock frequency can overshadow any IPC advantage back-to-back scheduling has to offer.
Area-wise, adding the OR gates has a small overhead as will be shown in Section 6.3.
6.2.4 Scheduling Policy
In the event that more instructions are ready than the available execution units, the
Selection unit, based on a selection policy, determines which instructions to execute.
This policy can be based on various parameters such as instruction age or location inside
the scheduler. Previous work has shown that a selection policy based on instruction age
tends to perform better than other simple-to-implement heuristics.
One way to consider instruction age in the selection policy is to organize the scheduler
as a FIFO queue. FIFOs preserve instruction ordering and provide relative age infor-
mation based on the location in the queue. Using FIFOs, insertion of instructions is a
trivial queue push operation. However, removing instructions requires more than pop
operations. As instructions can be selected for execution from arbitrary positions of the
FIFO, additional functionality must be provided to maintain the relative positioning of
instructions inside the FIFO.
In order to remove instructions from arbitrary positions inside the scheduler once
they execute, compaction has been implemented in commercial designs to maintain FIFO
ordering [37]. In Figure 6.2, the interconnect between rows provides compaction capabil-
ity. Upon scheduling an instruction, all entries starting from its position to the bottom
(younger) are shifted towards the top (older). This ensures that at any point in time
older instructions are placed at the top. This design guarantees that an instruction’s
relative position also reflects its relative age. The selection logic then uses a priority
Chapter 6. Instruction Scheduler 71
encoder which prioritizes based on instruction location, giving entries closer to the top
higher priority.
Compaction, as described above, requires extra connections between scheduler rows,
hence it impacts area and latency of the scheduler design. It should also be noted that
as we study only one-way datapaths, in any cycle at most only one scheduler entry is set
free. Therefore in every cycle only one shift operation is required to provide compaction
functionality.
The simplest-to-implement alternative to the age-based policy is a random, location-
based scheduling where priority is given to instructions according to where they are stored
inside the scheduler. Upon scheduling an instruction, its position is marked as free and
can be filled with a future instruction. Over time, instruction location provides almost
no information about its relative age.
6.3 Evaluation
This section compares the aforementioned scheduler designs based on their area, op-
erating frequency, IPC, and overall performance, measured in instructions per second
(IPS).
6.3.1 Methodology
We implement all scheduler designs in Verilog following the methodology explained in
Chapter 3. We also use software simulations to estimate the performance of each sched-
uler design. The simulated OoO processor consists of seven pipeline stages, uses 32KB
direct-mapped caches, and a 512-entry bimodal branch predictor. We simulate 2- to
32-entry instruction schedulers.
We use the following notation: CAM-B and CAM refer to schedulers with and without
back-to-back scheduling, regardless of their entry count. CAM-[B]A and CAM-[B]L refer
Chapter 6. Instruction Scheduler 72
200 400 600 800
1000
2 4 8 16 32
ALUTs
Entries
Area
CAM-BACAM-ACAM-BLCAM-L
Figure 6.3: Number of ALUTs used by scheduler designs.
to schedulers with age- and location-based (random) selection policies respectively.
6.3.2 Area
Figure 6.3 shows how the area of the various designs scales as a function of entry count
(x-axis is exponential). Area requirements grow at least linearly with the number of
entries. Back-to-back scheduling and selection policy have negligible impact on area. For
example, a 4-entry CAM-BA uses 164 ALUTs while CAM-A uses 161 ALUTs. The area
scaling is primarily determined by the number of scheduler entries, which is different than
the area scaling in conventional, custom logic implementations that are wire-dominated.
A Nios II/f processor can be implemented using approximately 1500 ALUTs [13].
Given the results in Figure 6.3, using more than eight entries in a scheduler would
introduce an overhead of 20% or more. Hence, an appropriate conclusion from the
results presented here is that using more than eight entries is not advisable from the
area standpoint.
Chapter 6. Instruction Scheduler 73
100 200 300 400 500 600
2 4 8 16 32
MHz
Entries
Frequency
CAM-BACAM-ACAM-BLCAM-L
Figure 6.4: Maximum clock frequency of the scheduler designs.
6.3.3 Frequency
Figure 6.4 reports the maximum frequency achieved by each design. Schedulers without
back-to-back scheduling consistently achieve higher frequencies. This is due to the fact
that the circuitry for instruction wakeup and instruction selection form two separate
combinatorial logic. For example, the 8-entry CAM-A and CAM-L operate at 344MHz
and 404MHz respectively, while the same size CAM-BA and CAM-BL operate at 244MHz
and 264MHz, a difference of 29% and 35% respectively. We also observe frequency losses
by moving from location- to the age-based policy as the shifting interconnect is added
to support compaction. This drop is highest at 20% between the 16-entry CAM-A and
CAM-L which operate at 330MHz and 265MHz respectively.
6.3.4 IPC
A lower frequency design is not necessarily a worse performing design. Performance de-
pends also on the number of instructions retired per cycle (IPC). Figure 6.5 reports IPC
Chapter 6. Instruction Scheduler 74
0.27 0.28 0.29
0.3 0.31 0.32
2 4 8 16 32
Entries
IPC
CAM-BACAM-ACAM-BLCAM-L
Figure 6.5: Instructions per cycle achieved using four scheduler designs.
for the various schedulers. CAM-BA consistently outperforms the rest of the schedulers.
The highest difference observed is 7.5% between the two-entry CAM-BA and CAM-BL
schedulers. Back-to-back scheduling improves IPC as expected. CAM-BA and CAM-
BL are superior to CAM-A and CAM-L respectively. Similarly, the age-based selection
outperforms the location-based selection. Most of the IPC benefits come from back-to-
back scheduling rather than from age-based selection. CAM-BL consistently outperforms
CAM-A beyond the scheduler entry count of two. We conclude that to improve IPC we
need to have a scheduler with age-based selection and back-to-back scheduling. However,
should any of these features need to be sacrificed (e.g., due to frequency constraints), we
find it best to substitute age-based policy with location-based policy rather than remov-
ing back-to-back scheduling support. The IPC advantage that back-to-back scheduling
provides is greater than that of the age-based selection policy.
Chapter 6. Instruction Scheduler 75
50 70 90
110 130 150 170
2 4 8 16 32
Entries
IPS
CAM-BACAM-ACAM-BLCAM-L
Figure 6.6: Overall performance as million instructions per second of four schedulerdesigns.
6.3.5 Performance
This section compares the overall performance of various scheduler designs in terms of
instructions per second (IPS). IPS considers both clock frequency and IPC. Figure 6.6
compares the IPS of 2- to 32-entry schedulers. We observe that schedulers without back-
to-back scheduling are consistently better performing for all entry counts. Although
these designs were shown to reach lower IPCs, their superior clock frequency provides
higher overall performance. Similarly, dropping age-based selection in favor of the simpler
location-based selection results in higher performance. Increasing the number of scheduler
entries reduces performance. Assuming that the entire processor can operate at the
scheduler speed, one would conclude that a very small scheduler with two entries would
be best.
Chapter 5 shows that an FPGA-friendly renaming unit, a crucial OoO component,
operates at 303MHz when implemented on the same platform. Thus, in Figure 6.7
we study the effect of limiting the processor clock frequency to 303MHz. In this case
Chapter 6. Instruction Scheduler 76
50 60 70 80 90
100
2 4 8 16 32
Entries
IPS - 303MHz Limit
CAM-BACAM-ACAM-BLCAM-L
Figure 6.7: Overall performance of scheduler designs when the operating frequency islimited to 303Mhz.
using a slightly larger scheduler proves to be better. The four-entry CAM-BA, eight-
entry CAM-A and 16-entry CAM-L are the top three designs. Comparing these three
designs we observe that the performance loss due to decreasing scheduler entry count is
effectively compensated by the age-based selection policy and back-to-back scheduling
support. Additionally, as lower entry counts are desirable considering area usage, we
conclude that the 4-entry CAM-BA is the best configuration to choose, both in terms of
area and performance.
6.4 Related Work
To avoid low frequency and high area usage of content addressable memories, Mesa-
Martinez et al. [43] propose SEED, Scalable Efficient Enforcement of Dependences. SEED
uses indexed tables to track instruction dependencies. It uses multi-banked structures
and is shown to scale well on ASICs. However, SEED’s scalability is shown to be poor
on FPGAs as routing overhead among multiple components becomes critical. Fytraki
Chapter 6. Instruction Scheduler 77
and Pnevmatikatos [30] and Derek et al. [18] implemented parts of an OoO processor on
an FPGA for the purpose of accelerating processor simulations. This is the first work
that studies how the area, frequency and most importantly performance of CAM-based
instruction schedulers scale with the number of scheduler entries on an FPGA.
6.5 Conclusion
This chapter explored part of the design space of instruction schedulers for out-of-order
soft processors. It examined the effect of scheduler size, instruction selection policy,
and back-to-back scheduling on performance, area and frequency. It showed that in
isolation (no restrictions on the clock frequency), a two-entry scheduler with a location-
based selection policy and no back-to-back scheduling achieves maximum performance.
However, by limiting the processor frequency to 303MHz (the frequency that an FPGA-
friendly register renamer operates at) we showed that a four-entry scheduler with age-
based selection policy and back-to-back scheduling reaches the maximum performance.
The results of this chapter can be used to estimate the best scheduler design under various
operating frequency assumptions.
Chapter 7
NCOR: Non-blocking Cache For
Runahead Execution
7.1 Introduction
This chapter presents NCOR (Nonblocking Cache Optimized for Runahead execution),
an FPGA-friendly alternative to conventional non-blocking caches. NCOR is specifically
designed for Runahead execution on FPGAs. NCOR avoids using content-addressable
memories, structures mapping poorly to FPGAs. Instead it judiciously sacrifices some of
the flexibility of a conventional non-blocking cache to achieve higher operating frequency
and thus superior performance when implemented on an FPGA. Specifically, NCOR
sacrifices the ability to issue secondary misses, that is requests for memory blocks that
map onto a cache line with an outstanding request to memory. Ignoring secondary misses
enables NCOR to track outstanding misses within the cache frame avoiding the need for
associative lookups. This chapter demonstrates that this simplification does not affect
performance nor correctness under Runahead execution.
This chapter quantitatively demonstrates that the usage of CAMs in conventional
non-blocking caches leads to a low operating frequency and high area usage. It also pro-
78
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 79
vides a detailed description of NCOR and of the underlying design trade-offs. It explains
how NCOR avoids the inefficiencies of conventional designs. It compares the frequency
and area of conventional, non-blocking CAM-based caches and NCOR. Finally, it mea-
sures how often secondary misses, those that NCOR does not service, occur in Runahead
execution showing that they are relatively infrequent. The NCOR cache architecture
proposed in this chapter has been published as [4, 3].
The rest of this paper is organized as follows: Section 7.2 reviews conventional, CAM-
based non-blocking caches. Section 7.3 provides the rationale behind the optimizations
incorporated in NCOR. Section 7.4 presents the NCOR architecture. Section 7.5 dis-
cusses the FPGA-implementation of NCOR. Section 7.6 evaluates NCOR comparing it
to conventional CAM-based non-blocking cache implementations. Section 7.7 reviews
related work, and Section 7.8 summarizes our findings.
7.2 Conventional Non-Blocking Cache
Non-blocking caches are used to extract Memory Level Parallelism (MLP) and reduce
latency compared to conventional blocking caches that service cache miss requests one at a
time. In blocking caches, if a memory request misses in the cache, all subsequent memory
requests are blocked and are forced to wait for the outstanding miss to receive data from
the main memory. Blocked requests may include requests for data that is already in
the cache or that could be serviced concurrently by modern main memory devices. A
non-blocking cache does not block subsequent memory requests when a request misses.
Instead, these requests are allowed to proceed concurrently. Some may hit in the cache,
while others may be sent to the main memory system as well. Overall, because multiple
requests are serviced concurrently, the total amount of time the program has to wait for
the memory to service its requests is reduced.
To keep track of outstanding requests and to make the cache available while a miss
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 80
is pending, Miss Status Holding Registers (MSHRs) are used which store information
regarding all outstanding requests [38]. MSHRs maintain the information that is neces-
sary to direct the data received from the main memory to its rightful destination, e.g.,
cache frame or a functional unit. MSHRs can also detect whether a memory request
is for a block for which a previous request is still pending. Such requests can be ser-
viced without issuing an additional main memory request. To detect these accesses and
to avoid duplicate requests, for every request missing in the cache, the entire array of
MSHRs is searched. A matching MSHR means the data has already been requested
from the memory. Such requests are queued and serviced when the data arrives. Search-
ing the MSHRs requires an associative lookup, which is implemented using a Content-
Addressable-Memory (CAM). CAMs map poorly to reconfigurable logic as Section 7.6
shows. As the number of MSHRs bounds the maximum number of outstanding requests,
more MSHRs are desirable to extract more MLP. Unfortunately, the area and latency of
the underlying CAM grow disproportionately large with the number of MSHRs making
large number of MSHRs undesirable.
7.3 Making a Non-Blocking Cache FPGA-Friendly
Runahead execution is conceptually an extension to a simple in-order processor. The
simplicity of its architecture is one of the primary reasons that makes Runahead suitable
for reconfigurable fabrics. However, for Runahead to be feasible on these fabrics, the ex-
tensions must come with low overhead. As Section 7.6 shows, conventional non-blocking
cache designs based on MSHRs do not map well onto FPGAs. Accordingly there is a need
to design a low cost non-blocking cache suitable for FPGAs. This work observes that
Runahead execution does not need the full functionality of a conventional non-blocking
cache and exploits this observation to arrive to an FPGA-friendly non-blocking cache
design for Runahead execution.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 81
Conventional non-blocking caches that use MSHRs do not map well on reconfigurable
fabrics. The primary reason is that MSHRs use a CAM to perform associative searches.
As Section 7.6 shows MSHRs lead to low clock frequencies and high area usage. In
addition to MSHRs, the controller of a non-blocking cache is considerably more complex
compared to the one in a blocking cache. The controller is responsible for a wide range of
concurrent operations resulting in large, complex state machines. This work presents the
Non-blocking Cache Optimized for Runahead execution, or NCOR. NCOR has an FPGA-
friendly design that revisits the conventional non-blocking cache design considering the
specific needs of Runahead execution. NCOR does away with MSHRs and incorporates
optimizations for the cache controller and data storage.
7.3.1 Eliminating MSHRs
Using the following observations, NCOR eliminates the MSHRs:
1) As originally proposed, Runahead executes all trigger-miss-independent instruc-
tions during Runahead mode. However, since the results produced in Runahead
mode are later discarded, the processor can choose not to execute some of these in-
structions as it finds necessary. This option of selective execution can be exploited
to reduce complexity by avoiding the execution of instructions that require addi-
tional hardware support. One such instruction class is those that cause secondary
misses, that is misses on already pending cache frames. Supporting secondary
misses is conventionally done via MSHRs, which do not map well to FPGAs.
2) In most cases servicing secondary misses offers no performance benefit. There are
two types of secondary misses: redundant and distinct. A redundant secondary
miss requests the same memory block as the trigger miss while a distinct secondary
miss requests a different memory block that happens to map to the same cache
frame as the trigger miss. Section 7.6 shows that distinct secondary misses are a
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 82
very small fraction of the memory accesses made in Runahead mode. It should be
noted that this fraction is larger in ASIC implementations in which many factors,
e.g., memory latency and pipeline depth, are different.
Servicing a redundant secondary miss cannot directly improve performance further
as the trigger miss will bring the data in the cache. A redundant secondary miss may
be feeding another load that will miss and that could be subsequently prefetched.
However this is impossible as the trigger-miss is serviced first which causes the
processor to switch to normal execution. On the other hand, distinct secondary
misses could prefetch useful data, but as Section 7.6 shows, this has a negligible
impact on performance.
Based on these observations the processor can simply discard instructions that cause
secondary misses during runahead mode while getting most, and often all of the perfor-
mance benefits of runahead execution. However, NCOR still needs to identify secondary
misses to be able to discard them. NCOR identifies secondary misses by tracking out-
standing misses within the cache frames using a single pending bit per frame. Whenever
an address misses in the cache, the corresponding cache frame is marked as pending .
Subsequent accesses to this frame would observe the pending bit and will be identified as
secondary misses, and discarded by the processor. Effectively, NCOR embeds the MH-
SRs in the cache, while judiciously simplifying their functionality to reduce complexity
and maintain much of the performance benefits.
7.3.2 Making the Common Case Fast
Ideally, the cache performs all operations in as few cycles as possible. In particular, it is
desirable to service cache hits in a single cycle, as hits are expected to be the common
case. In general, it is desirable to design the controller to favor the frequent operations
over the infrequent ones. Accordingly, NCOR uses a three-part cache controller which
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 83
Request
QueueLookup Request Bus
Tag
Data
Meta Data
System Bus
Figure 7.1: Non-blocking cache structure.
favors the most frequent requests, i.e., cache hits, by dedicating a simple sub-controller
just for hits. Cache misses and all non-cacheable requests (e.g., I/O requests) are handled
by other sub-controllers which are triggered exclusively for such events and are off the
critical path for hits. These requests complete in multiple cycles. The next section
explains the NCOR cache controller architecture in detail.
7.4 NCOR Architecture
Figure 7.1 depicts the basic structure of NCOR. The cache controller comprises Lookup,
Request, and Bus components. NCOR also contains Data, Tag, Request and Metadata
storage units.
7.4.1 Cache Operation
NCOR functions as follows:
• Cache Hit: The address is provided to Lookup which determines, as explained in
Section 7.4.2, that this request is a hit. The data is returned in the same cycle for
Load operations, and is stored in the cache during the next cycle for Store operations.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 84
Other soft processor caches, such as those of Altera Nios II, use two cycles for stores
as well [13].
• Trigger Cache Miss: If Lookup identifies a cache miss, it sends a signal to Request to
generate the necessary requests to handle the miss. Lookup blocks the cache interface
until Request signals back that it has generated all the necessary requests.
Request generates all the necessary requests directed at Bus to fulfil the pending mem-
ory operation. If a dirty line must be evicted, a write-back request is generated first.
Then a cache line read request is generated and placed in the Queue between Request
and Bus.
Bus receives requests through the Queue and sends the appropriate signals to the
system bus. The pending bit of the cache frame that will receive the data is set.
• Secondary Cache Miss in Runahead Mode: If Lookup identifies a secondary cache miss,
i.e., a miss on a cache frame with pending bit set, it discards the operation.
• Secondary Cache Miss in Normal Mode: If Lookup identifies a secondary cache miss in
normal execution mode, it blocks the pipeline until the frame’s pending bit is cleared.
It is possible to have a secondary miss in normal execution mode as a memory access
initiated in Runahead mode may be still pending. In Normal execution processor can
not discard operations and must wait for the memory request to be fulfilled.
The following subsections describe the function of each NCOR component.
7.4.2 Lookup
Lookup is the cache interface that communicates with the processor and receives memory
requests. Lookup performs the following operations:
• For cache accesses, Lookup compares the request address with the tag stored in the
Tag storage to determine whether this is a hit or a miss.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 85
• For cache hits, on a load, Lookup reads the data from the Data storage and provides
it to the processor in the same cycle as the Tag access. Reading the Data storage
proceeds in parallel with the Tag access and comparison. Stores, on the other hand,
take two cycles to complete as writes to the Data storage happen in the cycle after the
hit is determined. In addition, the cache line is marked as dirty.
• For cache misses, Lookup marks the cache line as pending.
• For cache misses and non-cacheable requests, Lookup triggers Request to generate
the appropriate requests. In addition, for loads, it stores the instruction metadata,
including the destination register name, in the MetaData storage. Lookup blocks the
processor interface until Request signals it has generated all the necessary requests.
• For cache accesses, whether the request hits or misses in the cache, if the corresponding
cache line is pending, Lookup discards the request if the processor is in Runahead mode.
However, if the processor is in normal execution mode, Lookup stalls the processor.
Note that it is possible to incur a pending line in normal execution mode under the
following scenario. The processor incurs a cache miss and switches to Runahead mode.
In Runahead mode it incurs a second cache miss and initiates another memory request,
setting the corresponding cache line to pending. The initial miss request is returned
and the processor switches back to normal execution mode. While the second miss
initiated in Runahead mode is still pending, the processor incurs another cache miss
that maps to the same pending cache line, hence the processor must stall.
7.4.3 Request
Request is normally idle waiting for a trigger from Lookup. When triggered, it issues
the appropriate requests to Bus through request Queue. Request performs the following
operations:
• Waits in the idle state until triggered by Lookup.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 86
• For cache misses, Request generates a cache line read request. In addition if the evicted
line is dirty, Request generates a cache line write-back request.
• For non-cacheable requests, depending on the operation, Request generates a single
read or write request.
• When all necessary requests are generated and queued, Request notifies Lookup of its
completion and returns to its idle state.
7.4.4 Bus
Bus is responsible for servicing bus requests generated by Request . Bus receives requests
through the request Queue and communicates through the system bus with the main
memory and peripherals. Bus consists of two internal modules:
Sender
Sender sends requests to the system bus. It removes requests from the request Queue
and, depending on the request type, sends the appropriate signals to the system bus. A
request can be of one of the following types:
• Cache Line Read: Read requests are sent to the system bus for each data word of the
cache line. The critical word (word originally requested by the processor) is requested
first. This ensures minimum wait time for data delivery to the processor.
• Cache Line Write-Back: Write requests are sent to the system bus for each data word
of the dirty cache line. Data words are retrieved from Data storage and sent to the
system bus.
• Single Read/Write: A single read/write request is sent to the memory/peripheral
through the system bus.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 87
Receiver
Receiver handles the system bus responses. Depending on the processor’s original request
type, one of the following actions is taken:
• Load from Cache: Upon receipt of the first data word, Receiver signals request com-
pletion to the processor and provides the data. This is done by providing the corre-
sponding metadata from the MetaData storage to the processor. Receiver also stores
all the data words received in the Data storage. Upon receipt of the last word, it stores
the cache line tag in the corresponding entry in the Tag storage, sets the valid bit and
clears both dirty and pending bits.
• Store to Cache: The first data word received is the data required to perform the store.
Receiver combines the data provided by the processor with the data received from the
system bus and stores it in the Data storage. It also stores subsequent data words,
as they are received, in the Data storage. Upon the receipt of the last word, Receiver
stores the cache line tag in the corresponding entry in the Tag storage, sets both valid
and dirty bits and clears the pending bit.
• Non-Cacheable Load: Upon receipt of the data word, Receiver signals request com-
pletion to the processor and provides the data. It also provides the corresponding
metadata from the MetaData storage. Non-Cacheable loads are those operations that
are specified by the instruction, e.g., IO operations, to not be handled by the data
cache and be sent directly to the system bus.
7.4.5 Data and Tag Storage
The Data and Tag storage units are tables holding cache line data words, tags, and status
bits. Lookup and Bus both access Data and Tag .
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 88
7.4.6 Request Queue
Request Queue is a FIFO memory holding requests generated by Request directed at
Bus . Request Queue conveys requests in the order they are generated.
7.4.7 Meta Data
For outstanding load requests, i.e., load requests missing in the cache or non-cacheable
operations, the cache stores the metadata accompanying the request. This data includes
Program Counter and destination register for Load instructions. Eventually when the
request is fulfilled this information is provided to the processor along with the data loaded
from the memory or I/O. This information allows the loaded data to be written to the
register file. MetaData is designed as a queue so that requests are processed in the order
they were received. No information is placed in the MetaData for Stores as the processor
does not require acknowledgements for their completion.
7.5 FPGA Implementation
This section presents the implementation of the non-blocking cache on FPGAs. It dis-
cusses the design challenges and the optimizations applied to improve clock frequency
and minimize the area. It first discusses the storage organization and usage and the
corresponding optimizations. It then discusses the complexity of the cache controller’s
state machine and how its critical path was shortened for the most common operations.
7.5.1 Storage Organization
Modern FPGAs contain dedicated Block RAM (BRAM) storage units that are fast and
take significantly less area compared to LUT-based storage. This subsection explains
the design choices that made it possible to use BRAMs for most of the cache storage
components.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 89
Tag
One Data Word
{unused, dirty, pending}
{unused, valid, Tag}
24 bits8 bits
One Cache Line
Data
Figure 7.2: The organization of the Data and Tag storage units.
Data
Figure 7.2 depicts the Data storage organization. As BRAMs have a limited port width,
the entire cache line does not fit in one entry. Consequently, cache line words are spread,
one word per entry, over multiple BRAM entries. This work targets the Nios-II ISA [13]
which supports byte, half-word, and word stores (one, two, and four bytes respectively).
These are implemented using the BRAM byte enable signal [21]. Using this signal avoids
two-stage writes (read-modify-write) which would increase area due to the added multi-
plexers.
Tag
Figure 7.2 depicts the Tag storage organization. Unlike cache line data, a tag fits in one
BRAM entry. In order to reduce BRAM usage, we store cache line status bits, i.e., valid,
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 90
dirty and pending bits, along with the tags.
Despite the savings in BRAM usage by storing cache line status bits along with the
tags, the following problem arises. Lookup makes changes only to the dirty and pending
bits and should not alter valid or Tag bits. In order to preserve valid and Tag bits while
performing a write, a two stage write could be used, in which bits are first read and then
written back. This read-modify-write sequence increases area and complexity and hurts
performance. We overcome this problem by using the byte enable signals. As Figure 7.2
shows, we store valid and Tag bits in the lower 24 bits, and dirty and pending bits in the
higher eight bits. Depending on the tag size, a number of bits are unused in the lower
24-bits portion. Using the byte enable signal, Lookup is able to change only the upper
byte, i.e., dirty and pending bits.
7.5.2 BRAM Port Limitations
Although BRAMs provide fast and area-efficient storage, they have a limited number
of ports. A typical BRAM in today’s FPGAs has two ports available for reading and
writing [21]. Figure 7.3 shows that both Lookup and Bus write and read to/from the
Data and Tag storages. This requires four ports. Our design uses only two ports based
on the following observations: BRAMs can be configured to provide two ports, each
providing both write and read operations over one address line. Although Lookup and
Bus both write and read to/from the Data and Tag at the same time, each only requires
one address line.
Tag
For every access from Lookup to the Tag storage, Lookup reads the Tag , valid , dirty and
pending bits for a given cache line. Lookup also writes to the Tag storage in order to
mark a line dirty or pending . However, reads and writes never happen at the same time
as marking a line dirty (for stores) or pending (for misses) happens one cycle after the
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 91
Lookup
Bus
Data
Tag
Sender
Receiver
Figure 7.3: Connections between Data and Tag storages and Lookup and Bus compo-nents.
tag and other status bits are read. Bus only writes to the Tag storage when a cache line
is retrieved from the main memory. Therefore, dedicating one address line to Lookup and
one to Bus is sufficient to access the Tag storage.
Data
For every Lookup access to the Data storage, Lookup either reads or writes a single, or
part of a word. However, Bus may need to write to, or read from the Data storage at the
same time. This occurs if Bus is sending words of a write-back request while a previously
requested data is being delivered by the system bus. To avoid this conflict, we restrict
Bus to send a write-back data word only when the system bus is not delivering any
data. Forward progress is guaranteed as outstanding write-back requests do not block
responses from the system bus. This restriction minimally impacts cache performance
as words are sent as soon as the system bus is idle. In Section 7.6.9 we show that even
in the worst case scenario, impact on performance is marginal. With this modification,
dedicating one address line to Lookup and one to Bus is sufficient for accessing the Data
storage.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 92
CPU-Side
Bus
Lookup Request Bus
(a)
(b)
Request
Queue
Request
Queue
Figure 7.4: (a) Two-component cache controller. (b) Three-component cache controller.
7.5.3 State Machine Complexity
The cache controller is responsible for looking up in the cache, performing loads and
stores, generating bus requests and handling bus transactions. Given the number of
operations that the controller handles, in many cases concurrently, it requires a large
and complex state machine. A centralized cache controller can be slow, and has the
disadvantage of treating all requests the same. However, we would like the controller to
respond as quickly as possible to those requests that are most frequent, i.e., requests that
hit in the cache. Accordingly, we partition the controller into sub-components. One could
partition the controller into two components of CPU-side and bus-side, as Figure 7.4(a)
shows. The CPU-side component would be responsible for looking up addresses in the
cache, performing loads and stores, handling misses and non-cacheable operations, and
sending necessary requests to the bus-side component. The bus-side component would
communicate with the main memory and system peripherals through the system bus.
Due to the variety of operations that the CPU-side component is responsible for, we
find that it still requires a non-trivial state machine. The state machine has numerous
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 93
RequestLookup
Wait
Write
Write
Back
Read
Line
I/OWait
I/O
Look
upIDLE
Figure 7.5: Lookup and Request state machines. Double-lined states are initial states.Lookup waits for Request completion in the “wait” state. All black states generaterequests targeted at the Bus controller.
input signals and this reduces performance. Among its inputs is the cache hit/miss signal,
a time-critical signal due to the large comparator used for tag comparison. As a result,
implementing the CPU-side component as one state machine leads to a long critical path.
Higher operating frequency is possible by further partitioning the CPU-side compo-
nent into two subcomponents, Lookup and Request , which cooperatively perform the
same set of operations. Figure 7.4(b) depicts the three-component cache controller. The
main advantage of this controller is that cache lookups that hit in the cache, the most
frequent operations, are handled only by Lookup and are serviced as fast as possible.
However, this organization has its own disadvantages. In order for the Lookup and Re-
quest to communicate, e.g., in the case of cache misses, extra clock cycles are required.
Fortunately, these actions are relatively rare. In addition, in such cases servicing the
request takes in the order of tens of cycles. Therefore, adding one extra cycle delay to
the operation has little impact on performance. Figure 7.5 shows an overview of the two
state machines corresponding to Lookup and Request.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 94
7.5.4 Latching the Address
We use BRAMs to store data and tags in the cache. As BRAMs are synchronous RAMs,
the input address needs to be available just before the appropriate clock edge (rising in
our design) of the cycle when cache lookup occurs. Therefore, in a pipelined processor,
the address has to be forwarded to the cache from the previous pipeline stage, e.g., the
execute stage in a typical 5-stage pipeline. After the first clock edge, the input address
to the cache changes as it’s forwarded from the previous pipeline stage. However, the
input address is further required for various operations, e.g., tag comparison. Therefore,
the address must be latched.
Since some cache operations take multiple cycles to complete, the address must be
latched only when a new request is received. This occurs when Lookup’s state machine
is entering the lookup state. Therefore, the address register is clocked based on the next
state signal. This is a time-critical signal and using it to clock a wide register, as is the
case with the address register, negatively impacts performance.
To avoid using this time-critical signal we make the following observations: The
cache uses a latched address in two phases: In the first cycle for tag comparison, and
in subsequent cycles for writes to Data storage and request generations. Accordingly,
we can use two separate registers, addr always and addr lookup one per phase. At every
clock cycle, we latch the input address into addr always. We use this register for tag
comparison in the first cycle. At the end of the first cycle, if Lookup is in the lookup
state, we copy the content of addr always into addr lookup. We use this register for
writes to the cache and request generation. As a result, the addr always register is
unconditionally clocked every cycle. Also, we use Lookup’s current state register, rather
than its next state combinational signal, to clock the addr lookup register. This improves
the design’s operating frequency.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 95
Table 7.1: Architectural properties of simulated processors.No. Ways 1-4
I-Cache Size (Bytes) 32K
D-Cache Size (Bytes) 4-32K
Cache Line Size 32 Bytes
Cache Associativity Direct Mapped
Memory Latency 26 Cycles
BPredictor Type GShare
BPredictor Entries 4096
BTB Entries 256
Pipeline Stages 5
No. Outstanding Misses 32
7.6 Evaluation
This section evaluates NCOR. It first compares the area and frequency of NCOR with
those of a conventional MSHR-based non-blocking cache. It then shows the potential
performance advantage that Runahead execution has over an in-order processor using a
non-blocking cache.
7.6.1 Methodology
We use software simulations to estimate the performance of various NCOR configurations.
We follow the methodology explained in Chapter 3. The processor models include a 5-
stage in-order pipelined processor with Runahead execution support. Table 7.1 details
the simulated processor micro-architecture. We also compare the area and frequency
characteristics of NCOR against a conventional, MSHR-based non-blocking cache.
Although NCOR’s architecture is applicable to set-associative caches as well, in this
study we only consider direct-mapped caches. We find that set-associativity substantially
increases cache’s architectural and implementation complexity [67]. Specifically, set-
associative caches require multiple comparison operations for every lookup, which leads
to low clock frequencies. We decide not to include set-associative caches in our study as
we expect substantial frequency loss by making the cache set associative.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 96
7.6.2 Simplified MSHR-Based Non-Blocking Cache
NCOR was motivated as a higher-speed, lower-cost and complexity alternative to con-
ventional, MSHR-based non-blocking caches. A comparison of the two designs is needed
to demonstrate the magnitude of these advantages. Our experience has been that the
complexity of a conventional non-blocking cache design quickly results in an impractically
slow and large FPGA implementation. This makes it necessary to seek FPGA-friendly
alternatives such as NCOR. For the purposes of demonstrating that NCOR is faster
and smaller than a conventional non-blocking cache, it is sufficient to compare against
a simplified non-blocking cache. This is sufficient, as long as the results demonstrate
the superiority of NCOR and provided that the simplified conventional cache is clearly
faster and smaller than a full-blown conventional non-blocking cache implementation.
The simplifications made to the conventional MSHR non-blocking cache are as follows:
• Requests mapping to a cache frame for which a request is already pending are not
supported. Allowing multiple pending requests targeting the same cache frame
substantially increases complexity.
• Each MSHR entry tracks a single processor memory request as opposed to all
processor requests for the same cache block request [38]. This eliminates the need
for a request queue per MSHR entry which tracks individual processor requests,
some of which may map onto the same cache block. In this organization the MSHRs
serve as queues for both pending cache blocks and processor requests. Secondary
misses are disallowed.
• Partial (byte or half-word) loads/stores are not supported.
We use this simplified MSHR-based cache for FPGA resource and clock frequency
comparison with NCOR. In the performance simulations, we use a regular MSHR-based
cache.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 97
7.6.3 Resources
FPGA resources include ALUTs, blockrams (BRAM), and the interconnect. In these
designs interconnect usage is mostly tied to ALUT and BRAM usage. Accordingly, this
section compares the ALUT and BRAM usage of the two cache designs. Figure 7.6
reports the number of ALUTs used by NCOR and the MSHR-based cache for various
capacities. The conventional non-blocking cache uses almost three times as many ALUTs
compared to NCOR. There are two main reasons why this difference occurs:
1. The MSHRs in the MSHR-based cache must use ALUTs exclusively instead of a
mix of ALUTs and BRAMs due to the nature of CAMs included in their design.
2. The large number of comparators required in the CAM structure of the MSHRs
require a large number of ALUTs.
While the actual savings in the number of ALUTs is small compared to the number
of ALUTs found in high capacity FPGAs available today (>100K ALUTs) such small
savings add up, for example in a multi-processor environment. Additionally, there are
designs where low-capacity FPGAs are required, for example in low budget or low power
applications.
In NCOR, the bulk of the cache is implemented using BRAMs, hence the high area
density and efficiency of the cache design. The vast majority of the BRAMs contain the
cache’s data, tag and status bits. As expected, both caches experience a negligible change
in ALUT usage over different capacities, as most of the cache storage is implemented using
BRAMs.
Figure 7.7 shows the number of BRAMs used in each cache for various capacities.
Compared to the conventional cache, NCOR uses one more BRAM as it stores pending
memory requests in BRAMs rather than in MSHRs.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 98
0
100
200
300
400
500
600
700
4KB 8KB 16KB 32KB
AL
UTs
Area
NCOR
MSHR
Figure 7.6: Area comparison of NCOR and MSHR-based caches over various capacities.
0
10
20
30
40
50
4KB 8KB 16KB 32KB
BR
AM
s
Block Ram
NCOR
MSHR
Figure 7.7: BRAM usage of NCOR and MSHR-based caches over various capacities.
7.6.4 Frequency
Figure 7.8 reports the maximum clock frequency the NCOR and the MSHR-based cache
can operate at and for various capacities. NCOR is consistently faster. The difference is
at its highest (58%) for the 4KB caches with NCOR operating at 329MHz compared to
the 207MHz for the MHSR-based cache. For both caches, and in most cases, frequency
decreases as the cache capacity increases. At 32KB NCOR’s operating frequency is within
18% of the 4KB NCOR. Although increased capacity results in reduced frequency in most
cases, the 8KB MSHR-based cache is faster than its 4KB counterpart. As the cache
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 99
150
200
250
300
350
4KB 8KB 16KB 32KB
MH
z
Maximum Clock Frequency
NCOR
MSHR
Figure 7.8: Clock frequency comparison of NCOR and of a four-entry MSHR-based cacheover various cache capacities.
capacity increases, more sets are used, and hence the tag size decreases. Accordingly,
this makes tag comparisons faster. At the same time, the rest of the cache becomes
slower. These two latencies combine to determine the operating frequency which is at
a local maximum at a capacity of 8KB for the MSHR-based cache. However, as cache
capacity continues to grow, any reduction in tag comparison latency is overshadowed by
the increase in latency in other components.
7.6.5 MSHR-Based Cache Scalability
The NCOR studied in this work is capable of handling up to 32 outstanding requests.
Supporting more outstanding requests in NCOR basically comes for free as these are
tracked in BRAMs. An MSHR-based cache however uses CAMs, hence LUTs for storage.
Figure 7.9 reports how the frequency and area of the MSHR-based cache scale with MSHR
entry count. As expected, as the number of MSHRs increases clock frequency drops and
area increases. With 32 MSHRs, the MSHR-based cache operates at only 126MHz and
requires 3269 ALUTs.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 100
0
50
100
150
200
250
0
1000
2000
3000
4000
2 4 8 16 32
MH
z
AL
UTs
# MSHRs
Area / Frequency
Area
Freq
Figure 7.9: Area and clock frequency of a 32KB MSHR-based cache with various numberof MSHRs. The left axis is ALUTs and the right axis is clock frequency.
7.6.6 Runahead Execution
Figure 7.10 reports the speedup achieved by Runahead execution on 1- to 4-way super-
scalar processors modeled in simulation. For this comparison, performance is measured
as the instructions per cycle (IPC) rate. IPC is a frequency independent metric and thus
is useful in determining the range of frequencies for which an implementation can operate
on and still outperform an alternative. Runahead is able to outperform the correspond-
ing in-order processor by extracting memory-level parallelism effectively hiding the high
main memory latency. For a typical single-issue pipeline (1-way), on average, Runahead
improves IPC by 26%.
As the number of outstanding memory requests increases, higher memory level par-
allelism is extracted, hence higher performance. Figure 7.11 shows how the IPC scales
with increasing the number of outstanding requests. Moving from two outstanding re-
quests to 32, we gain, on average, 7% in IPC. The impact of the number of outstanding
requests is even greater as the memory latency increases, as is expected with the increas-
ing gap between FPGA and DDR clock speeds. We study memory latency impact in
Figure 7.12. When memory latency is lower, increasing outstanding requests marginally
increases speedup, i.e. by 7%. However with a high memory latency, by moving from
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 101
0.00
0.20
0.40
0.60
0.80
1.00
IPC
Performance
Normal Runahead
Figure 7.10: Speedup gained by Runahead execution on 1- to 4-way superscalar proces-sors. The lower parts of the bars show the IPC of the normal processors. The full barsshow the IPC of the Runahead processor.
4%
6%
8%
10%
0.29
0.30
0.31
2 4 8 16 32
Sp
ee
du
p
IPC
Outstanding Requests
IPC
IPC Speedup
Figure 7.11: The impact of number of outstanding requests on IPC. Speedup is measuredover the first configuration with two outstanding requests.
two outstanding requests to 32, the speedup doubles, i.e., from 26% to 54%.
Next we compare the speedup gained with NCOR to that of a full-blown MSHR-
based cache. Figure 7.13 compares the IPC of Runahead execution with NCOR and
MSHR-based caches. NCOR achieves slightly lower IPC, less than 4% on average, as it
sacrifices memory level parallelism for lower complexity. However, in the case of sjeng,
MSHR performs worse. MSHR is more aggressive in prefetching cache lines, and in this
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 102
0%
10%
20%
30%
40%
50%
60%
26 100
Sp
eed
up
Memory Latency (Cycles)
Memory Latency Impact
2
32
Figure 7.12: Speedup gained by Runahead execution with two and 32 outstanding re-quests, with memory latency of 26 and 100 cycles.
00.10.20.30.40.50.6
IPC Comparison
NCOR MSHR
Figure 7.13: Performance comparison of Runahead with NCOR and MSHR-based cache.
case pollutes the cache rather than prefetching useful data.
Finally we compare NCOR and MSHR-based caches based on both IPC and their
operating frequency. We simulate two processors with different caches and compare the
two in terms of runtime in seconds to complete the execution of our benchmarks set.
Figure 7.14 compares the two systems over a range of cache sizes. NCOR performs
the same task up to 34% faster than MSHR. It should be noted that NCOR with 4KB
capacity performs faster than a 32KB MSHR-based cache.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 103
11
13
15
17
19
21
23
4KB 8KB 16KB 32KB
Se
co
nd
s
Runtime
NCOR
MSHR
Figure 7.14: Average runtime in seconds for NCOR and MSHR-based cache.
70%
80%
90%
100%
Hit Ratio
Normal Runahead
Figure 7.15: Cache hit ratio for both normal and Runahead execution.
7.6.7 Cache Performance
This section compares cache performance with and without Runahead execution. Fig-
ure 7.15 reports hit ratio for a cache with 32KB capacity with and without Runahead
execution. Runahead improves cache hit ratio, by as much as 23% for hmmer and by
about 7% on average. We also report the number of cache Misses Per Kilo Instructions
(MPKI) in Figure 7.16. Runahead reduces MPKI, on average by 39% as it effectively
prefetches useful data into the cache.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 104
020406080
100120
Misses / Kilo Instructions
Normal Runahead
Figure 7.16: Number of misses per 1000 instructions executed in both normal and Runa-head execution.
7.6.8 Secondary Misses
Runahead execution tied with NCOR achieves high performance even though the cache
is unable to service secondary misses. This section provides additional insight on why
discarding secondary misses has little effect on performance. Figure 7.17 reports, on
average, how many times the cache observes a secondary miss (only misses to a different
memory block) while in Runahead mode. The graph shows that every time the processor
switches to Runahead mode only 0.1 secondary misses are encountered, on average over
all benchmarks. Even if the cache was able to service secondary misses, it would have
generated only 10 memory requests every 100 times that it switches to Runahead mode.
Therefore, discarding secondary misses does not take away a significant opportunity to
overlap memory requests. Even for hmmer which experiences a high number of secondary
misses, Runahead achieves a 28% speedup as Figure 7.10 reports. This shows that non-
secondary misses are in fact fetching useful data.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 105
0
0.5
1
1.5
2
Secondary Misses / Runahead Mode
Figure 7.17: Average number of secondary misses (misses only to different cache blocks)observed per invocation of Runahead executions in a 1-way processor.
7.6.9 Writeback Stall Effect
In Section 7.5.2 we showed that BRAM port limitation requires NCOR to delay write-
backs in case the system bus is responding to an earlier cache line read request. Unfor-
tunately in order to study the impact on IPC we need a very accurate DDR-2 model in
software simulations which our infrastructure does not include. Alternatively, we study
the most pessimistic scenario in which all write-backs coincide with data return of pend-
ing cache line reads, resulting in write-back stalls. Although it is possible, this scenario is
unlikely to occur and represents the absolute worst case in this study. Figure 7.18 shows
that even in this worst case scenario, still Runahead execution with NCOR is effective,
and on average less than 2% performance is lost.
7.7 Related Work
Related work in soft processor cache design includes work on automatic generation of
caches and synthesizable high performance caches, including non-blocking and traversal
caches. To the best of our knowledge, NCOR is the first FPGA-friendly non-blocking
data cache optimized for Runahead execution.
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 106
0.00.10.20.30.40.50.6
IPC
Writeback Stall Effect
Normal Runahead Runahead w/ Writeback Stall
Figure 7.18: IPC comparison of normal, Runahead and Runahead with worst case sce-nario for write-back stalls.
The technique used in NCOR for tracking pending cache lines is similar to that pro-
posed by Franklin and Sohi which stores the MSHR information in the cache line rather
than in a separate structure[28]. They add a transit bit to each cache line, indicating
that the line is being fetched from the main-memory. In their scheme, the data stored
in a cache line marked as in-transit provide MSHR information. However, NCOR uses
separate registers to store this information as it only requires MSHR information for one
cache line, hence the area overhead is low.
Yiannacouras and Rose created an automatic cache generation tool for FPGAs [67].
Their tool is capable of generating a wide range of caches based on a set of configuration
parameters, for example cache size, associativity, latency, and data width. The tool is
also useful in identifying the best cache configuration for a specific application.
The PowerPC 470S is a synthesizable soft-core implementation that is equipped
with non-blocking caches. This core is available under a non-disclosure agreement from
IBM [42]. A custom logic implementation of this core, PowerPC 476FP has been im-
plemented by LSI and IBM [42]. However, this implementation is not tuned for FPGA
implementation and its efficiency on such platform must be studied.
Coole et. al., present a traversal data cache framework for soft processors [20]. Traver-
Chapter 7. NCOR: Non-blocking Cache For Runahead Execution 107
sal caches are suitable for applications with pointer-based data structures. It is shown
that, using traversal caches, for such applications performance may improve by as mush
as 27x. Traversal caches are orthogonal to NCOR.
Choi et al., study the design and implementation of multi-ported data caches on
FPGAs [19]. They investigate various cache architectures for systems with multiple
components accessing the cache at the same time. They propose a multi-pumped cache
which achieves high performance, without partitioning the memory, hence the entire
cached memory space is available through all cache ports in single cycle.
NCOR avoids using CAMs to increase area and frequency efficiency. Dhawan and
DeHon propose dMHC, a near-associative memory architecture that exploits BRAMs to
store data and uses Bloom filters to track and match keys inside the memory [24]. dMHC
is shown to achieve higher performance compared to a naive, LUT-based implementation
of content addressable memories on FPGAs [24].
7.8 Conclusion
This chapter presented NCOR, an FPGA-Friendly non-blocking data cache implemen-
tation for soft processors with Runahead execution. It showed that a conventional non-
blocking cache is expensive to build on FPGAs due to the CAM-based structures used
in its design. NCOR exploits the key properties of Runahead execution to avoid CAMs
and instead stores information about pending requests inside the cache itself. In addi-
tion, the cache controller is optimized by breaking its large and complex state machine
into multiple, smaller, and simpler sub-controllers. Such optimizations improve design
operating frequency. A 4KB NCOR operates at 329 MHz on Stratix III FPGAs while
it uses only 270 logic elements. A 32KB NCOR operates at 278 MHz using 269 logic
elements.
Chapter 8
SPREX: Soft Processor with
Runahead EXecution
This chapter presents SPREX (Soft Processor with Runahead EXecution), an FPGA-
friendly, synthesizable, soft processor with Runahead execution. Conventional Runa-
head implementations were proposed for ASIC designs with different constraints than
the FPGA fabric. Such implementations rely on structures that do not map well onto
FPGAs, such as CAMs. SPREX avoids the inefficiencies of conventional Runahead de-
signs by exploiting CFC and NCOR. CFC avoids copying in order to use BRAMs for
storage while providing checkpointing functionality. NCOR does not use CAMs while it
provides non-blocking data cache functionality required by Runahead execution. In this
chapter we discuss the details of SPREX implementation and the challenges of tuning
the implementation to map well onto FPGAs. We implement SPREX in Verilog and
show that for our benchmark set, it improves performance by 9% on average and by as
much as 36%. The architecture of SPREX and its performance study has been published
as [5].
The rest of this chapter is organized as follows: Section 8.1 discusses the challenges
in implementing a Runahead processor on FPGAs. Section 8.2 presents the architecture
108
Chapter 8. SPREX: Soft Processor with Runahead EXecution 109
of the SPREX. Section 8.3 presents our experimental evaluation of SPREX using both
software simulation and actual hardware implementation. Section 8.4 presents related
work and, finally, Section 8.5 concludes this chapter.
8.1 Challenges of Runahead Execution in Soft Pro-
cessors
A processor with Runahead execution requires additional functionality beyond simple
pipelining. In Chapter 2 we discussed Runahead execution in more detail. This func-
tionality comes with area and frequency overheads. Conventional Runahead designs were
proposed for custom ASIC implementation in which the implementation trade-offs are
different compared to FPGAs. For example, on FPGAs BRAMs have a limited number
of ports and discrete sizes, whereas arbitrary SRAMs can be implemented in ASICs [21].
One of the key mechanisms Runahead requires is register file checkpointing. Instruc-
tions executed in Runahead mode must not alter the processor’s architectural state,
including the register file content. As the processor switches to runahead mode, it check-
points the register file. That is, the processor saves a copy of the register file contents,
only to be restored when the processor exits Runahead mode.
Saving the register file requires copying its content to a backup storage. Conventional
ASIC implementations checkpoint register files by interleaving checkpoint bits next to
each register file bit [53]. This implementation allows mass copying of data in single
cycles. On FPGAs, however, for area efficiency, register files are implemented using
BRAMs, which are equipped with a limited number of ports, normally at most two [21].
Therefore, copying the entire register file takes multiple cycles. Multi-cycle checkpoint-
ing delays the processor in entering Runahead mode, thus diminishing any performance
benefits. An alternative would be to implement the register file using dedicated registers
in LUTs, which leads to a large and area inefficient design.
Chapter 8. SPREX: Soft Processor with Runahead EXecution 110
A Runahead processor pipelines multiple memory requests to reduce total data re-
trieval time. Consequently, the processor requires a non-blocking data cache. Conven-
tional non-blocking cache designs are based on CAMs which map poorly onto FPGAs.
CAMs include an array of large comparators to perform associative lookups. The re-
sulting FPGA implementation stores the CAM data in dedicated registers in LUTs, and
further uses LUTs to implement a collection of multiplexers to select the matching cell.
The FPGA implementation is slow and large [3, 4].
8.2 SPREX: An FPGA-Friendly Runahead Archi-
tecture
This section describes SPREX, an FPGA-friendly Runahead architecture that has been
tailored to map well on reprogrammable fabrics. SPREX is based on the Nios-II ISA
and resembles a Nios-II-s implementation [13]. SPREX revisits conventional runahead
architecture, taking into consideration what functions are needed, how well they map
onto FPGAs and their corresponding performance benefit. As a result, SPREX keeps
just those functions that are absolutely necessary for Runahead while avoiding other
functions which in most cases lead to negligible performance gains. Figure 8.1 shows a
conventional in-order processor architecture augmented with additional components for
Runahead support.
8.2.1 Checkpointing
A Runahead processor uses checkpointing to preserve its architectural state while execut-
ing instructions in Runahead mode. For checkpointing the register file, we use Copy-Free
Checkpointing (CFC) as proposed in Chapter 5. CFC checkpoints the register file with-
out performing any copy operations, therefore is ideal for implementation using BRAMs.
CFC can support multiple checkpoints and could be used as a component for an out-of-
Chapter 8. SPREX: Soft Processor with Runahead EXecution 111
Fetch Decode Execute WriteMemory
Register
TrackingNCOR
CFC
Register file
Runahead
Figure 8.1: Gray components form a typical 5-stage in-order pipeline. Black componentsare added to support Runahead execution.
order soft-core implementation. For Runahead only one checkpoint of the register file is
required. SPREX is based on the Nios-II ISA, which includes 32 registers that are 32
bits in width, totaling to 1024 bits of storage. With the checkpoint, the total storage
needed is 2048 bits, which still fits in one block RAM (M9K blocks on Stratix III devices).
Therefore, using CFC to checkpoint the register file comes with no storage overhead in
terms of block RAMs used.
CFC requires a small vector for checkpoint tracking. This vector is stored in dedicated
registers rather than in a block RAM as parallel access is required. Only one bit per
architectural register is needed, or 32 bits in total.
8.2.2 Non-Blocking Cache
In Chapter 7 we showed that not all of the capabilities a full-blown non-blocking cache
provides offer significant performance benefits. Conventional non-blocking caches are de-
signed to overlap any arbitrary combination of memory references for best performance.
To support all combinations of cache accesses, MSHRs are used. However, during Runa-
head mode the processor does not have to execute all instructions, it can selectively
choose to discard and not support the execution of instructions that require complex
Chapter 8. SPREX: Soft Processor with Runahead EXecution 112
support. Accordingly, we choose NCOR as the data cache for SPREX. NCOR is an
FPGA-friendly non-blocking cache specialized for Runahead execution. It does not ser-
vice secondary misses, that is misses that will bring blocks on cache lines that already
have another request pending. In Chapter 7 we showed that supporting secondary misses
has negligible performance benefits. On the other hand, by removing this support, NCOR
is able to replace MSHRs with single bits stored along with each cache line.
8.2.3 Extra Decoding
In runahead mode, not all instructions should be executed. For example instructions
changing the processor control registers, or those causing exceptions must be discarded.
Therefore, a small decoder is added to the Decode stage to identify these instructions
that need to be flushed in runahead mode.
8.2.4 Store Instructions
The processor runs in speculative mode when in runahead mode. Therefore, no in-
struction must make persistent changes to the processor state, including the data cache.
However, store instructions, if executed, will change data words in the data cache. For
store instructions, we considered the following options:
1. Discard stores altogether: Discarding any instruction in runahead mode is perfectly
safe and does not affect overall program execution correctness [3].
2. Discard stores hitting in the cache. Fetch cache lines addressed by missing stores,
without actually modifying the cache line; In conventional caches, stores must first
fetch the whole cache line and then modify the part they touch.
3. Use a speculative store-buffer to keep the store values produced in Runahead mode
and prevent them from modifying the memory hierarchy. In this option, subsequent
loads in the current Runahead mode, which try to access the same address, are
Chapter 8. SPREX: Soft Processor with Runahead EXecution 113
provided by the store data. This can potentially lead to a more precise execution
in Runahead mode, hence more precise memory prefetches.
The first two options are simple to implement. The only difference is in the way
store instructions are serviced in runahead mode. Performance wise, the second option
may achieve a higher performance by prefetching more cache lines compared to the first
option. However, it is also possible that such lines pollute the cache, hurting performance.
The third option requires an extra storage unit for keeping store values produced
in runahead mode. Many store-buffer designs have been proposed in the past [46]. The
typical design contains an associative array, which does not map well on FPGAs, imposing
area and complexity overheads. Alternative designs sacrifice performance for shrinking
the associative array [50].
Section 8.3 shows that prefetching cache lines for stores results in more cache pollution
than useful prefetches. We conclude that the added complexity of using store-buffers for
runahead mode is not justified. Moreover, we conclude that it is not beneficial to prefetch
cache lines for stores executed in runahead mode. Therefore, SPREX discards all store
instructions during runahead mode.
8.2.5 Register Validity Tracking
In order to maximize the prefetching of useful cache lines, the program execution must
be followed as accurately as possible in runahead mode. However, not all the data is
available to the processor when in runahead mode. For example, in runahead execution
mode, not all registers are holding valid data [25]. There are reasons why registers end up
with bogus values during runahead mode. First, if the trigger miss is a load instruction its
destination register does not hold valid data yet, as the data load is still pending. Hence,
any instruction using that register, and all instructions down the same dependency graph
produce bogus data. Additionally, the destination registers of discarded instructions end
up with bogus data as well.
Chapter 8. SPREX: Soft Processor with Runahead EXecution 114
Executing using bogus data may lead to prefetching bogus addresses, thus polluting
the cache. Since instructions execute speculatively in runahead mode, correctness is
preserved, but bogus prefetches may hurt performance. Moreover, Section 8.3 shows that
avoiding bogus prefetches leads to higher performance. Therefore, instructions accessing
bogus data are best identified and discarded. SPREX tracks register validity as we
execute instructions in runahead mode.
Tracking data validity results in a small overhead. For each register an additional
bit is required. An instruction is discarded if any of its source registers are marked as
invalid. In addition, if an instruction that produces a result is discarded, its destination
register is marked as invalid as well. Registers become valid if an instruction writes valid
data into them.
8.3 Evaluation
8.3.1 Methodology
Given the number of parameters involved in the design space of Runahead, we used
software simulations to determine the best configuration and then implemented it in
hardware. We follow the methodology explained in Chapter 3. Table 8.1 reports the
architectural properties of the simulated and implemented processor. Our simulation
infrastructure uses a simplified DDR2 memory model and as a result, the performance
predicted by simulation does not completely match that measured on actual hardware.
In all experiments, we report speedup achieved over a simple 5-stage inorder pipeline.
We use microbenchmarking to tune the base pipeline to match Nios II in terms of IPC.
After choosing the best Runahead architecture, we implement it in Verilog and synthesize
it to the FPGA. SPREX operates at a maximum clock speed of 146MHz. We choose the
clock speed of 133MHz, conveniently chosen as half the clock speed of the DDR memory.
The timers and the UART run at 50MHz.
Chapter 8. SPREX: Soft Processor with Runahead EXecution 115
Table 8.1: Architectural properties of the simulated and implemented processors.
Pipeline Stages 5
Branch Predictor Bimodal
Bimodal Entries 512
I-Cache Blocking
I-Cache Size 32KB
I-Cache Block Size 32Bytes
D-Cache NCOR
D-Cache Size 32KB
D-Cache Block Size 32Bytes
NCOR no. outstanding requests 32
Cache Associativity Direct Mapped
Memory Latency (simulation) 24 Cycles
Checkpointing CFC
CFC no. checkpoints 1
Chapter 5 and Chapter 7 show, using software simulation, that NCOR and CFC
can support runahead execution effectively [3, 1]. Here we initially investigate, through
software simulation, the following key additional design choices for runahead:
1. How to handle stores
2. Whether to track register validity during runahead mode
3. How many outstanding requests should we track for the data cache
We finally measure performance on actual hardware and report the area and frequency
characteristics.
8.3.2 Stores During Runahead
Figure 8.2 compares three simulated Runahead architectures in terms of speedup over a
simple 5-stage pipeline. All three architectures use register validity tracking. The first
Chapter 8. SPREX: Soft Processor with Runahead EXecution 116
-5%
0%
5%
10%
15%
20%
25%
Sp
ee
du
p
discard stores prefetch stores prefetch + store-buffer
Figure 8.2: Store handling during runahead mode. Speedup comparison (see text for adescription of the three choices).
architecture discards all store instructions in runahead mode. The second architecture
prefetches cache lines for missing stores incurred in runahead mode, however does not
store any data in the cache. The third architecture includes a store buffer in addition to
prefetching cache lines for stores.
In two cases, bzip2 and h264 we observe a significant loss of performance when store
instructions are included. Besides a mild performance gain for astar, other benchmarks
exhibit little to no sensitivity to the inclusion of stores in runahead execution. In the case
of quantum benchmark, we observe a negligible performance loss, i.e., less than 2%, which
is the result of cache pollution. We conclude that, given the complexity of store-buffers,
it is not beneficial to use store-buffers, nor is it to execute stores in runahead mode at all.
Hence our final SPREX implementation discards store instructions in runahead execution
mode. For the rest of the evaluation we restrict our attention to the first option where
stores are discarded.
Chapter 8. SPREX: Soft Processor with Runahead EXecution 117
-5%0%5%
10%15%20%25%
Sp
ee
du
p
Register Validity Tracking
without Reg. Tracking with Reg. Tracking
Figure 8.3: Speedup with and without register validity tracking.
8.3.3 Register Validity Tracking
Considering that correctness is not necessary in runahead execution mode, we have the
option of executing instructions using bogus data. Therefore it is not critical to track
valid data in registers. However, avoiding bogus instructions can yield higher performance
as more useful cache lines will be prefetched. Figure 8.3 compares the speedup achieved
with and without register validity tracking. Performance is better by 4% on average
when register tracking is enabled. As register tracking comes with little overhead, we opt
to include register tracking.
8.3.4 Number of Outstanding Requests
As the memory latency increases, more time is spent in runahead execution mode. Thus
the processor has a higher chance of finding and overlapping memory accesses. How-
ever, the number of memory requests that the processor can generate in runahead mode
depends on the number of outstanding requests that the cache supports.
NCOR uses block RAMs to store information regarding outstanding memory requests.
Figure 8.4 shows NCOR’s block RAM and ALUT usage based on the number of out-
Chapter 8. SPREX: Soft Processor with Runahead EXecution 118
370
380
390
400
410
0
1
2
3
4
5
2 4 8 16 32 64
AL
UT
Blo
ck R
am
NCOR Resource Usage
Blockram ALUTs
Figure 8.4: NCOR resource usage based on the number of outstanding requests.
-10%
0%
10%
20%
30%
Sp
eed
up
Outstanding Requests
2 4 8 16 32 64
Figure 8.5: Speedup comparison of architectures with various numbers of outstandingrequests.
standing requests. NCOR’s block RAM usage is oblivious to the number of outstanding
requests in the range of 2-64. However, ALUT usage is directly affected. Figure 8.5
shows performance over the same range of number of outstanding requests. The speedup
obtained with more than four outstanding requests is insignificant. Based on these results
we use an NCOR with four outstanding requests.
Chapter 8. SPREX: Soft Processor with Runahead EXecution 119
0%
20%
40%
60%
80%
100%
120%In
cre
as
e
Memory Bandwidth
Figure 8.6: Memory bandwidth usage increase due to Runahead execution.
8.3.5 Memory Bandwidth
SPREX prefetches cache lines with the hope of finding a line used in the near future. This
increases pressure on the memory subsystem, potentially increasing power dissipation. In
Figure 8.6 we report the increase in memory bandwidth usage by Runahead execution.
On average, memory bandwidth usage is increased by 12%, peaking at 95% for the h264
benchmark. We predict that Runahead does not significantly increase memory bandwidth
usage in the system. However, the actual impact on power dissipation must be measured
to reach a conclusive result which we leave for future work.
8.3.6 Branch Prediction Accuracy
SPREX encounters and predicts branch instructions in Runahead mode as well as in nor-
mal execution mode. Execution of branches in Runahead mode serves as an opportunity
for the branch predictor to be trained before the actual branch is encountered in nor-
mal execution. Therefore, higher branch prediction accuracy is expected with Runahead
execution. Figure 8.7 compares the prediction accuracy with and without Runahead
execution. For all benchmarks, except bzip2 and sjeng, prediction accuracy is increased,
Chapter 8. SPREX: Soft Processor with Runahead EXecution 120
0%
10%
20%
30%
40%
Branch Misprediction Rate
Normal Runahead
Figure 8.7: Comparison of branch prediction accuracy for normal and Runahead execu-tions.
on average by 13%. Prediction accuracy is decreased by 11% and 1% for bzip2 and sjeng
respectively.
8.3.7 Final Processor Performance
We compare our final SPREX implementation with a simple 5-stage pipeline. We use
execution time for our comparison, that is the number of processor cycles it takes to
execute 1 Billion instructions. Figure 8.8 compares the two architectures. SPREX con-
sistently outperforms the baseline processor studied. SPREX’s performance advantage is
much higher for bzip2, astar and xalanc. The speedup is maximum at 36% for the astar
benchmark. Lower performance gains are observed for other benchmarks, ranging from
3-5%.
8.3.8 Runahead Overhead
Runahead comes with overhead both in terms of area and frequency. Table 8.2 reports
area usage of the entire SPREX processor, including runahead functionality. The table
Chapter 8. SPREX: Soft Processor with Runahead EXecution 121
0%
10%
20%
30%
40%S
pe
ed
up
Runahead Speedup
Figure 8.8: Speedup gained with Runahead execution over normal execution on actualFPGA.
also reports area usage for individual runahead components. In case of NCOR, the num-
bers of brackets indicate the overhead in addition to a simple blocking cache. Runahead
requires a total of 324 additional logic elements, 279 registers and 4 block RAMs, which
amount to only 19%, 18% and 57% of the total processor logic elements, register and
block RAM usage respectively. Considering the storage required for caches, the block
RAM overhead is much lower. For a SPREX with 32KB caches, the BRAM overhead is
only 5%, that is 4 BRAMs in addition to 77 BRAMs used in the entire processor.
Previous research has shown that components used in this work to support Runahead
are fast and area efficient on FPGAs [1, 4, 3]. Future work can investigate critical
paths in SPREX and tune the architecture further to improve clock frequency and thus
performance.
8.4 Related Work
A few past works have focused on architectures targeting programs with unstructured
ILP, for example superscalar, out-of-order or Runahead. Santa Cruz Out-of-Order Risc
Chapter 8. SPREX: Soft Processor with Runahead EXecution 122
Table 8.2: Runahead processor hardware cost breakdown. Numbers in parentheses denoteoverhead for Runahead support.
ALUTs Registers Block RAMs
Entire SPREX 1774 1518 7+caches
Extra Decoder 4 - -
Register Tracking 83 32 -
CFC 88 32 -
NCOR 412 (149) 323 (215) 4+caches (4)
Total Runahead Overhead 324 (19%) 279 (18%) 4 (57%)
Including Cache Storage - - 4 (5%)
Engine, SCOORE [27], is a project targeting a full-blown out-of-order soft processor with
large resource usage, i.e., >100K LUTs. SCOORE shows why out-of-order implementa-
tions do not map well on FPGAs resulting in expensive and inefficient implementations.
The primary goal of the SCOORE project is simulation acceleration. Mathieu Rosire
et. al., propose a multi-banked ROB implementation [52], a key component used in out-
of-order architectures. Fytraki and Pnevmatikatos implement parts of an out-of-order
processor on an FPGA for the purpose of accelerating processor simulation [30]. To the
best of our knowledge, SPREX is the first soft processor architecture with Runahead
execution.
8.5 Conclusion
This is the first step towards implementing a high performing soft processor, targeting
programs with unstructured ILP. We presented SPREX, an FPGA-friendly, synthesiz-
able Runahead soft processor architecture and showed that Runahead in fact provides
significant performance benefits in the reconfigurable environments, by up to 36%. We
showed that by sacrificing less important functionality, we can achieve an efficient archi-
tecture for FPGAs, while we maintain Runahead performance benefits. Our next steps
Chapter 8. SPREX: Soft Processor with Runahead EXecution 123
would include understanding and eliminating frequency bottlenecks in our implementa-
tion of the architecture. Further optimization may improve clock frequency and allow us
to run the processor at a higher clock frequency, possibly matching that of the memory
controller, i.e. at 266 MHz.
Chapter 9
Concluding Remarks
For embedded systems incorporating soft processors, many architectures have been pro-
posed for accelerating the applications, including VLIW, Vector processing, and SIMD.
However, these architectures target programs with regular parallelism that can be ex-
tracted offline. As embedded systems grow in size and complexity, their software evolve
as well, leading to programs with unstructured parallelism which is inherently difficult,
sometimes impossible, to extract offline.
This thesis considered microarchitectures designed for programs with irregular par-
allelism with a set of constraints in mind that are unique to FPGAs. Superscalar, Out-
of-Order, and Runahead processing are the three main architectures proposed for such
applications, which have been extensively studied for the ASIC paradigm. This thesis
investigated the potential and feasibility of each architecture for FPGA implementa-
tion. Superscalar processing was shown to be undesirable due to low clock frequency
and high area cost. A narrow, Out-of-Order pipeline, on the other hand, was shown to
be promising. We redesigned and investigated many components of the OoO architec-
ture, including checkpointing, register renamer, instruction scheduler, and non-blocking
cache. Although the potential for OoO processing on FPGAs was demonstrated, a fully
functioning core was left for future work. Finally, a complete soft core with Runahead
124
Chapter 9. Concluding Remarks 125
execution was introduced which achieved high performance with comparable area costs
compared to off-the-shelf inorder soft processors.
This chapter presents the summary of the thesis and research contributions, followed
by directions for future research.
9.1 Thesis Summary
Implementing soft processors comes with various challenges including maintaining high
clock frequency, low area cost and low instruction cycle count. Despite their differences,
many processor microarchitectures are based on conventional 5-stage pipeline. Accord-
ingly, this thesis studied the challenges in implementing a typical soft processor on FPGAs
and proposed unique solutions for each challenge faced.
Next, we considered an OoO architecture which is suitable for accelerating programs
with unstructured parallelism. However, implementing an OoO soft processor comes with
additional challenges compared to a simple 5-stage inorder pipeline. Such an architec-
ture employs additional components and mechanisms which are mostly only studied for
ASIC implementation. Accordingly, we studied the feasibility of many OoO components
and mechanisms on FPGAs and proposed FPGA-friendly alternatives for cases where
conventional designs mapped poorly to FPGAs. We exploit the unique characteristics of
FPGA resources, such as BRAMs, and overcame the challenge of maintaining high clock
frequency and low area cost while providing the same functionality when redesigning
OoO components.
We also studied Runahead execution, as a simpler alternative to OoO, which we show
provides most of the benefits of OoO processing in an embedded environment. However,
Runahead still requires additional functionality on top of an inorder pipeline. This thesis
studied the requirements of Runahead execution and proposes novel techniques to provide
them while utilizing FPGA resources to achieve high clock frequency and low area cost.
Chapter 9. Concluding Remarks 126
Finally, SPREX, a complete soft processor implementation with Runahead execution,
was introduced which provides higher performance at a comparable area cost to simple
inorder processors available today.
More specifically, the contributions of this thesis are as follows:
• This thesis investigated the challenges in implementing soft processors on FPGAs.
As many processor architectures are based on the typical 5-stage pipeline, the
challenges one faces in implementing them are similar. This thesis identified and
categorized various challenges designers face in implementing soft processors, for
example low clock frequency due to data forwarding in the pipeline and hazard
detection, and proposed solutions to overcome such challenges.
• This thesis introduced CFC, a novel copy-free checkpointing mechanism that takes
advantage of the LUT structure and BRAMs on FPGAs to achieve high perfor-
mance and low area cost. Conventional checkpointing mechanisms employ bit-
interleaving techniques to copy checkpoint data between multiple storage banks.
However, CFC avoids data copying and uses sophisticated data indexing to locate
the desired checkpoint data among the many versions stored.
• This thesis investigated the implementation of instruction schedulers for OoO pro-
cessing on FPGAs. It showed that considering the scheduler as part of the whole
processor pipeline, it is beneficial, both in terms of clock frequency and area cost,
to employ a small four-entry scheduler which utilizes a sophisticated, age-based
selection policy and fast, back-to-back scheduling.
• This thesis introduced NCOR, a non-blocking data cache which is tailored for
Runahead execution on FPGAs. NCOR does away with CAMs which are used
in conventional non-blocking cache designs. Instead, it stores meta data used for
tracking the pending cache lines in the cache itself. Compared to a full-blown
Chapter 9. Concluding Remarks 127
non-blocking cache, NCOR provides only the required functionality that Runahead
execution needs, leading to a smaller and faster design.
• This thesis introduced SPREX, a complete soft processor with Runahead execution.
SPREX utilizes CFC and NCOR for checkpointing and non-blocking functionality
respectively, which are required for Runahead execution. Furthermore, SPREX
is shown to provide higher performance compared to off-the-shelf soft processors,
while using comparable FPGA resources.
9.2 Future Work
This thesis studied the implementation of a fast and small OoO soft processor on FPGAs,
a microarchitecture that had never been implemented on FPGAs with area and frequency
characteristics comparable to those of inorder processors. In this section we discuss
possible future research directions that are enabled by the research done in this thesis.
9.2.1 Out-of-Order Execution
This thesis demonstrated the potential for OoO execution on FPGAs. We showed that
a 1-way OoO soft processor is able to reach higher performance than a simple 5-stage
inorder pipeline. We proposed alternative, FPGA-friendly solutions for checkpointing,
renaming and non-blocking caches for OoO execution. Our solutions show that it is
possible to redesign conventional ASIC-oriented designs of such structures, making them
suitable for FPGA implementation. However, to achieve a complete OoO core, more
components are required and need to be investigated for FPGA implementation.
In OoO execution, Load-Store-Queues (LSQ) are employed to forward data between
memory instructions and detect Read-After-Write dependency violations [46]. When a
load instruction is executed, using the LSQ, data is forwarded from old uncommitted
stores. To find a matching store, the LSQ must be searched for a matching address.
Chapter 9. Concluding Remarks 128
Therefore, for fast access, LSQs are conventionally implemented using CAMs which are
slow and large on FPGAs. One could investigate the feasibility of implementing LSQs
on FPGAs and possibly redesign them to remove the CAMs from their structure.
ReOrder Buffers (ROB) are another components used in OoO execution, responsible
for tracking the original ordering among the instructions that are being executed. Using
ROB, the processor is able to commit instructions in the order they were fetched and
preserve correctness. Rosiere et. al., have proposed a multi-banked ROB implementation
tuned for FPGA implementation which is able to use BRAMs. Future work can utilize
this ROB design to form a complete OoO processor.
The next step in this path is to implement a complete OoO soft processor. Even if
all the components of the processor are already designed, integrating them all into one
coherent design is not a trivial task. Future work can target forming a complete OoO
processor utilizing the components proposed in this thesis and achieve a fast and small
implementation. Based on the data gathered in this thesis through complete system sim-
ulations, we expect a complete OoO soft processor to achieve performance improvements
of up to 20% compared to an inorder pipeline.
9.2.2 Multi-Processor Designs
In the last decade computer architecture research observed a shift towards utilizing
multi-processing due to the frequency wall [7]. Multi-processing exploits parallelism
in programs to achieve higher performance, while each processing element operates at
sustainable clock speeds and energy requirements. Proposals for such architectures in-
clude Simultaneous Multithreading and Multicore architectures, and Cell processors [60,
26, 48, 32].
The processing elements in a multi-processor architecture can have various microar-
chitectures, including OoO and Runahead. This work introduced SPREX, a complete
single core pipeline with Runahead execution. Future work in the multi-processing area
Chapter 9. Concluding Remarks 129
can utilize SPREX to form a multi-processor system with Runahead execution. However,
including a core with Runahead execution in a multi processor system introduces new,
and interesting challenges. For example, designing a coherent cache which is able to
track cache accesses in Runahead mode is challenging, as not all memory requests by a
Runahead core are initiated by the program itself. We leave investigating such challenges
to future work.
9.2.3 Power and Energy
This thesis focused on performance and area cost trade-off when designing processor
components or an entire processor. However, power dissipation and energy consumption
are increasingly more prohibitive in the embedded systems, following the trends in the
ASIC processor design [36]. Therefore, an important future direction is to study the
performance/area/power trade-off when implementing soft processors.
OoO processing requires additional components on top of inorder pipelines. Every ad-
ditional component, even with low runtime activity, dissipates power, hence increases the
processor’s energy footprint. This thesis introduced FPGA-friendly alternatives for vari-
ous OoO components considering only performance and area. Future work can reevaluate
these components taking into account energy consumption and propose solutions adher-
ing to specific energy requirements. For example, in Chapter 6 we showed that by limiting
the instruction scheduler clock frequency, the optimal design’s size and scheduling policy
changes. Hence, it is reasonable to assume that by introducing energy constraints the
optimal design may be different.
In Runahead, the processor continues instruction execution while waiting for memory
operations. These operations are performed for the sole purpose of finding subsequent
memory operations, and no instruction result is stored. Therefore, the processor con-
sumes extra energy in Runahead mode compared to an inorder pipeline. Additionally,
not all memory requests sent in Runahead mode are useful, yet consume energy in the
Chapter 9. Concluding Remarks 130
processor and the memory controller. On the other hand, finding, and overlapping, sub-
sequent memory operations reduces overall execution time, hence saving energy. Future
work can study this complex energy/performance trade-off in Runahead execution.
Bibliography
[1] Kaveh Aasaraai and Andreas Moshovos. Towards a viable out-of-order soft
core: Copy-free, checkpointed register renaming. In 19th Intl. Conf. on Field
Programmable Logic and Applications (FPL), Prague, Czech Republic, September
2009.
[2] Kaveh Aasaraai and Andreas Moshovos. Design space exploration of instruction
schedulers for out-of-order soft processors. In In the International Conference on
Field-Programmable Technology (Poster Presentation), 2010.
[3] Kaveh Aasaraai and Andreas Moshovos. NCOR: An FPGA-Friendly nonblocking
data cache for soft processors with runahead execution. International Journal of
Reconfigurable Computing, 2011.
[4] Kaveh Aasaraai and Andreas Moshovos. An efficient non-blocking data cache for soft
processors. In Proc. of the International Conference on ReConFigurable Computing
and FPGAs, December 2010.
[5] Kaveh Aasaraai and Andreas Moshovos. Sprex: A soft processor with runahead
execution. In Proc. of the International Conference on ReConFigurable Computing
and FPGAs, December 2012.
[6] Advanced Micro Devices Inc. AMD-K5 Processor Data Sheet. In Proceedings of the
Hot Chips VIII, 1997.
131
Bibliography 132
[7] Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, and Doug Burger. Clock rate
versus ipc: the end of the road for conventional microarchitectures. In Proceedings
of the 27th annual international symposium on Computer architecture, ISCA ’00,
pages 248–259, New York, NY, USA, 2000. ACM.
[8] Haitham Akkary, Ravi Rajwar, and Srikanth T. Srinivasan. Checkpoint pro-
cessing and recovery: Towards scalable large instruction window processors. In
Proceedings of the 36th International Symposium on Microarchitecture, pages 423–
434, 2003.
[9] Altera Corporation. Avalon Bus Specifications. http://www.altera.com/
literature/manual/mnl_avalon_spec.pdf.
[10] Altera Corporation. Embedded Peripherals IP. http://www.altera.com/
literature/ug/ug_embedded_ip.pdf.
[11] Altera Corporation. Functional Description - UniPHY. http://www.altera.com/
literature/hb/external-memory/emi_fd_uniphy.pdf.
[12] Altera Corporation. Logic Array Blocks and Adaptive Logic Modules in Stratix III
Devices.
[13] Altera Corporation. Nios II Processor Reference Handbook, May 2011.
[14] Altera Corporation. Nios II Performance Benchmarks, Dec. 2012.
[15] Arcturus Networks Inc. uClinux. http://www.uclinux.org/.
[16] Jean-Loup Baer and Tien-Fu Chen. Effective hardware-based data prefetching for
high-performance processors. IEEE Trans. Comput., 44(5):609–623, May 1995.
[17] Samson Belayneh and David R. Kaeli. A discussion on non-blocking/lockup-free
caches. SIGARCH Comput. Archit. News, 24(3):18–25, 1996.
Bibliography 133
[18] Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil A. Patil, William Reinhart, Dar-
rel Eric Johnson, Jebediah Keefe, and Hari Angepat. FPGA-accelerated simula-
tion technologies (fast): Fast, full-system, cycle-accurate simulators. In MICRO
40: Proceedings of the 40th Annual IEEE/ACM International Symposium on
Microarchitecture, pages 249–261, Washington, DC, USA, 2007. IEEE Computer
Society.
[19] Jongsok Choi, Kevin Nam, Andrew Canis, Jason Anderson, Stephen Brown, and
Tomasz Czajkowski. Impact of cache architecture and interface on performance and
area of fpga-based processor/parallel-accelerator systems. In Proceedings of the 2012
IEEE 20th International Symposium on Field-Programmable Custom Computing
Machines, FCCM ’12, pages 17–24, Washington, DC, USA, 2012. IEEE Computer
Society.
[20] James Coole and Greg Stitt. Traversal caches: A framework for FPGA accelera-
tion of pointer data structures. International Journal of Reconfigurable Computing,
2010:16 pages, 2010.
[21] Altera Corp. Stratix III Device Handbook: Chapter 4. TriMatrix Embedded Memory
Blocks in Stratix III Devices., 2010.
[22] Control Data Corporation. CDC 6600 mainframe computer, 1964.
[23] Fredrik Dahlgren, Michel Dubois, and Per Stenstrom. Sequential hardware prefetch-
ing in shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 6(7):733–
746, July 1995.
[24] Udit Dhawan and Andre DeHon. Area-efficient near-associative memories on
fpgas. In Proceedings of the ACM/SIGDA international symposium on Field
programmable gate arrays, FPGA ’13, pages 191–200, New York, NY, USA, 2013.
ACM.
Bibliography 134
[25] James Dundas and Trevor Mudge. Improving data cache performance by pre-
executing instructions under a cache miss. In ICS ’97: Proc. of the 11th intl. conf.
on Supercomputing, pages 68–75, New York, NY, USA, 1997. ACM.
[26] Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, Rebecca L. Stamm,
and Dean M. Tullsen. Simultaneous multithreading: A platform for next-generation
processors. IEEE Micro, 17:12–19, 1997.
[27] F. J. Mesa-Martinez et al. SCOORE Santa Cruz Out-of-Order RISC Engine, FPGA
Design Issues. In Workshop on Architectural Research Prototyping (WARP), held
in conjunction with ISCA-33, pages 61–70, 2006.
[28] K. I. Farkas and N. P. Jouppi. Complexity/performance tradeoffs with non-blocking
loads. In Proceedings of the 21st Annual International Symposium on Computer
Architecture, ISCA ’94, pages 211–222, Los Alamitos, CA, USA, 1994. IEEE Com-
puter Society Press.
[29] Freescale. e600 PowerPC Core Reference Manual.
[30] S. Fytraki and D. Pnevmatikatos. RESIM: A trace-driven, reconfigurable ILP pro-
cessor simulator. In Design and Automation Europe, 2008.
[31] J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantitative
Approach. Computer Architecture, the Morgan Kaufmann Ser. in Computer Ar-
chitecture and Design Series. Elsevier Science, 2006.
[32] H. Peter Hofstee. Power efficient processor architecture and the cell processor. In
Proceedings of the 11th International Symposium on High-Performance Computer
Architecture, HPCA ’05, pages 258–262, Washington, DC, USA, 2005. IEEE Com-
puter Society.
Bibliography 135
[33] Engin Ipek, Onur Mutlu, Jose F. Martınez, and Rich Caruana. Self-optimizing
memory controllers: A reinforcement learning approach. In Proceedings of the 35th
Annual International Symposium on Computer Architecture, ISCA ’08, pages 39–50,
Washington, DC, USA, 2008. IEEE Computer Society.
[34] J. E. Smith. A study of branch prediction strategies. In 8th Annual Symposium on
Computer Architecture, pages 135-147, June 1981.
[35] Norman P. Jouppi. Cache write policies and performance. In Proceedings of the
20th annual international symposium on computer architecture, ISCA ’93, pages
191–201, New York, NY, USA, 1993. ACM.
[36] Stefanos Kaxiras and Margaret Martonosi. Computer Architecture Techniques for
Power-Efficiency. Morgan and Claypool Publishers, 1st edition, 2008.
[37] J. Keller. The alpha 21264 microprocessor architecture. In In Proceedings of 9th
Annual Microprocessor Forum, 1996.
[38] D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In Proceedings
of the 8th Annual International Symposium on Computer Architecture, 1981.
[39] Ashok Kumar. The HP PA-8000 RISC CPU: a high performance out-of-order pro-
cessor. In Proceedings of the Hot Chips VIII, 1996.
[40] Martin Labrecque, Mark C. Jeffrey, and J. Gregory Steffan. Application-specific
signatures for transactional memory in soft processors. ACM Trans. Reconfigurable
Technol. Syst., 4(3):21:1–21:14, August 2011.
[41] Charles Eric LaForest and J. Gregory Steffan. Efficient multi-ported memories for
fpgas. In Proceedings of the 18th annual ACM/SIGDA international symposium
on Field programmable gate arrays, FPGA ’10, pages 41–50, New York, NY, USA,
2010. ACM.
Bibliography 136
[42] International Business Machines. IBM and LSI, PowerPC 476FP Embedded Proces-
sor Core and PowerPC 470S Synthesizable Core User’s Manual. http://www-03.
ibm.com/press/us/en/pressrelease/28399.wss.
[43] Francisco J. Mesa-Martınez, Michael C. Huang, and Jose Renau. Seed: scal-
able, efficient enforcement of dependences. In PACT ’06: Proceedings of the 15th
international conference on Parallel architectures and compilation techniques, pages
254–264, New York, NY, USA, 2006. ACM.
[44] A. Moshovos. Checkpointing alternatives for high performance, power-aware proces-
sors. In Proceedings of the 2003 international symposium on Low power electronics
and design, pages 318–321, 2003.
[45] A. Moshovos and G. S. Sohi. Micro-Architectural Innovations: Boosting Processor
Performance Beyond Technology Scaling. Proceedings of the IEEE, 89(11), Novem-
ber 2001.
[46] Andreas Moshovos, Scott E. Breach, T.N. Vijaykumar, and Gurindar S. Sohi. Dy-
namic speculation and synchronization of data dependencies. In In Proceedings of
the 24th International Symposium on Computer Architecture, 1997.
[47] Mayan Moudgill, Keshav Pingali, and Stamatis Vassiliadis. Register renaming and
dynamic speculation: an alternative approach. In Proceedings of the 26th annual
international symposium on Microarchitecture, MICRO 26, pages 202–213, Los
Alamitos, CA, USA, 1993. IEEE Computer Society Press.
[48] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung
Chang. The case for a single-chip multiprocessor. In Proceedings of the seventh
international conference on Architectural support for programming languages and
operating systems, ASPLOS VII, pages 2–11, New York, NY, USA, 1996. ACM.
Bibliography 137
[49] Subbarao Palacharla and J. E. Smith. Complexity-effective superscalar proces-
sors. In In Proceedings of the 24th Annual International Symposium on Computer
Architecture, pages 206–218, 1997.
[50] Il Park, Chong Liang Ooi, and T. N. Vijaykumar. Reducing design complexity of the
load/store queue. In In Proceedings of the 36th annual IEEE/ACM International
Symposium on Microarchitecture, 2003.
[51] European Space Research and Technology Centre. Leon3 multiprocessing cpu core.
http://www.gaisler.com/doc/leon3\_product\_sheet.pdf/.
[52] M. Rosiere, J.-I. Desbarbieux, N. Drach, and F. Wajsburt. An out-of-order super-
scalar processor on FPGA: The reorder buffer design. In Design, Automation Test
in Europe Conference Exhibition (DATE), 2012, pages 1549 –1554, march 2012.
[53] E. Safi, A. Moshovos, and A. Veneris. On the latency and energy of checkpointed
superscalar register alias tables. Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on, 18(3):365 –377, march 2010.
[54] J. E. Smith and G. Sohi. The Microarchitecture of Superscalar Processors.
Proceedings of the IEEE, 1995.
[55] CORPORATE SPARC International, Inc. The SPARC architecture manual (version
9). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1994.
[56] Standard Performance Evaluation Corporation. SPEC CPU 2006. http://www.
spec.org/cpu2006/.
[57] T. N. Buti et al. Organization and Implementation of the Register-
Renaming Mapper for Out-of-Order IBM POWER4 Processors.
IBM Journal of Research and Development, Vol. 49, No. 1, 2005.
Bibliography 138
[58] Terasic Inc. Altera DE3 development system with Stratix III FPGA. http://
university.altera.com/materials/boards/de3/.
[59] R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units.
IBM J. Res. Dev., 11(1):25–33, January 1967.
[60] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. Simultaneous multithread-
ing: maximizing on-chip parallelism. In 25 years of the international symposia on
Computer architecture (selected papers), ISCA ’98, pages 533–544, New York, NY,
USA, 1998. ACM.
[61] David W. Wall. Limits of instruction-level parallelism. In Proceedings of the fourth
international conference on Architectural support for programming languages and
operating systems, ASPLOS IV, pages 176–188, New York, NY, USA, 1991. ACM.
[62] Henry Wong, Vaughn Betz, and Jonathan Rose. Comparing FPGA vs. custom
cmos and the impact on processor microarchitecture. In Proceedings of the 19th
ACM/SIGDA international symposium on Field programmable gate arrays, FPGA
’11, pages 5–14, New York, NY, USA, 2011. ACM.
[63] Di Wu, Kaveh Aasaraai, and Andreas Moshovos. Low-cost, high-performance
branch predictors for soft processors. In 23rd International Conference on Field
Programmable Logic and Applications (FPL), September 2013.
[64] Xilinx Inc. MicroBlaze Processor Reference Guide, Mar. 2012.
[65] K.C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 16(2):28–
40, 1996.
[66] P. Yiannacouras, J. G. Steffan, and J. Rose. VESPA: portable, scalable, and flexible
FPGA-based vector processors. In Proceedings of the 2008 International Conference
Bibliography 139
on Compilers, Architectures and Synthesis for Embedded Systems, pages 61–70,
2008.
[67] Peter Yiannacouras and Jonathan Rose. A parameterized automatic cache generator
for FPGAs. In Proc. Field-Programmable Technology (FPT), pages 324–327, 2003.
[68] Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. Exploration and cus-
tomization of fpga-based soft processors. IEEE Trans. on CAD of Integrated Circuits
and Systems, 26(2):266–277, 2007.