Effective FPGA Debug for High-Level Synthesis Generated Circuits · PDF file ·...

8
Effective FPGA Debug for High-Level Synthesis Generated Circuits Jeffrey Goeders and Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, Canada {jgoeders,stevew}@ece.ubc.ca Abstract—High-level synthesis (HLS) promises to increase designer productivity in the face of steadily increasing FPGA sizes, and broaden the market of use, allowing software designers to reap the benefits of hardware implementation. One roadblock to HLS adoption is the lack of a debugging infrastructure. To debug, designers can run their source code on a processor; however, this does not capture interactions with other system components. The alternative is to debug using the RTL, which is beyond the expertise of software designers, and impractical for hardware designers as the RTL may not resemble the original source code. This paper presents a new approach to debugging HLS produced circuits, which allows the user to debug in the context of the source code, while running the circuit in-situ. This is accomplished by automatically inserting debug instrumentation into the circuit, which allows a debugger application to start and stop the circuit, monitor variables and set breakpoints. The instrumentation contains trace buffers to record the control and data flow in real-time, allowing the debugger to retrieve this data and replay the execution. As a proof of concept we integrated our approach into the LegUp HLS tool, and have made it publicly available. We present methods of optimizing the trace buffer usage, and show that we can replay 1243 lines of source code per 100Kb of memory allocated to trace buffers. On average, the instrumentation circuitry requires an 11% logic area overhead. This work enables real-time debugging of HLS circuits using a software-like debug interface, removing a major roadblock of HLS adoption. I. I NTRODUCTION As the capacities of Field-Programmable Gate Arrays (FP- GAs) grow, significant attention is being paid to designer pro- ductivity. FPGA vendors have invested heavily in high-level synthesis (HLS) technologies that automatically transform a software-like program (often written in C) into a register- transfer level (RTL) hardware circuit description. Compared to traditional software design, which involves running the algorithm on a processor, a hardware circuit can provide faster execution times and lower power. Compared to hardware design, this flow provides for significant designer productivity and time-to-market improvements, since the complexity of manual RTL descriptions can be avoided. More importantly, HLS tools may open the hardware realm to software de- signers, who would otherwise be unable to benefit from the performance advantages of custom hardware. Since software designers outnumber hardware designers by a factor of ten [1], HLS tools have the potential to provide huge growth for the FPGA industry. In order to realize wide adoption of this technology, a framework for the effective debug of HLS-generated hardware is required. Basic functionality of stand-alone HLS-generated blocks can be tested and debugged by porting and running the design on a workstation. However, real designs typically contain many blocks, only some of which may be designed using HLS techniques. Other blocks may be legacy IP cores, or interconnect fabrics for which a C model is not available [2]. For these designs, the only option is to test and debug the HLS- generated block in-situ by executing the synthesized hardware directly on the FPGA along with other parts of the system. Hardware verification can be performed with the help of debugging packages such as Altera SignalTap II [3], Xilinx ChipScope Pro [4] or Mentor’s Certus tool [5], all of which provide visibility into a hardware design. Unfortunately, these tools provide visibility that only has meaning to someone who understands the underlying hardware. A software designer typically views a design as a set of processes, each consisting of sequential control flow code, while the generated hardware consists of dataflow components operating in parallel across multiple clock cycles. HLS tools typically perform scheduling optimizations, moving operations across cycle boundaries, and allocation strategies that make the relationship between soft- ware variables and hardware entities difficult to understand. This mismatch between a software view of the design and the generated hardware makes debugging difficult. In the short- term, HLS will be used primarily by hardware designers seeking a more productive design environment, and the pro- ductivity advantages of HLS will be lost if these designers need to think about their design in terms of the underlying HLS Source Code HDL Circuit Debug Mapping FPGA Debug Fig. 1. HLS Debug Design Flow

Transcript of Effective FPGA Debug for High-Level Synthesis Generated Circuits · PDF file ·...

Page 1: Effective FPGA Debug for High-Level Synthesis Generated Circuits · PDF file · 2014-08-22Effective FPGA Debug for High-Level Synthesis Generated Circuits ... The alternative is to

Effective FPGA Debug for High-Level SynthesisGenerated CircuitsJeffrey Goeders and Steven J.E. Wilton

Department of Electrical and Computer EngineeringUniversity of British Columbia

Vancouver, Canada{jgoeders,stevew}@ece.ubc.ca

Abstract—High-level synthesis (HLS) promises to increasedesigner productivity in the face of steadily increasing FPGAsizes, and broaden the market of use, allowing software designersto reap the benefits of hardware implementation. One roadblockto HLS adoption is the lack of a debugging infrastructure. Todebug, designers can run their source code on a processor;however, this does not capture interactions with other systemcomponents. The alternative is to debug using the RTL, which isbeyond the expertise of software designers, and impractical forhardware designers as the RTL may not resemble the originalsource code.

This paper presents a new approach to debugging HLSproduced circuits, which allows the user to debug in the contextof the source code, while running the circuit in-situ. This isaccomplished by automatically inserting debug instrumentationinto the circuit, which allows a debugger application to startand stop the circuit, monitor variables and set breakpoints. Theinstrumentation contains trace buffers to record the control anddata flow in real-time, allowing the debugger to retrieve this dataand replay the execution.

As a proof of concept we integrated our approach into theLegUp HLS tool, and have made it publicly available. We presentmethods of optimizing the trace buffer usage, and show thatwe can replay 1243 lines of source code per 100Kb of memoryallocated to trace buffers. On average, the instrumentationcircuitry requires an 11% logic area overhead. This work enablesreal-time debugging of HLS circuits using a software-like debuginterface, removing a major roadblock of HLS adoption.

I. INTRODUCTION

As the capacities of Field-Programmable Gate Arrays (FP-GAs) grow, significant attention is being paid to designer pro-ductivity. FPGA vendors have invested heavily in high-levelsynthesis (HLS) technologies that automatically transform asoftware-like program (often written in C) into a register-transfer level (RTL) hardware circuit description. Comparedto traditional software design, which involves running thealgorithm on a processor, a hardware circuit can providefaster execution times and lower power. Compared to hardwaredesign, this flow provides for significant designer productivityand time-to-market improvements, since the complexity ofmanual RTL descriptions can be avoided. More importantly,HLS tools may open the hardware realm to software de-signers, who would otherwise be unable to benefit from theperformance advantages of custom hardware. Since softwaredesigners outnumber hardware designers by a factor of ten [1],HLS tools have the potential to provide huge growth for theFPGA industry.

In order to realize wide adoption of this technology, aframework for the effective debug of HLS-generated hardwareis required. Basic functionality of stand-alone HLS-generatedblocks can be tested and debugged by porting and runningthe design on a workstation. However, real designs typicallycontain many blocks, only some of which may be designedusing HLS techniques. Other blocks may be legacy IP cores, orinterconnect fabrics for which a C model is not available [2].For these designs, the only option is to test and debug the HLS-generated block in-situ by executing the synthesized hardwaredirectly on the FPGA along with other parts of the system.

Hardware verification can be performed with the help ofdebugging packages such as Altera SignalTap II [3], XilinxChipScope Pro [4] or Mentor’s Certus tool [5], all of whichprovide visibility into a hardware design. Unfortunately, thesetools provide visibility that only has meaning to someonewho understands the underlying hardware. A software designertypically views a design as a set of processes, each consistingof sequential control flow code, while the generated hardwareconsists of dataflow components operating in parallel acrossmultiple clock cycles. HLS tools typically perform schedulingoptimizations, moving operations across cycle boundaries, andallocation strategies that make the relationship between soft-ware variables and hardware entities difficult to understand.This mismatch between a software view of the design and thegenerated hardware makes debugging difficult. In the short-term, HLS will be used primarily by hardware designersseeking a more productive design environment, and the pro-ductivity advantages of HLS will be lost if these designersneed to think about their design in terms of the underlying

External Debugger Application

X

Xrd_en

addr X A0 A1 X

rd_data D0 D1 X

clk

X

rd_en

addr X A0 A1 X

rd_data D0 D1 X

clk

state S1 S2 S3 S4

program_en

S3

X

X

program_en

addrrd_enwr_en

wr_data

User Circuit

addrrd_enwr_en

wr_data

Debugger

0

1

addrrd_enwr_enwr_data

MemoryController

2-bitSaturating

Counter

Reset

OlderRead Data

Enable

1

2

3

OldRead Data

Enablerd_data

Memory Controller

rd_dataUser Circuit

rd_dataDebugger

2 Cycle Read Delay

==?

Trigger Hit

Address(32 bit)

Data(64 bit)

Operation(3 bit)

Logic

ConditionSatisfied

Trigger State(32 bit)

Trig

ger

C

on

fig

ura

tio

n

Cir

cuit

Sta

te

Me

mo

ry B

us

Debug Instrumentation

Memory Controller

User Circuit

Debug Manager

RS232

Clock & Trigger Unit

Memory Supervisor

Record & Replay Unit

State

Memory Bus

Program Clk

FPGAControl Flow Buffer

0

1

2

16 bits 96 bits

Module ID Module State Address Value0

1

2

Ctrl Idx

Data Flow Buffer

t

(cy cles)

Var

iabl

es AB

Circ

uit

Ha

lted

5045403530

Write

State Machine

Insn/Mem Control

Main

Me

mo

ry

Co

ntr

olle

r

S.M.

Insn/Mem

FooS.M.

Insn/Mem

Bar

On

-Ch

ip

Me

mo

ry

FPGA

SFR

.V.

SF

R.V.

Start

Finish

ReturnValue

HLS

Source Code

HDL Circuit

Debug Mapping

FPGA

Deb

ug

HLSSource

Code

HDL Circuit

Debug Mapping

FPGADebug

Fig. 1. HLS Debug Design Flow

Page 2: Effective FPGA Debug for High-Level Synthesis Generated Circuits · PDF file · 2014-08-22Effective FPGA Debug for High-Level Synthesis Generated Circuits ... The alternative is to

hardware. In the long term, HLS may be used by softwaredesigners; for these designers, reasoning about the behaviourof the hardware generated by HLS tools may be impossible.

In this paper, we propose techniques that allow softwaredesigners to understand the behaviour of a design that hasbeen synthesized to hardware, allowing for effective debug.There are two novel aspects: (1) during HLS compilation, weestablish a mapping between circuit state and source codeinstructions, as well as a mapping of source code variablesto memory addresses, and (2) we combine a single-steppingdebugging flow which is familiar to software designers witha real-time debugging flow, which is required to adequatelydebug many types of hardware circuits and the interactionbetween hardware blocks. Central to our approach is a smallamount of instrumentation that we add to the user circuit.Minimizing the overhead of this instrumentation is critical –we show that by optimizing the instrumentation based on thestructure of the generated circuit, we can achieve much bettervisibility compared to existing hardware-oriented approaches.

This paper is organized as follows. Section II describes therequirements of a debugging framework that allows HLS usersto understand the behaviour of the generated hardware. Sec-tion III then presents an overview of our approach. One of thekey contributions is the optimization of debug instrumentation;Section IV shows how the trace buffers can be optimizedby understanding the nature of the generated hardware andquantifies the increase in visibility that these optimizationscan provide. Compared to Altera’s SignalTap II, we obtainan improvement of 4.5X. Finally, Section V shows how wehave integrated our techniques into a proof-of-concept toolintegrated into the LegUp HLS framework [6]; the tool hasbeen made publicly available at www.ece.ubc.ca/∼jgoeders.

II. DEBUGGING HLS-GENERATED DESIGNS

A. Debugging and Validation Scenarios

When debugging and validating a design implemented usinghigh-level synthesis techniques, there are three (often inter-twined) tasks that an engineer needs to be concerned with:

1) The designer must ensure correct logical behaviour ofthe algorithm, and when incorrect behaviour is observed,identify the root cause. Typical errors include source-code mistakes, incorrect state transitions, or logic errors.

2) The designer must ensure correct interaction betweenvarious blocks in the system. A large design will typi-cally contain many blocks, only some of which may becreated using HLS-based techniques [2]. Other blocksmay be hand-crafted hardware blocks, or even legacydesigns for which full source-code is not available or notunderstood by the designer. In our experience debugginghardware designs, we have found that the interfacesbetween blocks are often much more susceptible todesign errors than the internals of the blocks themselves.

3) Despite significant progress over the past decade, HLStools are still relatively immature, and designers musthave confidence that the hardware created by the HLS

tool is correct. Providing a mechanism for designersto examine and understand the generated hardware isessential.

In this paper, we describe a framework that addresses each ofthese requirements.

B. Limitations of Debugging by Software Emulation

In HLS-based methodologies, designs are written insoftware-like languages (such as C). A straightforward de-bugging approach is to execute the software-like code on astand-alone processor, and use standard software debuggingtechnologies. This is attractive for two reasons: (1) softwaredebugging technology is mature, and these debuggers providenumerous features to help pin-point problems, (2) softwaredevelopers (who are envisaged to be a primary user of HLStechnologies) are already familiar with software-based debug-ging technologies, meaning they can use these tools with highconfidence. We anticipate that many of the logical errors indesigns can be uncovered in this way.

However, such an approach has limitations. First, it cannot be used to uncover errors in interfaces between blocks,when some of these blocks are not developed using an HLS-based methodology. While it would be possible to create aC model of the legacy blocks, this is error-prone and it isdifficult to ensure that the C model would faithfully describethe behaviour of the block in all possible modes (especially ifthe block was designed by someone else). Even if the otherblocks are described using C, they often will communicate on-chip using a bus or other network (eg. [2]), and the correctnessof the interface between the C block and the bus may needto be examined. Second, for streaming applications, it is oftennecessary to debug while running in-situ, that is, in the targetsystem, so the block can be evaluated using realistic inputtraffic. Again, it may be possible to create a C model of theinput traffic, however it would be difficult to ensure that such amodel accurately reflects all possible traffic patterns for whichthe chip should work. Third, software execution is slowerthan hardware – this is the reason for accelerating softwareusing hardware in the first place. This may make it impossibleto run a sufficient number of tests on a software model ofthe system. Finally, such an approach assumes that the HLStool is “perfect”; if the tool creates incorrect hardware, thendebugging the software model will not help find the problem.

C. Hardware Debugging

For all these reasons, the ability to debug the hardwareversion of a design is essential. Effective debug of hardwarecan best be achieved through instrumentation, that is, byadding small amounts of logic to a circuit/program to providevisibility and controllability to the design. Commercial debug-ging packages (e.g. [3]–[5]) can be used to provide visibilityinto the design; however, as described in the introduction, thesetools provide visibility that has meaning only in the contextof the generated RTL hardware.

To provide effective debug productivity to HLS users, itis therefore essential to create a debugging framework that

Page 3: Effective FPGA Debug for High-Level Synthesis Generated Circuits · PDF file · 2014-08-22Effective FPGA Debug for High-Level Synthesis Generated Circuits ... The alternative is to

External Debugger Application

X

Xrd_en

addr X A0 A1 X

rd_data D0 D1 X

clk

X

rd_en

addr X A0 A1 X

rd_data D0 D1 X

clk

state S1 S2 S3 S4

program_en

S3

X

X

program_en

addrrd_enwr_en

wr_data

User Circuit

addrrd_enwr_en

wr_data

Debugger

0

1

addrrd_enwr_enwr_data

MemoryController

2-bitSaturating

Counter

Reset

OlderRead Data

Enable

1

2

3

OldRead Data

Enablerd_data

Memory Controller

rd_dataUser Circuit

rd_dataDebugger

2 Cycle Read Delay

==? Breakpoint hit

Address

Value

Inequality Operation

Logic

ConditionSatisfied

Breakpoint State

Trig

ger

C

on

fig

ura

tio

n

Cir

cuit

Sta

te

Me

mo

ry B

us

Debug Instrumentation

Memory Controller

User Circuit

Debug Manager

RS232

Memory Supervisor

Record & Replay Unit

State

Memory Bus

Program Clk

FPGAControl Flow Buffer

0

1

2

<14 bits 96 bits

Address Value0

1

2

Ctrl Idx

Data Flow Buffer

t

(cy cles)

Var

iabl

es AB

Circ

uit

Hal

ted

5045403530

Write

State Machine

Insn/Mem Control

Main

Me

mo

ry

Co

ntr

olle

r

S.M.

Insn/Mem

FooS.M.

Insn/Mem

Bar

On

-Ch

ip

Me

mo

ry

FPGA

SFR

.V.

SF

R.V.

Start

Finish

ReturnValue

HLS

Source Code

HDL Circuit

Debug Mapping

FPGA

Deb

ug

HLSSource

Code

HDL Circuit

Debug Mapping

FPGADebug

State Sequence Count

6 bits

Clock & Breakpoint Unit

Fig. 2. Debugging System

provides visibility into a hardware design, but in a way that isfamiliar and meaningful for software designers. We anticipatesuch a framework should combine the advantages of two flows:

1) In the live-stepping flow, the RTL circuit can be instru-mented with hardware that allows for single-steppingand breakpointing. The user can step through the sourcecode and the debugger controls the circuit state by en-abling and disabling the circuit’s clock. While stopped,the user can inspect variables before continuing. Thisprovides live debugging with minimal hardware over-head.

2) Many of the debugging scenarios outlined at the startof this section require real-time execution of the design.In the real-time replay flow, further instrumentation isadded to record the control and data flow. This datais recorded in on-chip, circular buffers, until a certainbreakpoint or trigger is encountered. At this point, thecircuit is halted, and the debugger reads the data fromthe buffers. The user is able to replay the recordedexecution, stepping through the source code, addingbreakpoints, and reading variables. This allows users todebug the real-time interaction with other components inthe system, although it requires greater hardware over-head. This is analogous to the hardware debug methodsused in [7]–[10] where values on strategically selectedhardware signals are recorded in trace buffers for laterinvestigation. We will show that because the hardwareis of a known structure (generated from an HLS tool),there are opportunities for significant optimization in theinstrumentation required for this type of flow.

In the following sections, we show how these flows canbe integrated into an existing HLS system, and measure theoverhead of our techniques.

III. DEBUG FRAMEWORK OVERVIEW

A. Context

Although we anticipate our flow can work with any HLStool suite, to make our proposal concrete, we describe it in thecontext of the the academic open-source LegUp HLS tool [6].LegUp accepts a source program in a subset of ANSI C(no dynamic memory allocation, or recursion), and producesa Verilog circuit. The tool uses LLVM [11] to compile theC code to assembly language, and perform optimizations,

External Debugger Application

X

Xrd_en

addr X A0 A1 X

rd_data D0 D1 X

clk

X

rd_en

addr X A0 A1 X

rd_data D0 D1 X

clk

state S1 S2 S3 S4

program_en

S3

X

X

program_en

addrrd_enwr_en

wr_data

User Circuit

addrrd_enwr_en

wr_data

Debugger

0

1

addrrd_enwr_enwr_data

MemoryController

2-bitSaturating

Counter

Reset

OlderRead Data

Enable

1

2

3

OldRead Data

Enablerd_data

Memory Controller

rd_dataUser Circuit

rd_dataDebugger

2 Cycle Read Delay

==? Breakpoint hit

Address

Value

Inequality Operation

Logic

ConditionSatisfied

Breakpoint State

Trig

ger

C

on

fig

ura

tio

n

Cir

cuit

Sta

te

Me

mo

ry B

us

Debug Instrumentation

Memory Controller

User Circuit

Debug Manager

RS232

Clock & Trigger Unit

Memory Supervisor

Record & Replay Unit

State

Memory Bus

Program Clk

FPGAControl Flow Buffer

0

1

2

<14 bits 96 bits

Address Value0

1

2

Ctrl Idx

Data Flow Buffer

t

(cy cles)

Var

iabl

es AB

Cir

cuit

Ha

lted

5045403530

Write

State Machine

Insn/Mem Control

Main

Me

mo

ry

Co

ntr

olle

r

S.M.

Insn/Mem

FooS.M.

Insn/Mem

Bar

On

-Ch

ip

Me

mo

ry

FPGA

SFR

.V.

SF

R.V.

Start

Finish

ReturnValue

HLS

Source Code

HDL Circuit

Debug Mapping

FPGA

Deb

ug

HLSSource

Code

HDL Circuit

Debug Mapping

FPGADebug

State Sequence Count

6 bits

Fig. 3. Breakpoint Unit

after which the assembly is compiled to Verilog. The LLVMcompiler offers several levels of optimization (e.g. -O0 to -O3);our tool works for any of these optimization settings.

After optimization using LLVM, a Verilog module with afinite state machine is created for each function. Every assem-bly instruction in the function is transformed to equivalentVerilog code, and scheduled; instructions are scheduled inparallel when possible. LegUp maps variables to memories inthe hardware circuit, and includes a single memory controllerin the produced circuit. All modules have multiplexed accessto the memory controller.

B. Overall Debug Flow

In our flow, the user compiles the source code to HDL asusual. During compilation, the HLS tool automatically insertsextra debugging circuitry (instrumentation) into the RTL. Afterimplementing the circuit on the FPGA, the user launches thedebug application on a workstation, and connects to the FPGAboard. The application communicates with the debug circuitryand allows the user to debug the circuit operation in the contextof the original source code (Figure 1). In one mode, the usercan step through instructions, add breakpoints, and read/writevariables similar to a normal software debugger. In a secondmode, the user can run the circuit at speed, recording theactivities of certain variables for later playback. During thislatter mode, the user is able to step forwards and backwardsthough the replay window, insert breakpoints, and inspectvariables.

We first describe the instrumentation inserted by the com-piler. As shown in Figure 2, the instrumentation includes thefollowing components:Debug Manager Communicates with the debugger applica-

tion, forwarding requests to the other subsystems.Clock and Breakpoint Unit This unit contains a control-

lable clock buffer, which drives the user circuit, as wellas a hardware breakpoint unit, shown in Figure 3. Tofacilitate breakpoints, the debugger configures the break-point unit to watch for a certain circuit state, which whenhit, will disable the clock. The debugger can also setconditions on the breakpoint by specifying one or moreaddress/value inequalities. The breakpoint unit monitors

Page 4: Effective FPGA Debug for High-Level Synthesis Generated Circuits · PDF file · 2014-08-22Effective FPGA Debug for High-Level Synthesis Generated Circuits ... The alternative is to

the memory bus, and tracks whether the conditions aresatisfied.

Memory Supervisor Multiplexes access to the memory con-troller between the user circuit and the debug manager.This allows the debugger to access program variables.

Record and Replay Unit This unit records a trace of thecircuit execution by monitoring the circuit state andmemory writes, and capturing the data into circular tracebuffers. As the circuit is executing, data is constantlybeing written to the buffers, overwriting the oldest datain the buffer. This execution occurs at normal operatingspeeds, allowing the circuit to properly interact with othercomponents in the system. Once the circuit is halted,either manually, or via breakpoint, the debugger canretrieve the data stored in the trace buffers, and replay thecircuit execution. Since the on-chip memory is limited,the user will only be able to capture and replay a portionof the program execution, which we refer to as the replaywindow. In the next section, we show how the architectureof the trace buffers can be optimized to maximize the sizeof the replay window.

Using the instrumentation, the debugger is able to monitorand capture control and data information. To relate this circuit-level information back to the original source code, the debug-ger requires information about the transformations performedby the HLS tool. We modify the HLS tool to retain bothcontrol and dataflow transformation information:Control Flow During synthesis, the HLS tool schedules each

source instruction to a specific state in the circuit. Themapping output contains a list of instructions, their as-signed state, and the source code line number.

Data Flow During synthesis, each variable is mapped to anaddress in the memory space. The mapping output in-cludes the memory address, the variable width and depth,details of the data type, and the source code location.

The final piece of the system is a debugger program, whichruns on a workstation, and connects to the debug circuitry onthe FPGA. It is essential that the GUI be familiar to softwaredesigners, so a single step approach in which the user canstep forward through programs and interrogate variables isprovided.

IV. TRACE BUFFER OPTIMIZATION

As previously described, for many debugging tasks, it isnecessary to run the circuit in real-time without interruption.To do so, trace buffers are required in the Record and ReplayUnit to store a history of variable values which can be laterinterrogated to provide visibility into the operation of thecircuit. These trace buffers consists of large memories; thesize of the arrays dictate how much history can be collectedon-chip during each run.

Existing tools such as SignalTap II [3], ChipScope Pro [4],and Mentor’s Certus [5] contain trace buffers that can be usedto record the history of selected signals. These are generallyoptimized in the context of a generic RTL-designed hardware

TABLE ICHSTONE BENCHMARKS, PROPERTIES OF LEGUP PRODUCED CIRCUIT

Benchmark Lines ofC Code

StateBits

VariableSpace(kbits)

Writes/Cycle

adpcm 896 12 32 0.29aes 1105 14 39 0.09blowfish 1310 12 152 0.26dfadd 638 13 20 0.24dfdiv 466 13 17 0.22dfmul 443 13 15 0.26dfsin 827 13 20 0.23gsm 596 13 12 0.18mips 311 11 6 0.32motion 843 11 36 0.39sha 278 12 136 0.29

GeoMean 629 12.4 27 0.24

circuit, in which datapaths, storage elements, and interconnectsare handcrafted by a hardware designer. Circuits generated byan HLS tool; however, will have a known structure that isdictated by the designer of the HLS tool. In this section, weshow that by optimizing the trace buffer architecture for theparticular class of circuits generated by HLS tools, we canmake significantly better use of the debugging infrastructure.Although the discussion focuses on circuits generated by theLegUp tool, many of the characteristics we describe will becommon with other HLS tools, so we feel many of theseoptimizations can be generalized.

A. Split Trace Buffer Architecture

In the trace buffer architectures in [7]–[10], signals thatcontain information related to both the state of the system(state bits) and signals that contain data flowing through thesystem (data bits) are treated the same, and stored in thetrace buffer together. This provides for maximum flexibility– when debugging designs that are control intensive, moretrace buffer capacity can be devoted to state bits, and whendebugging designs that are data-intensive, more trace buffercapacity can be devoted to data bits. This is essential due tothe vast diversity that is possible when creating hand-craftedRTL designs.

Designs produced by an HLS tool tend to have a predeter-mined structure with a smaller number of easily identifiableimportant state bits and data bits. As a specific example, inthe LegUp tool we are using, the generated hardware alwayscontains a central controller, and all variable accesses arerouted to and from a memory controller (this is not strictlytrue for optimized code, however, even in that case, manyvariable accesses result in activity in the memory controller).

The fact that these important control and data signalsare easily identifiable opens an opportunity not available tohardware-oriented trace buffers. Rather than a generic tracebuffer architecture that can be used to record any signal, weuse a split trace buffer architecture where we store control anddata signals in separate buffers, as shown in Figure 5. We canthen optimize these buffers separately; since control and data

Page 5: Effective FPGA Debug for High-Level Synthesis Generated Circuits · PDF file · 2014-08-22Effective FPGA Debug for High-Level Synthesis Generated Circuits ... The alternative is to

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 1 2 3 4 5 6 7 8 9 10

Ave

rage

Cyc

les

pe

r B

it

Sequential Counter Bits

adpcm

aes

blowfish

dfadd

dfdiv

dfmul

dfsin

gsm

mips

motion

sha

G.Mean

Fig. 4. Memory efficiency of control buffer, when varying number ofsequential counter bits.

signals have very different access patterns, we would expectthat the optimum architectures for the two buffers are different.In the following subsections, we will describe how this can bedone, and the resulting impact on the replay window size.

In evaluating optimizations, we use the CHStone bench-marks [12], which were designed for C-based HLS tools.Information on these benchmarks, and the associated circuitsproduced by LegUp, is provided in Table I.

B. Control Trace Buffer Optimization

The control trace buffer is used to store the state bits inthe global controller. In the baseline design, each line in thebuffer stores the state of the controller in one cycle. Thus, ifthere are m lines in the control trace buffer, we can store stateinformation for m cycles, limiting the replay window size tom cycles.

The first optimization is to reduce the number of bitsrequired to store the program state (the width of the memory).In the baseline approach, we use 16 bits for the circuitstate. However, by leveraging the state information within theHLS tool, we can generate trace logic that captures only thenecessary state bits. This reduces the memory width from 16bits to an average of 12.4 bits for the benchmark circuits shownin Table I.

The second optimization is motivated by the fact that inmost cases, as the circuit transitions from one state to thenext, the circuit state value increments by one. To optimize forthis, we add an extra field to the control flow buffer, sequencecount, which counts the duration of cycles that the controlflow progressed sequentially. For a sequence count width ofs bits, each entry in the buffer can record the circuit state forup to 2s cycles. Circuits with many sequential computationswill benefit most from this optimization, while circuits withsubstantial control flow variations (function calls, and branchstatements) will see less benefit. To determine the optimalvalue of s, we simulated each benchmark in Modelsim, andextracted the entire control flow trace. Using this trace, wedetermine the average number of cycles that can be capturedby each trace buffer entry for various values of s, and calculate

the average cycles per bit. This data is provided in Figure 4.The best result occurs when s = 6, with an efficiency of 0.49cycles per bit.

Combining these two optimizations reduces the memoryrequirement of the control buffer from 16 bits per cycle toonly 2 bits per cycle, an 88% reduction.

C. Data Trace Buffer Optimization

Hand-crafted RTL circuits typically contain the maximumamount of parallelism possible given algorithmic and resourceconstraints. Thus, hardware-debugging oriented trace buffersin [7]–[10] are typically used to store the value of signalscycle-by-cycle (each line in the trace buffer stores the valueof signals for one cycle). Since it is infeasible to record thevalues of all signals each cycle, intelligent methods are usedto automatically choose signals that are likely to have the mostvalue during debug [13]–[15].

The problem with this organization is that data for onlya small number of cycles can be stored in a fixed-size tracebuffer. This is especially inefficient if the signals of interestdo not change often. Although general-purpose compressionschemes have been proposed [16], these systems still samplethe signals of interest every cycle. We have found that inHLS-circuits, signal activity tends to be lower than in hand-crafted RTL-circuits. This is a result of the nature of the tools– uncovering parallelism in software-like sequential code isa known difficult problem, and it should be expected thatautomated tools can not do as well as a good RTL designer.

Because of this, we employ a temporal organization for thedata trace buffer. Rather than storing the values of signals eachcycle, we store the value of signals only when they change.For low-activity signals, this may be expected to result in anincrease in the amount of “useful” data that can be storedin a trace buffer, however, it must be balanced by the needto store additional information about the time at which eachchange occurs (since there is no longer a one-to-one mappingbetween trace buffer line and clock cycle).

In our HLS system, this is enabled by the fact that variablewrites occur through a central memory controller. By tappingoff the memory controller, we can identify when all memorywrites occur, and store information about each of these writesin the trace buffer. Although we currently store informationabout all variable writes, it would be straightforward to storechanges to only those variables that the user deems useful fordebugging.

Figure 5 shows our architecture. Memory writes are stored External Debugger Application

X

Xrd_en

addr X A0 A1 X

rd_data D0 D1 X

clk

X

rd_en

addr X A0 A1 X

rd_data D0 D1 X

clk

state S1 S2 S3 S4

program_en

S3

X

X

program_en

addrrd_enwr_en

wr_data

User Circuit

addrrd_enwr_en

wr_data

Debugger

0

1

addrrd_enwr_enwr_data

MemoryController

2-bitSaturating

Counter

Reset

OlderRead Data

Enable

1

2

3

OldRead Data

Enablerd_data

Memory Controller

rd_dataUser Circuit

rd_dataDebugger

2 Cycle Read Delay

==?Breakpoint hit

Address

Value

Inequality Operation

Logic

ConditionSatisfied

Breakpoint State

Trig

ger

C

on

fig

ura

tio

n

Cir

cuit

Sta

te

Me

mo

ry B

us

Debug Instrumentation

Memory Controller

User Circuit

Debug Manager

RS232

Clock & Trigger Unit

Memory Supervisor

Record & Replay Unit

State

Memory Bus

Program Clk

FPGAControl Flow Buffer

0

1

2

<14 bits 96 bits

Address Value0

1

2

Ctrl Idx

Data Flow Buffer

t

(cy cles)

Var

iabl

es AB

Circ

uit

Ha

lted

5045403530

Write

State Machine

Insn/Mem Control

Main

Me

mo

ry

Co

ntr

olle

r

S.M.

Insn/Mem

FooS.M.

Insn/Mem

Bar

On

-Ch

ip

Me

mo

ry

FPGA

SFR

.V.

SF

R.V.

Start

Finish

ReturnValue

HLS

Source Code

HDL Circuit

Debug Mapping

FPGA

Deb

ug

HLSSource

Code

HDL Circuit

Debug Mapping

FPGADebug

State Sequence Count

6 bits

Fig. 5. Control and data flow trace buffers.

Page 6: Effective FPGA Debug for High-Level Synthesis Generated Circuits · PDF file · 2014-08-22Effective FPGA Debug for High-Level Synthesis Generated Circuits ... The alternative is to

in the trace buffer; the address field indicates the variablethat was updated, and the value field indicates the value ofthis variable (we will elaborate on this further below). Thectrl_idx field is used to track the index and sequence countin the control flow buffer that the memory write occurred at.

As a further optimization, we use the information withinthe HLS tool to reduce the number of address bits that arecaptured, based on the memory space used by the benchmark.Based on an average write frequency of 0.24 writes/cycle forthe benchmarks, the optimized data buffer requires only 22.7bits per cycle, a 76% reduction over the baseline.

D. Balancing Control and Data Trace Buffer Sizes

The total amount of memory that can be allocated tothe trace buffers depends on the total amount of RAM onthe chip as well as the amount of memory used by theuser circuit. Given a fixed total amount of memory, the toolautomatically partitions the space into control and data tracebuffers, attempting to balance the size of each of these tracebuffers to maximize the number of lines of code that can bereplayed. The number of cycles that can be replayed is theminimum of the number of cycles captured by the two buffers;for example, if the control buffer contains m cycles of data,and the data buffer contains n cycles of data, the user willonly be able to replay the last min(m,n) cycles. Thus, it isimportant to properly balance the memory allocation.

Based on Modelsim simulation of the benchmarks, on aver-age, the data buffer requires 11 times more memory per cyclethan the control buffer. One approach is to always allocate 11times more memory to the data buffer. However, the furtherthe benchmark varies from this ratio, the more likely it will bethat one buffer fills up faster than the other, wasting memoryspace. An improved method is to use a ratio that is tailoredto the individual benchmark. This method increases the replaywindow size by 29%, although it requires simulation to obtainthe circuit-specific ratio.

E. Improvement over Signal Tap

Table II summarizes the results of our optimizations on thestorage requirements. The table also includes the expectednumber of C lines that can be debugged within the replaywindow, per 100Kb of memory allocated to the trace buffers.

For the baseline, we trace the state and data signals usingSignalTap II. It requires 16 bits for the control signals, and 96bits for the data signals per cycle, resulting in a replay windowsize of 275 lines per 100Kb of memory. The table lists thecumulative effect of each optimization on the replay windowsize, when applied to our custom instrumentation. With alloptimizations, the replay window contains 1243 lines of Ccode per 100Kb of memory, a 4.5X improvement over SignalTap. With modern FPGAs containing tens of megabits of on-chip memory, it is possible for the user to replay thousandsof lines of source code using only a few percentages of thememory resources.

TABLE IICUMULATIVE IMPROVEMENTS TO REPLAY WINDOW SIZE FROM TRACE

BUFFER OPTIMIZATIONS.

OptimizationTrace BufferBits/Cycle

ReplayWindow Size

Control Data (C lines/100Kb)

Baseline (SignalTap II) 16.0 96.0 275

Temporal Data Buffer - 26.3 564Addr Bit Trimming - 22.7 615State Bit Trimming 12.4 - 678State Sequence Counter 2.0 - 963Tailored Balancing - - 1243

Final 2.0 22.7 1243vs. SignalTap II (-88%) (-76%) (+352%)

F. Variable Coverage

Ideally, the debugger would be able to determine the valueof any variable at any point during the replay window. Thiswould be trivial if we used a traditional trace buffer approachto store all values every cycle; however, since only the variablewrites are stored, it requires extra consideration. ConsiderFigure 6, which illustrates the data flow capture of a simplebenchmark with two variables, A and B. In this case, thecircuit is run until t = 52, and then halted due to a breakpoint,resulting in a replay window of 35 < t < 52. The dataflow buffer will contain the new values of A at t = 40 andt = 50. Using this data, we can determine the value of Afor t >= 40. Unfortunately, it is impossible to determine thevalue of A when t < 40. Variable B presents a different case,where the trace buffer contains no writes to the variable. Inthis situation, since the circuit is halted, the debugger uses thememory controller to retrieve the current value of B.

If the system is modified to store the old value of a variableinto the trace buffer, full observability can be obtained duringthe replay window. For example, if the trace buffer containsthe old values of A at t = 40 and t = 50, then the trace datacan be used to determine A when t <= 50, and the memorycontroller can be used for t > 50.

Although the latter method provides full coverage, it isonly possible if the memory blocks on the FPGA are capableof providing the old value during a memory write. XilinxVirtex FPGAs and certain generations of Altera Stratix FPGAsprovide this feature [17]–[19].

External Debugger Application

X

Xrd_en

addr X A0 A1 X

rd_data D0 D1 X

clk

X

rd_en

addr X A0 A1 X

rd_data D0 D1 X

clk

state S1 S2 S3 S4

program_en

S3

X

X

program_en

addrrd_enwr_en

wr_data

User Circuit

addrrd_enwr_en

wr_data

Debugger

0

1

addrrd_enwr_enwr_data

MemoryController

2-bitSaturating

Counter

Reset

OlderRead Data

Enable

1

2

3

OldRead Data

Enablerd_data

Memory Controller

rd_dataUser Circuit

rd_dataDebugger

2 Cycle Read Delay

==?

Trigger Hit

Address(32 bit)

Data(64 bit)

Operation(3 bit)

Logic

ConditionSatisfied

Trigger State(32 bit)

Trig

ger

C

on

fig

ura

tio

n

Cir

cuit

Sta

te

Me

mo

ry B

us

Debug Instrumentation

Memory Controller

User Circuit

Debug Manager

RS232

Clock & Trigger Unit

Memory Supervisor

Record & Replay Unit

State

Memory Bus

Program Clk

FPGAControl Flow Buffer

0

1

2

16 bits 96 bits

Module ID Module State Address Value0

1

2

Ctrl Idx

Data Flow Buffer

t

(cy cles)

Var

iabl

es AB

Circ

uit

Ha

lted

5045403530

Write

Fig. 6. Example of memory writes in a replay window.

Page 7: Effective FPGA Debug for High-Level Synthesis Generated Circuits · PDF file · 2014-08-22Effective FPGA Debug for High-Level Synthesis Generated Circuits ... The alternative is to

TABLE IIIINSTRUMENTATION AREA

Benchmark Cyclone II Logic ElementsBase Circuit Live Stepping Live Stepping

+ Replay

adpcm 37636 +3277 (9%) +3753 (10%)aes 76622 +1408 (2%) +2269 (3%)blowfish 55897 +2385 (4%) +2903 (5%)dfadd 29031 +3272 (11%) +3789 (13%)dfdiv 38478 +3088 (8%) +3682 (10%)dfmul 17684 +2567 (15%) +3167 (18%)dfsin 88432 +5680 (6%) +6330 (7%)gsm 25470 +3100 (12%) +3710 (15%)mips 7970 +1820 (23%) +2392 (30%)motion 33556 +3375 (10%) +3968 (12%)sha 15126 +2248 (15%) +2751 (18%)

GeoMean 31376 +2755 (9%) +3386 (11%)

V. PROOF OF CONCEPT IMPLEMENTATION

As a proof of concept we have implemented our techniquesin the LegUp HLS tool. We modified the LegUp tool toautomatically insert the instrumentation shown in Figure 2.Table III provides the area overhead of the extra circuitry. Onaverage the debug circuitry to provide live-stepping debuggingrequires an extra 9% of logic elements. Further instrumentingthe circuit to add the record and replay feature only increasesthe overhead marginally, to 11%.

To obtain control mapping information, we use an existingfeature of LLVM (which is used as a front-end in LegUp).LLVM can optionally collect debug metadata by attachingdebug information to each instruction in the intermediaterepresentation (IR), specifying the corresponding source codefile and line number. During optimizations, the IR instructionsmay be modified, reordered or eliminated; however, LLVMpreserves the debug information throughout the optimiza-tions [20].

Our debugger application, written in Python+Qt, runs asa standalone program on a workstation, and connects toan FPGA via RS232. Figure 7 provides a screenshot ofthe program. The debug application communicates with theinstrumentation to obtain the circuit state, start and stop theapplication, read variables, set breakpoints, and extract datafrom the trace buffers. The GUI shows the source code,highlights the active instructions in green, and includes a Ganttchart to illustrate how the instructions are scheduled.

In live stepping mode, the user is able to single step, runto breakpoints, or run indefinitely. When stopped the usercan read and write to variables, or configure breakpoints.Currently, our instrumentation only contains one hardwarebreakpoint, although it could easily be modified to includemore. At any time the circuit is halted, the user can switch toreplay mode, and the debugger retrieves the control and dataflow history from the trace buffers. During replay mode, theuser is able to step forwards and backwards though the replaywindow. In replay mode, breakpoints can be implemented insoftware, so they are no longer limited in number. A slider isprovided to allow the user to quickly move through a large

replay window.Our proof-of-concept has several limitations which will be

addressed in future work. One limitation is that we do notsupport Pthreads and OpenMP parallelism constructs whichare available in the latest version of LegUp [21]. Second, oursystem only supports reading from and writing to variables thatreside in memory. However, when compiler optimizations areenabled in LLVM, local stack variables can be optimized away,or be replaced by intermediary registers (global variables arenot optimized away). The debugger is not able to determinethe value of any variable that has been optimized away, or beenreplaced by a register. In our experience, enabling full compileroptimizations (-O3) will optimize away many local variables,which may make debugging with optimizations difficult. Weplan to address these limitations in future work.

VI. RELATED WORK

There are several past works on circuit debug using tracebuffers (e.g. [8], [22]–[24]), and instrumenting circuits withdebug cores (e.g. [22], [25]–[27]). However, these techniquestarget general RTL circuits, whereas our approaches are tar-geted to HLS produced circuits.

The most similar work to this paper is the debugger [28]created for the academic Sea Cucumber HLS tool [29]. Inthis work they do not use trace buffers, but rather leveragethe device readback feature of certain FPGAs, that allows allregisters within the FPGA to be read externally. The majordrawbacks of this approach is that it is very slow, requiringseveral seconds to read values from the FPGA [30], and that itcan only be used for a live-stepping flow. In order to performin-situ debugging, the circuit would need to be started andstopped for each instruction. Since this behavior is disruptiveto interactions with other blocks in the system, only non-interacting circuits could be properly tested. As explainedearlier, if the circuit has no external interactions, it often canmore easily be tested using software debugging.

VII. CONCLUSION

In this paper, we presented a new approach to debuggingHLS produced circuits, which allows the user to debug the cir-cuit in the context of the original source software, supportingsingle-stepping, breakpointing and variable inspection. Thisis accomplished by modifying the HLS tool to automaticallyinsert debug instrumentation into the produced HLS circuit.

A key feature of our approach is the record and replay mode,which allows the user to debug the circuit in-situ, runningat full speed and interacting with other system components.Instrumented trace buffers record the relevant control and datasignals during execution. Once a breakpoint is hit, the data isretrieved by the debugger, and the user is able to replay therecorded execution. We present methods for optimizing thetrace buffers, allowing the user to replay an estimated 1243lines of C code per 100Kb of memory allocated to the tracebuffers, a 4.5X improvement over SignalTap.

As a proof of concept we integrated our approach into theLegUp HLS tool, and have made it publicly available.

Page 8: Effective FPGA Debug for High-Level Synthesis Generated Circuits · PDF file · 2014-08-22Effective FPGA Debug for High-Level Synthesis Generated Circuits ... The alternative is to

Fig. 7. Debugger application

REFERENCES

[1] United States Bureau of Labor Statistics, “Occupational Outlook Hand-book,” 2012.

[2] K. Wakabayashi, “Reconfigurable chip advantage compared withGPGPU from the compiler perspective,” Keynote Speech, InternationalConference on Field-Programmable Technology, Dec 2013.

[3] Altera, “Quartus II Handbook Version 13.1 Volume 3: Verification; 13.Design Debugging Using the SignalTap II Logic Analyzer,” Nov 2013.

[4] Xilinx, “ChipScope Pro Software and Cores: User Guide,” Apr 2012.[5] Mentor Graphics. (2014, Apr) Certus ASIC Prototyping Debug

Solution. [Online]. Available: http://www.mentor.com/products/fv/certus[6] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski,

S. D. Brown, and J. H. Anderson, “LegUp: An Open-source High-levelSynthesis Tool for FPGA-based Processor/Accelerator Systems,” ACMTransactions on Reconfigurable Technology and Systems, vol. 13, no. 2,pp. 24:1–24:27, 2013.

[7] J.-S. Yang and N. Touba, “Expanding Trace Buffer Observation Windowfor In-System Silicon Debug through Selective Capture,” in VLSI TestSymposium, April 2008.

[8] E. Hung and S. J. E. Wilton, “Speculative Debug Insertion for FPGAs,”in Field Programmable Logic and Applications, 2011.

[9] M. Riley, N. Chelstrom, M. Genden, and S. Sawamura, “Debug of theCELL Processor: Moving the Lab into Silicon,” in International TestConference, Oct 2006.

[10] P. Graham, B. Nelson, and B. Hutchings, “Instrumenting Bitstreams forDebugging FPGA Circuits,” in Field-Programmable Custom ComputingMachines, March 2001.

[11] C. Lattner and V. Adve, “LLVM: a compilation framework for lifelongprogram analysis transformation,” in Code Generation and Optimization,March 2004.

[12] Y. Hara, H. Tomiyama, S. Honda, and H. Takada, “Proposal and quan-titative analysis of the chstone benchmark program suite for practical c-based high-level synthesis,” Journal of Information Processing, vol. 17,pp. 242–254, 2009.

[13] E. Hung and S. J. E. Wilton, “On evaluating signal selection algorithmsfor post-silicon debug,” in Quality Electronic Design, March 2011.

[14] X. Liu and Q. Xu, “Trace Signal Selection for Visibility Enhancementin Post-silicon Validation,” in Design, Automation and Test in Europe,2009.

[15] H. F. Ko and N. Nicolici, “Automated trace signals selection using theRTL descriptions,” in International Test Conference, Nov 2010, pp. 1–10.

[16] E. Anis and N. Nicolici, “On using lossless compression of debug datain embedded logic analysis,” in International Test Conference, Oct 2007.

[17] Altera, “Internal Memory (RAM or ROM) User Guide,” Nov 2013.[18] Xilinx, “Virtex-6 FPGA Memory Resources User Guide,” Feb 2014.[19] ——, “Virtex-5 FPGA User Guide,” Mar 2012.[20] LLVM Compiler Infrastructure. (2014, Apr) Source Level

Debugging with LLVM. [Online]. Available: http://llvm.org/docs/SourceLevelDebugging.html

[21] J. Choi, S. Brown, and J. Anderson, “From software threads to parallelhardware in high-level synthesis for FPGAs,” in Field-ProgrammableTechnology, Dec 2013.

[22] B. Vermeulen and S. Goel, “Design for debug: catching design errorsin digital chips,” IEEE Design Test of Computers, vol. 19, no. 3, pp.35–43, May 2002.

[23] B. Vermeulen, “Functional Debug Techniques for Embedded Systems,”IEEE Design & Test, vol. 25, no. 3, pp. 208–215, May 2008.

[24] C. MacNamee and D. Heffernan, “Emerging on-ship debugging tech-niques for real-time embedded systems,” Computing Control Engineer-ing Journal, vol. 11, no. 6, pp. 295–303, Dec 2000.

[25] B. Hutchings, B. Nelson, M. Wirthlin, and D. Wilde, “Unified DebugEnvironment for Adaptive Computing Systems,” Brigham Young Uni-versity, Tech. Rep., Sep. 2003.

[26] Y. Iskander, C. Patterson, and S. Craven, “High-Level Abstractions andModular Debugging for FPGA Design Validation,” ACM TransactionsReconfigurable Technology and Systems, vol. 7, no. 1, pp. 2:1–2:22, Feb.2014.

[27] B. Vermeulen, M. Urfianto, and S. Goel, “Automatic generation of break-point hardware for silicon debug,” in Design Automation Conference,July 2004, pp. 514–517.

[28] K. Hemmert, J. Tripp, B. Hutchings, and P. Jackson, “Source leveldebugger for the Sea Cucumber synthesizing compiler,” in Field-Programmable Custom Computing Machines, April 2003.

[29] J. L. Tripp, P. A. Jackson, and B. Hutchings, “Sea Cucumber: ASynthesizing Compiler for FPGAs,” in Field-Programmable Logic andApplications, 2002.

[30] Y. Iskander, C. Patterson, and S. Craven, “Improved Abstractions andTurnaround Time for FPGA Design Validation and Debug,” in FieldProgrammable Logic and Applications, Sept 2011.