MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation...

9
MEG: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization Using FPGAs and Hybrid Memory Cube Jialiang Zhang, Yang Liu, Gaurav Jain, Yue Zha, Jonathan Ta and Jing Li University of Wisconsin-Madison Electrical and Computer Engineering Department [email protected], {liu574,gjain6, yzha3, jta}@wisc.edu, [email protected] Abstract—Emerging 3D memory technologies, such as the Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), provide increased bandwidth and massive memory-level parallelism. Efficiently integrating emerging memories into exist- ing system pose new challenges and require detailed evaluation in a real computing environment. In this paper, we propose MEG, an open-source, configurable, cycle-exact, and RISC-V based full system simulation infrastructure using FPGA and HMC. MEG has three highly configurable design components: (i) a HMC adaptation module that not only enables communication between the HMC device and the processor cores but also can be extended to fit other memories (e.g., HBM, nonvolatile memory) with minimal effort, (ii) a reconfigurable memory controller along with its OS support that can be effectively leveraged by system designers to perform software-hardware co-optimization, and (iii) a performance monitor module that effectively improves the observability and debuggability of the system to guide perfor- mance optimization. We provide a prototype implementation of MEG on Xilinx VCU110 board and demonstrate its capability, fidelity, and flexibility on real-world benchmark applications. We hope that our open-source release of MEG fills a gap in the space of publicly-available FPGA-based full system simulation infrastructures specifically targeting memory system and inspires further collaborative software/hardware innovations. I. I NTRODUCTION DDR SDRAM is the predominant technology used to build the main memory systems of modern computers. However, the slowdown in technology scaling and the quest to increase memory bandwidth driven by emerging data-intensive appli- cations such as machine learning has led to the development of a new class of 3D memory [1]. The two well-known realizations of 3D memory are Microns Hybrid Memory Cube (HMC) [2] and JEDEC High Bandwidth Memory (HBM) [3]. These memories exploit TSV-based stacking technology and re-architects the memory banks into multiple independently- operated channels to achieve greater memory-level parallelism than DDR memory. For instance, a standalone HMC can provide an order of magnitude higher bandwidth with not only 70% less energy per bit than DDR3 DRAM technologies but also use nearly 90% less space than todays RDIMMs [2]. However, integrating 3D memories into computer systems is non-trivial. It requires detailed evaluation of these memo- ries within computers running realistic software stacks. Such evaluation would help computer architects in comprehensively understanding the relationship between micro-architecture, OS, and applications and further driving software-hardware collaborative innovations for memory systems. Early studies on HMC-based main memory can be interesting [4], but the majorities abstract the HMC equivalently as high bandwidth, yet low energy, DDR SDRAM. This methodology does not fully exploit the unique architectural attributes of HMC for higher performance. Hardware-software co-design of HMC- based main memory is fundamentally hindered by a lack of research evaluation infrastructure with high fidelity, flexibility, and performance. There are decades of research into performance evaluation methods for computer systems [5][6][7][8]. Software-based simulators are flexible, but suffer from low simulation speeds and low-fidelity simulations when modeling real hardware, especially memory systems [7]. For instance, a typical DDR memory controller reorders and schedules requests multiple times while keeping track of dozens of timing parameters that are inherently challenging to model [9]. FPGA-accelerated simulations address the majority of these problems by provid- ing low latency, high simulation throughput, and low cost per simulation cycle [10][11]. Yet, the current critical limitation to this approach has been modeling the DDR DRAM-based main memory system, as mapping the controller RTL, physical interface, and chip models into an FPGA fabric is too complex and resource-intensive [12]. To address these challenges, we present MEG, an open- source, configurable, cycle-exact, and RISC-V based full sys- tem simulation infrastructure using FPGA and HMC. MEG comprises a FPGA-based rocket processor [13] and a HMC- based main memory. It also includes a bootable Linux image for a realistic software flow, enabling cross-layer hardware- software co-optimization. We leverage a commercially avail- able FPGA platform - the Xilinx VCU110 board equipped with a Xilinx virtex Ultrascale FPGA and a 2GB HMC module from Micron - to implement the very first version of MEG. The commodity nature of the platform effectively lowers the barrier for open source adoption. MEG comprises three highly configurable design compo- nents: a HMC adaptation module, a reconfigurable memory controller, and a flexible performance monitoring scheme. The HMC adaptation module handles the incompatible information 145 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) 2576-2621/19/$31.00 ©2019 IEEE DOI 10.1109/FCCM.2019.00029

Transcript of MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation...

Page 1: MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization Using FPGAs and Hybrid Memory

MEG: A RISCV-based System SimulationInfrastructure for Exploring Memory Optimization

Using FPGAs and Hybrid Memory Cube

Jialiang Zhang, Yang Liu, Gaurav Jain, Yue Zha, Jonathan Ta and Jing LiUniversity of Wisconsin-Madison

Electrical and Computer Engineering Department

[email protected], {liu574,gjain6, yzha3, jta}@wisc.edu, [email protected]

Abstract—Emerging 3D memory technologies, such as theHybrid Memory Cube (HMC) and High Bandwidth Memory(HBM), provide increased bandwidth and massive memory-levelparallelism. Efficiently integrating emerging memories into exist-ing system pose new challenges and require detailed evaluation ina real computing environment. In this paper, we propose MEG,an open-source, configurable, cycle-exact, and RISC-V basedfull system simulation infrastructure using FPGA and HMC.MEG has three highly configurable design components: (i) aHMC adaptation module that not only enables communicationbetween the HMC device and the processor cores but also can beextended to fit other memories (e.g., HBM, nonvolatile memory)with minimal effort, (ii) a reconfigurable memory controller alongwith its OS support that can be effectively leveraged by systemdesigners to perform software-hardware co-optimization, and (iii)a performance monitor module that effectively improves theobservability and debuggability of the system to guide perfor-mance optimization. We provide a prototype implementation ofMEG on Xilinx VCU110 board and demonstrate its capability,fidelity, and flexibility on real-world benchmark applications. Wehope that our open-source release of MEG fills a gap in thespace of publicly-available FPGA-based full system simulationinfrastructures specifically targeting memory system and inspiresfurther collaborative software/hardware innovations.

I. INTRODUCTION

DDR SDRAM is the predominant technology used to build

the main memory systems of modern computers. However,

the slowdown in technology scaling and the quest to increase

memory bandwidth driven by emerging data-intensive appli-

cations such as machine learning has led to the development

of a new class of 3D memory [1]. The two well-known

realizations of 3D memory are Microns Hybrid Memory Cube

(HMC) [2] and JEDEC High Bandwidth Memory (HBM) [3].

These memories exploit TSV-based stacking technology and

re-architects the memory banks into multiple independently-

operated channels to achieve greater memory-level parallelism

than DDR memory. For instance, a standalone HMC can

provide an order of magnitude higher bandwidth with not

only 70% less energy per bit than DDR3 DRAM technologies

but also use nearly 90% less space than todays RDIMMs

[2]. However, integrating 3D memories into computer systems

is non-trivial. It requires detailed evaluation of these memo-

ries within computers running realistic software stacks. Such

evaluation would help computer architects in comprehensively

understanding the relationship between micro-architecture,

OS, and applications and further driving software-hardware

collaborative innovations for memory systems. Early studies

on HMC-based main memory can be interesting [4], but the

majorities abstract the HMC equivalently as high bandwidth,

yet low energy, DDR SDRAM. This methodology does not

fully exploit the unique architectural attributes of HMC for

higher performance. Hardware-software co-design of HMC-

based main memory is fundamentally hindered by a lack of

research evaluation infrastructure with high fidelity, flexibility,

and performance.

There are decades of research into performance evaluation

methods for computer systems [5][6][7][8]. Software-based

simulators are flexible, but suffer from low simulation speeds

and low-fidelity simulations when modeling real hardware,

especially memory systems [7]. For instance, a typical DDR

memory controller reorders and schedules requests multiple

times while keeping track of dozens of timing parameters

that are inherently challenging to model [9]. FPGA-accelerated

simulations address the majority of these problems by provid-

ing low latency, high simulation throughput, and low cost per

simulation cycle [10][11]. Yet, the current critical limitation

to this approach has been modeling the DDR DRAM-based

main memory system, as mapping the controller RTL, physical

interface, and chip models into an FPGA fabric is too complex

and resource-intensive [12].

To address these challenges, we present MEG, an open-

source, configurable, cycle-exact, and RISC-V based full sys-

tem simulation infrastructure using FPGA and HMC. MEG

comprises a FPGA-based rocket processor [13] and a HMC-

based main memory. It also includes a bootable Linux image

for a realistic software flow, enabling cross-layer hardware-

software co-optimization. We leverage a commercially avail-

able FPGA platform - the Xilinx VCU110 board equipped

with a Xilinx virtex Ultrascale FPGA and a 2GB HMC module

from Micron - to implement the very first version of MEG.

The commodity nature of the platform effectively lowers the

barrier for open source adoption.

MEG comprises three highly configurable design compo-

nents: a HMC adaptation module, a reconfigurable memory

controller, and a flexible performance monitoring scheme. The

HMC adaptation module handles the incompatible information

145

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

2576-2621/19/$31.00 ©2019 IEEEDOI 10.1109/FCCM.2019.00029

Page 2: MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization Using FPGAs and Hybrid Memory

flow between the host system and the HMC controller module.

Constructed to be flexible, the adaptation module can be

easily extended to fit other memories (e.g., HBM, nonvolatile

memory) with minimal effort. The design consists of a com-

bination of an address translation unit, a HMC boot sequence

generator, and an ID management unit. The detailed design

will be explained in Section IV-A. The reconfigurable memory

controller along with its OS support is the key components that

system designers can effectively leverage to perform software-

hardware co-optimization. It provides architectural support

and an interface that can communicate high-level program

semantics, i.e., data access patterns, from both the application

to the underlying OS and architecture (the unique addressing

scheme of HMC) to achieve more intelligent data placement

for memory system performance optimization. The detailed

design and a simple use case for such cross-layer optimization

will be introduced in Section IV-B. Finally, the addition of a

flexible performance monitoring scheme (Section IV-C) im-

proves system observability and debuggability for performance

optimization. It comprises a trace-based performance monitor

along with a set of reconfigurable hardware counters, enabling

users to flexibly monitor and collect runtime information,

i.e., detailed full system execution traces, from a workload

and better comprehend memory system behavior, ultimately

guiding performance optimization.

The paper is organized as follows. Section II provides

background on HMC technology and RISC-V processors.

Section III presents our experimental characterization on the

native HMC device. Section IV presents the MEG system in-

cluding three detailed design components. Section V presents

the evaluation results on synthetic and real-world benchmark

applications followed by a conclusion in Section VI.

II. BACKGROUND

In this section, we will provide background on emerging

parallel memory architecture and RISC-V Processors.

A. Hybrid Memory Cube

Hybrid memory cube (HMC) employs a new parallel

memory architecture that stacks multiple DRAM dies on

top of a CMOS logic layer using through-silicon-via (TSV)

technology[2]. Fig. 1 shows an overview of the HMC archi-

tecture. It consists of multiple DRAM layers with each layer

divided into multiple partitions, and each partition comprising

of several memory banks. A vertical slice of stacked banks

forms a structure known as a vault. Each vault has its inde-

pendent TSV bus and vault controller in the logic base layer.

The vault controller manages the timing of DRAM banks

within the stack, independent of other vaults. Therefore, a

stack is analogous to the notion of a channel in traditional

DRAM-based memory systems as it contains all components

of a DRAM channel: a memory controller, several memory

ranks (partition), and a bi-directional bus. We can then view

these parallel memory architectures as a device that integrates

numerous DRAM channels into a single device.

There are some major differences between traditional DDRx

SDRAM and HMC. The DDRx SDRAM has a relatively large

row buffer size ranging from 512B to 2048B and uses an

open-page policy to benefit from the row buffer locality. In

contrast, HMC uses a much smaller row buffer size (16B) and

all data in the row buffer will be read out for a single memory

request. Hence, HMC uses closed-page policy to minimize

the power consumption. In addition, HMC provides multiple

address mapping modes as a way to control memory access

scheduling. We show the default address mapping scheme in

Fig. 2 that maps the memory blocks across vaults, banks,

and rows. The minimum access granularity of HMC is 16B,

implying that the 4 LSB are ignored. The 4-bit vault address

(6 to 9) combined with the 4-bit bank address (10 to 13) are

used to select a bank. Besides the default addressing scheme,

HMC supports other addressing scheme such as scheme 1, 2,

and 3 in Fig. 2. HMC’s structure provides high bandwidth

and energy efficiency potentials. In Section IV-B, we will

demonstrate that only with efficient utilization of fine-grain

parallelism, i.e., careful selection of addressing modes with

respect to different data access patterns, we can achieve these

theoretical potentials.

Partition

VaultController

TSVsVault

ControllerVault

Controller

Vault

Logic & CrossbarSerdes Buffers

ResponseLink

RequestLink

Fig. 1. A conceptual diagram illustrates the HMC architecture.

B. RISC-V Processors

To build an open-source and configurable FPGA simulation

platform with Linux support, we chose to use the RISC-V

architecture, as it is the only soft-core processor with Linux

support while being open-source. There are many widely-used

soft-processors such as the ARM cortex M1 [14], the Xilinx

Microblaze [15], the Intel NIOSII[16], and the RISC-V family.

While Microblaze and NIOSII both provide operating system

support, e.g., Linux, their hardware design is closed-sourced,

leaving the RISC-V the only viable option to support the full

system evaluation.

The RISC-V specification [17] is a both open and free

standard, leading to the architecture’s wide-spread adoption in

both industry and academia. For example, the Rocket-Chip im-

plements a RISC-V System-on-Chip for architecture education

and research. SiFive leverages rocket-chip to implement the

first Linux-capable RISC-V soft-core U500. VectorBlox [18]

has produced Orca to host the MXP vector processor. Finally,

the VexRiscV project[19] and the Taiga project [20] have

optimized the RISC-V processor for FPGAs. We extended

146

Page 3: MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization Using FPGAs and Hybrid Memory

Fig. 2. Different HMC address mapping schemes.

MEG from the SiFive U500 design for the following reasons:

(i) it comes with Linux support; (ii) it is a rocket-core based

design, which allows processor parameter modification via

Chisel [21], a high-level abstraction.

Until the the development of physical support with parallel

memory controllers, FPGA-based emulation plays a crucial

role in enabling the evaluation and optimization of parallel

memory subsystems.

III. EXPLOIT MEMORY PARALLELISM

As described in Section II-A, HMC provides massive mem-

ory parallelism at both vault and bank levels, where vault-level

parallelism does not exist in conventional DDRx memory and

is unique to HMC. Therefore, choosing an optimal memory ad-

dress mapping scheme is critical to maximize parallelism and

memory bandwidth. While simple in theory, mapping scheme

selection is not a trivial task. Specifically, distinct memory

access patterns generated by different applications suggest that

memory mapping optimization is complex and multi-faceted.

To better understand the relationship between the mapping

scheme and the memory performance, we perform several

experiments on a FPGA-HMC platform and use the resulting

observations to optimize system performance (Section IV-B).

Fig. 3. HMC streaming bandwidth with different levels of parallelism

In the FPGA-HMC platform, a 2GB HMC module is

connected to a Xilinx UltraScale FPGA through two half-

width (8 lanes) 15G HMC links. We implement a dedicated

module on FPGA to generate the memory requests that are sent

to the HMC module. Specifically, this module reads an array

in HMC memory with different strides, where each element

can vary in size between 32B, 64B, 128B, and 256B. Four

representative memory address mapping schemes (Fig. 2) are

evaluated in these experiments.

Fig. 4. Bandwidth of accessing 32B elements in an array on HMC withdifferent stride size

Changing the address mapping scheme varies the num-

ber of banks and vaults used to store the data array and

serve memory requests concurrently. This difference yields

a measurable performance change, i.e., memory bandwidth

difference between different mapping schemes. Particularly in

the case of streaming access, i.e., stride 1, the attained memory

bandwidth can differ in a factor of ∼ 10× (Fig. 3). Further

memory bandwidth measurements of the four schemes under

varying strides and element sizes yield drastic differences

in performance across the schemes. This observation implies

that a fixed memory address mapping scheme cannot deliverthe best performance for different scenarios. Then, it is then

necessary to select optimal address mapping schemes based

on the detected memory access pattern, i.e., stride and element

size. We leverage the strong link between the application high-

level data structure and its memory access pattern to guide

optimal selection of the memory mapping scheme (Section

IV-B).

Fig. 5. Bandwidth of accessing an array on HMC with a fixed stride of 16and different element sizes

IV. SYSTEM DESIGN

Fig. 6 shows an overview of our MEG system. The three

major design components are highlighted, including (i) a

147

Page 4: MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization Using FPGAs and Hybrid Memory

HMC adaptation module that both enables the communication

between HMC device and the processor cores and is extensible

to fit other memories (e.g., HBM, nonvolatile memory) with

minimal effort, (ii) a reconfigurable memory controller along

with its OS support that can be effectively leveraged by sys-

tem designers to perform software-hardware co-optimization,

and (iii) a performance monitoring module that effectively

improves the observability and debuggability of the system to

guide performance optimization. In the following subsections,

we present detailed descriptions of each design component.

Fig. 6. System diagram of MEG

A. HMC Adaptation Module

In this subsection, we provide a detailed description of the

HMC adaptation module which interconnects the HMC and

host system. This design consists of three components: an

address translation unit, a HMC boot sequence generator, and

an ID management unit.

Since the host system provides a memory port which can

only be interfaced via SDRAM, we have provided an address

translation module for modifying address information and

ensuring compatibility with HMC requirements. The HMC

memory module differs from standard SDRAM in two primary

aspects, (a) it provides the ability to have multiple outstanding

transactions, and (b) it requires metadata specifying the type of

transaction and packet size in every transaction. To cater to the

first, we explore the use of the ID field provided by the HMC

module and configure the rocket-core system accordingly

to utilize the maximum achievable memory bandwidth. To

address the second, we implement a forwarder module where

we insert metadata that specifies transaction type into the user

channel. In addition, we extend this module to perform clock

domain crossing, as the HMC controller requires the data on

the AXI4 port to be synchronized with its reference clock,

which is different from the core operating frequency.

Aside from the interface difference, the HMC memory

module requires a sequence of steps for initialization and

power-management. To alleviate user burdens, we have created

a specialized boot sequence generator for the proper setup

of the HMC module. The implemented module interfaces

with the HMC configuration port and issues the appropriate

initialization commands. After ensuring a stable reference

voltage, we begin the initialization by de-asserting the reset

signal and subsequently configuring the reference clock and

the lane bandwidth. In the next phase, the HMC controller

and memory module perform a synchronization process for

link layer configuration. After completion, we verify the

physical layer and clock reset signals of the HMC memory to

identify whether the initialization completed successfully and

accordingly propagate the ready signal to the other modules

in the design.

Fig. 7. Detailed implementation of the HMC adaptation module

To exploit the expansive bandwidth offered by the HMC

memory module, we employ an ID management module to

configure the cache in the rocket-core to be non-blocking,

i.e., have Miss Status Handling Registers (MSHRs), enabling

the module to issue multiple memory requests to the memory.

MSHRs are special registers that contain the required metadata

such as tag and cache-way to redirect the memory response to

an appropriate location in the cache. Configuring the amount

of MSHRs in the rocket core and using the HMC ID-field

enables the core to utilize maximum achievable bandwidth.

In our system, change in the number of MSHRs also impacts

the resource utilization in subsequent modules in the memory

hierarchy. Also, systems with non-blocking caches require

every memory request to be checked in the MSHR buffer such

that the system can identify whether previous requests serviced

correspond to the same location in memory. Considering both

MSHR logic and MSHR area overheads, there exists a trade-

off between the amount of MSHRs in the system and the

number of outstanding transactions supported by the memory

148

Page 5: MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization Using FPGAs and Hybrid Memory

Fig. 8. An example of selecting the optimal addressing scheme for accessing 64B elements with a stride of 16

hierarchy. We discuss this relationship between the resource

utilization and number of MSHRs in the evaluation section.

B. Reconfigurable Memory Controller

MEG system provides a reconfigurable memory controller

to enable memory subsystem optimization. The major com-

ponent added into the memory controller is a TCAM mod-

ule used for selecting the optimal address mapping scheme.

Despite a relatively simple design in hardware, support from

the programming model and runtime management are both

required to efficiently utilize this modified memory controller.

In this subsection, we present a simple example to illustrate

how to optimize the memory system performance with this

modified controller. More comprehensive system solutions will

be our future work.

Fig. 9. A use case illustrating how to apply the reconfigurable memorycontroller to perform memory system optimization

Achieving the optimal selection of the memory address

mapping scheme consists of three steps. The first step is the

offline code profiling stage. In this step, user applications are

analyzed to extract high-level semantics and access patterns,

e.g., element size, stride, etc. This high-level information is

then used to determine the optimal address mapping scheme.

With the optimized data placement, the massive memory

parallelism provided by HMC can be efficiently utilized, as

continuous memory requests are served by different vaults in

parallel, as illustrated in Fig. 8.

The second step is a modified memory allocation pro-

cess. Specifically, we modify the malloc function to take

an additional parameter, i.e., the address mapping scheme.

Consequently, users can call this malloc function to assign

the optimal mapping scheme for a specific memory allocation.

Moreover, we also alter the Linux kernel to store the physical

address range, i.e., page mask, and the corresponding mapping

scheme into a TCAM module.

The last step is the runtime address mapping process. The

memory controller is modified to apply the assigned mapping

scheme for a specific physical address range. When the mem-

ory controller receives a memory request, the physical address

of this memory request is used as the key to search the TCAM

module and to fetch the assigned address mapping scheme.

The corresponding mapping scheme then converts physical

addresses into hardware addresses to read HMC memory.

Fig. 10. The hardware components for the performance monitor scheme.

C. Performance Monitoring Scheme

As shown in Fig. 10, MEG comprises of two complementary

monitoring schemes, hardware counters and a trace collector.

Depending upon the desired detail level, users can decide

between either using the simple performance counters for

monitoring event frequency, or the PCIe-based trace collection

149

Page 6: MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization Using FPGAs and Hybrid Memory

module for obtaining high-fidelity details at the hardware

signal granularity. Through this module, we provide the user a

flexible and configurable interface for monitoring any desired

signal over a preferred duration. For configurability, users

can write predefined values to a memory-mapped register, to

enable or disable the event recording module. In addition, the

user can monitor the performance counters and execute trace

collection via the JTAG and the PCIe ports simultaneously.

The JTAG port, used for monitoring the hardware counters,

provides sufficient bandwidth for transferring counter data to

the host machine. Trace collection on the other hand requires

the high bandwidth PCIe interface due to the large amount

of data generated. Conventionally, this imposes a limitation

on the number of signals the user can map onto the PCIe

capture interface. In our design, instead of constraining the

user to regenerate the bitstream for each change in the target

capture signals, we have implemented a multiplexer scheme

where the user can map all potential signals and control them

using the provided memory-mapped register. This provides the

user a programmable trace collection interface which can be

controlled using only software.

Fig. 11. The control flow of the performance monitoring scheme.

As illustrated in Fig. 11, in order to integrate the monitoring

module, the user has to select and connect the target signals

to the corresponding signal ports of our module and generates

the bitstream with the modified design. We provide the user

with a modified Linux image, which includes the memory

mapped register, to load into the FPGA programmed with

the generated bitstream. The register controls the monitoring

duration, where writing a one to it starts the monitoring

TABLE IRESOURCE UTILIZATION

LOGIC BRAM DSPTotal 1072240 3780 1800Used 189229 205 60Utilization 17% 5% 3%

process, and writing a zero stops the corresponding process. At

the hardware level, a polling mechanism monitors the value of

the register. Once the value of the register has been asserted,

both the counter module and the trace module begin data

collection. Capturing the counter data over the JTAG port

does not require any modification on the host machine, and

only one line of commands have to be executed for enabling

the host PCIe driver to accept the generated trace data. On

executing the appropriate command, the host machine begins

reading data from the designated on-board storage buffer. If

the detected value of the memory mapped register is zero, i.e.,

whenever the user decides to stop the monitoring process, the

hardware counters and trace collection processes are halted.

On halt, the counter data are transferred to the host machine,

whereas the PCIe module waits for the on-board buffer to

be cleared before generating an interrupt to the host. This

interrupt signals the end of the transfer to the host, following

which the host machine no longer accepts data from the FPGA.

Subsequently, if the user decides to monitor a different set of

signals, the memory mapped register can be programmed with

the corresponding IDs and the process can be rerun.

V. EVALUATIONS

A. Experimental Setup

We implement MEG on Xilinx VCU110 board (Fig. 12)

that equipped with a combination of a Xilinx virtex Ultrascale

FPGA and a 2GB HMC module from Micron. The design is

implemented using Xilinx Vivado 2016.4 and Xilinx XHMC

IP 1.1. The resource utilization is summarized in Table I.

Fig. 12. MEG platform, including a Xilinx Virtex Ultrascale FPGA, a 2GBMicron HMC module and SD card to store the Linux image

To evaluate the performance of MEG, we select 4 different

benchmarks: random access, stride array access, locality sen-

sitive hashing, and graph traversal, where both random access

and stride array access are synthetic benchmarks while both

150

Page 7: MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization Using FPGAs and Hybrid Memory

locality sensitive hashing and graph traversal are from real-

world applications. The random access benchmark is used

to verify that the RISC-V cores can generate enough traffic

to saturate the physical interface. The stride-array access

benchmark is the same as one used in Section III, but re-

evaluated in MEG with complete system stack. The locality

sensitive hashing and graph traversal are then used to evaluate

the performance of MEG in real-world applications.

B. Evaluation Results

Fig. 13. The performance (in terms of GUPS) for random access when varyingthe size of MSHR and the number of cores.

1) Random Access: To verify that our system can generate

sufficient memory requests to saturate the HMC physical

interface, we measure the random access performance in

terms of GUPS (giga-updates per second) while varying the

number of MSHR (on-the-fly memory request) and the number

of cores. We modified the code from the Random Access

benchmark in the HPC challenge [22] and optimize it for

better in-order execution performance. In particular, instead

of executing read-modify-update operations atomically, we

combine several operations of the same type to hide memory

access latency and reduce the number of stalls caused by

serialized memory access. As shown in Fig. 13, using all

4 cores and 32 MSHRs generates enough memory traffic to

saturate the HMC interface.

Fig. 14. The performance (in terms of GUPS) for random access whendifferent number of vaults/banks are used to store data.

We further evaluate the sensitivity of random access per-

formance (in terms of GUPS) to parallelism. By changing the

memory address mapping schemes, we can limit the amount of

vaults and banks that are used to store data. As shown in Fig.

14, the random access performance monolithically increases

when data are stored in more vaults and banks. We note

that, no performance improvement is achieved when more

than 4 vaults are used to store data. This is because of the

limited interconnection bandwidth (2 half-width physical links

between FPGA and HMC) in our platform. The HMC module

can be connected to the FPGA with 4 full-width physical links

(4× bandwidth compared with that provided in our platform).

With this interconnection, the random access performance can

be further improved when data are stored in more than 4 vaults.

Fig. 15. The performance comparison between the default and optimal addressmapping schemes when reading an array with a given stride and a fixedelement size (64B).

Fig. 16. The performance comparison between the default and optimal addressmapping schemes when reading an array with a given element size and a fixedstride (16).

2) Stride Memory Access: We extend the stride-array ac-

cess experiments in section III with different payload sizes and

access strides. We first fix the size of element to 64B (aligned

to RISC-V cache line size) and vary the stride. As shown in

Fig. 15, we can see that the performance gap between the

default address mapping and the optimal address mapping can

151

Page 8: MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization Using FPGAs and Hybrid Memory

be as large as 10×, especially when using a stride of 16, where

the data are stored in one vault (limited parallelism) with the

default address mapping scheme.

We then evaluate the performance with various element

sizes and a fixed stride (16). As shown in Fig. 16, the perfor-

mance gap between the optimal address mapping scheme and

the default one decreases with an increased element size. This

is because when the element size is larger than 256B, more

than 4 vaults are used to store one data element. Consequently,

both default and optimal address mapping schemes have fully

utilized the parallelism provided by the HMC module.

Fig. 17. The performance comparison between the default and optimal addressmapping schemes under different LSH entry sizes.

3) LSH: Nearest neighbor search is the key kernel for many

applications (e.g., image search), and one of the most widely

adopted methods is Locality Sensitive Hashing (LSH) [23].

LSH hashes the entire dataset using a set of hash functions,

and each function is designed to statistically hash similar data

into the same bucket. To serve a query, the query is first

hashed using the same hash function set, and the matching

buckets are traversed to find the optimal data entry (according

to the calculated distance between the query and data entry).

Therefore, the data access pattern of LSH is a combination

of random access (read matching buckets is random) and

sequential access (read data in one bucket is sequential).

In this experiment, we run the LSH applications with four

threads (one on each core). We evaluate the performance of

different address mapping scheme under different data entry

sizes. As shown in Fig. 17, the performance gap between the

optimal and default address mapping scheme increases with

an increased data entry size. This is because that the default

address mapping scheme can lead to a vault conflict (different

threads read the same vault) with large data entry. Here, we

take the case that the data entry size is 1kB to illustrate

this vault conflict. Specifically, one data entry is stored in all

16 vaults when using the default address mapping scheme.

Therefore, reading one data entry needs to access all vaults.

This means that the memory requests issued by different

threads/cores can only be sequentially served, leading to a

frequent pipeline stalls (low performance). On the contrary,

one data entry is stored in 4 vaults under the optimal address

mapping scheme. Consequently, the HMC module can serve

up to four memory requests in parallel, leading to an increased

performance.

Fig. 18. The performance comparison for the BFS benchmark between thedefault and optimal address mapping schemes under different number ofvertices.

4) Graph Traversal: Graph traversal is one of themost im-

portant kernels in many graph problems, including maximum

flow [24], shortest path [25], and graph search [26]. It has

a random access pattern as the nodes that are traversed in

one round is determined by the nodes that are traversed in

the previous round. In this experiment, we use the level-

synchronized BFS design that are proposed by Zhang et. al

[27] to evaluate the performance of different address mapping

scheme under different list sizes. Similar to the case in LSH,

different vaults are used to store one list under different

address mapping schemes, and the number of memory requests

that can be served by HMC in parallel also varies under

different schemes. As shown in Fig. 18, the optimal address

mapping scheme can improve the performance by more than

10% compared with the default one.

VI. CONCLUSION

This work introduces the first RISC-V based full system

simulation infrastructure (MEG) using FPGA and emerging

HMC. We present a prototype implementation of MEG and

demonstrate the capability, fidelity and flexibility of the design

on real-world benchmark applications. The techniques we

propose can be applied more generally to other memory

types, such as HBM, non-volatile memories, etc. With our

preliminary evaluation on a relatively simple case, we be-

lieve and hope that MEG, with its flexibility and fidelity,

can enable further comprehensive studies and research ideas

across all levels of computing stack for memory system

operations, from microarchitecture to programming models,

runtime management, operating systems, and applications. As

this is the very first prototype implementation, we believe there

is plenty of space left for improving the platform such as the

usability, easy-to-use software interface, etc. As a contribution

to RISC-V community, we plan to release MEG publicly as

an open-source research platform, accessible to a wide range

of software and hardware developers to advance computer

systems research.

ACKNOWLEDGEMENT

This work was supported in part by Semiconductor Re-

search Corporation (SRC).

152

Page 9: MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization Using FPGAs and Hybrid Memory

REFERENCES

[1] J. Kim, A. J. Hong, S. M. Kim, K.-S. Shin, E. B. Song, Y. Hwang,F. Xiu, K. Galatsis, C. O. Chui, R. N. Candler et al., “A stacked memorydevice on logic 3d technology for ultra-high-density data storage,”Nanotechnology, vol. 22, no. 25, p. 254006, 2011.

[2] J. T. Pawlowski, “Hybrid memory cube (hmc),” in 2011 IEEE Hot Chips23 Symposium (HCS). IEEE, 2011, pp. 1–24.

[3] J. Standard, “High bandwidth memory (hbm) dram,” JESD235, 2013.[4] Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low, L. Pileggi,

J. C. Hoe, and F. Franchetti, “3d-stacked memory-side acceleration:Accelerator and system design,” in In the Workshop on Near-DataProcessing (WoNDP)(Held in conjunction with MICRO-47.), 2014.

[5] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu,A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood,“Multifacet’s general execution-driven multiprocessor simulator (gems)toolset,” SIGARCH Comput. Archit. News, vol. 33, no. 4, pp. 92–99, Nov.2005. [Online]. Available: http://doi.acm.org/10.1145/1105734.1105747

[6] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi,and S. K. Reinhardt, “The m5 simulator: Modeling networked systems,”IEEE Micro, vol. 26, no. 4, pp. 52–60, 2006.

[7] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al., “The gem5simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2,pp. 1–7, 2011.

[8] F. Ryckbosch, S. Polfliet, and L. Eeckhout, “Fast, accurate, and validatedfull-system software simulation of x86 hardware,” IEEE micro, vol. 30,no. 6, pp. 46–56, 2010.

[9] D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, andB. Jacob, “Dramsim: a memory system simulator,” ACM SIGARCHComputer Architecture News, vol. 33, no. 4, pp. 100–107, 2005.

[10] S. Karandikar, H. Mao, D. Kim, D. Biancolin, A. Amid, D. Lee,N. Pemberton, E. Amaro, C. Schmidt, A. Chopra et al., “Firesim:Fpga-accelerated cycle-exact scale-out system simulation in the publiccloud,” in Proceedings of the 45th Annual International Symposium onComputer Architecture. IEEE Press, 2018, pp. 29–42.

[11] H. Angepat, D. Chiou, E. S. Chung, and J. C. Hoe, “Fpga-acceleratedsimulation of computer systems,” Synthesis Lectures on ComputerArchitecture, vol. 9, no. 2, pp. 1–80, 2014.

[12] J. D. Leidel and Y. Chen, “Hmc-sim-2.0: A simulation platform forexploring custom memory cube operations,” in Parallel and DistributedProcessing Symposium Workshops, 2016 IEEE International. IEEE,2016, pp. 621–630.

[13] R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C. Celio, H. Cook,D. Dabbelt, J. Hauser, A. Izraelevitz et al., “The rocket chip generator,”2016.

[14] ARM, Arm Cortex-M1 DesignStart FPGA-Xilinx edition.[15] Xilinx, MicroBlaze Micro Controller System v3.0.[16] Intel, Nios II Processor Reference Guide.[17] R.-V. foundation, The RISC-V Instruction Set Manual.[18] A. Severance and G. G. Lemieux, “Embedded supercomputing in fpgas

with the vectorblox mxp matrix processor,” in Proceedings of theNinth IEEE/ACM/IFIP International Conference on Hardware/SoftwareCodesign and System Synthesis. IEEE Press, 2013, p. 6.

[19] SpinalHDL, VexRiscv CPU.[20] E. Matthews, Z. Aguila, and L. Shannon, “Evaluating the performance

efficiency of a soft-processor, variable-length, parallel-execution-unitarchitecture for fpgas using the risc-v isa,” in 2018 IEEE 26th AnnualInternational Symposium on Field-Programmable Custom ComputingMachines (FCCM). IEEE, 2018, pp. 1–8.

[21] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avizienis,J. Wawrzynek, and K. Asanovic, “Chisel: constructing hardware in ascala embedded language,” in Design Automation Conference (DAC),2012 49th ACM/EDAC/IEEE. IEEE, 2012, pp. 1212–1221.

[22] P. Luszczek, J. Dongarra, and J. Kepner, “Design and implementation ofthe hpc challenge benchmark suite,” CT Watch Quarterly, vol. 2, no. 4A,pp. 18–23, 2006.

[23] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approxi-mate nearest neighbor in high dimensions,” in Foundations of ComputerScience, 2006. FOCS’06. 47th Annual IEEE Symposium on. IEEE,2006, pp. 459–468.

[24] A. V. Goldberg and R. E. Tarjan, “A new approach to the maximum-flow problem,” Journal of the ACM (JACM), vol. 35, no. 4, pp. 921–940,1988.

[25] P. Harish and P. Narayanan, “Accelerating large graph algorithms onthe gpu using cuda,” in International conference on high-performancecomputing. Springer, 2007, pp. 197–208.

[26] I. Mani and E. Bloedorn, “Multi-document summarization by graphsearch and matching,” arXiv preprint cmp-lg/9712004, 1997.

[27] J. Zhang, S. Khoram, and J. Li, “Boosting the performance of fpga-based graph processor using hybrid memory cube: A case for breadthfirst search,” in Proceedings of the 2017 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays. ACM, 2017, pp.207–216.

153