MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation...
Transcript of MEG: A RISCV-Based System Simulation Infrastructure for ... · MEG: A RISCV-based System Simulation...
MEG: A RISCV-based System SimulationInfrastructure for Exploring Memory Optimization
Using FPGAs and Hybrid Memory Cube
Jialiang Zhang, Yang Liu, Gaurav Jain, Yue Zha, Jonathan Ta and Jing LiUniversity of Wisconsin-Madison
Electrical and Computer Engineering Department
[email protected], {liu574,gjain6, yzha3, jta}@wisc.edu, [email protected]
Abstract—Emerging 3D memory technologies, such as theHybrid Memory Cube (HMC) and High Bandwidth Memory(HBM), provide increased bandwidth and massive memory-levelparallelism. Efficiently integrating emerging memories into exist-ing system pose new challenges and require detailed evaluation ina real computing environment. In this paper, we propose MEG,an open-source, configurable, cycle-exact, and RISC-V basedfull system simulation infrastructure using FPGA and HMC.MEG has three highly configurable design components: (i) aHMC adaptation module that not only enables communicationbetween the HMC device and the processor cores but also can beextended to fit other memories (e.g., HBM, nonvolatile memory)with minimal effort, (ii) a reconfigurable memory controller alongwith its OS support that can be effectively leveraged by systemdesigners to perform software-hardware co-optimization, and (iii)a performance monitor module that effectively improves theobservability and debuggability of the system to guide perfor-mance optimization. We provide a prototype implementation ofMEG on Xilinx VCU110 board and demonstrate its capability,fidelity, and flexibility on real-world benchmark applications. Wehope that our open-source release of MEG fills a gap in thespace of publicly-available FPGA-based full system simulationinfrastructures specifically targeting memory system and inspiresfurther collaborative software/hardware innovations.
I. INTRODUCTION
DDR SDRAM is the predominant technology used to build
the main memory systems of modern computers. However,
the slowdown in technology scaling and the quest to increase
memory bandwidth driven by emerging data-intensive appli-
cations such as machine learning has led to the development
of a new class of 3D memory [1]. The two well-known
realizations of 3D memory are Microns Hybrid Memory Cube
(HMC) [2] and JEDEC High Bandwidth Memory (HBM) [3].
These memories exploit TSV-based stacking technology and
re-architects the memory banks into multiple independently-
operated channels to achieve greater memory-level parallelism
than DDR memory. For instance, a standalone HMC can
provide an order of magnitude higher bandwidth with not
only 70% less energy per bit than DDR3 DRAM technologies
but also use nearly 90% less space than todays RDIMMs
[2]. However, integrating 3D memories into computer systems
is non-trivial. It requires detailed evaluation of these memo-
ries within computers running realistic software stacks. Such
evaluation would help computer architects in comprehensively
understanding the relationship between micro-architecture,
OS, and applications and further driving software-hardware
collaborative innovations for memory systems. Early studies
on HMC-based main memory can be interesting [4], but the
majorities abstract the HMC equivalently as high bandwidth,
yet low energy, DDR SDRAM. This methodology does not
fully exploit the unique architectural attributes of HMC for
higher performance. Hardware-software co-design of HMC-
based main memory is fundamentally hindered by a lack of
research evaluation infrastructure with high fidelity, flexibility,
and performance.
There are decades of research into performance evaluation
methods for computer systems [5][6][7][8]. Software-based
simulators are flexible, but suffer from low simulation speeds
and low-fidelity simulations when modeling real hardware,
especially memory systems [7]. For instance, a typical DDR
memory controller reorders and schedules requests multiple
times while keeping track of dozens of timing parameters
that are inherently challenging to model [9]. FPGA-accelerated
simulations address the majority of these problems by provid-
ing low latency, high simulation throughput, and low cost per
simulation cycle [10][11]. Yet, the current critical limitation
to this approach has been modeling the DDR DRAM-based
main memory system, as mapping the controller RTL, physical
interface, and chip models into an FPGA fabric is too complex
and resource-intensive [12].
To address these challenges, we present MEG, an open-
source, configurable, cycle-exact, and RISC-V based full sys-
tem simulation infrastructure using FPGA and HMC. MEG
comprises a FPGA-based rocket processor [13] and a HMC-
based main memory. It also includes a bootable Linux image
for a realistic software flow, enabling cross-layer hardware-
software co-optimization. We leverage a commercially avail-
able FPGA platform - the Xilinx VCU110 board equipped
with a Xilinx virtex Ultrascale FPGA and a 2GB HMC module
from Micron - to implement the very first version of MEG.
The commodity nature of the platform effectively lowers the
barrier for open source adoption.
MEG comprises three highly configurable design compo-
nents: a HMC adaptation module, a reconfigurable memory
controller, and a flexible performance monitoring scheme. The
HMC adaptation module handles the incompatible information
145
2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
2576-2621/19/$31.00 ©2019 IEEEDOI 10.1109/FCCM.2019.00029
flow between the host system and the HMC controller module.
Constructed to be flexible, the adaptation module can be
easily extended to fit other memories (e.g., HBM, nonvolatile
memory) with minimal effort. The design consists of a com-
bination of an address translation unit, a HMC boot sequence
generator, and an ID management unit. The detailed design
will be explained in Section IV-A. The reconfigurable memory
controller along with its OS support is the key components that
system designers can effectively leverage to perform software-
hardware co-optimization. It provides architectural support
and an interface that can communicate high-level program
semantics, i.e., data access patterns, from both the application
to the underlying OS and architecture (the unique addressing
scheme of HMC) to achieve more intelligent data placement
for memory system performance optimization. The detailed
design and a simple use case for such cross-layer optimization
will be introduced in Section IV-B. Finally, the addition of a
flexible performance monitoring scheme (Section IV-C) im-
proves system observability and debuggability for performance
optimization. It comprises a trace-based performance monitor
along with a set of reconfigurable hardware counters, enabling
users to flexibly monitor and collect runtime information,
i.e., detailed full system execution traces, from a workload
and better comprehend memory system behavior, ultimately
guiding performance optimization.
The paper is organized as follows. Section II provides
background on HMC technology and RISC-V processors.
Section III presents our experimental characterization on the
native HMC device. Section IV presents the MEG system in-
cluding three detailed design components. Section V presents
the evaluation results on synthetic and real-world benchmark
applications followed by a conclusion in Section VI.
II. BACKGROUND
In this section, we will provide background on emerging
parallel memory architecture and RISC-V Processors.
A. Hybrid Memory Cube
Hybrid memory cube (HMC) employs a new parallel
memory architecture that stacks multiple DRAM dies on
top of a CMOS logic layer using through-silicon-via (TSV)
technology[2]. Fig. 1 shows an overview of the HMC archi-
tecture. It consists of multiple DRAM layers with each layer
divided into multiple partitions, and each partition comprising
of several memory banks. A vertical slice of stacked banks
forms a structure known as a vault. Each vault has its inde-
pendent TSV bus and vault controller in the logic base layer.
The vault controller manages the timing of DRAM banks
within the stack, independent of other vaults. Therefore, a
stack is analogous to the notion of a channel in traditional
DRAM-based memory systems as it contains all components
of a DRAM channel: a memory controller, several memory
ranks (partition), and a bi-directional bus. We can then view
these parallel memory architectures as a device that integrates
numerous DRAM channels into a single device.
There are some major differences between traditional DDRx
SDRAM and HMC. The DDRx SDRAM has a relatively large
row buffer size ranging from 512B to 2048B and uses an
open-page policy to benefit from the row buffer locality. In
contrast, HMC uses a much smaller row buffer size (16B) and
all data in the row buffer will be read out for a single memory
request. Hence, HMC uses closed-page policy to minimize
the power consumption. In addition, HMC provides multiple
address mapping modes as a way to control memory access
scheduling. We show the default address mapping scheme in
Fig. 2 that maps the memory blocks across vaults, banks,
and rows. The minimum access granularity of HMC is 16B,
implying that the 4 LSB are ignored. The 4-bit vault address
(6 to 9) combined with the 4-bit bank address (10 to 13) are
used to select a bank. Besides the default addressing scheme,
HMC supports other addressing scheme such as scheme 1, 2,
and 3 in Fig. 2. HMC’s structure provides high bandwidth
and energy efficiency potentials. In Section IV-B, we will
demonstrate that only with efficient utilization of fine-grain
parallelism, i.e., careful selection of addressing modes with
respect to different data access patterns, we can achieve these
theoretical potentials.
Partition
VaultController
TSVsVault
ControllerVault
Controller
Vault
Logic & CrossbarSerdes Buffers
ResponseLink
RequestLink
Fig. 1. A conceptual diagram illustrates the HMC architecture.
B. RISC-V Processors
To build an open-source and configurable FPGA simulation
platform with Linux support, we chose to use the RISC-V
architecture, as it is the only soft-core processor with Linux
support while being open-source. There are many widely-used
soft-processors such as the ARM cortex M1 [14], the Xilinx
Microblaze [15], the Intel NIOSII[16], and the RISC-V family.
While Microblaze and NIOSII both provide operating system
support, e.g., Linux, their hardware design is closed-sourced,
leaving the RISC-V the only viable option to support the full
system evaluation.
The RISC-V specification [17] is a both open and free
standard, leading to the architecture’s wide-spread adoption in
both industry and academia. For example, the Rocket-Chip im-
plements a RISC-V System-on-Chip for architecture education
and research. SiFive leverages rocket-chip to implement the
first Linux-capable RISC-V soft-core U500. VectorBlox [18]
has produced Orca to host the MXP vector processor. Finally,
the VexRiscV project[19] and the Taiga project [20] have
optimized the RISC-V processor for FPGAs. We extended
146
Fig. 2. Different HMC address mapping schemes.
MEG from the SiFive U500 design for the following reasons:
(i) it comes with Linux support; (ii) it is a rocket-core based
design, which allows processor parameter modification via
Chisel [21], a high-level abstraction.
Until the the development of physical support with parallel
memory controllers, FPGA-based emulation plays a crucial
role in enabling the evaluation and optimization of parallel
memory subsystems.
III. EXPLOIT MEMORY PARALLELISM
As described in Section II-A, HMC provides massive mem-
ory parallelism at both vault and bank levels, where vault-level
parallelism does not exist in conventional DDRx memory and
is unique to HMC. Therefore, choosing an optimal memory ad-
dress mapping scheme is critical to maximize parallelism and
memory bandwidth. While simple in theory, mapping scheme
selection is not a trivial task. Specifically, distinct memory
access patterns generated by different applications suggest that
memory mapping optimization is complex and multi-faceted.
To better understand the relationship between the mapping
scheme and the memory performance, we perform several
experiments on a FPGA-HMC platform and use the resulting
observations to optimize system performance (Section IV-B).
Fig. 3. HMC streaming bandwidth with different levels of parallelism
In the FPGA-HMC platform, a 2GB HMC module is
connected to a Xilinx UltraScale FPGA through two half-
width (8 lanes) 15G HMC links. We implement a dedicated
module on FPGA to generate the memory requests that are sent
to the HMC module. Specifically, this module reads an array
in HMC memory with different strides, where each element
can vary in size between 32B, 64B, 128B, and 256B. Four
representative memory address mapping schemes (Fig. 2) are
evaluated in these experiments.
Fig. 4. Bandwidth of accessing 32B elements in an array on HMC withdifferent stride size
Changing the address mapping scheme varies the num-
ber of banks and vaults used to store the data array and
serve memory requests concurrently. This difference yields
a measurable performance change, i.e., memory bandwidth
difference between different mapping schemes. Particularly in
the case of streaming access, i.e., stride 1, the attained memory
bandwidth can differ in a factor of ∼ 10× (Fig. 3). Further
memory bandwidth measurements of the four schemes under
varying strides and element sizes yield drastic differences
in performance across the schemes. This observation implies
that a fixed memory address mapping scheme cannot deliverthe best performance for different scenarios. Then, it is then
necessary to select optimal address mapping schemes based
on the detected memory access pattern, i.e., stride and element
size. We leverage the strong link between the application high-
level data structure and its memory access pattern to guide
optimal selection of the memory mapping scheme (Section
IV-B).
Fig. 5. Bandwidth of accessing an array on HMC with a fixed stride of 16and different element sizes
IV. SYSTEM DESIGN
Fig. 6 shows an overview of our MEG system. The three
major design components are highlighted, including (i) a
147
HMC adaptation module that both enables the communication
between HMC device and the processor cores and is extensible
to fit other memories (e.g., HBM, nonvolatile memory) with
minimal effort, (ii) a reconfigurable memory controller along
with its OS support that can be effectively leveraged by sys-
tem designers to perform software-hardware co-optimization,
and (iii) a performance monitoring module that effectively
improves the observability and debuggability of the system to
guide performance optimization. In the following subsections,
we present detailed descriptions of each design component.
Fig. 6. System diagram of MEG
A. HMC Adaptation Module
In this subsection, we provide a detailed description of the
HMC adaptation module which interconnects the HMC and
host system. This design consists of three components: an
address translation unit, a HMC boot sequence generator, and
an ID management unit.
Since the host system provides a memory port which can
only be interfaced via SDRAM, we have provided an address
translation module for modifying address information and
ensuring compatibility with HMC requirements. The HMC
memory module differs from standard SDRAM in two primary
aspects, (a) it provides the ability to have multiple outstanding
transactions, and (b) it requires metadata specifying the type of
transaction and packet size in every transaction. To cater to the
first, we explore the use of the ID field provided by the HMC
module and configure the rocket-core system accordingly
to utilize the maximum achievable memory bandwidth. To
address the second, we implement a forwarder module where
we insert metadata that specifies transaction type into the user
channel. In addition, we extend this module to perform clock
domain crossing, as the HMC controller requires the data on
the AXI4 port to be synchronized with its reference clock,
which is different from the core operating frequency.
Aside from the interface difference, the HMC memory
module requires a sequence of steps for initialization and
power-management. To alleviate user burdens, we have created
a specialized boot sequence generator for the proper setup
of the HMC module. The implemented module interfaces
with the HMC configuration port and issues the appropriate
initialization commands. After ensuring a stable reference
voltage, we begin the initialization by de-asserting the reset
signal and subsequently configuring the reference clock and
the lane bandwidth. In the next phase, the HMC controller
and memory module perform a synchronization process for
link layer configuration. After completion, we verify the
physical layer and clock reset signals of the HMC memory to
identify whether the initialization completed successfully and
accordingly propagate the ready signal to the other modules
in the design.
Fig. 7. Detailed implementation of the HMC adaptation module
To exploit the expansive bandwidth offered by the HMC
memory module, we employ an ID management module to
configure the cache in the rocket-core to be non-blocking,
i.e., have Miss Status Handling Registers (MSHRs), enabling
the module to issue multiple memory requests to the memory.
MSHRs are special registers that contain the required metadata
such as tag and cache-way to redirect the memory response to
an appropriate location in the cache. Configuring the amount
of MSHRs in the rocket core and using the HMC ID-field
enables the core to utilize maximum achievable bandwidth.
In our system, change in the number of MSHRs also impacts
the resource utilization in subsequent modules in the memory
hierarchy. Also, systems with non-blocking caches require
every memory request to be checked in the MSHR buffer such
that the system can identify whether previous requests serviced
correspond to the same location in memory. Considering both
MSHR logic and MSHR area overheads, there exists a trade-
off between the amount of MSHRs in the system and the
number of outstanding transactions supported by the memory
148
Fig. 8. An example of selecting the optimal addressing scheme for accessing 64B elements with a stride of 16
hierarchy. We discuss this relationship between the resource
utilization and number of MSHRs in the evaluation section.
B. Reconfigurable Memory Controller
MEG system provides a reconfigurable memory controller
to enable memory subsystem optimization. The major com-
ponent added into the memory controller is a TCAM mod-
ule used for selecting the optimal address mapping scheme.
Despite a relatively simple design in hardware, support from
the programming model and runtime management are both
required to efficiently utilize this modified memory controller.
In this subsection, we present a simple example to illustrate
how to optimize the memory system performance with this
modified controller. More comprehensive system solutions will
be our future work.
Fig. 9. A use case illustrating how to apply the reconfigurable memorycontroller to perform memory system optimization
Achieving the optimal selection of the memory address
mapping scheme consists of three steps. The first step is the
offline code profiling stage. In this step, user applications are
analyzed to extract high-level semantics and access patterns,
e.g., element size, stride, etc. This high-level information is
then used to determine the optimal address mapping scheme.
With the optimized data placement, the massive memory
parallelism provided by HMC can be efficiently utilized, as
continuous memory requests are served by different vaults in
parallel, as illustrated in Fig. 8.
The second step is a modified memory allocation pro-
cess. Specifically, we modify the malloc function to take
an additional parameter, i.e., the address mapping scheme.
Consequently, users can call this malloc function to assign
the optimal mapping scheme for a specific memory allocation.
Moreover, we also alter the Linux kernel to store the physical
address range, i.e., page mask, and the corresponding mapping
scheme into a TCAM module.
The last step is the runtime address mapping process. The
memory controller is modified to apply the assigned mapping
scheme for a specific physical address range. When the mem-
ory controller receives a memory request, the physical address
of this memory request is used as the key to search the TCAM
module and to fetch the assigned address mapping scheme.
The corresponding mapping scheme then converts physical
addresses into hardware addresses to read HMC memory.
Fig. 10. The hardware components for the performance monitor scheme.
C. Performance Monitoring Scheme
As shown in Fig. 10, MEG comprises of two complementary
monitoring schemes, hardware counters and a trace collector.
Depending upon the desired detail level, users can decide
between either using the simple performance counters for
monitoring event frequency, or the PCIe-based trace collection
149
module for obtaining high-fidelity details at the hardware
signal granularity. Through this module, we provide the user a
flexible and configurable interface for monitoring any desired
signal over a preferred duration. For configurability, users
can write predefined values to a memory-mapped register, to
enable or disable the event recording module. In addition, the
user can monitor the performance counters and execute trace
collection via the JTAG and the PCIe ports simultaneously.
The JTAG port, used for monitoring the hardware counters,
provides sufficient bandwidth for transferring counter data to
the host machine. Trace collection on the other hand requires
the high bandwidth PCIe interface due to the large amount
of data generated. Conventionally, this imposes a limitation
on the number of signals the user can map onto the PCIe
capture interface. In our design, instead of constraining the
user to regenerate the bitstream for each change in the target
capture signals, we have implemented a multiplexer scheme
where the user can map all potential signals and control them
using the provided memory-mapped register. This provides the
user a programmable trace collection interface which can be
controlled using only software.
Fig. 11. The control flow of the performance monitoring scheme.
As illustrated in Fig. 11, in order to integrate the monitoring
module, the user has to select and connect the target signals
to the corresponding signal ports of our module and generates
the bitstream with the modified design. We provide the user
with a modified Linux image, which includes the memory
mapped register, to load into the FPGA programmed with
the generated bitstream. The register controls the monitoring
duration, where writing a one to it starts the monitoring
TABLE IRESOURCE UTILIZATION
LOGIC BRAM DSPTotal 1072240 3780 1800Used 189229 205 60Utilization 17% 5% 3%
process, and writing a zero stops the corresponding process. At
the hardware level, a polling mechanism monitors the value of
the register. Once the value of the register has been asserted,
both the counter module and the trace module begin data
collection. Capturing the counter data over the JTAG port
does not require any modification on the host machine, and
only one line of commands have to be executed for enabling
the host PCIe driver to accept the generated trace data. On
executing the appropriate command, the host machine begins
reading data from the designated on-board storage buffer. If
the detected value of the memory mapped register is zero, i.e.,
whenever the user decides to stop the monitoring process, the
hardware counters and trace collection processes are halted.
On halt, the counter data are transferred to the host machine,
whereas the PCIe module waits for the on-board buffer to
be cleared before generating an interrupt to the host. This
interrupt signals the end of the transfer to the host, following
which the host machine no longer accepts data from the FPGA.
Subsequently, if the user decides to monitor a different set of
signals, the memory mapped register can be programmed with
the corresponding IDs and the process can be rerun.
V. EVALUATIONS
A. Experimental Setup
We implement MEG on Xilinx VCU110 board (Fig. 12)
that equipped with a combination of a Xilinx virtex Ultrascale
FPGA and a 2GB HMC module from Micron. The design is
implemented using Xilinx Vivado 2016.4 and Xilinx XHMC
IP 1.1. The resource utilization is summarized in Table I.
Fig. 12. MEG platform, including a Xilinx Virtex Ultrascale FPGA, a 2GBMicron HMC module and SD card to store the Linux image
To evaluate the performance of MEG, we select 4 different
benchmarks: random access, stride array access, locality sen-
sitive hashing, and graph traversal, where both random access
and stride array access are synthetic benchmarks while both
150
locality sensitive hashing and graph traversal are from real-
world applications. The random access benchmark is used
to verify that the RISC-V cores can generate enough traffic
to saturate the physical interface. The stride-array access
benchmark is the same as one used in Section III, but re-
evaluated in MEG with complete system stack. The locality
sensitive hashing and graph traversal are then used to evaluate
the performance of MEG in real-world applications.
B. Evaluation Results
Fig. 13. The performance (in terms of GUPS) for random access when varyingthe size of MSHR and the number of cores.
1) Random Access: To verify that our system can generate
sufficient memory requests to saturate the HMC physical
interface, we measure the random access performance in
terms of GUPS (giga-updates per second) while varying the
number of MSHR (on-the-fly memory request) and the number
of cores. We modified the code from the Random Access
benchmark in the HPC challenge [22] and optimize it for
better in-order execution performance. In particular, instead
of executing read-modify-update operations atomically, we
combine several operations of the same type to hide memory
access latency and reduce the number of stalls caused by
serialized memory access. As shown in Fig. 13, using all
4 cores and 32 MSHRs generates enough memory traffic to
saturate the HMC interface.
Fig. 14. The performance (in terms of GUPS) for random access whendifferent number of vaults/banks are used to store data.
We further evaluate the sensitivity of random access per-
formance (in terms of GUPS) to parallelism. By changing the
memory address mapping schemes, we can limit the amount of
vaults and banks that are used to store data. As shown in Fig.
14, the random access performance monolithically increases
when data are stored in more vaults and banks. We note
that, no performance improvement is achieved when more
than 4 vaults are used to store data. This is because of the
limited interconnection bandwidth (2 half-width physical links
between FPGA and HMC) in our platform. The HMC module
can be connected to the FPGA with 4 full-width physical links
(4× bandwidth compared with that provided in our platform).
With this interconnection, the random access performance can
be further improved when data are stored in more than 4 vaults.
Fig. 15. The performance comparison between the default and optimal addressmapping schemes when reading an array with a given stride and a fixedelement size (64B).
Fig. 16. The performance comparison between the default and optimal addressmapping schemes when reading an array with a given element size and a fixedstride (16).
2) Stride Memory Access: We extend the stride-array ac-
cess experiments in section III with different payload sizes and
access strides. We first fix the size of element to 64B (aligned
to RISC-V cache line size) and vary the stride. As shown in
Fig. 15, we can see that the performance gap between the
default address mapping and the optimal address mapping can
151
be as large as 10×, especially when using a stride of 16, where
the data are stored in one vault (limited parallelism) with the
default address mapping scheme.
We then evaluate the performance with various element
sizes and a fixed stride (16). As shown in Fig. 16, the perfor-
mance gap between the optimal address mapping scheme and
the default one decreases with an increased element size. This
is because when the element size is larger than 256B, more
than 4 vaults are used to store one data element. Consequently,
both default and optimal address mapping schemes have fully
utilized the parallelism provided by the HMC module.
Fig. 17. The performance comparison between the default and optimal addressmapping schemes under different LSH entry sizes.
3) LSH: Nearest neighbor search is the key kernel for many
applications (e.g., image search), and one of the most widely
adopted methods is Locality Sensitive Hashing (LSH) [23].
LSH hashes the entire dataset using a set of hash functions,
and each function is designed to statistically hash similar data
into the same bucket. To serve a query, the query is first
hashed using the same hash function set, and the matching
buckets are traversed to find the optimal data entry (according
to the calculated distance between the query and data entry).
Therefore, the data access pattern of LSH is a combination
of random access (read matching buckets is random) and
sequential access (read data in one bucket is sequential).
In this experiment, we run the LSH applications with four
threads (one on each core). We evaluate the performance of
different address mapping scheme under different data entry
sizes. As shown in Fig. 17, the performance gap between the
optimal and default address mapping scheme increases with
an increased data entry size. This is because that the default
address mapping scheme can lead to a vault conflict (different
threads read the same vault) with large data entry. Here, we
take the case that the data entry size is 1kB to illustrate
this vault conflict. Specifically, one data entry is stored in all
16 vaults when using the default address mapping scheme.
Therefore, reading one data entry needs to access all vaults.
This means that the memory requests issued by different
threads/cores can only be sequentially served, leading to a
frequent pipeline stalls (low performance). On the contrary,
one data entry is stored in 4 vaults under the optimal address
mapping scheme. Consequently, the HMC module can serve
up to four memory requests in parallel, leading to an increased
performance.
Fig. 18. The performance comparison for the BFS benchmark between thedefault and optimal address mapping schemes under different number ofvertices.
4) Graph Traversal: Graph traversal is one of themost im-
portant kernels in many graph problems, including maximum
flow [24], shortest path [25], and graph search [26]. It has
a random access pattern as the nodes that are traversed in
one round is determined by the nodes that are traversed in
the previous round. In this experiment, we use the level-
synchronized BFS design that are proposed by Zhang et. al
[27] to evaluate the performance of different address mapping
scheme under different list sizes. Similar to the case in LSH,
different vaults are used to store one list under different
address mapping schemes, and the number of memory requests
that can be served by HMC in parallel also varies under
different schemes. As shown in Fig. 18, the optimal address
mapping scheme can improve the performance by more than
10% compared with the default one.
VI. CONCLUSION
This work introduces the first RISC-V based full system
simulation infrastructure (MEG) using FPGA and emerging
HMC. We present a prototype implementation of MEG and
demonstrate the capability, fidelity and flexibility of the design
on real-world benchmark applications. The techniques we
propose can be applied more generally to other memory
types, such as HBM, non-volatile memories, etc. With our
preliminary evaluation on a relatively simple case, we be-
lieve and hope that MEG, with its flexibility and fidelity,
can enable further comprehensive studies and research ideas
across all levels of computing stack for memory system
operations, from microarchitecture to programming models,
runtime management, operating systems, and applications. As
this is the very first prototype implementation, we believe there
is plenty of space left for improving the platform such as the
usability, easy-to-use software interface, etc. As a contribution
to RISC-V community, we plan to release MEG publicly as
an open-source research platform, accessible to a wide range
of software and hardware developers to advance computer
systems research.
ACKNOWLEDGEMENT
This work was supported in part by Semiconductor Re-
search Corporation (SRC).
152
REFERENCES
[1] J. Kim, A. J. Hong, S. M. Kim, K.-S. Shin, E. B. Song, Y. Hwang,F. Xiu, K. Galatsis, C. O. Chui, R. N. Candler et al., “A stacked memorydevice on logic 3d technology for ultra-high-density data storage,”Nanotechnology, vol. 22, no. 25, p. 254006, 2011.
[2] J. T. Pawlowski, “Hybrid memory cube (hmc),” in 2011 IEEE Hot Chips23 Symposium (HCS). IEEE, 2011, pp. 1–24.
[3] J. Standard, “High bandwidth memory (hbm) dram,” JESD235, 2013.[4] Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low, L. Pileggi,
J. C. Hoe, and F. Franchetti, “3d-stacked memory-side acceleration:Accelerator and system design,” in In the Workshop on Near-DataProcessing (WoNDP)(Held in conjunction with MICRO-47.), 2014.
[5] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu,A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood,“Multifacet’s general execution-driven multiprocessor simulator (gems)toolset,” SIGARCH Comput. Archit. News, vol. 33, no. 4, pp. 92–99, Nov.2005. [Online]. Available: http://doi.acm.org/10.1145/1105734.1105747
[6] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi,and S. K. Reinhardt, “The m5 simulator: Modeling networked systems,”IEEE Micro, vol. 26, no. 4, pp. 52–60, 2006.
[7] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al., “The gem5simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2,pp. 1–7, 2011.
[8] F. Ryckbosch, S. Polfliet, and L. Eeckhout, “Fast, accurate, and validatedfull-system software simulation of x86 hardware,” IEEE micro, vol. 30,no. 6, pp. 46–56, 2010.
[9] D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, andB. Jacob, “Dramsim: a memory system simulator,” ACM SIGARCHComputer Architecture News, vol. 33, no. 4, pp. 100–107, 2005.
[10] S. Karandikar, H. Mao, D. Kim, D. Biancolin, A. Amid, D. Lee,N. Pemberton, E. Amaro, C. Schmidt, A. Chopra et al., “Firesim:Fpga-accelerated cycle-exact scale-out system simulation in the publiccloud,” in Proceedings of the 45th Annual International Symposium onComputer Architecture. IEEE Press, 2018, pp. 29–42.
[11] H. Angepat, D. Chiou, E. S. Chung, and J. C. Hoe, “Fpga-acceleratedsimulation of computer systems,” Synthesis Lectures on ComputerArchitecture, vol. 9, no. 2, pp. 1–80, 2014.
[12] J. D. Leidel and Y. Chen, “Hmc-sim-2.0: A simulation platform forexploring custom memory cube operations,” in Parallel and DistributedProcessing Symposium Workshops, 2016 IEEE International. IEEE,2016, pp. 621–630.
[13] R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C. Celio, H. Cook,D. Dabbelt, J. Hauser, A. Izraelevitz et al., “The rocket chip generator,”2016.
[14] ARM, Arm Cortex-M1 DesignStart FPGA-Xilinx edition.[15] Xilinx, MicroBlaze Micro Controller System v3.0.[16] Intel, Nios II Processor Reference Guide.[17] R.-V. foundation, The RISC-V Instruction Set Manual.[18] A. Severance and G. G. Lemieux, “Embedded supercomputing in fpgas
with the vectorblox mxp matrix processor,” in Proceedings of theNinth IEEE/ACM/IFIP International Conference on Hardware/SoftwareCodesign and System Synthesis. IEEE Press, 2013, p. 6.
[19] SpinalHDL, VexRiscv CPU.[20] E. Matthews, Z. Aguila, and L. Shannon, “Evaluating the performance
efficiency of a soft-processor, variable-length, parallel-execution-unitarchitecture for fpgas using the risc-v isa,” in 2018 IEEE 26th AnnualInternational Symposium on Field-Programmable Custom ComputingMachines (FCCM). IEEE, 2018, pp. 1–8.
[21] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avizienis,J. Wawrzynek, and K. Asanovic, “Chisel: constructing hardware in ascala embedded language,” in Design Automation Conference (DAC),2012 49th ACM/EDAC/IEEE. IEEE, 2012, pp. 1212–1221.
[22] P. Luszczek, J. Dongarra, and J. Kepner, “Design and implementation ofthe hpc challenge benchmark suite,” CT Watch Quarterly, vol. 2, no. 4A,pp. 18–23, 2006.
[23] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approxi-mate nearest neighbor in high dimensions,” in Foundations of ComputerScience, 2006. FOCS’06. 47th Annual IEEE Symposium on. IEEE,2006, pp. 459–468.
[24] A. V. Goldberg and R. E. Tarjan, “A new approach to the maximum-flow problem,” Journal of the ACM (JACM), vol. 35, no. 4, pp. 921–940,1988.
[25] P. Harish and P. Narayanan, “Accelerating large graph algorithms onthe gpu using cuda,” in International conference on high-performancecomputing. Springer, 2007, pp. 197–208.
[26] I. Mani and E. Bloedorn, “Multi-document summarization by graphsearch and matching,” arXiv preprint cmp-lg/9712004, 1997.
[27] J. Zhang, S. Khoram, and J. Li, “Boosting the performance of fpga-based graph processor using hybrid memory cube: A case for breadthfirst search,” in Proceedings of the 2017 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays. ACM, 2017, pp.207–216.
153