1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC...

23
1 JADE Heterogeneous System Simulation Platform Version 4.0 OPTICS Lab Big Data System Lab Department of Electronic and Computer Engineering Hong Kong University of Science and Technology https://www.ece.ust.hk/ eexu July 2018

Transcript of 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC...

Page 1: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

1

JADE Heterogeneous System Simulation Platform

Version 4.0

OPTICS Lab

Big Data System Lab

Department of Electronic and Computer Engineering

Hong Kong University of Science and Technology

https://www.ece.ust.hk/∼eexu

July 2018

Page 2: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

2

CONTENTS

1 Motivation 3

2 Introduction 32-A New features in JADE Version 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 JADE organization 53-A Root directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53-B Workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Installation guideline 64-A Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64-B Quick compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64-C Advanced compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5 Recommended workflow 7

6 Running simulation 86-A Quick guideline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

7 Application model 97-A Applications and processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117-B Mapping and Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127-C Building COSMIC tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127-D Using COSMIC mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127-E Additional COSMIC tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

8 Architectures 148-A Architecture file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148-B Network file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148-C Custom network topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158-D Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168-E Components configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178-F Power models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

9 Output items 19

10 Agreement and License 19

11 Revision history 19

References 20

12 Appendix 2012-A Basic command line options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2012-B Advanced command line options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Page 3: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

3

1. MOTIVATION

Recent advances in the computing industry towards multiprocessor technologies shifted the dominantmethod of performance increase from frequency scaling to parallelism. Due to its huge design space,evaluating candidate multicore processor and tens or hundreds of such processor architectures in earlydesign stages, when the number of variables is at its maximum, is challenging. Simulation plays animportant role in estimating architecture performance, and evaluating how the system would performon average, as well as boundary cases, would require many iterations to cover various cases in theapplication input domain. Since simulation of heterogeneous systems with enough details are naturallyslow, exhaustively evaluating the system for all possible inputs require tremendous amount of timeand resources. While there exist quite a few multiprocessor simulators available, they often rely onindividual input specification, demanding extensive input enumeration and simulation runs, diminishingtheir effectiveness for complex systems evaluation. Aiming to fulfill this gap, we publicly release aheterogeneous multiprocessor system simulation platform called JADE, targeting fast initial architectureexplorations. Opposing to most simulators, JADE uses statistical models that follow distributions extractedfrom internal structures of the application, providing a more convenient and systematic explorationapproach to evaluate systems performance. JADE simulation features include detailed electrical and opticalinterconnections for both single-chip and multi-chip systems, detailed memory hierarchy infrastructure,and built-in energy analysis allowing studies of a broad spectrum of systems.

2. INTRODUCTION

JADE is a cycle accurate event driven simulator, implemented using C++ and most of the configurationsare templates in text form. Figure 1 shows an overview of JADE features. It receives as input the detaileddescription of the hardware architecture such as network topology, memory hierarchy, cache coherenceprotocol and specific processor parameters. The application model and its mapping to the system areavailable from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provideholistic power analysis for various configurations of architecture, and technology nodes. The latest JADErelease also includes power management functions: Dynamic Voltage and Frequency Scaling (DVFS),Energy Consumption Calculation and Power Delivery Systems (PDS). Along with memory access, holisticperformance and power analysis, it is also possible to extract customized system behavior to monitorspecific component. The following sections detail selected features implemented in JADE.

By performing cycle-level simulations, JADE can model the detailed behaviors of each module andcomprehensively profile the characteristics of the system components. A set of templates are alreadyprovided with different NoC and common communication topologies, such as Crossbar, Ring, Torus,Mesh, Folded-Torus, and Fat-tree with various count of cores, caches and memories. These templatescan be easily modified or extended to accommodate a wide range of simulation schemes, allowing highflexibility of components placement and creation of new regular and irregular topologies.

JADE uses the COSMIC Heterogeneous Multiprocessor Benchmark and Application Performance atExtreme Scale (APEX) benchmark [1] as application models. Applications are modeled as Task Commu-nication Graph that includes execution time and memory accesses profiled in three different instructionset architectures: ARM-v8, RISC-V and x86-64. Each application is profiled in two modes: statisticalapplication models, where a proxy application is used to statistically mimic the behavior of the originalapplication; and a recorded application model in which real traces and execution time are used.

A. New features in JADE Version 4.0The version 4.0 of JADE includes important changes in the applications, CPU models, cache hierarchies

and usabilities. In this section we briefly discuss about the major changes.The newest version of JADE includes applications profiled for more recent instruction sets. The new

instruction set available are:

Page 4: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

4

Hardware Architecture

Output

JADE

Power Architecture Template

Processor Library

Processor Architecture

Memory Hierarchy

Coherence Protocol

Network Architecture

Optical Electrical

COSMIC Benchmark

Statistical Application

Model

Recorded Application

Model

Mapping Memory and Cache Coherence

Library

Optical and Electrical

Network Library

Holistic Power Analysis

Holistic Performance

Analysis

System Behavior

Memory Access Trace

Network subsystem

Processor subsystem

Memory subsystem

(RUBY)

Fig. 1: Overview of JADE.

• aarch64: a 64-bit version of ARM.• riscv: the open source RISC instruction set.• x64: (also known as x86-64) the 64-bit version of the x86 ISA.Opposing to previous releases of JADE, in the version 4.0, the applications have been profiled using

QEMU. Since our profiling technique is almost architecture independent we not need to profile theapplications using a cycle-accurate simulator. Also, QEMU is able to emulate more architectures thanmost of the cycle-accurate simulators available.

The new applications have been profiled with a superior technique for locality modeling, the HierarchicalReuse Distance. This new profing allows better representation of the behavior for individual tasks. Thetasks dependency continues the same as in previous versions: using a Task Communication Graph (TCG).

We redesigned our CPU model to better accommodate for the new model of the application. The newsource codes are now simpler to use and to modify. The new CPU model has a constant IPC (default1-IPC) but nothing impedes the users from modifying it. Furthermore, a subtle change is that the baseoperating frequency for the system is now 2GHz. You can tune it by setting the --base-period commandline option.

There is a new cache coherence protocol for a 3-level cache hierarchy. In previous releases of JADE,we provided templates for 3-levels of caches with exclusive caches. The previous 3-level cache modelswere exclusive between L1 and L2, and inclusive between L2 and L3. In this new release, we provide athree level cache hierarchy with all levels inclusive. All cache hierarchy are provided inside the directoryprotocols.

Page 5: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

5

The usability of the JADE simulator has been improved. All the parameters including architectureparameters can be easily tuned at the command line interface. To learn about the new usage, please seesection 6.

3. JADE ORGANIZATION

This section briefly describes the directory organization of JADE.

A. Root directoryIn table I JADE directory organization is briefly explained. In this manual, it is supposed that the JADE

root folder is located at /home/JADE-v4.0 and all commands would be executed from this directory.The relevant directories for users are classified as User space and there resides all files required torealize simulations with different topologies, applications, and components. Additional directories includesthe source of JADE, and a set of tools provided task manipulations and files generation.

TABLE I: JADE-v3.0 root directory organization and description

Class DirectoryName

Description

User space

applications Contains all files related to applications, such as application TCG files, recordedtraces, and task mapping files.

architectures Holds hardware models, for instance architecture description, network topology, andcomponents configurations

workspace Keeps all binary files and results generated from the simulation.documents JADE user instructions.

Source code

common Global headers and main definitions.network Network related files.

processor Processing elements source code.ruby Cache subsystem model.slicc Code generation.

protocols Coherence protocol files.support Power management files.

dramsim2 Detailed memory activity.

Tools scripts Helper scripts for compilation.utils Utility tools.

B. WorkspaceThe user workspace is composed by multiple directories for each JADE compilation and it depends

on the components configuration and the coherence protocol, see section 5 for further details. For eachcoherence protocol and configuration file used a new workspace directory is created under the directoryworkspace in which the final binary file jade.exec will reside. The workspace would have a namefollowing the configuration file used, and the coherence protocol. Down below is the syntax adopted:

<config>_<coherence_protocol>

Table II details more the purpose of each directory under the workspace.

Page 6: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

6

TABLE II: Workspace directory description.

Class DirectoryName

Description

User Environment

bin Keeps the compiled JADE executable file.

result Default destination for statistics results and system behavior output.

html Documentation generated by Ruby.

Compilation Outputsrc Source code generated with respect to the protocol specified.

obj Object files generated by the compilation.

4. INSTALLATION GUIDELINE

A. RequirementsIn order to compile JADE successfully a few softwares and libraries should be present in your system.

Most of recent Linux distributions would attend these requirements, we provide here a reference just incase you encounter any problem. Table III shows a summarized list of dependencies of JADE divided intwo classes: those mandatory and those optional. Make sure all of the mentioned external softwares areproperly set in your environment variables (PATH).

TABLE III: External dependencies

Software Version Purpose Quick reference

Mandatory

GCC 4.8.5 Compilation https://gcc.gnu.org/

Python 2.7+ Code generationhelper https://www.python.org/

GZIP 1.3.5 Decompressingtraces http://www.gzip.org/

BISON 2.3 Parser https://www.gnu.org/software/bison/

FLEX 2.5.4 Lexical analyzer http://flex.sourceforge.net/

Optional Doxygen 1.8.10 Codedocumentation http://www.stack.nl/∼dimitri/doxygen/

B. Quick compilationAll of the steps explained here suppose that the root JADE folder is located at /home/JADE-v4.0.

The easiest way to compile jade is enter the JADE root directory and type make. This will compileusing the default coherence protocol (MSI MOSI CMP directory) and the default components config-uration (generic) and the binary would be located at: /workspace/generic_msi_mosi_cmp_directory/jade.exec.

$ cd /home/JADE-v4.0$ make

When the compilation is over, a message similar to the one shown below should appear in the terminal:$ JADE-v4.0 Location: workspace/generic_msi_mosi_cmp_directory/bin/jade.exec$ JADE-v4.0 Generated Successfully!!

From this point, JADE is ready to be used and one may jump to section 6 to have more details abouthow to simulate applications.

Page 7: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

7

C. Advanced compilationIn case a different set of components configuration is required to be used, one should type make

including the flag SYSTEM_SETTINGS followed by the corresponding configuration name -refer toTable IV for available sample configurations. The detailed configurations for each SYSTEM_SETTINGScan also be changed in configuration files in subsection 8-E. The coherence protocol can be selected byusing the flag PROTOCOL.

Example 4.1. JADE with generic components and MSI MOSI CMP directory:

$ cd /home/JADE-v4.0$ make SYSTEM_SETTINGS=generic PROTOCOL=MSI_MOSI_CMP_directory

For each pair <configuration, protocol> a new folder will be created inside the workspacedirectory and be named as <configuration>_<protocol>, following the component configurationfile name and the cache coherence protocol used. In example 4.1, the destination folder is at /home/JADE-v4.0/workspace/generic_msi_mosi_cmp_directory.

5. RECOMMENDED WORKFLOW

JADE allows customization of the components configuration such as cache size, memory latency, cachelatency, and many others. The default configuration is based on a generic configuration, derived fromthe classical 5 stage MIPS pipeline. A few other configurations are provided under the directory ./architecture/configs. Table IV summarizes the configurations, where a brief comment about themand the corresponding .config file is specified. A brief description about some important parametersof the configuration file is shown in section 8-E. To use a different component configuration, user shouldcompile JADE again using the flag SYSTEM_SETTINGS, refer to section 4 for usage.

TABLE IV: Summary of memory configuration of components configurationfiles provided (Two-level cache).

SYSTEM SETTINGS Parameter Value Configuration file

generic∗

Main Memory (each) 8GBytes

generic.config

L2 Cache Size (per core)512KBytes

L1 Cache Size (per core) 32KBytes

Cache Line Size 64 bytes

riscv

Main Memory 8GBytes

riscv.configL2 Cache Size (per core) 2MBytesL1 Cache Size (per core) 64KBytes

Cache Line Size 64 bytes

aarch64

Main Memory 8GBytes

aarch64.config

L2 Cache Size (per core)512KBytes

L1 Cache Size (per core) 32KBytes

Cache Line Size 64 bytes

x64

Main Memory 8GBytes

x64.config

L2 Cache Size (per core)512KBytes

L1 Cache Size (per core) 64KBytes

Cache Line Size 64 bytes∗Default Makefile configuration.

Page 8: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

8

For each coherence protocol and configuration file used a new workspace directory is created under thedirectory workspace in which the final binary file jade.exec will reside. The workspace would have aname following the configuration file used, and the cache coherence protocol. Down below is the syntaxused:

<config>_<coherence_protocol>

Example 5.1. For generic components configuration using MSI MOSI CMP directory coherence proto-col, the workspace would be named generic_msi_mosi_cmp_directory

Figure 2 shows the recommend outlined workflow of JADE. The first stage is to choose which coherenceprotocol and components configuration is more appropriate for the simulation, compile it using make andthe appropriate coherence protocol and components configuration file, then execute JADE passing theappropriate command line arguments for different architecture, network topologies and application. Afteranalyzing the results, if any changes need to be made for the protocol or the components configurationis required to re-compile JADE. To simulate with different architecture, network topology or applicationis not require to recompile the code, and user can freely select different inputs.

CoherenceProtocol

ComponentsConfiguration

JADECompilation

Application Mapping

ArchitectureNetworkTopology

Power Library

Power & Performance

System Behavior

AnalysisExecutionBuildBefore

Compilation

-L-w-d

-i -g

-O

JADE/workspaceJADE/applications/mapJADE/applications/app

JADE/architectures/arc JADE/architectures/net JADE /architectures/pwr

JADE/architectures/config

Fig. 2: Recommended outlined workflow.

6. RUNNING SIMULATION

This section is dedicated to clarify how to simulate your desired system and applications using JADE.It has been divided into two parts, the first explains the simplest way to get JADE running for the firsttime and later a more detailed description of each part is provided.

A. Quick guidelineThe simplest command line is shown below, where ARC, NETWORK, APP and MAP are the file location

for the architecture, network topology, statistical application model and mapping, respectively. Note thatif a different configuration file is used, the workspace directory will be changed as explained in thesubsection 4-C.

$ ./workspace/generic_msi_mosi_cmp_directory/bin/jade.exec -d ARC -w NETWORK -i APP -g MAP

Architecture file is the the first thing users have to decide and it contains basic information aboutthe architecture under evaluation, such as how many processors, memories and caches will be present.Quite a few architecture files are provided in /home/JADE-v4.0/architectures/arc. For a given

Page 9: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

9

architecture, the placement and interconnection of the components are flexible and is determined by thenetwork file. For a given architecture file, user should choose the a desired network topology. We provideseveral topology templates for each of the architecture files, and can be located at /home/JADE-v4.0/architectures/net. Users can further create your own network files or modify the existing onesto simulate more specific systems.

The second step, is to decide which application should be used for simulation. We provided twoclasses of applications, statistical application model (extension .sam) and recorded application model(extension .ram), all located in /home/JADE-v4.0/applications/app -refer to table VI fora brief description of the applications. Users should select the application file in accordance with thecomponent configuration selected during compilation.

The next step is to map and schedule the application to the underlying hardware architecture. Weprovided a few mappings for each application located at /home/JADE-v4.0/applications/map.Refer to section 7-B to understand how to get the right mapping file for each the corresponding applicationand topology. A description of the flags required for each of the input files is depicted in table V. Followingare some examples on how to simulate some applications for a given hardware architecture.

Example 6.1. Simulating FFT-1024-aarch64 in a 8x8 mesh topology, assuming a ARM build has beenalready set up:

./workspace/aarch64_msi_mosi_cmp_directory/bin/jade.exec -d ./architectures/arc/mesh_8x8.arc-w ./architectures/net/mesh_8x8.txt -i ./applications/app/FFT-1024-aarch64.sam-g ./applications/map/FFT-1024-aarch64-mesh_8x8.map

Example 6.2. simulating LDPC-270x32-aarch64 in a 64 core - fattree topology:

./workspace/aarch64_msi_mosi_cmp_directory/bin/jade.exec -d ./architectures/arc/fattree_64.arc-w ./architectures/net/fattree_64_2mem.txt -i ./applications/app/LDPC-270x32-aarch64.sam-g ./applications/map/LDPC-270x32-aarch64-fattree_64.map

Using the approaches presented in example 6.1 and example 6.2, the results will be stored at /home/JADE-v4.0/workspace/aarch64_msi_mosi_cmp_directory/result/results.txt andit will overwrite any previously obtained result. To set a different location to store results, one shoulduse special arguments when launching JADE. Please, refere to section 12-A for the full command lineoptions.

A brief description of the most commonly used command line arguments are shown in table V. Thedetailed usage is shown in appendix 12-A.

In order to simulate recorded application models, the user is first required to decompress the corre-sponding data and instruction traces located in /home/JADE-v4.0/applications/models usinggzip and pass as argument the application model (.ram file), the data trace (flag -D) and the instructiontrace (flag -I). The example below shows how to extract instruction trace, where APP is the applicationname desired.

$ gzip -d {APP}.itrc.gz

7. APPLICATION MODEL

JADE uses the heterogeneous multiprocessor application benchmarks provided by COSMIC, whereapplications are partitioned into multiple tasks and modeled as Task Communication Graph (TCG),representing task dependencies and amount of communication between them. The APEX benchmarks arealso obtained via similar fashion. For simplicity, the following descriptions won’t distinguish COSMICand APEX benchmarks unless otherwise stated. Additionally, statistical distributions is also obtained byprofiling individual tasks for a target processor, using considerable dataset inputs. The task dependenciesand the statistical distributions are used to randomly generate task execution time and memory accesses,as an approximation of the overall joint behavior of application and processor. Due to its randomness,

Page 10: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

10

TABLE V: Simulation command line argument description

Class Brief Flag Required Description

General

Help -h No Print help information.

Advanced help -H No Print the advanced command line options to tune architectureparameters.

Random Seed -r No Set the random seed. If not specified, value from configuration willbe used. Use -r random, to use current time as seed.

Iterations -N NoSet the number of iterations to be performed by repeatedly executingthe statistical application (SAM). If running recorded applicationmodels (RAM), only one iteration is allowed. Default is 1.

Hardware

Networktopology

-w Yes Full path to the file describing the topology of the system.

Processorfrequency

-F No Set operating frequency GHz. Default is 2 GHz.

Inter-chipbandwidth

-B No Inter-chip interconnect bandwidth multiplier, 100Gbps. Default is 1.

Architecturedetails

-d Yes System architecture defined in arc files.

Power modellib -L No Specify the power model library filename. Default power library is

located at architectures/pwr/pwr lib

TechnologyNode

-A No Technology node (nm) used for power report. Default is 7nmFINFET. Example for 22nm use -A 22 .

Software

Applicationmodel

-i Yes Statistical or recorded application TCG model, .sam or .ram filesrespectively.

Mapping andScheduling

-g Yes Application mapping and scheduling defined in .map files.

Data trace -D No Data memory traces, only required when using recorded applicationmodel. Full path of data trace file.

Inst. Trace -I No Instruction memory traces, only required when using recordedapplication model. Full path of instruction trace file.

Output Output -O No Specify the desired output file name. If not specified, result will becreated in result/results.txt under the respective workspace.

the statistical application model (SAM) is useful to replicate many possible behavior outcomes thatcould eventually happen. Differently from conventional simulation methods that rely on individual inputspecification, requiring enumeration of a large number of inputs, SAM is a more convenient and systematicmethod to iteratively exercise the system and evaluate overall average system performance, as well ascorner case conditions.

Applications are profiled using three different instruction sets and processor configurations: ARM-v8,RISCV and x86-64. Users should select the appropriate application model for the given componentsconfiguration selected during compilation time. Apart from the statistical, JADE also includes the COSMICrecorded models composed by the original traces of memory addresses. While the recorded applicationmodel is proper to deterministically replicate the application behavior for one input case, the statisticalapproach is useful iteratively exercise the system approximating the original application on the averagecase. Table VI shows COSMIC applications models currently available in JADE-v4.0, more applicationsare being developed and will soon be available.

COSMIC benchmarks is also provided as a stand-alone suite. Please visit https://www.ece.ust.hk/∼eexufor a more detailed description of COSMIC.

Page 11: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

11

TABLE VI: List of applications provided

Class Benchmark Description

Kernels

FFT Fast Fourier Transform with 1024 complex number inputs.

MD Computer simulation of physical movements of atoms.

LDPC Low-Density Parity-Check code encoder with 270 input words.

TURBO Turbo code decoder with 40-bit data block and 1/3 coding rate.

RSe Reed-Solomon code encoder.

RSd Reed-Solomon code decoder.

US Ultrasound, edical diagnostic algorithm using 2D/3D ultrasound imaging.

RT Ray-tracing, 3D scene rendering algorithm.

Machinelearning

ML-FMP Machine Learning for Financial Market Prediction (ML-FMP) using fullyconnected neural networks.

ML-ALIP Machine Learning for Automatic Linguistic Indexing of Pictures (ML-ALIP)using fully connect neural networks.

Cifar-train Training phase of a Deep Convolutional Neural Network for the Cifar-10 dateset.

Facerec-train Training phase of a Deep Convolutional Neural Network for face recognition.

AlexNet-inf Interfence phase of the DCNN with 5 convolutional layers.

Apex

HPCG Conjugate gradient algorithm. Generates a linear system of a three-dimensionalheat diffusion problem.

HPCG-prec HPCG including computation of the preconditioned conjugate gradient.

Pennant Pennant mini-app. Lagrangian staggered-grid hydrodynamics algorithm on 2-Dunstructured finite-volume mesh.

Snap Performance modelling of a modern discrete ordinates neutral particle transportapplication.

Stream Synthetic benchmark measuring the memory bandwidth and a correspondingcomputation rate for four simple vector kernels.

A. Applications and processorsTable VI summarizes each application provided in current release of JADE. The application files

are located under /home/JADE-v4.0/applications/app, whereas the name provided for eachapplication follows the following syntax:

<application>-<configuration>-<arch>.sam<application>-<configuration>-<arch>.ram

Statistical application models has extension .sam while recorded application models are .ram. Thetag <application> is the application shortened name, for instance FFT, RSe, and ML-FMP for FastFourier, Reed-Solomon Encoder and Machine Learning for Financial Market Prediction respectively. Referto Table VI for each shortened name of each application. The tag <configuration> is the respectiveapplication configuration mode and <arch> is the architecture used to profile, aarch64, x64 or riscv.

Recorded applications models require two additional files, the traces for instruction and data memoryaccess. They follow similar syntax as the application, but with extension itrc and dtrc for instructiontrace and data trace respectively. The following examples shows the file names related to some applications(statistical and recored). Note that for recorded application models, three files should be specified, the.ram file, containing the basic TCG structure of the application and the instruction and data memorytraces, as shown in example 7.2.

Example 7.1. Statistical Application Model for Ray Tracing with 20x20 image resolution for ARM-v8:

TCG application file: RT-20x20-aarch64.sam

Page 12: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

12

Example 7.2. Recorded application model for Reed-Solomon Encoder for x86-64:

TCG application file: RSe-32x28x8-x64.ramData Trace: RSe-32x28x8-x64.dtrc

Instruction Trace: RSe-32x28x8-x64.itrc

B. Mapping and SchedulingIn order to run the application in the underlying hardware architecture, it is needed to map each task

of the application to a processing unit. Different mappings would lead to different performance results.COSMIC-v2.0 provides a generic optimization tool that maps and schedule the task for a given topologyaiming to minimize the communication traffic overhead and the total execution time.

We provided a few mappings for each individual application. However if a different mapping is requiredusers are encouraged to build the COSMIC tools and generate a mapping as needed. How to generatemappings are explained in the following topics. For each application and target topology, a differentmapping file is required, therefore when simulating JADE, it should be used the corresponding .mapfile for the corresponding pair <application, topology>, otherwise problems could occur duringruntime.

We already provided a quite a few mappings for each applications and they are located at /home/JADE-v4.0/applications/map. Each mapping file is named after the application used and thetopology following the following syntax:

<application-file-name>-<topology-file-name>.map

Where <application-file-name> follows the syntax described in section 7-A. Down below area few examples:

Example 7.3. FFT 1024 for ARM in the mapped for Mesh 4x4:

MAP file: FFT-1024-aarch64-mesh_4x4.map

Example 7.4. LDPC 270x32 for RISCV in the mapped for Fattree 4:

MAP file: LDPC-270x32-riscv-fattree_4.map

For a full list of already mapped applications for a target architecture, have a look in the folder/home/JADE-v4.0/applications/map for a full list of mappings. Each mapping file name followsthe above syntax explained.

C. Building COSMIC toolsTo build COSMIC tools, enter the tools directory /home/JADE-v4.0/utils/cosmicTools and

type make. Below we show the commands required to do so. The binary files would be created under/home/JADE-v4.0/utils/cosmicTools/bin.

$ cd /home/JADE-v4.0/utils/cosmicTools$ make

A message like the one shown down below should appear in the terminal.

"Built all tools successfully."

D. Using COSMIC mappingTo map an application to a hardware architecture, one can use cosmic_mapping.exec. Command

below shows the typical command line to generate the mapping file, in which APP is the correspondingpath to the application model (.sam), ARC corresponds to the desired hardware architecture file (.arc),

Page 13: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

13

MAP is the output mapping file to be created and TIME_OUT is a number in seconds to time limit foroptimization.

./applications/tools/bin/cosmic_mapping.exec -i APP -a ARC -m MAP -b /dev/null TIME_OUT

Note that the cosmic tool generates not only the mapping files but also buffer allocation memory.However it is not useful for JADE-v4.0 then we set the output buffer file to be ignored, by inserting -b/dev/null in the command line.

Example 7.5. Mapping Machine Learning for Image classification (ML-ALIP-70x20x1) for Torus 10x10and name it as NewMapping.map for at most 2 hours (7200 seconds), the following command should betyped in the terminal in a single line.

./utils/cosmicTools/bin/cosmic_mapping.exec -i ./applications/ML-ALIP-70x20x1.sam-a ./architectures/torus_10x10.arc -m ./applications/map/NewMapping.map -b /dev/null -t7200

E. Additional COSMIC toolsBesides mapping & scheduling tools, COSMIC includes a few other tools summarized in table VII.

For instance, cosmic_htcg.py (Hierarchical TCG) is a python tool that help creating custom TCGapplications by combining existing application models. Figure 3 shows an example of hierarchical TCG.In this example, TCG1 is defined as starting application and TCG4 is defined as end application. To builda hierarchical application composed by existing application models, users should create a configurationscript that creates the top entity TCG and import cosmic_htcg.py API. The top entity script shouldset the location of application models to be used, and the edges between these applications. A simpleexample is provided in /home/JADE-v4.0/utils/cosmicTools/htcg/example1.py. Thisexample instantiates 4 FFT applications and connect them in series, see example 7.6 on how to executeit.

A

C

B

D

E F

A

C

B

D

E F

A

C

B

D

E

A

B

TCG 1

TCG 2

TCG 3

TCG 4

HierarchicalTask

A

TaskDependencies

Fig. 3: Hierarchical TCG example.

Example 7.6. Generating 4 FFT applications connected in series (/home/JADE-v4.0/utils/cosmicTools/htcg/example1.py). It will generate the new TCG file (.sam), buffer file (.buf) and the newmapping file (.map).

Page 14: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

14

$ cp ./utils/cosmicTools/htcg/example1.py ./utils/cosmicTools/bin/example1.py$ chmod +x ./utils/cosmicTools/bin/example1.py$ cd ./utils/bin$ ./example1.py

COSMIC also incorporates an address synthesizer, that generates memory addresses for its benchmarkapplications. The synthesizer generates address traces following the syntax defined by COSMIC, but ispossible to generate the address trace using syntax of a few trace driven simulators, such as Dinero IV,DRAMsim2, and RUBY by passing the proper command line flags. Additionally, it is also possible toinclude custom syntax by modifying the synthesizer source code accordingly.

TABLE VII: Additional COSMIC tools.

Tool Syntax Comments

cosmic htcg.py Use API provided. See example1.py for usage. Creates hierarchical TCG usingexisting TCG applications.

cosmic synthesizer.exec

synthesizer.exec samFile dataOutinstOut [cosmicDataTrace][cosmicInstTrace][--dinero|--cosmic|--dramsim2|--gem5][--unified]

Synthesizes address references for agiven SAM. Several output syntaxare available. Generates data addressin dataOut file, and instructionaddresses in instOut. If unified(I/D) trace is desired, use flag--unified, and pass /dev/nullto instOut.

8. ARCHITECTURES

A. Architecture fileThe architecture file must be specified by the command line -d. It describes how many processors,

memories and L2 caches exists in the system and also a few more parameters about the processingunits. Several sample files are provided in /home/JADE-v4.0/architectures/arc and its syntaxis rather simple. For each architecture file there will be a corresponding network file that follows thenetwork topology described.

B. Network fileThe network file defines the connection among different components of the system, such as L1, L2,

L3, main memory and router/switch. It should be noticed that in three-level cache architecture we namethe first level cache as L0, L1 the second level cache and L2 the last level cache. For two-level cachehierarchy we name the first level of cache as L1 and the last level of cache as L0. In three-level cacheorganization, L0 and L1 caches are private for each core while the last level cache (named L2) is sharedfor all cores. For two-level cache organization, the first level of cache is L1 and is private. While the lastlevel of cache (L2) is shared among cores.

We use directory based cache coherence protocols. In the network file, each directory stands for amemory controller and the main memory. We provide various network files for different number ofprocessors and network topologies, each of them are named based on their topology and number ofmemories. For instance mesh_8x8_8mem.txt is a 8x8 mesh topology with 8 main memories. Adetailed instruction about writing a network file can be found at: [2].

JADE have full support for many-chip simulation. In JADE, each chips have its own private memoryspace. When simulating multi-chip systems, you can assign each chip a separate memory space andconfigure the relevant parameters. You can specify the number of chips in your network files.

Page 15: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

15

A short list of network topologies already provided is shown in table VIII. For a full list of filesusers are encouraged to have look in the directory /home/JADE-v4.0/architectures/net. Eachnetwork file has a header that briefly describes its topology, number of processors, memory controllers andnumber of L2 cache banks and LLC slices. It should be noticed that in multiple chips case, total numberof processors and processors per chip should be specified accordingly in the header. The processing coresare evenly assigned to each chip based on the numbering of L1 cache (means a core). Within each chip,the placement of L2/L3 cache bank, directory and their counts can be decided independently.

TABLE VIII: Examples of network files already provided.

Topology File name NumberProcessors

Number L2banks

NumberMemory

Controllers

Crossbarcrossbar 1 1 1 1crossbar 4 1mem 4 4 1

Mesh mesh 4x4 16 16 16mesh 8x8 2L2Cache 2mem 64 2 2

Fattree fattree 8 2mem 8 8 2fattree 16 2mem 16 16 2

Torus torus 2x2 1mem 4 4 1torus 8x16 16mem 128 128 16

Folded-torus folded-torus 4x4 16mem 16 16 16folded-torus 8x8 64mem 64 64 64

Ring ring 16 16 4 4ring 64 64 16 4

C. Custom network topologiesCreating customized topologies with adjustable placement of components can be accomplished by two

methods. The first one is by using the network generator script we provide, located at /home/JADE-v4.0/scripts/networkGenerator.py. Current implementation can generate Mesh, Torus, Folded-torus, Crossbar and Fattree. The number of memory controllers can be selected, as well as its placement.The network generator syntax is detailed in table IX. Note that for crossbar and fattree, it should beinserted a number zero right after the number of cores. For other topologies, the number of cores is givenby rows × columns.

TABLE IX: Network generator usage.

Topolgies Syntax

Mesh, torus, folded-torus networkGenerator.py topology rows columns Number of Memories List of memories location

Crossbar, fattree networkGenerator.py topology cores 0 Number of Memories List of memories location

The network generator script will print the output file in the stdout, therefore it should redirect thestandard output to the proper destination file, as shown in examples 8.1 and 8.2.

Example 8.1. To generate a Mesh 4x4 with 1 memory located at node 0, the following command lineshould be used:

./scripts/networkGenerator.py mesh 4 4 1 0 > outputNetwork.txt

Example 8.2. To generate a Torus 2x2 with 2 memories located at node 0 and 3, the following commandline should be used:

Page 16: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

16

Internal link

External link

Component (ID)

CORE

L1Icache

L1Dcache

CORE

L1Icache

L1Dcache

L2cache

� �

L1(0) L1(1)

L2(0)R(0)

R(1)

R(2)

MC

Directory(0)

NetworkBoundaries

(a) Example of network.

// External components placementnet += external(‘L1Cache’, 0, 0, 1)net += external(‘L1Cache’, 1, 1, 1)net += external(‘L2Cache’, 0, 1, 1)net += external(‘Directory’, 0, 2, 1)

// Internal connection routersnet += link(0, 2, 1, 1)net += link(0, 1, 1, 1)net += link(1, 2, 1, 1)

// Initialize object for networknet = createNet(1, 2, 1, 1)

(b) Example of code to generate network file.

Fig. 4: Example network generated using the Python API provided (networkApi.py).

./scripts/networkGenerator.py torus 2 2 2 0 3 > outputNetwork.txt

In order to obtain more complex and irregular topologies, users can generate a regular topology withthe script provided and remove or insert specific elements or nodes. To facilitate elaboration of hugetopologies, the Python API to generate network files can be used and is located at /home/JADE-v4.0/scripts/networkApi.py. A network object can be created by calling createNet, passing thenumber of chips, number of processors per chip, the number of L2 cache banks and the number memorycontrollers. This object is printable, so users can use this object to write to a specific file or to thestdout, respecting python syntax.

Any element that is outside the network is treated as external. Therefore, connecting processors, L2caches and directories to the network can be accomplished by using the external method, giving asargument the component unique identifier (ID), the router node that it is going to be connected and thelink latency. This simple model can incorporate off-chip components by setting the proper link latency.Each node in the network is implemented as a router, and in order to connect routers to each other, themethod link should be used. Similarly, it should be given the pair of routers to be connected, the latencybetween them and a weighting for routing decision.

Figure 4 shows a simple example on how to use the networkApi.py to generate a network topology.A system composed by two processors, one L2 cache and one memory controller (and its directory) isshown in 4b. The code used to generate the system is shown in 4b. The first step is to create the networkobject, then connect the external elements to routers (nodes) and finally create the desired internal structureby connecting the routers to each other.

D. Memory hierarchyJADE comes with plenty of templates to model a diverse set of memory hierarchies. The templates are

located inside the directory protocols. You can modify the cache hierarchy by setting the PROTOCOLflag during compilation time. See section 5 for details.

We provide several examples of cache hirarchy with 2-levels and 3-levels. The templates are all basedon the SLICC (Specification Language for Implementing Cache Coherence) provided by the Ruby memorysimulator [3]. The default cache hierarchy is a 2-level with the directory-based cache coherence MSI-MOSIMSI MOSI CMP directory. We also provide templates for inclusive and exclusive 3-levels of caches.

By default, the last level cache is shared across all cores while the other caches are private to eachcore. The corresponding parameters can be configured in configuration files 8-E. You can choose to use

Page 17: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

17

whatever cache hierarchy and protocols as you like by specifying the protocol file list under directoryprotocols when compiling JADE.

E. Components configuration fileMany components of JADE can be individually tuned. For instance, one can set buffer sizes of NoC

routers, link latencies, size of memory and caches, latency to access cache and many other parameters.A few configuration files are already provided, located in /home/JADE-v4.0/architectures/configs. Although many of the parameters are explained inside the configuration files, we provide inTable X a more detailed description about some selected configuration items, since they are important forsimulation. When the configurations in configuration file are changed, it is necessary to remake JADEto make these configurations effective. For easy use, we also provide command line setting for somefrequently used parameters, as list in Table V. It will take effect without remaking JADE but complicatingthe command line. It is up to users to choose the way they like.

TABLE X: Selected configuration items

Parameter Description Example

g RUNNING CYCLES Specify number of cycles tofinish the simulation.

0 means run the simulation to the end.

g INTERLEAVED MEMORY The way an address mapped tomemory controllers.

True means interleaved mapping, False meanscontinuous mapping.

g simple NoC Specify the NoC granularity.True means simplified NoC, False means detailedNoC.

g JADE E NETWORK Specify if electricalarchitecture used.

True, electrical network is used.

g MEMORY SIZE BYTES Main memory size in bytes?for each memory.

For 4GB memories set 4294967296.

g DATA BLOCK BYTESData block size in bytes.Corresponds to cache line sizesin all levels.

For cache line size of 64, set 64.

L1 CACHE NUM SETS BITS Number of cache sets in bitsper core.

To obtain L1 cache of size 64Kbytes, in whichdata block is 64 bytes, and associtivity is 4, youneed to set this variable to 8.

L2 CACHE NUM SETS BITS Number of cache sets in bitsper core.

For example, to set a total size of L2 of2MBytes, with data block 64B, associtivity 8, andand assuming there are 4 processors, you need toset this variable to 10.

Users should select the appropriate application model file in accordance with the components con-figuration used to compile JADE. For instance, in case SYSTEM_SETTINGS=aarch64 were used tobuild JADE, users should use application files with suffix *-aarch64.sam. Please refer to [4] for moredescriptions of memory parameters. We further provide a generic configuration parameter based on ageneric MIPS implementation, to allow users to modify and tune each parameter individually.

F. Power modelsIn JADE-v4.0, we incorporate holistic Power Management (PM) function for managing the system power

and generating power statics of different components. The PM include three subsystems: Dynamic Volt-age and Frequency Scaling(DVFS), Energy Consumption Calculation and Power Delivery System(PDS)Models. The central PM controller in JADE is called PMU (Power Management Unit), which is definedunder the directory support. PMU is responsible of the following functions:

Page 18: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

18

• Initializing Power Management(PM) States, which represents the voltage and frequency pairs• Launching Power Tools and get corresponding power consumption results for target processor con-

figurations and defined PM states• Maintain and update real-time PM states for different components(DVFS)• Calculate energy consumptions for different components (Core/L2)• Create Power Delivery System Model• Print power results at the end of the simulation• Generating per-core frequency/dynamic power/static power variation based on process variationhe PMU initialization workflow of JADE simulation is as shown in Figure 5. Firstly, it defines the

frequencies and voltage pairs for different PMStates in VFLists for different components. Secondly, itlaunches the Power Tool (Currently McPAT) in parallel to compute the dynamic and static power forall the PMStates of different components. The temporary configuration file for is constructed. Then thepower tool is launched directly by ./utils/mcpat/bin/powerWizard.py using the above config file. Finally,the computation of power tools, corresponding results are written in Power Vectors. These results couldbe further used for DVFS and energy consumption calculation. For all components, including processorcores, routers, caches, etc., both dynamic and static energy are calculated.

JADE PMU Initiate PMStates

(Voltage & Frequency Pairs)

LaunchPowerTool

GetPowerResults

Power Vectors

SIMULATION

McPAT McPAT McPAT McPAT

Creat PDSModelUpdate

INFOSet PM State

Fig. 5: The PMU workflow in JADE.

Currently the DVFS function supports the following components, and each component can have itsown voltage and frequency settings. Processor Cores: PM state is changed by the function SetProcPM-State(). Currently we apply DVFS at the beginning of task execution. The users can also set PMStateat different occasions. Voltage/Frequency Scaling Time(VScalingTime) is considered if the PDSModelis enabled. It is generated by PDSModel. PMU also maintain the working states for each cores(idle orrunning) for PDSModel Energy Calculation. We use PMU::UpdateProcWorkingState(Processor∗ proc)to update the working state and compute the PDS efficiency at the same time. Network Interfaces:Use PMU::SetPMState NI(w NetworkInterface∗ NI) to change PMstate of NI in its wakeup() function.Routers: Use PMU::SetPMState Router(w router∗ router) to change PMstate of router in its wakeup()function. By default, the DVFS for processor, router and network interface are all disabled. If you want

Page 19: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

19

to use this feature, you can set ”Flag DVFS proc”, ”Flag DVFS router”, ”Flag DVFS NI” to be true in/home/JADE-v4.0/support/PMU.h.

JADE also includes process variation generator to simulate the variation of core frequencies, averagedynamic power and average static power. The fundamental variations of two device-level parameters thatcause the above variations are the effective gate length Leff and the threshold voltage Vth. We modelLeff and Vth as spatial correlated normal distribution, and use spherical function to generate covariancebetween cores.

9. OUTPUT ITEMS

Several statistics are provided by JADE including holistic performance and power analysis, systembehavior and memory access trace. For a given input application and architecture, several simulationiterations are executed allowing the application behavior to be randomly constructed. From the simulationwe use a configurable amount of intermediate results, ignoring initial and final iterations to allow systemwarm-up and sink.

Displayed outputs include overall application execution time, individual processor statistics such astotal busy time, and waiting time to depict the utilization ratio of processors. The network counters showninclude packet injection rate for individual routers, average packet delay, throughput, and flow controlstatistics. JADE also incorporates power library models, for selected technology nodes, and reports powerand energy usage for various components within the system.

Memory access trace is also recorded, supplementing further investigation of memory behavior ofapplications. Besides conventional outputs, oftentimes researchers and designers need to get more detailedbehavior of some specific component in the system to better understand it and lately tune. To attenddifferent needs, a custom output is also possible by adding customized recorders to the modules, enablingextraction of specific metrics and behavior.

Although we intend to print out the statistics of system behavior and performance as detailed as possible,we still understand that users might be interested in some certain perspectives that we have not coveredin the current version. To this end, users may need to go through the source codes and add probes torealize their own requirements.

10. AGREEMENT AND LICENSE

JADE copyrights can be found in its root directory. If you use JADE in your research, please cite thefollowing paper (http://dx.doi.org/10.1145/2857058.2857066):

Rafael K. V. Maeda, Peng Yang, Xiaowen Wu, Zhe Wang, Jiang Xu, Zhehui Wang, Haoran Li, LuanH. K. Duong, Zhifei Wang, JADE: a Heterogeneous Multiprocessor System Simulation Platform UsingRecorded and Statistical Application Models, HiPEAC Workshop on Advanced Interconnect Solutionsand Technologies for Emerging Computing Systems, Prague, January 2016.

11. REVISION HISTORY

Page 20: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

20

TABLE XI: Revision history

Version Changes Date2.1 Cache coherence protocols 1-Jun-16

2.2 Inter-chip communication and separate memory space 10-Oct-16

2.3 Optimized memory usage and compatible with APEX benchmarks 10-Dec-16

2.4 Power management unit 2-Feb-17

2.5 Process variation 1-Apr-17

2.6 Internal test and verification 1-May-17

3.0 Public release 1-Aug-17

3.1 Adoption of COSMIC v3.0 1-Jul-18

3.2 Enhanced CPU models. 1-Aug-18

3.3 Adoption of more cache coherence protocols. 1-Aug-18

4.0 Public release 1-Sept-18

REFERENCES

[1] “Benchmark Distribution & Run Rules,” http://www.nersc.gov/research-and-development/apex/apex-benchmarks/.[2] http://research.cs.wisc.edu/gems/doc/gems-wiki/moin.cgi/Understanding%20Network%20Files.[3] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood,

“Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset,” ACM SIGARCH Computer Architecture News, vol. 33,no. 4, pp. 92–99, 2005.

[4] http://research.cs.wisc.edu/gems/doc/gems-wiki/moin.cgi/.

12. APPENDIX

A. Basic command line optionsThere are two classes of command line. The basic options and the advanced options. This section

shows the basic options whereas in Appendix 12-B we show the advanced options. To obtain the fullbasic command line options you can run the command:

./workspace/aarch64_msi_mosi_cmp_directory/bin/jade.exec --help

The basic command line option is shown below:

JADE command line options:

Usage options:-h [ --help ] List command line options.-H [ --help-arch ] List architecture parameters options.-v [ --version ] Print JADE version

Basic options (all required):-i [ --app ] arg Application file (.sam). Examples under

directory: JADE/applications/app.-d [ --arc ] arg Architecture file (.arc). Examples under

directory: JADE/architectures/arc.-w [ --net ] arg Topology file (.net). Examples under

directory: JADE/architectures/net.-g [ --map ] arg Mapping file (.map). Examples under

directory: JADE/applications/map.-O [ --out ] arg Results output file. Default file:

Page 21: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

21

workspace/dir/result/results.txt

Simulation control:-P [ --periodic-stats ] arg Print results periodically. Should pass the

period to print-l [ --sim-length ] arg Maximum simulation length. Given in time unit

(ps).-r [ --random ] arg Random seed. Use 0 for time() and larger than

0 for fixed seed.--verbose arg Will print some debugging informatio while

simulating.-N [ --iterations ] arg How many iterations the same application

should be executed. The results will beaccumulated.

-f [ --number-workloads ] arg Number of workloads. Fix-me - might not beworking.

-D [ --dtrace ] arg Data trace file when running cycle-accurate.Examples under JADE/applications/app.

-I [ --itrace ] arg Instruction trace file when runningcycle-accurate. Examples underJADE/applications/app.

-y [ --cached-net ] Use cached routing information to build thenetwork. Saves simulation time.

-Y [ --gen-net ] Do not simulate, only generate the cachedrouting information.

Basicsystem parameters.:--base-period-ps arg Baseline period (in picoseconds) of all

components. Default: 500ps = 2GHz.

Synthetic traffic options:-R [ --injection-rate ] arg Injection rate for the synthetic-traffic.-T [ --dynamic-mtu ] arg Dynamic MTU for inter-chip communication.-B [ --bandwidth-mult ] arg Inter-chip interconnect bw multiplier-E [ --electrical-icn ] Use electrical inter-chip network-F [ --performance-ratio ] arg Set the processor performance ratio-M [ --integrated-nic ] Integrated NIC

Power parameters.:-L [ --power-lib ] arg Set the power library file.-A [ --tech-node ] arg Set the technology node (nm).--enable-pds Activate PDS simulation, requires user to set

the output file (pds-outfile)--pds-result-path arg Set the path to write the PDS results.

Default: workspace/dir/result/pds/*--pds-baseline-iterations arg Baseline iterations to compute average pds

stats.--pds-iterations-average arg Number of iterations to compute average pds

stats.

B. Advanced command line optionsThe advanced command line options sets specific architecture parameters such as cache size, cache

latencies and many more. You are advised to use these options carefully since misconfiguration of theparameters might cause unpredicatble behavior in the simualtion. You can obtain the full advanced optionsby running JADE with the command:

./workspace/aarch64_msi_mosi_cmp_directory/bin/jade.exec --help-arch

Page 22: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

22

Here is the full command line option shown:

Advanced architecture parameters. Use carefully:

System parameters.:--g_NUM_MEMORIES arg (uint)--g_PROCS_PER_CHIP arg (uint)--g_NUM_CHIPS arg (uint)--g_NUM_CHIP_BITS arg (uint)--g_NUM_PROCESSORS_BITS arg (uint)--g_PROCS_PER_CHIP_BITS arg (uint)

Network parameters:--g_adaptive_routing arg (boolean)--g_PRINT_TOPOLOGY arg (boolean)--g_JADE_E_NETWORK arg (boolean)--g_JADE_SUOR arg (boolean)--g_INIC arg (boolean)--g_simple_NoC arg (boolean)--g_DETAIL_NETWORK arg (boolean)--g_GARNET_NETWORK arg (boolean)--g_NETWORK_TESTING arg (boolean)--g_bash_bandwidth_adaptive_threshold arg (float)--g_ODB arg (float)--InjectRate arg (float)--g_topology_file arg (string)--g_NETWORK_TOPOLOGY arg (string)--g_DP arg (uint)--g_PF arg (uint)--g_D_MTU arg (uint)--NETWORK_LINK_LATENCY arg (uint)--COPY_HEAD_LATENCY arg (uint)--ON_CHIP_LINK_LATENCY arg (uint)--g_endpoint_bandwidth arg (uint)--NUMBER_OF_VIRTUAL_NETWORKS arg (uint)--g_SUOR_cluster_size arg (uint)--g_FLIT_SIZE arg (uint)--g_NUM_PIPE_STAGES arg (uint)--g_VCS_PER_CLASS arg (uint)--g_BUFFER_SIZE arg (uint)--g_SpecifiedGenerator arg (string)

Processor parameters.:--g_SYNTHETIC_DRIVER arg (boolean)--g_DETERMINISTIC_DRIVER arg (boolean)--PROFILE_HOT_LINES arg (boolean)--PROFILE_ALL_INSTRUCTIONS arg (boolean)--USER_MODE_DATA_ONLY arg (boolean)--TRANSACTION_TRACE_ENABLED arg (boolean)--g_SIMICS arg (boolean)--FINITE_BUFFERING arg (boolean)--g_trace_warmup_length arg (uint)--g_tester_length arg (uint)--ENABLE_DVFS arg (boolean)--g_synthetic_locks arg (uint)--g_deterministic_addrs arg (uint)--g_callback_counter arg (uint)--g_NUM_SMT_THREADS arg (uint)--SIMICS_RUBY_MULTIPLIER arg (uint)--OPAL_RUBY_MULTIPLIER arg (uint)--g_NUM_PROCESSORS arg (uint)--PROCESSOR_BUFFER_SIZE arg (uint)--SEQUENCER_TO_CONTROLLER_LATENCY arg (uint)

Memory parameters:--PERFECT_MEMORY_SYSTEM_LATENCY arg (uint)--L0_CACHE_ASSOC arg (uint)--L0_CACHE_NUM_SETS_BITS arg (uint)--L1_CACHE_ASSOC arg (uint)--L1_CACHE_NUM_SETS_BITS arg (uint)--L2_CACHE_ASSOC arg (uint)--L2_CACHE_NUM_SETS_BITS arg (uint)--g_DATA_BLOCK_BYTES arg (uint)--g_PAGE_SIZE_BYTES arg (uint)--g_NUM_L2_BANKS arg (uint)

--g_MEMORY_SIZE_BITS arg (uint)--g_DATA_BLOCK_BITS arg (uint)--g_PAGE_SIZE_BITS arg (uint)--g_NUM_L2_BANKS_BITS arg (uint)--g_NUM_L2_BANKS_PER_CHIP_BITS arg (uint)--g_NUM_L2_BANKS_PER_CHIP arg (uint)--g_NUM_MEMORIES_BITS arg (uint)--g_NUM_MEMORIES_PER_CHIP arg (uint)--g_MEMORY_MODULE_BITS arg (uint)--DIRECTORY_CACHE_LATENCY arg (uint)--ISSUE_LATENCY arg (uint)--CACHE_RESPONSE_LATENCY arg (uint)--L2_RESPONSE_LATENCY arg (uint)--L2_TAG_LATENCY arg (uint)--L1_RESPONSE_LATENCY arg (uint)--L0_RESPONSE_LATENCY arg (uint)--MEMORY_RESPONSE_LATENCY_MINUS_2 arg (uint)--DIRECTORY_LATENCY arg (uint)--RECYCLE_LATENCY arg (uint)--L2_RECYCLE_LATENCY arg (uint)--TBE_RESPONSE_LATENCY arg (uint)--L0_REQUEST_LATENCY arg (uint)--L1_REQUEST_LATENCY arg (uint)--L2_REQUEST_LATENCY arg (uint)--L0CACHE_TRANSITIONS_PER_RUBY_CYCLE arg (uint)--L1CACHE_TRANSITIONS_PER_RUBY_CYCLE arg (uint)--L2CACHE_TRANSITIONS_PER_RUBY_CYCLE arg (uint)--DIRECTORY_TRANSITIONS_PER_RUBY_CYCLE arg (uint)--g_SEQUENCER_OUTSTANDING_REQUESTS arg (uint)--NUMBER_OF_TBES arg (uint)--NUMBER_OF_L0_TBES arg (uint)--NUMBER_OF_L1_TBES arg (uint)--NUMBER_OF_L2_TBES arg (uint)--PROTOCOL_BUFFER_SIZE arg (uint)--MEM_BUS_CYCLE_MULTIPLIER arg (uint)--BANKS_PER_RANK arg (uint)--RANKS_PER_DIMM arg (uint)--DIMMS_PER_CHANNEL arg (uint)--BANK_BIT_0 arg (uint)--RANK_BIT_0 arg (uint)--DIMM_BIT_0 arg (uint)--BANK_QUEUE_SIZE arg (uint)--BANK_BUSY_TIME arg (uint)--RANK_RANK_DELAY arg (uint)--READ_WRITE_DELAY arg (uint)--MEM_CTL_LATENCY arg (uint)--REFRESH_PERIOD arg (uint)--BASIC_BUS_BUSY_TIME arg (uint)--TFAW arg (uint)--MEM_RANDOM_ARBITRATE arg (uint)--MEM_FIXED_DELAY arg (uint)--PRINT_INSTRUCTION_TRACE arg (boolean)--REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH arg (boolean)--g_INTERLEAVED_MEMORY arg (boolean)--SyntheticTraffic arg (boolean)--MAP_L2BANKS_TO_LOWEST_BITS arg (boolean)--g_MEMORY_SIZE_BYTES arg (ulong)--g_MEMORY_MODULE_BLOCKS arg (ulong)--XACT_CONFLICT_RES arg (string)--PERFECT_MEMORY_SYSTEM arg (boolean)

Instrumentation parameters.:--RANDOMIZATION arg (boolean)--g_FILTERING_ENABLED arg (boolean)--PROTOCOL_DEBUG_TRACE arg (boolean)--PERIODIC_TIMER_WAKEUPS arg (boolean)--PERFECT_FILTER arg (boolean)--PERFECT_VIRTUAL_FILTER arg (boolean)--PERFECT_SUMMARY_FILTER arg (boolean)--ENABLE_MAGIC_WAITING arg (boolean)--ENABLE_WATCHPOINT arg (boolean)--DEBUG_FILTER_STRING arg (string)--DEBUG_VERBOSITY_STRING arg (string)

Page 23: 1 JADE Heterogeneous System Simulation Platform · available from the heterogeneous COSMIC benchmark. JADE integrates power model libraries to provide holistic power analysis for

23

--DEBUG_OUTPUT_FILENAME arg (string)--g_RANDOM_SEED arg (uint)--g_DEADLOCK_THRESHOLD arg (uint)--g_RETRY_THRESHOLD arg (uint)--g_FIXED_TIMEOUT_LATENCY arg (uint)--g_NUM_COMPLETIONS_BEFORE_PASS arg (uint)--g_think_time arg (uint)--g_hold_time arg (uint)--g_wait_time arg (uint)--g_DEBUG_CYCLE arg (uint)--NULL_LATENCY arg (uint)--TIMER_LATENCY arg (uint)--ABORT_RETRY_TIME arg (uint)--DEBUG_START_TIME arg (ulong)

Legacy parameters.:--g_DISTRIBUTED_PERSISTENT_ENABLED arg (boolean)--g_DYNAMIC_TIMEOUT_ENABLED arg (boolean)--XACT_ENABLE_VIRTUALIZATION_LOGTM_SE arg

(boolean)--XACT_EAGER_CD arg (boolean)--XACT_LAZY_VM arg (boolean)--XACT_VISUALIZER arg (boolean)--XACT_NO_BACKOFF arg (boolean)--g_REPLACEMENT_POLICY arg (string)--BLOCK_STC arg (boolean)--PROFILE_EXCEPTIONS arg (boolean)--ATMTP_ENABLED arg (boolean)

--ATMTP_ABORT_ON_NON_XACT_INST arg (boolean)--ATMTP_ALLOW_SAVE_RESTORE_IN_XACT arg (boolean)--READ_WRITE_FILTER arg (string)--VIRTUAL_READ_WRITE_FILTER arg (string)--SUMMARY_READ_WRITE_FILTER arg (string)--g_CACHE_DESIGN arg (string)--ATMTP_XACT_MAX_STORES arg (uint)--ATMTP_DEBUG_LEVEL arg (uint)--FAN_OUT_DEGREE arg (uint)--XACT_DEBUG_LEVEL arg (uint)--XACT_NUM_CURRENT arg (uint)--XACT_LAST_UPDATE arg (uint)--XACT_COMMIT_TOKEN_LATENCY arg (uint)--XACT_LOG_BUFFER_SIZE arg (uint)--XACT_STORE_PREDICTOR_HISTORY arg (uint)--XACT_STORE_PREDICTOR_ENTRIES arg (uint)--XACT_STORE_PREDICTOR_THRESHOLD arg (uint)--XACT_FIRST_ACCESS_COST arg (uint)--XACT_FIRST_PAGE_ACCESS_COST arg (uint)--XACT_LENGTH arg (uint)--XACT_SIZE arg (uint)--PROFILE_XACT arg (boolean)--PROFILE_NONXACT arg (boolean)--XACT_DEBUG arg (boolean)--XACT_MEMORY arg (boolean)--XACT_ENABLE_TOURMALINE arg (boolean)--XACT_ISOLATION_CHECK arg (boolean)