Architectural vulnerability factor estimation through ...cj82n890p/fulltext.pdfAs GPU hardware...
Transcript of Architectural vulnerability factor estimation through ...cj82n890p/fulltext.pdfAs GPU hardware...
Architectural Vulnerability Factor Estimation through Fault
Injections
A Thesis Presented
by
Fritz Gerald Previlon
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Master of Science
in
Computer Engineering
Northeastern University
Boston, Massachusetts
April 2016
To my family!!
i
Contents
List of Figures iv
List of Tables v
List of Acronyms vi
Acknowledgments vii
Abstract of the Thesis viii
1 Introduction 11.1 Introduction to GPU Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 The Open Compute Language (OpenCL) . . . . . . . . . . . . . . . . . . 31.1.2 The OpenCL Platform Model . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Architecture of the Evergreen family of GPUs . . . . . . . . . . . . . . . 51.1.4 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.1.5 Local Data Storage (LDS) . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 82.1 Soft Error Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Faults vs Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Correctable Errors (CE) . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3 Detected Unrecoverable Errors (DUE) . . . . . . . . . . . . . . . . . . . . 92.1.4 Silent Data Corruptions (SDC) . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Transient Fault - Background and Terminology . . . . . . . . . . . . . . . . . . . 92.2.1 Raw Circuit Fit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Vulnerability Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Architectural Vulnerability Factor . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Architecturally Correct Execution (ACE) Analysis . . . . . . . . . . . . . 12
ii
3 Transient Faults on GPUs 143.1 Fault Injection Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 GPU-Qin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.2 SASSIFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.3 Multi2Sim Fault Injection [1] . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 ACE Analysis studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.1 GPGPU-SODA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Methodology 184.1 Multi2Sim simulation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Multi2Sim for Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Post-Experiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.5 Evaluation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5.1 Platform for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.5.2 Evaluated Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Results and Analysis 245.1 Local Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3 How does vulnerability vary over time? . . . . . . . . . . . . . . . . . . . . . . . 29
5.3.1 Case Study: LDS - RadixSort . . . . . . . . . . . . . . . . . . . . . . . . 295.3.2 Case Study: MatrixMultiplication . . . . . . . . . . . . . . . . . . . . . . 32
6 Conclusion 35
Bibliography 36
iii
List of Figures
1.1 Clock rate and Power for Intel x86 microprocessors over eight generations and 25years (Source [2]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 OpenCL Programming Model and Evergreen Hardware Architecture. . . . . . . . 4
4.1 Possible outcomes for each simulation run . . . . . . . . . . . . . . . . . . . . . . 194.2 This graph shows how the Architectural Vulnerability Factor (AVF) value changes
based on the number of fault-injection experiments. We notice that the AVF valueshows little variation and stabilizes after 5,000 injections. . . . . . . . . . . . . . . 20
4.3 Formula for the number of faults to select for injection . . . . . . . . . . . . . . . 21
5.1 Results of Fault injection experiments on the Local Data Share . . . . . . . . . . . 255.2 Amount of local memory used by each application for each NDRange mapped to the
compute device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Results of Fault injection experiments on the General Purpose Register File . . . . 275.4 Number of General-Purpose registers used by each application for each NDRange
mapped to the compute device . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.5 Intervals of Vulnerability for Radixsort. This shows that the faults that lead to
incorrect output fall only into specific intervals of time . . . . . . . . . . . . . . . 305.6 Intervals for LDS accesses for Radixsort. . . . . . . . . . . . . . . . . . . . . . . 315.7 Intervals of Vulnerability for MatrixMultiplication. This vulnerability of MatrixMul-
tiplication shows a periodic behavior . . . . . . . . . . . . . . . . . . . . . . . . . 325.8 Local Memory accesses in MatrixMultiplication show a periodic behavior . . . . . 33
iv
List of Tables
4.1 The GPU configuration used in the experiments . . . . . . . . . . . . . . . . . . . 224.2 The benchmarks used in the experiments . . . . . . . . . . . . . . . . . . . . . . . 23
v
List of Acronyms
AVF Architectural Vulnerability Factor. Probability that a soft error will cause an error in a programoutput. The AVF of an architectural bit can be thought of as the Fraction of time a bit mattersfor final output of a program
ACE Architecturally Correct Execution. ACE Analysis is a method to derive an upper bound onAVF using performance simulation.
LDS Local Data Storage Local Memory module in a compute unit of a GPU. This module is sharedbetween work-items in a compute unit and allows for communication between the work-items.It can be manipulated using explicit instructions.
vi
Acknowledgments
Here I wish to thank those who have supported me during the process of the thesis work.First I would like to thank my family and close friends who have encouraged me and
believed in me. Their support was critical to the completion of this work.I also want to think the friends from NUCAR and my committee members (Dr. Vilas
Sridharan and Prof. Ningfang Mi) who have provided important and constructive feedback throughoutthe progress of this thesis.
Lastly, my advisor David Kaeli has been a reliable and indispensable guide and supportfrom the beginning of this work. Many thanks!
vii
Abstract of the Thesis
Architectural Vulnerability Factor Estimation through Fault Injections
by
Fritz Gerald Previlon
Master of Science in Electrical and Computer Engineering
Northeastern University, April 2016
David Kaeli, Adviser
Given the large number of processing cores, as well as their impressive parallel processingcapabilities, Graphic Processing Units (GPUs) have become the accelerator of choice across multipledomains. GPUs are able to accelerate processing in a wide range of applications including scien-tific computing, bio-informatics, and financial applications. Their presence in the world’s fastestsupercomputers has been steadily growing over the last few years.
With technology scaling, soft-error reliability has become a major issue for hardwaredesigners. Soft-errors are a non-permanent fault, where a bit flip occurs in a latch or memory cell. Arecent study by the Department of Energy has identified soft errors as one of the top 10 barrier toexascale computing. The architecture research community needs to pursue solutions to address thechallenges presented by the growing presence of soft errors. While some number of soft errors willnot necessarily cause an error at the output of a program, many will corrupt vulnerable program state.Since GPUs are increasingly being used for compute instead of just graphics, their reliability hasbecome a concern. Therefore, an important step in tackling soft errors in GPUs is to first assess theimpact of soft errors and the robustness of the GPUs in the presence of these faults.
In this thesis, we evaluate this question using fault injection on an AMD Evergreen familyof GPUs. In this study, we inject bit flips using a detailed architectural simulator. Our results indicatethat a GPU can be a highly resilient device to soft errors. We present a study of trends that appear incommon GPU programs when soft errors occur in GPU memory hierarchy. These trends can be usedto inform programmers, as well as system designers, when making decisions about how to increasethe reliability of GPU software and hardware.
viii
Chapter 1
Introduction
For more than 3 decades, frequency scaling - increasing a processor’s frequency for
performance - has been the driving force behind Moore’s law [3]. Processor frequencies have
increased from 1-8 MHz in the 1970s to 2-4 GHz today (approximately 4,000 times faster). How-
ever, power/thermal constraints have made it very challenging for us to continue increasing clock
frequencies of microprocessors. Figure 1.1 shows how both power and clock rate have increased
rapidly for decades, but have recently flattened [2]. The microprocessor industry has thus turned
to parallelism in order to obtain higher performance using the same frequency, though with only a
minimal increase in power consumption. For the past decade, general purpose compute applications
have started to leverage the parallelism provided through parallel computing hardware, as well as
sophisticated parallel programming interfaces.
As parallelism became more prevalent, the market saw an increase in multi-core processors,
able to take advantage of the parallelism inherent in general-purpose applications. New program-
ming interfaces have been developed in order to facilitate the development of parallel applications.
Developers have looked for ways to accelerate their performance-critical applications in order to
exploit the performance benefits offered by the parallelism in multi-core processors.
With hundreds of cores and streaming processing devices, Graphics Processing Units
(GPUs) have become an attractive parallel processing device. Originally, GPU acceleration was
limited due to the lack of programmable shaders, then the use of graphics-oriented programming
languages. Improved programmability has helped these devices become quite attractive for high-
performance computing and other data-intensive applications. Beyond their primary graphics role,
they are now are used in a growing range of applications, including scientific computing [4], bio-
informatics [5], molecular modeling [6] and financial applications [7].
1
CHAPTER 1. INTRODUCTION
Figure 1.1: Clock rate and Power for Intel x86 microprocessors over eight generations and 25 years(Source [2])
However, because their primary use has been for graphic processing, reliability (or more
specifically, soft error reliability) has never been a major concern for GPU designers. As expressed
by Sheaffer et al., a user is quite unlikely to care about or even perceive a single-bit error in a single
pixel for a single frame, when running traditional gaming programs [8].
To continue to exploit the GPU impressive parallel compute capabilities, and expand the
use of GPUs to a wider range of markets and industries, it is imperative that reliability issues on
GPUs be rigorously addressed.
In a traditional CPU design, soft-error reliability is not a foreign concept. Reliability has
commonly been a key design trade-off considered during processor design. Soft errors are radiation-
induced errors and are caused by energetic particles (neutrons from cosmic rays, and alpha particles
from packaging materials) generating electron-hole pairs as they pass through semiconductor devices.
Soft-error reliability has been well studied on CPUs; numerous techniques have been developed
to characterize errors, and to protect microprocessors against these faults [9][10][11]. However,
little work has been done on the resiliency of GPUs in the presence of soft errors. We need to first
understand how vulnerability of these devices is dependent on underlying program characteristics.
In this thesis, we present an extensive fault injection study on soft error reliability in the memory
hierarchy of a class of GPUs.
We have found that, in general, a GPU is fairly resilient to soft errors when running typical
2
CHAPTER 1. INTRODUCTION
GPU applications. We have also found trends in the resiliency of a GPU, which can be exploited by
GPU application designers to make their software more robust against soft errors.
Next, we provide background information on General-Purpose Computing on Graphics
Processing Units (GPGPU).
1.1 Introduction to GPU Programming
GPUs were originally designed to efficiently render 3-D graphics, providing highly opti-
mized datapaths for generating frames and frames of pixel data. The research community recognized
that GPUs could also be used for massive data processing, and started executing floating-point
computations using shader languages such as OpenGL and DirectX. The applications that were first
ported to GPUs typicallly involved compuations on matrices and vectors. Matrix multiplication was
one of the early CPU programs that performed significantly better when run on a graphics card [12].
However, porting these general-purpose applications to GPUs was a very complex and daunting task,
as it required that the programmer to recast their algorithms in terms of the graphics APIs.
Industry leaders AMD and Nvida recognized this trend, and proposed general purpose
programming languages that would allow GPUs to be used for a broader class of applications.
OpenCL [13] and CUDA [14] have emerged as two standard programming frameworks that allow
GPUs to be integrated in supercomputers and desktops as accelerators. Programmers were no longer
tied to the underlying graphics programming model. They could focus more on high-performance
computing, which attracted many more developers of general purpose applications to a GPU platform.
1.1.1 The Open Compute Language (OpenCL)
As GPU hardware vendors introduced programmable shaders, AMD and NVIDIA intro-
duced support for OpenCL and CUDA, respectively. These C-like parallel programming frame-
works provide a Software Development Kit (SDK) that includes a rich set of APIs and compil-
ers/runtimes/drivers. In this thesis we use programs written in OpenCL [13].
OpenCL is an emerging framework for programming heterogeneous devices. It is an
industry standard maintained by Khronos, a non-profit technology consortium. OpenCL has seen
an increasing number of adoptions from major vendors in industry, including Apple, AMD, ARM,
NVIDIA, Intel, Imagination Technologies, Qualcomm and S3. OpenCL provides a number of
abstraction models, allowing the model to be applied to a wide range of system architectures. In
3
CHAPTER 1. INTRODUCTION
a) Elements defined in the OpenCL programming model. Work-items running
the same code form work-groups, which in turn, compose the whole ND-Range.
b) Simplified block diagram of the Radeon HD 5870 hardware architecture. This GPU
belongs to the Evergreen family of AMD devices.
Figure 1.2: OpenCL Programming Model and Evergreen Hardware Architecture.
the OpenCL terminology, the GPU is referred to as the device and the CPU as the host. Figure 1.2
presents an overview of the OpenCL programming model.
1.1.2 The OpenCL Platform Model
The platform model for OpenCL consists of a host connected to one or more OpenCL
devices. Each device consists of one or more compute units (CU) and each compute unit further
consists of one or more Processing Elements (PE). Within a device, the computations are performed
within the processing elements. An OpenCL application runs on the host and sends commands to the
device be executed by the processing elements. OpenCL’s programming model emphasizes parallel
4
CHAPTER 1. INTRODUCTION
processing by assuming a Single Program Multiple Data (SPMD) paradigm, in which a piece of
code, (called a kernel) maps to multiple subsets of input data, creating a massive number of parallel
threads.
The host program is the starting point for the OpenCL program and executes on the CPU.
The device kernel is written in OpenCL. In most cases, the OpenCL kernel runs on a GPU device
and is usually compiled during the execution of the OpenCL host program.
An instance of the kernel executing in a Processing Element is called a work-item and is
identified by a global ID. Each work-item executes the same code, but the specific execution path
can vary per work-item by querying its ID.
Work-items are organized into work-groups. Work-groups are assigned a work-group ID.
Work-items within a work-group are also assigned a unique local ID. A set of work-groups in turn
form an ND-Range, which is a grid of work-item groups that share a common global memory space.
1.1.3 Architecture of the Evergreen family of GPUs
In this thesis, we have worked with the AMD Evergreen family of GPUs to evaluate
soft-error resiliency in GPUs. The Evergreen family was an earlier flagship GPU developed by
AMD. This device was designed to target general-purpose data-intensive applications, along with
the primary graphics applications. While the Evergreen devices are a few years old, they support
OpenCL exection.
Figure 1.2b shows the general systems architecture of the Radeon HD 5870 GPU, a popular
GPU in the Evergreen family. The GPUs in this family have computational units called compute
units that can take advantage of data parallelism.
The Radeon 5870 has 20 compute units. Each compute unit has 16 stream cores. Each
stream core in a compute unit is devoted to the execution of one instance of an OpenCL kernel. The
stream cores also have access to a 32KB local data storage. Additionally, each stream core has 5
processing elements that execute the machine instructions.
Interestingly, the stream cores in Evergreen are time-multiplexed in 4 slots. This gives
the illusion that each stream core is running 4 different kernels simultaneously. Furthermore, the
Evergreen architecture has support for 5-way Very Long Instruction Word (VLIW) bundles of
arithmetic instructions. The hardware support is provided in each stream core in the form of the 5
processing elements, labeled x, y, z, w and t. As a result, the Radeon 5870 GPU has the ability to
issue up to 5 floating-point operations in one cycle.
5
CHAPTER 1. INTRODUCTION
When an OpenCL kernel is launched on an Evergreen GPU, the ND-Range is initially
transferred to it. A dispatcher processes the ND-Range and assigns work-groups to any of the
available compute units in any order. Each compute unit contains a set of 16 stream cores, each
devoted to the execution of one work-item. All stream cores within the compute unit have access to a
common local data storage (LDS), used by the work-items to share data at the work-group level. The
LDS is the implementation of the local memory concept as defined in OpenCL. Finally, each stream
core contains 5 processing elements to execute Evergreen machine instructions in a work-item, plus
a file of general-purpose registers, which provides the support for the private memory concept as
defined in OpenCL.
The GPU memory hierarchy is divided into three memory scopes: 1) private memory (the
register file), 2) local memory (the local data storage), and 3) global memory. Access to each memory
scope is defined by software.
In this thesis, we focus on the vulnerability of the first two memory scopes in the GPU, the
Local Data Storage and the Vector Register File. These structures represent a large portion of a GPU
chip and can be directly addressed by a programmer. It is crucial for a programmer to understand
how to use these memory scopes when resilience is critical to an application. We provide a brief
description of each of these two structures in the following paragraphs.
1.1.4 Register File
The register file of a compute unit can be considered its private memory as defined by the
OpenCL programming model. The register file provides each work-item in a work-group that is
mapped to a compute unit at a given time with their private copy of register values. The register file
can be accessed and modified by specific instructions, the Arithmetic Logic Unit (ALU) instructions
and the TEX (fetch through a texture cache) instructions during their read or write stages of execution.
The register file in a GPU is significantly larger than that of a CPU. Moreover, since GPUs
are throughput-oriented devices, they can usually have hundreds of threads (work-items) running
concurrently. GPUs utilize fine-grained scheduling among the individual threads to hide latencies
which can be associated with memory operations and dependencies from these threads. Having such
a large number of threads and being a throughput-oriented device, a GPU needs to have dedicated
hardware to support each running thread in the device. This explains the motivation for a large
register file (about 64 KB in the Radeon 5870).
6
CHAPTER 1. INTRODUCTION
1.1.5 Local Data Storage (LDS)
As previously explained, the register file provides private memory for the individual threads
running on a GPU. Each thread is provided with its own set of separate registers. Communication
between individual threads through the register file is therefore not allowed. However, the OpenCL
programming model supports the sharing of data between work-items within a work-group.
The GPU uses local memory in order to support this feature. Each compute unit contains
one local memory module that is accessible by all work-items that are running in the work-group.
The local memory functionality is different than that of the cache in the CPU. In a GPU, data in local
memory is manipulated using explicit instructions, and the size of the local memory is comparable to
the register file size (32 KB in the Radeon 5870).
1.2 Contributions of this Thesis
The contributions of this thesis include:
1. We present a reliability study of the vector register file and the local data share of a GPU.
2. We simulate the presence of single-bit faults using fault injection and carry out a study by
simulating GPU workloads on a cycle-based simulation model of the AMD Radeon 5870.
3. We provide a characterization of the resiliency of a suite of OpenCL kernels to the effects of
particle-induced faults.
4. We observe how the vulnerability of applications change over time and provide insights that
can be used by application developers to reduce vulnerability.
1.3 Organization of the Thesis
This thesis is organized as follows. In Chapter 2) we discuss prior work on reliability
modeling. In Chaper 3 we review the limited prior work on GPU reliability. Chapter 4 describes
the framework we use for our fault injection experiments, as well as the details oft our fault model.
We also discuss the Architectural Vulnerability Factor of the applications that we use. Chapter 5
provides the results of our simulation study, Chapter 6 summarizes lessons learned in this thesis, and
discusses directions for future work.
7
Chapter 2
Background
In this chapter, we provide background information on Soft Errors and the methods used to
deal with these errors. We discuss techniques and paradigms used at the architectural level to assess
the error rate of a processor, then we discuss recent reliability work for general purpose GPUs.
2.1 Soft Error Overview
Soft errors are intermittent malfunctions of the hardware that cannot be reproduced. Soft
errors are dynamic and are changes to a cell’s contents, rather than a change in the circuitry. They
are caused by single event upsets (SEUs) which are most often the result of particle strikes on silicon
devices. Among the most common particles that produce Soft Errors are neutrons from cosmic rays
and alpha particles from packaging materials.
When these strikes occur, the particles are able to inject charge into the devices which can
alter values in the devices. Each cell in a device has a minimum charge needed to change the stored
value in the cell. This minimum charge is called the critical charge f (Qcrit) for that cell. Following a
particle strike, if the accumulated charge exceeds the critical charge of the cell, a Soft Error occurs.
In short, particle strikes which generate a charge higher than Qcrit will cause a Soft Error.
2.1.1 Faults vs Errors
A fault is an undesired state change in hardware. A fault in a particular layer in the
computing stack may propagate to the next layer. The undesired state change in the next layer is
termed an error. In this thesis, we use the term transient faults for the soft errors, defined in 2.1.
8
CHAPTER 2. BACKGROUND
When a transient fault occurs in a bit, this bit can be overwritten to remove the fault. When the bit is
not overwritten, the incorrect state that happens as a consequence of this fault is termed an error.
Errors can be classified based on their impact on the system. We identify Correctable
Errors, Detected Unrecoverable Errors or Silent Data Corruptions.
2.1.2 Correctable Errors (CE)
Correctable Errors are errors from which the system is able to recover from and return to
normal operation. This is usually made possible through either hardware or software. Because the
system is able to recover from the effect of these errors, they are not usually not a cause of concern.
Many vendors however, use the reported rate of Correctable Errors as a warning that a system may
have an impending hardware problem [15].
2.1.3 Detected Unrecoverable Errors (DUE)
Detected Unrecoverable Errors are errors that will be discovered either through a program,
operating system, or hardware. These errors are typically reported to the system and very often the
system cannot recover. They often cause a system to crash.
2.1.4 Silent Data Corruptions (SDC)
Silent Data Corruptions are undetected errors that alter data in a system without being
detected, and ultimately permanently corrupt program states or user data. Because they can cause a
program to produce incorrect results without the knowledge of the user, these are the most undesirable
errors.
2.2 Transient Fault - Background and Terminology
In order to deal with transient faults, microprocessor vendors often establish an error
budget for each design. Designers then perform extensive analysis to ensure that a design meets these
target budgets. Vendors express their error budget in terms of Mean Time Between Failures (MTBF).
For example, for its Power4 processor-based systems, IBM targets 1,000 years system MTBF for
SDC errors, 25 years system MTBF for DUE errors that lead to system crash and 10 years system
MTBF for DUE errors that lead to application crash [16].
9
CHAPTER 2. BACKGROUND
Another commonly used measurement unit for error rates is Failures in Time or FIT, which
is inversely related to MTBF. One FIT corresponds to one failure in a billion hours. Therefore, 1,000
years MTBF equals 114 FIT(109/(1000 ∗ 365 ∗ 24)). The same way, zero FIT means that there is
infinite time between failures (infinite MTBF). Designers prefer to work with FIT as opposed to
MTBF because it is an additive unit, unlike MTBF.
To evaluate whether a chip meets its FIT target, designers use sophisticated computer
models. The effective FIT rate of a structure in a chip is the product of two metrics, its raw circuit
FIT rate and its vulnerability factor.
2.2.1 Raw Circuit Fit Rate
The raw circuit FIT rate of a cell also called intrinsic FIT rate is its device-level Transient
Fault Rate and includes any extra derating such as the ones that may be necessary for a dynamic cell.
2.2.2 Vulnerability Factor
Vulnerability factor (also called derating factor or soft error sensitivity factor) is an in-
dication of the probability that an internal fault in a device will result in an externally-visible
error.
Several vulnerability factors affect the FIT rate of a structure. Timing vulnerability factor
for example measures the percentage of time a fault in a structure will lead to an externally-visible
error. A strike in the stored bit of a level-sensitive latch may not cause an external error if the strike
occurs while the latch was accepting data. The stored bit will be overridden by the entering data.
Assuming that the latch was receiving data 50% of the time, its Timing Vulnerability Factor is 50%.
Several other vulnerability factors affect the effective FIT rate of a structure. However, in
this work, we are assuming that all vulnerability factors, except the architectural vulnerability factor
are incorporated into the raw circuit FIT rate.
2.3 Architectural Vulnerability Factor
The Architectural Vulnerability Factor of a structure is defined as the percentage of bits in
the structure that are necessary for correct program execution over the lifetime of a simulation. It
expresses the probability that a bit flip in the structure will produce a visible incorrect result at the
output of a program.
10
CHAPTER 2. BACKGROUND
Current predictions show that the overall raw FIT rate per bit will remain constant for
the next several technology generations. Therefore, it is crucial to focus efforts on reducing the
architectural vulnerability factor of a chip to make the chip more reliable and competitive.
There are two common methodologies for assessing AVF in silicon devices: fault injection
and ACE analysis [17][18][9]. These two methods help designers analyse the AVF of an architecture
in various stages of the design process.
2.3.1 Fault Injection
Fault injection is the most widespread method for assessing reliability. A fault injection
campaign compares the reference behavior of the circuit for a given workload (that is, the correct
behavior validated by the designer) with the behavior obtained in the presence of each fault in a
predetermined set [19].
In a fault injection campaign, a fault is injected in a hardware structure at a random time
and at a random location, while a workload is being executed on the device under test. The output of
the workload is then examined against a golden output to determine whether the injected fault caused
a visible failure. This process is then repeated a number of times and as the number of runs becomes
statistically significant, the ratio between the number of failing runs to the total number of runs will
converge towards the Architectural Vulnerability Factor of the structure. Hardware-implemented
fault injection and software-implemented fault injection are the most common approaches used to
perform fault injection.
2.3.1.1 Hardware-implemented Fault Injection
In this method, faults are inserted into the actual device silicon by either using a dedicated
custom hardware [20] or by injecting the faults into integrated circuits using heavy-ion radiation [21].
Because Hardware Fault Injection is done in actual hardware, there is no need to know the
internal details of the hardware and it really mimics what happens in real systems. It is therefore
very accurate; the effects of the operating system, the latency from IO operations, and other non-
determinist effects are already taken into account. Furthermore, since injections are done in the
actual hardware that is running the workloads, a fault injection campaign takes significantly less time
than software-implemented fault injection.
However, the disadvantages of a hardware fault injection campaign make it very difficult to
consider. First, hardware fault injection needs to be done post silicon, as we need at least a hardware
11
CHAPTER 2. BACKGROUND
prototype. This is usually too late considering that such reliability analysis is often needed during the
architectural exploration phase of a design. However, the results can help make reliability decisions
for future devices that use a similar technology or architecture. Second, it is very expensive and
time-consuming to build a dedicated custom hardware and submit a hardware through an electron
beam.
2.3.1.2 Software-implemented Fault Injection
In Software Fault Injection, faults are injected in the simulated hardware under test.
Because this is done in a software implementation of the hardware, it can be done in a performance
simulator which is usually available during the architectural exploration phases of a microprocessor
design project. Therefore, the results of a software-implemented fault injection campaign can be used
to influence the design of a new chip. Moreover, since we are using a software implementation of
the hardware, we naturally have more visibility into the internals of the architecture under test. One
drawback of the software fault injection method is that simulation tends to be very slow compared to
the execution of a workload on the native hardware.
2.3.2 ACE Analysis
ACE analysis has first been developed by Mukherjee et al [9] to calculate the Architectural
Vulnerability Factor (AVF) of pipeline structures such as the instruction queue and the Reorder Buffer.
Traditional ACE analysis is implemented in simulation and will determine the AVF of hardware
structures by executing a single pass through a program.
In ACE analysis, the AVF of hardware structures is estimated by tracking the hardware
state bits that are required for Architecturally Correct Execution (ACE). If any fault occurs in a
storage cell containing these ACE bits, and if there is no error correction technique present on the
system, there will be a visible error in the output of the program. The remaining state bits that are
not ACE are called un-ACE bits; they are not required for architecturally correct execution of the
program and a fault in a storage cell containing an un-ACE bit will not cause a visible error at the
output of the program.
The AVF for a single-bit storage cell is the fraction of time it holds an ACE bit. Conse-
quently, the AVF for a hardware structure is the average AVF of its storage cells. ACE analysis
on a structure starts by conservatively assuming that all bits in the structure are ACE bits, then
proceeds to identify bits that can be marked as un-ACE. Un-ACE bits can be categorized as either
12
CHAPTER 2. BACKGROUND
architectural or microarchitectural un-ACE bits. Examples of architectural un-ACE bits include
bits from NOP instructions, performance-enhancing instructions (e.g., prefetches), predicated-false
instructions, dynamically-dead code, and logical masking. Examples of microarchitectural un-ACE
bits are idle or invalid bits, mis-speculated bits (wrong-path instructions or predictor structure bits),
and microarchitecturally dead bits.
Because ACE analysis generates a conservative value for the AVF of a structure, the AVF
value obtained through ACE analysis can very often be too conservative. It has been shown that even
a refined ACE analysis can overestimate the error vulnerability of a structure by 2-3x [22]. This
can result in overprotection of the structure, which makes a processor uncompetitive. Furthermore,
although ACE analysis gives more insight into the resilience of a structure, performing ACE analysis
on certain structures can be a very involved process.
13
Chapter 3
Transient Faults on GPUs
In this chapter, we review previous studies on the effects of transient faults on GPUs. We
also consider the tools that have been developed to study these effects. The previous studies can be
categorized into fault injection studies and ACE analysis studies.
3.1 Fault Injection Studies
Fault injection evaluates the impact of introducing a fault into the execution of a program.
The execution can be done on live hardware (typically done through radiation beam testing, or a
software-based injector) or in a simulated microprocessor or memory system using software.
3.1.1 GPU-Qin
GPU-Qin [23] is a fault injection tool for GPUs. The tool is built to perform fault injection
studies on real GPUs running CUDA-based applications. It uses CUDA-GDB, the NVIDIA tool for
debugging GPU applications. The applications are first profiled, and then instructions are selected as
fault injection sites. At runtime, GPU-Qin injects a fault into the selected instructions.
The results of a fault injection campaign with GPU-Qin [23] showed that some applications
inherently possess some resiliency characteristics to transient faults. This should be taken into account
when protecting an application against transient faults. In addition, there was a wide variation in
the rates of Silent Data Corruptions (SDC) and crashes across the studied applications. However,
benchmarks with similar behaviors showed similar vulnerability behavior. For example, HashGPU-
sha1 and HashGPU-md5 are respectively SHA1 and MD5 hash implementations of StoreGPU[24], a
library that accelerates a number of hash-based primitives.
14
CHAPTER 3. TRANSIENT FAULTS ON GPUS
The reason for this variability in rate of SDCs in GPU applications is mainly related to the
applications’ characteristics. For example, applications based on search algorithms are likely to have
lower SDC rates than applications that perform computations such as linear algebra because a fault
that affects the search in a part that will not lead to a match, is unlikely to produce an incorrect result.
Applications based on the ”average out” algorithm, such as stencil codes [25], also have a low SDC
rate.
These applications have computations in which the final state is a product or average of
multiple temporary states. Because the final state will be an average or product of temporary states, a
fault affecting an intermediate state is likely to be masked and unlikely to affect the final state.
Another focus of the fault injection study with GPU-Qin was a technique to cluster
applications into five resilience categories based on their SDC rates. Because of the variability in
SDC rates across the applications, and the similarity in resilience among similar algorithms, the
GPU-Qin authors found it useful to categorize the applications based on their SDC rate and the
operations they perform. From this clustering, each of the resilience categories seemed to match very
well with one or many of the dwarves defined by Asanovic et al. [26]
3.1.2 SASSIFI
SASSIFI [27] is another fault injection tool for NVIDIA GPU’s. i SASSIFI is based on
SASSI, a low-level, compiler-based assembly-language instrumentation framework that allows the
injection of code at specific points in a program [28]. SASSIFI injects faults in the destination values
of executing instructions of a running program at the architectural level. This allows for faster fault
injection, increased visibility into the applications and the possibility for a detailed study and analysis
of the magnitude of Silent Data Corruptions (SDC).
SASSIFI provides the user with the ability to trace an SDC all the way back to the
specific fault which produced it, and also the ability to correlate program properties with program
vulnerabilities, which is a key to develop low cost error mitigation schemes. Because SASSIFI injects
faults at the architecture level (as opposed to the microarchitecture level), fault injection experiments
with SASSIFI can only measure the derating that happened at the application level.
A fault injection study using SASSIFI and the Rodinia applications evaluated the variation
in the SDC rate of these applications. Further analysis was done on the injected faults that caused
different outcomes and it was observed that fault injection outcomes vary with different kernels of
the same program, and with different invocations of the same kernel.
15
CHAPTER 3. TRANSIENT FAULTS ON GPUS
SASSIFI is similar to GPU-Qin, as it injects single-bit faults in destination values of
executing instructions of a program. One key difference is that SASSIFI is able to inject faults into
control and predicate registers. Another drawback of these tools is that because the instrumenting
instructions need to run on the GPU, code injections may perturb the workload that is running on the
GPU, possibly altering the behavior of the workloads. This can lead to inaccuracy in the reported
reliability results.
Finally, it is worth mentioning that SASSIFI and GPU-Qin inject faults at a level above the
microarchitecture. This means that the derating that comes from hardware structures is not taken
into account in the results. Furthermore, the location for an injection is carefully selected in order to
reduce the population size and easily attain statistical significance.
3.1.3 Multi2Sim Fault Injection [1]
In this thesis we utilize the Multi2Sim simulation framework [29] to provide the basis
for our simulation model of the AMD 5870. We leverage prior work by Farazmand et al. [1] to
build our fault injection framework in Multi2Sim [29]. Multi2Sim is a simulation framework for
CPU-GPU heterogeneous computing. We provide a more detailed description on the environment
and infrastructure in Chapter 4.
In prior work by Farazmand et al., For this fault injection campaign, faults are injected
in structures of the microarchitecture. The results of this campaign show that a great number of
resources are not utilized by the GPU, especially for the small applications that were used. This
results in a very low rate of Silent Data Corruptions and crashes. For the injections in utilized
resources, the GPU demonstrated high resilience, and in many cases, the applications were able to
run to completion without any error in their output.
There were a few interesting implications from this fault injection study. The authors
observed that structures with similar functionality in the CPU and GPU were not necessarily similar
in terms of their vulnerability. In addition, given that very few injections into the register file led to
an error in the program outputs, it makes little sense to dedicate significant resources to protect this
structure. This prior work is the only other fault injection campaign that tries to compute the AVF
values for structures of the GPU.
16
CHAPTER 3. TRANSIENT FAULTS ON GPUS
3.2 ACE Analysis studies
3.2.1 GPGPU-SODA
GPGPU-SODA [30] is a framework to evaluate the vulnerability of a GPU to transient
faults. It is built on a cycle-accurate, open-source and publicly available, simulator, GPGPU-Sim.
GPGPU-SODA is capable of estimating the vulnerability of the major microarchitecture structures
in a Streaming Multiprocessor using ACE analysis. GPGPU-SODA attempts to characterize the
vulnerability of different micro-architectural structures a GPU to transient faults through architecture
vulnerability factor (AVF) analysis.
Tan et al.’s study with GPGPU-SODA found that the GPU microarchitecture vulnerability
is highly related to workload characteristics such as the percentage of un-ACE instructions, the per-
block resource requirements and the degree of branch divergence. They also concluded that several
structures are highly susceptible to transient faults, and that the entire GPU should be considered for
protection.
17
Chapter 4
Methodology
As discussed in Chapter 2, fault injection can be performed on a real or a simulated
hardware. GPU-Qin and SASSIFI, for example, inject faults in physical GPUs that are under test.
This fault injection method (in real hardware) will yield very accurate results and can be significantly
faster than the alternative.
However, simulation-based fault injection is useful in the fact it can help GPU designers to
explore potential design decisions. Our simulator as described in this chapter is built for architectural
exploration. Our fault injection mechanism, which is built into Multi2Sim [29], can be used to carry
fault injection experiments while exploration different microarchitectural, compiler and runtime
tradeoffs. In this chapter, we describe the framework that we used to perform the fault injection
campaign, as well as the post-experiment analyses.
4.1 Multi2Sim simulation model
We performed our fault injection campaign in an architectural simulator, Multi2Sim [29].
Multi2Sim is an open-source, modular and fully configurable simulation framework for CPU-GPU
computing. Multi2Sim provides a wide range of CPU and GPU choices. The specific framework used
in this thesis leverages a model of the AMD Evergreen family of GPUs. The Evergreen Instruction
Set Architecture has been used in the implementation of AMD’s mainstream RadeonTM5000 and
6000 series of GPUs. The model implemented in Multi2Sim is similar to the RadeonTM5870 GPU.
Multi2Sim supports both functional simulation and architectural (or detailed) simulation
for the Evergreen family of GPUs. Functional simulation provides traces of Evergreen instructions;
architectural or detailed simulation tracks the execution time and architectural state of the GPU
18
CHAPTER 4. METHODOLOGY
hardware structures. Simulation of a program in the Evergreen model begins with the execution of a
CPU code, the host code of the OpenCL program. The host code is run using the CPU simulation
module of Multi2Sim. The OpenCL API calls from the host program are intercepted and used to
kick-off the GPU simulation.
4.2 Multi2Sim for Fault Injection
The authors of Multi2Sim have introduced the ability to perform fault injection into the
execution of a program [31]. Using Multi2Sim, we are able to assess the reliability of individual
hardware structures. We can inject faults during any cycle of the runtime of any hardware structure
that is modelled by Multi2Sim. The fault injection mechanism is not specific to the Evergreen model,
and therefore can be implemented and applied to different GPU architectures in Multi2Sim. In
addition to the study presented in Chapter 3, the fault injection mechanism in Multi2Sim has been
extended and used in other fault injection studies [32] [33] [1].
When using Multi2Sim to inject a fault in a hardware structure during a simulation, a fault
definition file is fed to the simulator. This fault definition file contains the following information:
a) the targeted hardware structure, b) the specific fault location, and c) the injection time. The fault
location is the position of the bit within the hardware structure where the fault should be injected and
the injection time is the simulation cycle where the injection is performed.
Figure 4.1: Possible outcomes for each simulation run
A fault is represented by a bit flip in the simulated hardware structure at the specified cycle
and location. The faulty value is either propagated to other locations in the simulation model, or is
masked by the program. The programs used in our experiment have a self-check mechanism. This
mechanism compares the output of the GPU program to a reference golden precalculated output. The
possible outcomes of a single simulation with a fault injection are presented in Figure 4.1.
19
CHAPTER 4. METHODOLOGY
For each structure and each application, a total of 10,000 single faults are injected. The
statistical significance of this number of experiments is discussed in Section 4.3. In order to calculate
the Architectural Vulnerability Factor, we compute the number of fault injections that result in a
program failure (SDCs) and divide by the total number of faults injected.
4.3 Statistical Significance
Figure 4.2: This graph shows how the AVF value changes based on the number of fault-injectionexperiments. We notice that the AVF value shows little variation and stabilizes after 5,000 injections.
Our goal is to statistically estimate the AVF of the workloads running on an AMD Evergreen
GPU. We want to choose a number of simulations that is large enough for our results to achieve
statistical significance, and not so large that performing the experiments become burdensome.
In order to reach this confidence level, we computed the AVF of the structures with a
varying number of injections. The results of this experiment are shown in Figure 4.2. We found that
the AVF value varies significantly with a small number of injections. However, after the number of
injections passes 5,000 there is little variance in the AVF value.
20
CHAPTER 4. METHODOLOGY
Figure 4.3: Formula for the number of faults to select for injection
We also verify that the number of faults injected into every structure is statistically signifi-
cant using the methodology presented by Leveugle et al. [19]. According to their methodology, given
a confidence level, the sample size n, or number of faults to randomly select for injection, can be
computed with the formula in Figure 4.3. The variables in this formula are:
• N: initial population size. This is the number of all the potential injection sites.
• p: estimated probability of faults resulting in an error. The authors demonstrated that p=0.5 is
a sufficient value to use in our experiments.
• e: margin of error. This is the most sensitive parameter in the formula. Reducing this parameter
can increase the sample size very quickly. We chose 0.005 as our margin of error
• t: cut-off point or confidence level. We chose 95% for our confidence level
Using this formula, we computed the required sample size for the vector register file of the
GPU while running the ScanLargeArrays workload. ScanLargeArrays runs for less than 7.5 million
cycles. To guarantee a margin of error of less than 0.05%, we would be required to inject 9,800
faults. Given that most of our applications run for less than 7.5 million cycles, and that the number
of potential injection sites on the local memory are far less than that of the register file, the initial
population size is then the highest in the case of ScanLargeArrays. Thus, we can easily argue that
this case will give us the highest error margin. Consequently, with 10,000 injections used in this
thesis, our results will have a margin of error of less than 0.5%.
4.4 Post-Experiment Analysis
Fault-injection campaigns very often treat the system under evaluation as a black box.
After the results of the campaign are reported, researchers come out with a number that measures the
resilience of the system. Our goal is primarily to offer insights to GPU application developers into
21
CHAPTER 4. METHODOLOGY
the vulnerability of their applications. Data obtained from our fault injection campaign allows us
to perform more analysis on the injected faults throughout the execution of the programs. Because
we have the data from each fault and their outcome at the end of the simulation, we are able to
perform an per-time-interval vulnerability study on the applications. That is, we can track the
vulnerability of an application during the course of its execution. We hope that these results, coupled
with their application profiles, will be helpful to programmers when evaluating the vulnerability of
their applications.
4.5 Evaluation Framework
Number of compute units 1Number of Stream Cores 16Number of Vector Registers 16384Number of Memory Banks 32
Table 4.1: The GPU configuration used in the experiments
4.5.1 Platform for Evaluation
The configuration used for the GPU model used in the experiments is presented in Table 4.1.
We have used the AMD RadeonTM5870 as the base configuration. An overview of the Evergreen
architecture is provided in Figure 1.2.
Because our applications are small compared to the size of the workloads that a GPU can
potentially run, we have used only one compute unit in order to maximize the occupancy of the GPU.
The faults can therefore be injected into usable resources of the GPU. We argue that focusing on the
reliability of a single CU should not signficantly impact the fidelity of our AVF values.
4.5.2 Evaluated Benchmarks
The applications are taken from the AMDAPP SDK [34] and they are common general-
purpose GPU applications. These applications were chosen because they provide a representative
cross-section of common GPU applications. The list of selected applications is shown in Table 4.2.
22
CHAPTER 4. METHODOLOGY
Benchmark DescriptionBNSRCH Binary Search Binary Search finds the position of a given
element in a sorted array. Instead of dividingthe search space at every pass, it is dividedinto N segments and is called N’ary search.Computation complexity is log to base N.
BSRT Bitonic Sort Sorting network of nlog2n comparators.Performs best when sorting a smallnumber of elements.
DCT DCT Discrete Cosine Transform is a commontransform for compressions of 1D and 2Dsignals such as audio, images and video.
HGRM Histogram Calculates the histogram of n arrayMMUL MatrixMultiplication Performs the multiplication
of two matricesMTRNS MatrixTranspose Matrix transpose optimized to
coalesce accesses to shared memoryand avoid bank conflicts
PSUM PrefixSum Computes an array whichis the running totals of the elementsof the input array.
RDXS RadixSort Radix-based sorting algorithmstreat keys as multi-digit numbersin which each digit is an integerwith a value ranging from 0 to m,where m is the radix.
SLA ScanLargeArrays This scan is based on a prefix sumbut the scan is done block-wise,then the blocks are combined into asingle result array.
Table 4.2: The benchmarks used in the experiments
23
Chapter 5
Results and Analysis
Multiple factors can affect the vulnerability of a structure. Vulnerability can differ based
on kind of computation a program performs, whether or not the application can tolerate approximate
output values, and the level of occupancy of the structure where the fault was injected. Also,
the microarchitecture can mask faults. Specifically, an identical fault injected in an alternative
microarchitecture can lead to very different outcomes.
In this thesis, we will look at is the liveness of a structure. We will examine how a
programmer can reduce the vulnerability of a structure by reducing the liveness of that structure.
5.1 Local Data Storage
The results derived from our fault injection study in the Local Data Share are presented
in Figure 5.1. From these results, we observe that some applications do not make use of the LDS
and therefore show no vulnerability. Other applications either partially or fully use the Local Data
Storage (LDS) and show a wide range of vulnerability. Figure 5.2 shows the maximum amount of
local memory used by each application.
Some applications, notably BitonicSort and BinarySearch do not use the Local Data
Storage in their algorithm. There is no apparent sharing between work-items within a workgroup.
Therefore, any fault injected in the LDS will not have an an impact on visible output of the program.
Because the locations for the bit flips were randomly chosen, and because most of the
applications are small in nature and can make minimal use of the LDS, many of the flipped bits were
into unallocated portions of the LDS. This can be seen in applications have a high percentage of
24
CHAPTER 5. RESULTS AND ANALYSIS
Figure 5.1: Results of Fault injection experiments on the Local Data Share
25
CHAPTER 5. RESULTS AND ANALYSIS
Figure 5.2: Amount of local memory used by each application for each NDRange mapped to thecompute device
26
CHAPTER 5. RESULTS AND ANALYSIS
NoFaults (benign faults). For example, in PrefixSum, DCT and ScanLargeArrays, a large number
of the faults were injected into non-utilized portions of the LDS.
Other applications, such as Histogram, MatrixMultiplication, RadixSort and Scan-
LargeArrays heavily use the LDS. However, the failure rate for these applications varies widely.
Histogram experiences very little resilience to faults, while MatrixMultiplication is highly tolerant
to faults. When running MatrixMultiplication, the probability that a fault in the LDS leads to a
program visible error in the output matrix is around 1%. This suggests that the inherent resilience of
an application should be taken into account when protecting it from transient faults.
5.2 Register File
Figure 5.3: Results of Fault injection experiments on the General Purpose Register File
The results of the injection fault campaign in the register file of the GPU are shown
27
CHAPTER 5. RESULTS AND ANALYSIS
Figure 5.4: Number of General-Purpose registers used by each application for each NDRangemapped to the compute device
28
CHAPTER 5. RESULTS AND ANALYSIS
in Figure 5.3. These results show that the register file of the GPU is a highly resilient structure. A
good portion of the activated faults get masked, and so do not affect the output of the program.
Figure 5.4 shows the maximum number of registers used for each NDRange mapped to a
compute device per benchmark.
5.3 How does vulnerability vary over time?
To understand the inherent resilience of an application, we look to see how vulnerability of
an application varies over time. Our goal is to identify points in time where the application is most
vulnerable (or least vulnerable) and match them with specific architectural events that can be under
the control of a developer. This should provide insights to a programmer or application designer on
how to make their applications more fault-tolerant. HERE
To identify points in time where an application is most vulnerable, we divide the application
simulation time in cycle-intervals. Then we look all the faults that were injected during this cycle
and their outcomes. We also verify that the small number of faults into each interval is statistically
significant using the formula from Leveugle et al [19], presented in 4.3. We estimate that the number
of faults needed to obtain a 95% confidence interval in our results is about 90. Because the faults are
uniformly distributed over time, we are confident that our results are statistically meaningful. We
found some interesting findings which can beneficial to application designers. We cannot guarantee
that the observed trends can be seen in every application; however they can offer some guidelines
when designing fault-tolerant applications. We are presenting our findings in a few case studies that
we present below.
5.3.1 Case Study: LDS - RadixSort
The LDS-AVF variation over time for RadixSort is shown in Figure 5.5. This graph shows
that midway through the running of RadixSort, the application is highly vulnerable. Almost any fault
in these intervals will lead to an incorrect result. We conduct some analysis to examine the reasons
behind the high vulnerability of this application.
The RadixSort application has 3 phases: a histogram which runs on the GPU, a scan of the
generated histogram which runs on the CPU and a permute which runs again on the GPU. Figure 5.6
shows the in-flight accesses to the LDS during the course of the running of the application. The
29
CHAPTER 5. RESULTS AND ANALYSIS
Figure 5.5: Intervals of Vulnerability for Radixsort. This shows that the faults that lead to incorrectoutput fall only into specific intervals of time
30
CHAPTER 5. RESULTS AND ANALYSIS
Figure 5.6: Intervals for LDS accesses for Radixsort.
31
CHAPTER 5. RESULTS AND ANALYSIS
application loops four times, each time operating on a different part of the 32-bit integers, starting
with the least significant.
We see that the first iteration is not vulnerable at all. This is because this iteration is done
on low-order bits of the integers. In the other iterations, the histogram kernel is the most vulnerable,
with the 3rd iteration being almost 100% vulnerable.
5.3.2 Case Study: MatrixMultiplication
Figure 5.7: Intervals of Vulnerability for MatrixMultiplication. This vulnerability of MatrixMultipli-cation shows a periodic behavior
The LDS-AVF time variation for MatrixMultiplication is shown in Figure 5.7. This figure
shows that the vulnerability of the LDS with MatrixMultiplication presents a periodic behavior with
some short intervals of no vulnerability. This is usually explained by the fact that the application is
reading locations of memory in chunks and re-writing other locations. In Figure 5.8, we show how
the accesses to the LDS vary over time. The local memory accesses also show a periodic behavior,
32
CHAPTER 5. RESULTS AND ANALYSIS
Figure 5.8: Local Memory accesses in MatrixMultiplication show a periodic behavior
33
CHAPTER 5. RESULTS AND ANALYSIS
with brief spurts in writes followed by longer periods of memory reads. We coincidentally see that
the intervals of low vulnerability of the LDS fall right before the memory writes. This is because
the application has finished using the values stored in memory right before it initiates the writes
that would overwrite the memory locations. Therefore, a fault in an LDS location right before the
location gets overwritten is not going to affect the result of the Matrix Multiplication.
34
Chapter 6
Conclusion
In this thesis, we conducted a thorough characterization of the effects of particle-induced
errors in the vector register file and the local data share of a GPU from the AMD Evergreen family
of GPUs. Our study shows that the vulnerability of the common GPU applications varies widely
depending on the implementation of the application.
We also look at the a few trends that can be exploited by application developers in order to
make their applications more robust. We observed that some applications show a periodic behavior
in their vulnerability and are most vulnerable at specific intervals during their execution. Overall, the
longer the intervals of vulnerability and the more intervals of vulnerability, the more vulnerable the
application is. One general rule of thumb for a highly resilient application is to reduce the liveness of
useful data, i.e. reduce the time between writing to a structure and reading the written data. Also,
because, caches are more likely to be protected, a GPU application developer seeking to make his/her
application more reliable should consider storing his/her data in global memory as a trade-off for
better performance.
In future work, we plan on developing a more comprehensive vulnerability analysis
framework. A fault injection campaign is a very lengthy process and provides very little detail with
regards of the sources of masking in a hardware structure. A comprehensive reliability framework will
help us identify the sources of vulnerability and provide better insights to both hardware designers
and application developers.
35
Bibliography
[1] N. Farazmand, R. Ubal, and D. Kaeli, “Statistical fault injection-based analysis of a GPU
architecture,” in Workshop on Silicon Errors in Logic - System Effects (SELSE), 2012.
[2] D. A. Patterson and J. L. Hennessy, Computer Organization and Design, Fourth Edition,
Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer
Architecture and Design), 4th ed. San Francisco, CA, USA: Morgan Kaufmann Publishers
Inc., 2008.
[3] G. E. Moore, “Cramming more components onto integrated circuits,” Proceedings of the IEEE,
vol. 86, no. 1, pp. 82–85, Jan 1998.
[4] E. Alerstam, T. Svensson, and S. Andersson-Engels, “Parallel computing with graphics process-
ing units for high-speed monte carlo simulation of photon migration,” Journal of biomedical
optics, vol. 13, p. 060504, 2008.
[5] M. C. Schatz, C. Trapnell, A. L. Delcher, and A. Varshney, “High-throughput sequence
alignment using graphics processing units,” BMC Bioinformatics, vol. 8, no. 1, pp. 1–10, 2007.
[Online]. Available: http://dx.doi.org/10.1186/1471-2105-8-474
[6] J. E. Stone, J. C. Philips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten, “Accel-
erating molecular modeling applications with graphics processors,” Journal of Computational
Chemistry, vol. 28, pp. 2618–2640, 2007.
[7] S. Grauer-Gray, W. Killian, R. Searles, and J. Cavazos, “Accelerating financial applications on
the gpu,” in Proceedings of the 6th Workshop on General Purpose Processor Using Graphics
Processing Units, ser. GPGPU-6. New York, NY, USA: ACM, 2013, pp. 127–136. [Online].
Available: http://doi.acm.org/10.1145/2458523.2458536
36
BIBLIOGRAPHY
[8] J. W. Sheaffer, D. P. Luebke, and K. Skadron, “The visual vulnerability spectrum:
Characterizing architectural vulnerability for graphics hardware,” in Proceedings of
the 21st ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, ser.
GH ’06. New York, NY, USA: ACM, 2006, pp. 9–16. [Online]. Available:
http://doi.acm.org/10.1145/1283900.1283902
[9] S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, and T. Austin, “A systematic methodology
to compute the architectural vulnerability factors for a high-performance microprocessor,”
in Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International
Symposium on, Dec 2003, pp. 29–40.
[10] D. T. Stott, B. Floering, D. Burke, Z. Kalbarczpk, and R. K. Iyer, “Nftape: a framework for
assessing dependability in distributed systems with lightweight fault injectors,” in Computer Per-
formance and Dependability Symposium, 2000. IPDS 2000. Proceedings. IEEE International,
2000, pp. 91–100.
[11] J. Aidemark, J. Vinter, P. Folkesson, and J. Karlsson, “Goofi: generic object-oriented fault in-
jection tool,” in Dependable Systems and Networks, 2001. DSN 2001. International Conference
on, July 2001, pp. 83–88.
[12] E. S. Larsen and D. McAllister, “Fast matrix multiplies using graphics hardware,” in Supercom-
puting, ACM/IEEE 2001 Conference, Nov 2001, pp. 43–43.
[13] T. K. G. T. O. Standard, www.khronos.org/opencl.
[14] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming
with cuda,” Queue, vol. 6, no. 2, pp. 40–53, Mar. 2008. [Online]. Available:
http://doi.acm.org/10.1145/1365490.1365500
[15] S. Mukherjee, Architecture Design for Soft Errors. San Francisco, CA, USA: Morgan
Kaufmann Publishers Inc., 2008.
[16] D. Bossen, “Cmos soft errors and server design,” IEEE 2002 Reliability Physics Tutorial Notes,
Reliability Fundamentals, vol. 121, pp. 07–1, 2002.
[17] M.-C. Hsueh, T. K. Tsai, and R. K. Iyer, “Fault injection techniques and tools,” IEEE Computer,
vol. 30, no. 4, pp. 75–82, Apr. 1997.
37
BIBLIOGRAPHY
[18] C. Constantinescu, M. Butler, and C. Weller, “Error injection-based study of soft error
propagation in amd bulldozer microprocessor module.” in DSN, R. S. Swarz, P. Koopman,
and M. Cukier, Eds. IEEE Computer Society, 2012, pp. 1–6. [Online]. Available:
http://dblp.uni-trier.de/db/conf/dsn/dsn2012.html
[19] R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert, “Statistical fault injection: Quantified
error and confidence,” in Design, Automation Test in Europe Conference Exhibition, 2009.
DATE ’09., April 2009, pp. 502–506.
[20] J. Arlat, Y. Crouzet, and J.-C. Laprie, “Fault injection for dependability validation of fault-
tolerant computing systems,” in Fault-Tolerant Computing, 1989. FTCS-19. Digest of Papers.,
Nineteenth International Symposium on, June 1989, pp. 348–355.
[21] J. Karlsson, P. Liden, P. Dahlgren, R. Johansson, and U. Gunneflo, “Using heavy-ion radiation
to validate fault-handling mechanisms,” Micro, IEEE, vol. 14, no. 1, pp. 8–23, Feb 1994.
[22] N. J. Wang, A. Mahesri, and S. J. Patel, “Examining ace analysis reliability estimates using
fault-injection,” in Proceedings of the 34th Annual International Symposium on Computer
Architecture, ser. ISCA ’07. New York, NY, USA: ACM, 2007, pp. 460–469. [Online].
Available: http://doi.acm.org/10.1145/1250662.1250719
[23] B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “Gpu-qin: A methodology for
evaluating the error resilience of gpgpu applications,” in Performance Analysis of Systems and
Software (ISPASS), 2014 IEEE International Symposium on, March 2014, pp. 221–230.
[24] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, and M. Ripeanu, “Storegpu:
Exploiting graphics processing units to accelerate distributed storage systems,” in Proceedings
of the 17th International Symposium on High Performance Distributed Computing, ser.
HPDC ’08. New York, NY, USA: ACM, 2008, pp. 165–174. [Online]. Available:
http://doi.acm.org/10.1145/1383422.1383443
[25] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and
K. Yelick, “Stencil computation optimization and auto-tuning on state-of-the-art multicore archi-
tectures,” in 2008 SC - International Conference for High Performance Computing, Networking,
Storage and Analysis, Nov 2008, pp. 1–12.
38
BIBLIOGRAPHY
[26] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A.
Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, “The landscape
of parallel computing research: A view from berkeley,” EECS Department, University
of California, Berkeley, Tech. Rep. UCB/EECS-2006-183, Dec 2006. [Online]. Available:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
[27] S. K. S. H. Hari, T. Tsai, M. Stephenson, S. Keckler, and J. Emer, “SASSIFI: Evaluating
resilience of gpu applications,” in Workshop on Silicon Errors in Logic - System Effects
(SELSE), 2015.
[28] M. Stephenson, S. K. Sastry Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans,
M. O’Connor, and S. W. Keckler, “Flexible software profiling of gpu architectures,” in
Proceedings of the 42Nd Annual International Symposium on Computer Architecture,
ser. ISCA ’15. New York, NY, USA: ACM, 2015, pp. 185–197. [Online]. Available:
http://doi.acm.org/10.1145/2749469.2750375
[29] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, “ Multi2Sim: A Simulation Framework for
CPU-GPU Computing ,” in Proc. of the 21st International Conference on Parallel Architectures
and Compilation Techniques, Sep. 2012.
[30] J. Tan, N. Goswami, T. Li, and X. Fu, “Analyzing soft-error vulnerability on gpgpu microarchi-
tecture,” in Workload Characterization (IISWC), 2011 IEEE International Symposium on, Nov
2011, pp. 226–235.
[31] R. Ubal, D. Schaa, P. Mistry, X. Gong, Y. Ukidave, Z. Chen, G. Schirner, and D. Kaeli,
“Exploring the heterogeneous design space for both performance and reliability,” in Proceedings
of the 51st Annual Design Automation Conference, ser. DAC ’14. New York, NY, USA: ACM,
2014, pp. 181:1–181:6. [Online]. Available: http://doi.acm.org/10.1145/2593069.2596680
[32] F. Previlon, M. Wilkening, V. Sridharan, S. Gurumurthi, and D. Kaeli, “Examining the impact
of ace interference on multi-bit avf estimates,” Proceedings of SELSE-8: Silicon Errors in
Logic-System Effects, 2015.
[33] M. Wilkening, V. Sridharan, S. Li, F. Previlon, S. Gurumurthi, and D. R. Kaeli, “Calculating
architectural vulnerability factors for spatial multi-bit transient faults,” in Proceedings of
the 47th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-47.
39
BIBLIOGRAPHY
Washington, DC, USA: IEEE Computer Society, 2014, pp. 293–305. [Online]. Available:
http://dx.doi.org/10.1109/MICRO.2014.15
[34] “AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK),”
http://developer.amd.com/sdks/AMDAPPSDK.
40