Reducing Cache Contention On GPUs - eGrove
Transcript of Reducing Cache Contention On GPUs - eGrove
University of Mississippi University of Mississippi
eGrove eGrove
Electronic Theses and Dissertations Graduate School
2016
Reducing Cache Contention On GPUs Reducing Cache Contention On GPUs
Kyoshin Choo University of Mississippi
Follow this and additional works at: https://egrove.olemiss.edu/etd
Part of the Computer Sciences Commons
Recommended Citation Recommended Citation Choo, Kyoshin, "Reducing Cache Contention On GPUs" (2016). Electronic Theses and Dissertations. 454. https://egrove.olemiss.edu/etd/454
This Dissertation is brought to you for free and open access by the Graduate School at eGrove. It has been accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of eGrove. For more information, please contact [email protected].
REDUCING CACHE CONTENTION ON GPUS
A Dissertationpresented in partial fulfillment of requirements
for the degree of Doctor of Philosophyin the Department of Computer and Information Science
The University of Mississippi
by
Kyoshin Choo
August 2016
Copyright Kyoshin Choo 2016ALL RIGHTS RESERVED
ABSTRACT
The usage of Graphics Processing Units (GPUs) as an application accelerator has become
increasingly popular because, compared to traditional CPUs, they are more cost-effective, their
highly parallel nature complements a CPU, and they are more energy efficient. With the popu-
larity of GPUs, many GPU-based compute-intensive applications (a.k.a., GPGPUs) present sig-
nificant performance improvement over traditional CPU-based implementations. Caches, which
significantly improve CPU performance, are introduced to GPUs to further enhance application
performance. However, the effect of caches is not significant for many cases in GPUs and even
detrimental for some cases. The massive parallelism of the GPU execution model and the resulting
memory accesses cause the GPU memory hierarchy to suffer from significant memory resource
contention among threads.
One cause of cache contention arises from column-strided memory access patterns that
GPU applications commonly generate in many data-intensive applications. When such access
patterns are mapped to hardware thread groups, they become memory-divergent instructions whose
memory requests are not GPU hardware friendly, resulting in serialized access and performance
degradation. Cache contention also arises from cache pollution caused by lines with low reuse.
For the cache to be effective, a cached line must be reused before its eviction. Unfortunately, the
streaming characteristic of GPGPU workloads and the massively parallel GPU execution model
increase the reuse distance, or equivalently reduce reuse frequency of data. In a GPU, the pollution
caused by a large reuse distance data is significant. Memory request stall is another contention
factor. A stalled Load/Store (LDST) unit does not execute memory requests from any ready warps
ii
in the issue stage. This stall prevents the potential hit chances for the ready warps.
This dissertation proposes three novel architectural modifications to reduce the contention:
1) contention-aware selective caching detects the memory-divergent instructions caused by the
column-strided access patterns, calculates the contending cache sets and locality information and
then selectively caches; 2) locality-aware selective caching dynamically calculates the reuse fre-
quency with efficient hardware and caches based on the reuse frequency; and 3) memory request
scheduling queues the memory requests from a warp issuing stage, frees the LDST unit stall and
schedules items from the queue to the LDST unit by multiple probing of the cache. Through sys-
tematic experiments and comprehensive comparisons with existing state-of-the-art techniques, this
dissertation demonstrates the effectiveness of our aforementioned techniques and the viability of
reducing cache contention through architectural support. Finally, this dissertation suggests other
promising opportunities for future research on GPU architecture.
iii
ACKNOWLEDGEMENTS
First and foremost, I would like to thank my wife, Jihyun, for fully and pleasantly support-
ing me with great patience and believing what I am passing through after I quit my job. I could
not have done any of this long journey without her. Second, I would like to thank my children for
their trust, understanding and always being with me to remind of what is most important. My final
personal thanks go to my parents and parents-in-laws for praying, always supporting me in every
aspect and encouraging me not to give up and to achieve my goal.
Professionally, I would like to thank my advisor Professor Byunghyun Jang for guiding me
with the right direction and pushing me to pursue high-quality research, giving me the freedom to
explore my own ideas and helping me develop them. His guidance over the last four years has been
invaluable and has lead to the goal. Special thanks to David Troendle for being a great colleague.
David’s depth of industrial knowledge and insightful thought has helped all the research I have
done. I also thank all of the HEROES (HEteROgEneous Systems research) lab members, Tuan
Ta, Esraa Abdelmageed, Chelsea Hu, Ajay Sharma, Andrew Henning, Md. Mainul Hassan, Oreva
Addoh, Mengshen Zhao, Leo Yi, Michael Ginn, Blake Adams, and Sampath Gowrishetty for their
help and invaluable discussion and comments. I would also like to thank Professor Jang’s former
colleagues, Dana Schaa and Rafael Ubal for giving insightful comments on the simulator.
I would also list to thank the committee members of my PhD qualifying exam and final
PhD dissertation defense: Professor Philip J. Rhodes, Professor Feng Wang and Professor Robert
J. Doerksen for their direction and suggested improvements to my dissertation. Last but not least,
I would like to thank all the faculty members of Computer and Information Science department
iv
at the University of Mississippi: Professor Dawn E. Wilkins, Professor H. Conrad Cunningham,
Professor Yixin Chen, Professor J. Adam Jones, Professor P. Tobin Maginnis, former Professor
Jianxia Xue, Ms. Kristin Davidson, and Ms. Cynthia B. Zickos for their academic contribution
and friendly support.
v
To my wife Jihyun and our children, Esther, Samuel and Gracie.
vi
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Summary of Terminology Usage . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Programming GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Abstract of GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Warp Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Modern GPU Global Memory Accesses . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5.2 Memory Access Handling . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5.3 Memory Access Characteristics . . . . . . . . . . . . . . . . . . . . . . . 16
CACHE CONTENTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1 Taxonomy of Memory Access Locality . . . . . . . . . . . . . . . . . . . . . . . 193.2 Taxonomy of Cache Contention . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Cache Miss Contention Classification . . . . . . . . . . . . . . . . . . . . 253.2.2 Cache Resource Contention Classification . . . . . . . . . . . . . . . . . 26
3.3 Cache Contention Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.1 Limited Cache Resource . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.2 Column-Strided Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3.3 Cache Pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.4 Memory Request Stall . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
vii
CONTENTION-AWARE SELECTIVE CACHING . . . . . . . . . . . . . . . . . . . . . 344.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Intra-Warp Cache Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Impact of Memory Access Patterns on Memory Access Coalescing . . . . 354.2.2 Coincident Intra-warp Contention Access Pattern . . . . . . . . . . . . . . 35
4.3 Selective Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3.1 Memory Divergence Detection . . . . . . . . . . . . . . . . . . . . . . . 394.3.2 Cache Index Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.3 Locality Degree Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Experiment Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.4.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.5.1 Performance Improvement . . . . . . . . . . . . . . . . . . . . . . . . . 444.5.2 Effect of Warp Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 474.5.3 Cache Associativity sensitivity . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.6.1 Cache Bypassing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.6.2 Memory Address Randomization . . . . . . . . . . . . . . . . . . . . . . 49
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
LOCALITY-AWARE SELECTIVE CACHING . . . . . . . . . . . . . . . . . . . . . . . 525.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.1 Severe Cache Resource Contention . . . . . . . . . . . . . . . . . . . . . 535.2.2 Low Cache Line Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Locality-Aware Selective Caching . . . . . . . . . . . . . . . . . . . . . . . . . 555.3.1 Reuse Frequency Table Design and Operation . . . . . . . . . . . . . . . 555.3.2 Threshold Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.3.3 Algorithm Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.3.4 Contention-Aware Selective Caching Option . . . . . . . . . . . . . . . . 60
5.4 Experiment Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.4.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.5.1 Performance Improvement . . . . . . . . . . . . . . . . . . . . . . . . . 625.5.2 Effect of Warp Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 645.5.3 Effect of Cache Associativity . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.6.1 CPU Cache Bypassing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.6.2 GPU Cache Bypassing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
viii
MEMORY REQUEST SCHEDULING . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.2 Cache Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.1 Memory Request Stall due to Cache Resource Contention . . . . . . . . . 706.3 Memory Request Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3.1 Memory Request Queuing and Scheduling . . . . . . . . . . . . . . . . . 726.3.2 Queue Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.3.3 Scheduling Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.3.4 Contention-Aware Selective Caching Option . . . . . . . . . . . . . . . . 74
6.4 Experiment Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.4.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.5.1 Design Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.5.2 Performance Improvement . . . . . . . . . . . . . . . . . . . . . . . . . 786.5.3 Effect of Warp Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 786.5.4 Effect of Cache Associativity . . . . . . . . . . . . . . . . . . . . . . . . 79
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.1 Cache Bypassing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.1.1 CPU Cache Bypassing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.1.2 GPU Cache Bypassing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2 Memory Address Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . 837.3 Warp Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857.4 Warp Throttling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867.5 Cache Replacement Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . 898.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.2.1 Locality-Aware Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 908.2.2 Locality-Aware Cache Replacement Policy . . . . . . . . . . . . . . . . . 91
PUBLICATION CONTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
ix
LIST OF FIGURES
2.1 A GPU kernel execution flow example - A GPU kernel is launched in host CPU,run on GPU, and then returned to the host CPU. An example CUDA code is shownon the right side. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Baseline GPU architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 A typical memory hierarchy in the baseline GPU architecture. L1D, L1T, and L1C
stand for L1 data, L1 texture, and L1 constant caches, respectively. . . . . . . . . . 122.4 A detailed memory hierarchy view. . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Memory access handling procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6 Coalescing examples of memory-convergent and memory-divergent instructions. . . 152.7 GPU memory access characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . 173.1 Memory access pattern for a thread and a warp. . . . . . . . . . . . . . . . . . . . 223.2 Memory access pattern for a thread block. . . . . . . . . . . . . . . . . . . . . . . 233.3 Memory access pattern for an SM. . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Classification of miss contentions at L1D cache in per kilocycle and in percentage. . 253.5 Resource contentions at L1D cache in per kilocycle and in percentage. . . . . . . . 273.6 Classification of cache misses (intra-warp(IW), cross-warp(XW), and cross-block(XB)
miss) and comparison with different associativity (4-way and 32-way) caches. Leftbar is with 4-way associativity and right with 32-way. . . . . . . . . . . . . . . . . 29
3.7 Block reuse percentage in the L1D cache. Reuse0 represents no-reuse until eviction. 313.8 LDST unit is in a stall. A memory request from ready warps cannot progress be-
cause the previous request is in stall in the LDST unit. . . . . . . . . . . . . . . . . 323.9 The average number of ready warps when cache resource contention occurs. . . . . 324.1 (Revisited) Coalescing example for memory-convergent instruction and memory-
divergent instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 Example of contending set by column-strided accesses. . . . . . . . . . . . . . . . 364.3 Example of BICG memory access pattern. . . . . . . . . . . . . . . . . . . . . . . 374.4 The task flow of the proposed selective caching algorithm in an LDST unit. . . . . . 404.5 Different selective caching schemes with associativity size n when the memory
divergence is detected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.6 Overall improvement - IPC improvement and L1D cache access reduction. . . . . . 464.7 IPC improvement for different schedulers. . . . . . . . . . . . . . . . . . . . . . . 474.8 IPC improvement for different associativities. . . . . . . . . . . . . . . . . . . . . 475.1 Stall time percentage over simulation cycle time. . . . . . . . . . . . . . . . . . . . 545.2 The number of addresses and instructions for load. . . . . . . . . . . . . . . . . . . 565.3 Reuse frequency table entry and operation. . . . . . . . . . . . . . . . . . . . . . . 575.4 Reuse frequency table update and caching decision. . . . . . . . . . . . . . . . . . 57
x
5.5 IPC improvement with different threshold values for caching decision. . . . . . . . 595.6 Overall improvement: IPC improvement and L1D cache access reduction. . . . . . 635.7 IPC improvement with different schedulers. . . . . . . . . . . . . . . . . . . . . . 655.8 IPC improvement with different associativities. . . . . . . . . . . . . . . . . . . . . 656.1 LDST unit in stall and the scheduling queue. . . . . . . . . . . . . . . . . . . . . . 706.2 The average number of ready warps when cache resource contention occurs. . . . . 716.3 A procedure for the memory request scheduler. . . . . . . . . . . . . . . . . . . . . 726.4 A detailed view of the memory request queue and scheduler. . . . . . . . . . . . . . 736.5 IPC improvement with different implementations. . . . . . . . . . . . . . . . . . . 776.6 Overall IPC improvement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.7 IPC improvement with different schedulers. . . . . . . . . . . . . . . . . . . . . . 796.8 IPC improvement with different associativities. . . . . . . . . . . . . . . . . . . . . 80
xi
LIST OF TABLES
2.1 GPU hardware and software terminology comparison between standards. . . . . . . 62.2 Baseline GPGPU-Sim configuration. . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 Cache capacity across modern multithreaded processors. . . . . . . . . . . . . . . . 284.1 Baseline GPGPU-Sim configuration. . . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Benchmarks from PolyBench [31] and Rodinia [14]. . . . . . . . . . . . . . . . . . 455.1 Baseline GPGPU-Sim configuration. . . . . . . . . . . . . . . . . . . . . . . . . . 615.2 Benchmarks from PolyBench [31] and Rodinia [14]. . . . . . . . . . . . . . . . . . 626.1 Baseline GPGPU-Sim configuration. . . . . . . . . . . . . . . . . . . . . . . . . . 756.2 Benchmarks from PolyBench [31] and Rodinia [14]. . . . . . . . . . . . . . . . . . 76
xii
LIST OF ABBREVIATIONS
GPU Graphics Processing Unit
GPGPU General Purpose Graphics Processing Unit
SIMT Single-Instruction, Multiple-Thread
SIMD Single-Instruction, Multiple-Data
MIMD Multiple-Instruction, Multiple-Data
APU Accelerated Processing Unit
AMD Advanced Micro Devices
IC Integrated Circuit
CPU Central Processing Unit
CMP Chip-MultiProcessor
PCI/PCIe Peripheral Component Interconnect / Peripheral Component Interconnect express
CUDA Compute Unified Device Architecture
OpenCL Open Computing-Language
API Application Programming Interface
DLP Data-Level Parallelism
TLP Thread-Level Parallelism
ILP Instruction-Level Parallelism
CU Compute Unit
SM Streaming Multiprocessor
PC Program Counter
LDST Load Store
xiii
MACU Memory Access Coalescing Unit
MSHR Miss Status Holding Register
L1D Level 1 Data
L1I Level 1 Instruction
L1T Level 1 Texture
L1C Level 1 Constant
L2 Level 2
LLC Last Level Cache
MC Memory Controller
FR-FCFS First-Ready First-Come-First-Serve
PKI Per Kilo (Thousand) Instructions
MPKI Misses Per Kilo (Thousand) Instructions
DRAM Dynamic Random Access Memory
GTO Greedy-Then-Oldest
LRR Loose Round Robin
CCWS Cache-Conscious Warp Scheduling
xiv
CHAPTER 1
INTRODUCTION
The success of General-Purpose computing on Graphics Processing Unit (GPGPU) tech-
nology has made high performance computing affordable and pervasive in platforms ranging from
workstations to hand-held devices. Its cost-effectiveness and power efficiency are unprecedented
in the history of parallel computing. Following their success in general-purpose computing, GPUs
have shown significant performance improvement in many compute-intensive scientific applica-
tions, such as geoscience [79], molecular dynamics [4], DNA sequence alignment [84], and large
graph processing [47], physical simulations in science [62], and so on. Furthermore, the advent
of the big data era has further stimulated the need to leverage the massive computation power
of GPUs in accelerating emerging data-intensive applications, such as data warehousing appli-
cations [7, 29, 30, 36, 83, 89] and big data processing frameworks [35, 15, 16, 34, 78]. The
GPU-based implementations can provide an order of magnitude performance improvement over
traditional CPU-based implementations [36, 83].
Since GPUs were originally introduced to efficiently handle computer graphics workloads
in specialized applications such as computer-generated imagery and video games, these traditional
GPU workloads involved a large amount of data streaming. To deliver high throughput, GPUs
rely on massive fine-grained multithreading to hide long latency operations such as global memory
accesses [22], which operates by issuing instructions from different threads when the threads being
executed are stalled. Given enough threads and enough memory bandwidth, the GPU’s data path
can be kept busy, increasing overall throughput at the expense of a single-thread latency.
1
GPU memory systems have grown to include a multi-level cache hierarchy with both hard-
ware and software controlled cache structures. For instance, Nvidia Fermi GPUs introduced a
relatively large (up to 48 KB per core) and configurable L1 cache and 768 KB L2 cache [64].
Likewise, AMD GCN GPUs also offer a 16 KB L1 cache per core and 768 KB L2 cache [2].
While a throughput processor’s cache hierarchy exploits application inherent locality and
increases the overall performance, the massively parallel execution model of GPUs suffers from
the resource contention. In particular, for applications whose performances are sensitive to caching
efficiency, such cache resource contention degrades the effectiveness of caches in exploiting local-
ity, thereby suffering from significant performance drop. Experiments show that caches are not
always beneficial to GPU applications [41]. From a suite of 12 Rodinia benchmark [14] applica-
tions running on a Nvidia Tesla C2070 GPU with its 16 KB L1 caches turned on and off, only
two of them show substantial performance improvements from caching, but two show non-trivial
performance degradation. The effect of cache is negligible in other 8 benchmarks [41].
One of the main reasons for the performance degradation is cache contention in the memory
hierarchy, especially in the L1 data (L1D) cache. The limited resources that serve the massively
parallel memory requests destroy the application’s inherent memory access locality and create se-
vere cache contention. Software-based optimization could mitigate this problem, but the effort
required to revise the existing code to use cache efficiently is non-trivial [24]. A couple of warp
schedulers are also proposed to reduce the contention by limiting the active warps in an SM. How-
ever, these approaches reduce the utilization of the GPUs [49, 72, 57].
This dissertation proposes architectural support to reduce the cache contention. In particu-
lar, this dissertation identifies the causes of the contention and resolves it by detecting the memory
access patterns that significantly destroys the memory access locality, calculating reuse frequency
of the memory accesses, and probing cache under memory request stall. This dissertation also
demonstrates the effectiveness of these proposed techniques by extensively comparing them with
2
closely related state-of-the-art techniques.
1.1 Research Challenges
One cause of the cache contention arises from column-strided memory access patterns that
GPU applications commonly generate in many data-intensive applications. Such access patterns
are compiled to warp instructions whose generated memory accesses are not well coalesced. We
call these instructions memory-divergent instructions. The resulting memory accesses are serially
processed in the LDST unit due to the resource contention, stress the L1D caches, and result in
serious performance degradation. Even though the impact of memory divergence can be alleviated
through various software techniques, architectural support for memory divergence mitigation is
still highly desirable because it eases the complexity in the programming and optimization of
GPU-accelerated data-intensive applications [72, 73].
Cache contention also arises from cache pollution caused by low reuse frequency data. For
the cache to be effective, a cached line must be reused before its eviction. But, the streaming
characteristic of GPGPU workloads and the massively parallel GPU execution model increases the
reuse distance, or equivalently reduces the reuse frequency of data. If the low reuse frequency data
is cached, it can evict the high reuse frequency data in the cache before it is reused. In a GPU, the
pollution caused by a low reuse frequency (i.e., large reuse distance) data is significant.
Memory request stall is another contention factor. A stalled LDST unit does not execute
memory requests from any ready warps in the previous issue stage. The warp in the LDST unit
retries until the cache resource becomes available. During this stall, the private L1D cache is also
in a stall, so no other warp requests can probe the cache. However, there may be data in an L1D
cache which may become a hit if other ready warps in the issue stage access the cache. The current
structure of the issue stage and L1D unit execution do not allow the provisional cache probing in
such a case. This stall prevents the potential hit chances for the ready warps.
3
1.2 Research Contribution
In this dissertation, we have thoroughly investigated three architectural challenges that can
severely impact the performance of GPUs and eventually degrade overall performance. For each
challenge, this dissertation proposes solutions to reduce the cache contention such as contention-
aware selective caching, locality-aware selective caching, and memory request scheduling. We
compare the proposed solutions with the closely related state-of-the-art techniques. In particular,
this dissertation has made the following contributions.
• Using the application inherent memory access locality classification along with limitation
of resources, we classify cache miss contention into 3 categories, intra-warp (IW), cross-
warp (XW), and cross-block (XB) contention according to the cause of the misses. We also
classify the cache resource contention into 3 categories, LineAlloc fail, MSHR fail, and
MissQueue fail.
• We identify and quantify the factors of the contention such as a column-strided pattern and
its resulting memory-divergent instruction, cache pollution by low reuse frequency data, and
memory request stall.
• We propose a mechanism to detect the column-strided pattern and its resulting memory-
divergent instruction which generates divergent memory accesses, calculate the contend-
ing cache sets and locality information, and caches selectively. We demonstrate that the
contention-aware selective caching can improve the performance more than 2.25x over base-
line and reduce memory accesses.
• We propose a mechanism with low hardware complexity to detect the locality of memory
requests based on per-PC reuse frequency and cache selectively. We demonstrate that it
improves the performance by 1.39x alone and 2.01x along with contention-aware selective
4
caching over baseline, prevents 73% of the no-reuse data from caching and improves reuse
frequency in the cache by 27x.
• We propose a memory request schedule queue that holds ready warps’ memory requests
and a scheduler to effectively schedule them to increase the chances of a hit in the cache
lines. We demonstrate that there are 12 ready warps on average when the LDST unit is in
a stall and this potential improves the overall performance by 1.95x and 2.06x along with
contention-aware selective caching over baseline.
1.3 Organization
The rest of this dissertation is organized as follows:
• Chapter 2 details the summary of terminology used in this dissertation, programming model
in GPUs, the baseline GPU architecture, and warps scheduling and memory access handling.
• Chapter 3 discusses the data locality, cache miss contention classification, cache resource
contention classification and the cache contention factors.
• Chapter 4 presents a contention-aware selective caching proposal to reduce intra-warp asso-
ciativity contention caused by memory-divergent instructions.
• Chapter 5 presents a locality-aware selective caching to measure the reuse frequency of the
instructions and prevent low reuse data from caching.
• Chapter 6 presents a memory request scheduling to better utilize the cache resource under
LDST unit stall.
• Chapter 7 discusses related works.
• Chapter 8 concludes the dissertation and discusses the directions for the future work.
5
CHAPTER 2
BACKGROUND
This chapter introduces background knowledge that serves as a foundation for the rest
of this dissertation. In particular, Section 2.1 summarizes the terminology usage in all chapters
throughout this dissertation. Section 2.2 describes the GPU programming flow and work-item
formation from the software point of view. Section 2.3 presents the abstract model of GPU archi-
tecture used in this dissertation. Section 2.4 and Section 2.5 explain how the warps are assigned to
the execution units and how global memory requests are handled in the GPU memory hierarchy.
More detailed background knowledge can be found in many other references [3, 66, 37, 52].
2.1 Summary of Terminology Usage
This Dissertation CUDA [66] OpenCL [52]
thread thread work-itemwarp warp wavefront
thread block thread block work-group
SIMD lane CUDA core processing elementStreaming Multiprocessor (SM) SM Compute Unit (CU)
private memory local memory private memorylocal memory shared memory local memory
global memory global memory global memory
Table 2.1. GPU hardware and software terminology comparison between standards.
Table 2.1 summarizes the terminology used in this dissertation. We present this summary
to avoid confusion between multiple equivalent technical terms from multiple competing GPGPU
6
CPU GPU
Launch Kernel
Resume Execution
Wait for GPUBlock(0,0) Block(0,1) Block(0,2)
Block(1,0) Block(1,1) Block(1,2)
Threads
Kernel Grid __global__ void kernel(...) {
int i = blockIdx.x *
blockDim.x +
threadIdx.x;
int j = blockIdx.y;
if (i<M && j<N) {
...
}
...
}
Figure 2.1. A GPU kernel execution flow example - A GPU kernel is launched in host CPU, runon GPU, and then returned to the host CPU. An example CUDA code is shown on the right side.
programming frameworks and industry standards. Detailed definitions of the terms are presented
in the following sections. A more thorough explanation of GPU terminologies other than the
explanations introduced here can be found in various programming guides and framework specifi-
cations [51, 52, 66].
2.2 Programming GPUs
Programming environments and abstractions have been introduced to help software devel-
opers write GPU applications. Nvidia’s CUDA [66] and Khronos Group’s OpenCL [51, 52] are
popular frameworks. Despite the terminology differences summarized in Table 2.1, these frame-
works have very similar programming models, as introduced below.
Depending on the system configuration, a GPU can be connected via a PCI/PCIe bus or
reside on the same die with the CPU. Applications programmed in high-level programming lan-
guages, such as CUDA [66] and OpenCL [51, 52], begin execution on CPUs. The GPU portion
of the code is launched from the CPU code in the form of kernels. Depending on the system, data
to be used by a GPU are transferred through the PCI/PCIe bus through an internal interconnection
7
between the CPU and GPU, or through pointer exchange.
A CUDA or OpenCL program running on a host CPU contains one or more kernel func-
tions. These kernels are invoked by the host program, offloaded to a GPU device, and executed
there. The kernel code specifies operations to be performed by the GPU from the perspective of
a single GPU thread. At kernel launch, the host specifies the total number of threads executing
the kernel and their thread grouping. As shown in Figure 2.1, a kernel divides its work into a grid
of identically sized thread blocks. The size of a thread block is the number of threads (e.g., 256)
in that block. From the programmer’s perspective, every instruction in the kernel is concurrently
executed by all threads in the same thread block. However, there is a limited number of hardware
lanes (e.g., 32) that an SM can execute concurrently on real hardware. Consequently, threads are
executed in groups of hardware threads called warps. The number of threads in a warp is called
the size of the warp.
GPU threads have access to various memory spaces. A thread can access its own private
memory, which is not accessible by other threads. Threads in the same block can share data and
synchronize via local memory, and all threads in a kernel can access global memory. The local
and global memories support atomic operations. Global memory is cached in on-chip private L1
data (L1D) cache and shared L2 cache. Much of this dissertation focuses on reducing contention
on this L1D cache.
Finally, because every thread executes the same binary instructions as other threads in the
same kernel, it must use its thread ID and thread block ID variables to determine its own identity
and operate on its own data accordingly. Those IDs can be one- or multi-dimensional, and they
are represented by consecutive integer values in each dimension of the block starting with the first
dimension, as shown in Figure 2.1.
8
Texture Cache
Execution Units
Streaming
Multiprocessor
Streaming
Multiprocessor
Streaming
Multiprocessor...
Interconnection Network
Memory
Partition
L2
MC
Memory
Partition
L2
MC
Memory
Partition
L2
MC
Memory
Partition
L2
MC
...
Off-Chip
DRAM
Channel
Off-Chip
DRAM
Channel
Off-Chip
DRAM
Channel
Off-Chip
DRAM
Channel
...
THREAD Block Scheduler Instruction Cache
Warp Scheduler Warp Scheduler
Register File
SIMD
SIMD
...
SFUs
LD/ST Units
Interconnection Network
L1 Data Cache
Constant CacheLocal Memory
Figure 2.2. Baseline GPU architecture.
2.3 Abstract of GPU Architecture
The modern GPU architecture is illustrated in Figure 2.2. A GPU consists of a thread block
scheduler, an array of Streaming Multiprocessors (SMs), an interconnection network between SMs
and memory modules, and global memory units. Off-chip DRAM memory (global memory) is
connected to the GPU through a memory bus. Each SM is a highly multi-threaded and pipelined
SIMD processor. SIMD lanes execute distinct threads, operate on a large register file, and progress
in lock-step with other threads in the SIMD thread group (warp).
The SM architecture is detailed in the right side of Figure 2.2. It consists of warp sched-
ulers, a register file, SIMD lanes, Load/Store(LDST) units, various on-chip memories including
L1D cache, local memory, texture cache, and constant cache. LDST units manage accesses to
various memory spaces. Depending on the data being requested, GPU memory requests are sent to
either L1D cache, local memory, texture cache or constant cache. Each memory partition consists
of L2 cache and a memory controller (MC) that controls off-chip memory modules. An intercon-
9
Number of SMs 15SM configuration 1400Mhz, SIMD Width: 16
Warp size: 32 threadsMax threads per SM: 1536Max Warps per SM: 48 WarpsMax Blocks per SM: 8 blocks
Warp schedulers per core 2Warp scheduler Greedy-Then-Oldest (GTO), default [73, 72]
Loose Round-Robin (LRR)Two-Level (TL)[63]
Cache/SM L1 Data: 16KB 128B-line/4-way (default)L1 Data: 16KB 128B-line/2-wayL1 Data: 16KB 128B-line/8-wayReplacement Policy: Least Recently Used (LRU)Shared memory: 48KB
L2 unified cache 768KB, 128B line, 16-wayMemory partitions 6Instruction dispatch 2 instructions per cyclethroughput per schedulerMemory Scheduler Out of Order (FR-FCFS)DRAM memory timing tCL=12, tRP=12,
tRC=40, tRAS=28,tRCD=12, tRRD=6
DRAM bus width 384 bits
Table 2.2. Baseline GPGPU-Sim configuration.
nection network handles data transfer between the SMs and L2 caches, and the memory controller
handles data transfer between L2 and off-chip memory modules.
We use GPGPU-Sim [5] for detailed architectural simulation of the GPU architecture. The
details of the architectural specification can be found in the GPGPU-Sim 3.x manual [82]. We
present the details of our simulation setup and configurations in Table 2.2.
10
2.4 Warp Scheduling
As shown in Figure 2.2, each SM contains multiple physical warp lanes (SIMD) and two
warp schedulers, independently managing warps with even and odd warp identifiers. In each cycle,
both warp schedulers pick one ready warp and issue its instruction into the SIMD pipeline back-
end [64, 65]. To determine the readiness of each decoded instruction, a ready bit is used to track
its dependency on other instructions. It is updated in the scoreboard by comparing its source
and destination registers with other in-flight instructions of the warp. Instructions are ready for
scheduling when their ready bits are set (i.e., data dependencies are cleared).
GPU scheduling logic consists of two stages: qualification and prioritization. In the qualifi-
cation stage, ready warps are selected based on the ready bit that is associated with each instruction.
In the prioritization stage, ready warps are prioritized for execution based on a chosen metric, such
as cycle-based round-robin [63, 44], warp age [73, 72], instruction age [12], or other statistics that
can maximize resource utilization or minimize stalls in memory hierarchy or in execution units.
For example, the Greedy-Then-Oldest (GTO) scheduler [73, 72] maintains the highest priority for
the currently prioritized warp until it is stalled. The scheduler then selects the oldest among ready
warps for scheduling. GTO is a commonly used scheduler because of its good performance in a
large variety of general purpose GPU benchmarks.
2.5 Modern GPU Global Memory Accesses
2.5.1 Memory Hierarchy
An SM contains physical memory units shared by all its concurrently executing threads.
Private memory mapped to registers is primarily used for threads to store their individual states
and contexts. Local memory mapped to programmer-managed scratch-pad memories [8] is used to
share data within a thread block. All cores share a large, off-chip DRAM to which global memory
maps. Modern GPUs also have a two-level cache hierarchy (L1D and L2) for global memory
11
...
Interconnection Network
L2
Off-Chip Global Memory
SM
Scra tchpad Mem.
(Local Memory)
Registers
(Private Memory)
L1D L1T&C
SM
Scra tchpad Mem.
(Local Memory)
Registers
(Private Memory)
L1D L1T&C
SM
Scra tchpad Mem.
(Local Memory)
Registers
(Private Memory)
L1D L1T&C
Figure 2.3. A typical memory hierarchy in the baseline GPU architecture. L1D, L1T, and L1Cstand for L1 data, L1 texture, and L1 constant caches, respectively.
accesses. Texture and constant caches (L1T and L1C) have existed since the early graphics-only
GPUs [33]. Figure 2.3 shows the described memory hierarchy in the baseline GPU architecture.
Current Nvidia GPUs have configurable L1D caches whose size can be 16, 32, or 48 KB.
The L1D cache in each core shares the same total 64 KB of memory cells with local memory,
giving users dynamically configurable choices regarding how much storage to devote to the L1D
cache versus the local memory. AMD GPU L1D caches have a fixed size of 16 KB. Current
Nvidia GPUs have non-configurable 768 KB - 1.5 MB L2 caches, while AMD GPUs have 768 KB
L2 caches. L1T and L1C are physically separate from L1D and L2, and they are only accessed
through special constant and texture instructions. GPU programmers are usually encouraged to
declare and use local memory variables as much as possible because the on-chip local memories
have a much shorter latency and much higher bandwidth than the off-chip global memory. Small,
but repeatedly used data are good candidates to be declared as local memory objects. However,
because local memories have a limited capacity and a GPU core cannot access another core’s local
memory, many programs cannot use local memories due to their data access patterns. In these
situations, the global memory with its supporting cache hierarchy should be the choice.
Threads from the same thread block must execute on the same SM in order to use the SM’s
scratch-pad memory as a local memory. However, when thread blocks are small, multiple thread
12
LD/STTexture Cache
Constant CacheLocal Memory
Warp
Sch
ed
ule
r
Regis
ter
Fil
e
Execution Units (ALU/SFU)
Warp
Sch
ed
ule
r
...
W0
W46
W2
W1
W47
W2
Acce
ss
Genera
tio
n
L1 Data Cache
MA
CU L1D
MSHR
Mem
Port
...
Figure 2.4. A detailed memory hierarchy view.
blocks may execute on a single SM as long as core resources are sufficient. Specifically, four
hardware resources - the number of thread slots, the number of thread block slots, the number
of registers, and the local memory size - dictate the number of thread blocks that can execute
concurrently on a SM; we call this number a SM’s (thread) block concurrency. An interconnection
network connects SMs to the DRAM via a distributed set of L2 caches and memory modules. A
certain number of memory modules share one memory controller (MC) that sits in front of them
and schedules memory requests. This number varies from one system to the other. The DRAM
scheduler queue size in each memory module impacts the capacity to hold outstanding memory
requests.
2.5.2 Memory Access Handling
Figure 2.4 shows a typical thread processing and memory hierarchy on an SM. An instruc-
tion is fetched and decoded for a group of threads called a warp. The size of warp can vary across
devices, but is typically 32 or 64. All threads in a warp execute in an SIMD fashion with each
thread using its unique thread ID to map to its data. A user-defined thread block is composed of
multiple warps. At issue stage, a warp scheduler selects one of the ready warps for execution.
Figure 2.5 shows the detailed global memory access handling. A memory instruction is
issued on a per warp-basis with usually 32 threads in Nvidia Fermi, or 64 threads in AMD Southern
13
A warp instruction
32 memory
access requests
MACU
k actual
requests
L1D Cache Probing
HIT
Resource checking:
Line
MSHR
MissQueue
stall
Send request
available
iterateIterate k times
END
MISS
Figure 2.5. Memory access handling procedure.
Island architectures. Once a memory instruction for global memory is issued, it is sent to a Memory
Access Coalescing Unit (MACU) for memory request generation to the next lower layer of the
memory hierarchy. To minimize off-chip memory traffic, the MACU merges simultaneous per-
thread memory accesses to the same cache line. Depending on the stride of the memory addresses
among threads, the number of resulting memory requests varies. For example, when 4 threads in
a warp access 4 consecutive words, i.e., stride-1 access, in a cache line-aligned data block, the
MACU will generate only one memory access to L1D cache as shown in Figure 2.6a. Otherwise,
simultaneous multiple accesses are coalesced to a smaller number of memory accesses to L1D
cache to fetch all required data. In the worst case, the 4 memory accesses are not coalesced at all
and generate 4 distinct memory accesses to L1D cache as shown in Figure 2.6c. Therefore, a fully
memory-divergent instruction can generate as many accesses as the warp size. The working set
is defined as the amount of memory that a process requires in a given time interval [20]. When
14
(a) Fully convergent
add
(b) Partially divergent
ad
(c) Fully divergent
Figure 2.6. Coalescing examples of memory-convergent and memory-divergent instructions.
many memory-divergent instructions are issuing memory requests, the working set size becomes
large. If it exceeds the cache size, it causes cache contention. In the rest of this dissertation,
the memory instructions that generate only one memory access after MACU are called memory-
convergent instructions, and the others are called memory-divergent instructions. The resultant
memory accesses from an MACU are sequentially sent to L1D via a single 128-byte port [12].
When a load memory request hits in L1D, the requested data is written back to the register
file and dismissed. If it misses in L1D, the request checks if it can allocate enough resources to
process the request. When the request acquires resources, it is sent to the next lower memory
hierarchy to fetch data. Otherwise, it retries at next cycle to acquire the resources. The resources
to be checked by the request are a line in a cache set, a Miss Status Holding Register (MSHR) entry
and a miss queue entry. An allocate-on-miss policy cache allocates one cache line in the destination
set of cache if available. Otherwise, it fails allocation on the cache and retries at the next cycle until
the resource is ready. An allocate-on-fill policy cache skips this process and attempts allocation
on reception of the requested data from the lower memory hierarchy. The MSHR is used to track
in-flight memory requests and merge duplicate requests to the same cache line. Upon MSHR
allocation, a memory request is buffered into the memory port for network transfer. An MSHR
entry is released after its corresponding memory request is back and all accesses to that block are
15
serviced. Memory requests buffered in the memory port are drained by the on-chip network in
each cycle when lower memory hierarchy is not saturated.
Since L1D caches are not coherent across cores, every global memory store is treated as
a write-through transaction followed by invalidation on the copies in the L1D cache [2]. Store
instructions require no L1D cache resources and are directly buffered into the memory port to the
L1 cache. For this reason, only global memory loads, not stores, are taken into consideration in
this dissertation.
2.5.3 Memory Access Characteristics
Since GPUs have a different programming model and execution behavior from traditional
CPUs, their memory access also has unique characteristics different from traditional processors.
According to our evaluation and analysis of cache behavior and performance [19], GPU memory
access has the following characteristics.
Considerably low cache hit rate in L1D: As shown in Figure 2.7a, L1D cache hit rate on
GPUs is considerably lower than those on the CPUs (49% vs. 88% on average). This suggests the
reuse rate of the data in the cache is not high. This low cache hit rate in L1D results in increased
memory traffic to the lower levels of the memory hierarchy, L2 and off-chip global memory. It
leads to overall longer memory latency, and thus degrades the overall memory performance.
Compulsory misses dominate: The main cause of the low cache hit rate for L1D results
from the fact that compulsory misses dominate in an L1D cache. Compulsory miss is sometimes
referred as cold miss or first reference miss since this miss occurs when the data block is brought
to the cache for the first time. From Figure 2.7b, the compulsory miss rate over total misses are
about 65% on average. This behavior contradicts the conventional wisdom that compulsory misses
are negligibly small on traditional multi-core processors [74]. This difference shows that the data
in GPU is not reused much, that is, data reusability is very low. This is different from the CPU
16
BS DCT RG FW FWT MT SLA RED RS HIST BLS BO AVG0
0.2
0.4
0.6
0.8
1
Cac
he H
it R
ate CPU
GPU
(a) Cache hit rate on L1D cache.
BS DCT RG FW FWT MT SLA RED RS HIST BLS BO AVG0
50
100
Per
cent
age
(%)
(b) Compulsory miss rate in L1D cache.
BS DCT RG FW FWT MT SLA RED RS HIST BLS BO AVG0
8
16
24
32
Coa
lesc
ing
Deg
ree
(c) Coalescing degree in Memory Access Coalescing Unit (MACU).
Figure 2.7. GPU memory access characteristics.
memory access patterns where temporal and spatial localities are used for caching. Tian et al. [80]
show that zero-reuse blocks in the L1 data cache are about 45% on average GPU applications.
Coalescing occurs massively: As described in Section 2.2, GPUs achieve high performance
through massively parallel thread execution. Such massive thread execution subsequently issues
many memory requests to its private L1D cache. Since typical memory accesses are known to
be regular with strided patterns in GPUs, massive coalescing occurs at the MACU. Figure 2.7c
shows the coalescing degree for each benchmark. The coalescing degree is defined as the ratio of
the number of memory requests generated by warp instructions to the total L1D cache requests.
Since the warp size is 32, maximum coalesce degree is 32. As can be seen in Figure 2.7c, on some
17
benchmarks such as bs, dct, and bo, the coalescing degree is very high, more than 16. On
average, approximately 7 memory requests are coalesced to the same cache line. In other words,
the memory traffic reduction by the MACU is 7 to 1.
18
CHAPTER 3
CACHE CONTENTION
This chapter introduces the taxonomy of memory access locality that GPU applications
inherently have. This chapter also introduces two other cache contention classifications, miss con-
tention and resource contention, followed by the factors that are involved in the cache contention.
As noted in Section 2.5.2, only global memory loads - not stores - are taken into consideration,
since global memory stores bypass the L1D cache and do not affect the cache contention described
in this dissertation.
3.1 Taxonomy of Memory Access Locality
A comprehensive understanding of GPU memory access characteristics, especially locality,
is essential to a better understanding of contention in the memory hierarchy of a GPU. We introduce
the four-category memory access locality including intra-thread locality, intra-warp locality, cross-
warp locality, and cross-block locality. The definition of each category follows.
• Intra-thread data locality (IT) applies to memory instructions being executed by one thread
in a warp. This category captures the temporal and spatial locality of a thread which has a
similar pattern to that of the CPU workload.
• Intra-warp data locality (IW) applies to memory instructions being executed by threads from
the same warp. Depending on the result of coalescing, the instructions are classified to
either memory-convergent (in which thread accesses are mapped to the same cache block)
or memory-divergent (in which thread accesses are mapped to more than 2 cache blocks).
19
When the instruction is memory-convergent, the spatial locality between threads is taken care
of by the coalescer (MACU), and therefore, the number of memory requests per instruction
becomes one, which is the same as the IT locality case. When the instructions are not
memory-convergent, each instruction generates multiple memory requests and the working
set size becomes larger.
• Cross-warp Intra-block data locality (XW) applies to memory instructions being executed
by threads from the same thread block, but from different warps in the thread block. If these
threads access data mapped to the same cache line, they have XW locality. Warp scheduler
and memory latency affect the locality.
• Cross-warp Cross-block data locality (XB) applies to memory instructions being executed
by threads from different thread blocks, but in the same SMs. If these threads access data
mapped to the same cache line, they have XB locality. The thread block scheduler, warp
scheduler and memory latency affect the locality. XB locality between thread blocks mapped
to different SMs is not considered since they do not show locality in an SM.
Since more threads are involved in the memory access locality for GPUs as described in the
taxonomy above, we need to review the definition of temporal and spatial locality. The traditional
definition of temporal locality and spatial locality are the followings [37].
Temporal locality: If a particular memory location is referenced at a particular time, then
it is likely that the same location will be referenced again in the near future. There is a temporal
proximity between the adjacent references to the same memory location. In this case, it is common
to make an effort to store a copy of the referenced data in special memory storage, which can be
accessed faster. Temporal locality is a special case of spatial locality, namely when the prospective
location is identical to the present location.
Spatial locality: If a particular memory location is referenced at a particular time, then
20
Listing 3.1. SYRK benchmark kernel1__global__ void syrk_kernel(...)2{3/* C := alpha*A*A’ + beta*C , NxN matrix*/4int j = blockIdx.x * blockDim.x + threadIdx.x;5int i = blockIdx.y * blockDim.y + threadIdx.y;6
7if ((i<N) && (j<N)) {8c[i*N+j] = c[i*N+j] * beta;9110for(int k=0; k<N; k++) {11c[i*N+j] = alpha * a[i*N+k] * a[j*N+k] + c[i*N+j];122 3 413}14}15}
it is likely that nearby memory locations will be referenced in the near future. In this case, it is
common to attempt to guess the size and shape of the area around the current reference for which
it is worthwhile to prepare faster access.
To visualize the above category in detail, we choose a benchmark, syrk, in Polybench/
GPU [31]. The syrk benchmark has intra-warp (IW) locality as well as cross-warp intra-block
(XW) and cross-warp cross-block (XB) locality. The example kernel code in Listing 3.1 and
address distribution per category are given. For simplicity in this example, we assume N = 16,
warp size 4, thread block size 16. There are 16 thread blocks total, and the maximum warps in an
SM are 4 thread blocks.
The code listing above shows four load instructions, which are highlighted and labeled in
circled numbers in the listing. The load instruction execution sequence is {1, {2, 3, 4}N} where
{...}N represents repetition of the sequence in the curly bracket and the N value is 16 in this
case. Since N is 16, there are 49 load memory accesses per thread. The memory address pattern
for the thread 0 (T0) is shown in Figure 3.1a, intra-thread locality (IT). Since load instructions
1 and 4 read the same memory address regardless of the k value, these load instructions have
temporal locality in a thread memory access. On the other hand, since load instructions 2 and 3
21
Access Cycle0 5 10 15 20 25 30 35 40 45 50
Mem
ory
Add
ress
0
100
200
300Memory Access Pattern for a thread (T0)
T0
Temporal locality
Spatial locality
(a) Memory access pattern for a thread (T0) - Intra-thread (IT) locality.
Access Cycle0 5 10 15 20 25 30 35 40 45 50
Mem
ory
Add
ress
0
100
200
300Memory Access Pattern for a warp (W0)
T0T1T2T3
CoalescingDirection
(b) Memory access pattern for a warp (W0 with T0, T1, T2, T3) - Intra-warp (IW) locality. Memoryaccesses within a rectangle can be coalesced.
Figure 3.1. Memory access pattern for a thread and a warp.
are accessing neighboring memory addresses, respectively, this shows spatial locality. Hence, the
memory pattern for the thread T0 shows both temporal and spatial locality.
The next scope is intra-warp locality (IW). Figure 3.1b shows the memory address pattern
for the 4 threads, T0, T1, T2, and T3, in the same warp. Since coalescing occurs at the intra-
warp level, as marked with red rectangles in the figure, coalescing occurs within the red rectangle,
in the vertical direction of address distribution in a cycle. As explained in Section 2.5.2, only
memory addresses within the cache block size are coalesced. Therefore, when the memory requests
are scattered enough not to be grouped into a cache line, i.e., N is larger, the memory-divergent
instruction generates up to warp-size memory requests. The temporal locality and spatial locality
observed in the IT locality graph still hold.
The next scope is cross-warp intra-block (XW) locality as in Figure 3.2. From this scope
and beyond, the warp scheduler plays a very important role. The address pattern without the warp
scheduler effect is shown in Figure 3.2a. The pairs, {W0, W2}, and {W1, W3} show very similar
22
Access Cycle0 5 10 15 20 25 30 35 40 45 50
Mem
ory
Add
ress
0
100
200
300
400Memory Access Pattern for a thread block (TB0)
W0W1W2W3
(a) Memory access pattern for a thread block (TB0) - ideal - Cross-warp Intra-block (XW) locality.
Access Cycle0 10 20 30 40 50 60 70
Mem
ory
Add
ress
0
100
200
300
400Memory Access Pattern for a thread block (TB0)
W0W1W2W3
(b) Memory access pattern for a thread block - skewed by warp scheduler effect - Cross-warp Intra-block (XW) locality.
Figure 3.2. Memory access pattern for a thread block.
Access Cycle0 5 10 15 20 25 30 35 40 45 50
Mem
ory
Add
ress
0
100
200
300
400Memory Access Pattern for all thread blocks
TB0TB1TB2TB3
(a) Memory access pattern for an SM - ideal - Cross-warp Cross-block (XB) locality.
Access Cycle0 10 20 30 40 50 60 70 80
Mem
ory
Add
ress
0
100
200
300
400Memory Access Pattern for all thread blocks
TB0TB1TB2TB3
(b) Memory access pattern for an SM - skewed by warp scheduler effect - Cross-warp Cross-block(XB) locality.
Figure 3.3. Memory access pattern for an SM.
23
memory access pattern and that pattern has very good spatial and temporal locality. When we
include the warp scheduler effect as shown in Figure 3.2b, the memory accesses are more scattered
in time. In this figure, by scheduling the locality pairs apart, for example {W0, W1} first and
{W2, W3} later, the locality shown in the code can be completely destroyed. As shown, the x-
axis of Figure 3.2b is also extended to 1.5x cycles longer. When it comes to the real warp scheduler
introduced in Section 2.4, the memory access pattern is much more complex. As the number of
memory requests grows, the chances of thrashing grow high since, in a short period of time, many
memory accesses are contending to acquire the cache resources.
The largest scope is cross-warp cross-block (XB) locality. Figure 3.3 shows this case. The
working set of memory access grows even larger than in the XW case. Figure 3.3a shows the
memory access pattern without the effect of the warp scheduler and the memory latency. This still
shows good inter-block locality between the blocks without scheduler effect, while the scheduler
effect affects the locality pattern significantly.
In summary, GPU applications have inherent memory access locality either at the thread
level, warp level, cross-warp level or cross-block level. Since threads are executed as a warp in
GPUs, the thread level locality is absorbed to the warp level by coalescing at the MACU unit.
However, when the memory accesses with locality are executed in a GPU, they are competing with
each other to acquire the limited resources such as warp schedulers, LDST units, caches, and other
resources in the GPU. When the number of memory accesses at a certain time interval (working set)
is large, the contention becomes severe. This contention creates a serious performance bottleneck.
3.2 Taxonomy of Cache Contention
Due to the resource limitations of the memory hierarchy, as introduced in Section 3.1, the
inherent memory access locality can be dramatically reduced and causing cache miss contention
and cache resource contention. According to the cause of the contentions, we classify them as
24
2dco
nv
2mm
3dco
nv
3mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
mea
n
0
500
1000
Per
kilo
cycl
e
Intra-warp contentionCross-warp contentionCross-block contention
(a) Miss contention classification (per kilocycle).
2dco
nv
2mm
3dco
nv
3mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
mea
n
0
50
100
Per
cent
age
(%)
(b) Miss contention classification (percentage).
Figure 3.4. Classification of miss contentions at L1D cache in per kilocycle and in percentage.
follows.
3.2.1 Cache Miss Contention Classification
When a cache miss occurs, the new memory request must evict a previously residing cache
line (victim cache line). Depending on the warp ID and block ID relationship between the in-
serting and evicting lines, the cache miss is classified into intra-warp, cross-warp, and cross-block
contention. We classify the GPU cache miss contention as follows:
• Intra-Warp contention (IW) refers to contention among memory requests from threads in the
same warp. The contention can occur among the memory requests from the same instruction
of the same warp or among the memory requests from different instructions of the same
warp. We define the former case as coincident IW contention because the contention is
25
between the requests occurring at the same time frame (during coalescing), and the latter
case as non-coincident IW contention.
• Cross-Warp Contention (XW) refers to contention among memory requests from threads in
different warps. Since many warps execute concurrently in an SM, the memory working set
size from those warps tends to be large. Hence, the cross-warp contention frequently causes
capacity misses. Data blocks are evicted frequently before any reuse occurs, especially when
the reuse distance is long. More importantly, memory requests with no reuse may evict cache
lines that have high reuse, resulting in cache pollution.
• Cross-Block Contention (XB) refers to contention among memory requests from threads in
different thread blocks. This is a subset of the XW contention where contention source is
from two different blocks.
Figure 3.4 shows the miss contention classification in L1D cache depending on the cause
of miss contention. From the figure, about 45% of the cache misses on average are from intra-warp
contention. Especially, for the benchmarks such as atax, bicg, gesummv, mvt, syr2k,
and syrk, intra-warp contention dominates the miss contention.
3.2.2 Cache Resource Contention Classification
When an incoming memory access request is serviced in the cache hierarchy after a miss,
it checks the cache to see if there are enough resources such as cache line, MSHR entry, and miss
queue entry to serve the request. When the cache does not have enough resources, the incoming
request is stalled and retries until it acquires the cache resources it needs. Depending on the type
of the resource acquisition fail, we classify the resource contention into three categories:
• Line Allocation Fail occurs when there is no cache line available for allocation. When a
request misses in a cache, the request tries to find an available line in the cache for allocation
26
2dco
nv
2mm
3dco
nv
3mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
mea
n
0
500
1000
Per
kilo
cycl
e
LineAlloc FailMissQueue FailMSHR Fail
(a) Resource contention (per kilocycle).
2dco
nv
2mm
3dco
nv
3mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
mea
n
0
50
100
Per
cent
age
(%)
(b) Resource contention (percentage).
Figure 3.5. Resource contentions at L1D cache in per kilocycle and in percentage.
under Allocation-on-Miss policy. When all the lines are reserved, line allocation fail occurs
and the request retries until a line is available.
• MSHR Fail occurs when the request can reserve a line in a set but there is no entry available
in the MSHR. The MSHR holds an active request sent to the lower level cache until the
request returns with data.
• Miss Queue Fail occurs when a line is available to be allocated and there is a MSHR entry
available, but the miss queue to the lower level is full. It also means that the lower level
cache is blocked.
The graph in Figure 3.5 shows the metrics for each type of resource reservation fails. Ap-
proximately 80% of the resource contention is LineAllocFail and the resource contention takes
27
Type Processor L1 cacheThreads Cache/ Core / Thread
CPUIntel Haswell 32 KB 2 16 KBIntel Xeon-Phi 32 KB 4 8 KBAMD Kaveri 64 KB 2 32 KB
Type Processor L1 cacheThreads Cache Cache/ Core / Thread / Warp
GPU
Nvidia Fermi 48 KB 1536 32 B 1 KBNvidia Kepler 48 KB 2048 24 B 750 BAMD SI 16 KB 2560 6.4 B 410 B
Table 3.1. Cache capacity across modern multithreaded processors.
about 500 cycles per kilocycles on average for the benchmarks we tested.
3.3 Cache Contention Factors
3.3.1 Limited Cache Resource
Modern GPUs have widely adopted hardware-managed cache hierarchies inspired by the
successful deployment in CPUs. However, traditional cache management strategies are mostly
designed for CPUs and sequential programs; replicating them directly on GPUs may not deliver
the expected performance as GPUs’ relatively smaller cache can be easily congested by thousands
of threads, causing serious contention and thrashing.
Table 3.1 lists the L1D cache capacity, thread volume, and per-thread and per-warp L1
cache size for several state-of-the-art multithreaded processors. For example, the Intel Haswell
CPU has 16 KB cache per thread per core available, but, the NVIDIA Fermi GPU has only 32 B
cache per thread available, which is significantly smaller than CPU cache. Even if we consider the
cache capacity per warp (i.e., execution unit in a GPU), it has only 1 KB per warp, which is still
far smaller than for the CPU cache. Generally, the per-thread or per-warp cache share for GPUs
is much smaller than for CPUs. This suggests the useful data fetched by one warp is very likely
28
2dco
nv
2mm
3dco
nv
3mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
mea
n
0
50
100P
erce
ntag
eIWXWXB
Left 4Right 32
Figure 3.6. Classification of cache misses (intra-warp(IW), cross-warp(XW), and cross-block(XB)miss) and comparison with different associativity (4-way and 32-way) caches. Left bar is with4-way associativity and right with 32-way.
to be evicted by other warps before actual reuse. Likewise, the useful data fetched by a thread can
also be evicted by other threads in the same warp. Such eviction and thrashing conditions destroy
locality and impair performance. Moreover, the excessive incoming memory requests can lead to
significant delay when threads are queuing for the limited resources in caches (e.g., a certain cache
set, MSHR entries, miss buffers, etc.). This is especially so during an accessing burst period (e.g.,
in the starting phase of a kernel) or set-contending coincident IW contention.
3.3.2 Column-Strided Accesses
The cache contention analysis in Section 3.2 shows that the intra-warp contention takes
about 45% of the overall cache miss contention and that 80% of the resource contention is line
allocation fail. The analysis infers that the intra-warp associativity contention has a big impact
on GPU performance. In order to illustrate the problem of intra-warp contention and quantify
their direct impacts on GPU performance, we used two L1D configurations that have the same
total capacity (16 KB) but different cache associativities (4 vs 32) to execute 17 benchmarks from
PolyBench [31] and Rodinia [14].
As shown in Figure 3.6, after increasing the associativity from 4 to 32, the intra-warp
misses in atax, bicg, gesummv, mvt, syr2k, syrk are reduced significantly. About
29
45% of the misses in gesummv are still intra-warp (IW) misses, because it has two fully memory
divergent loads that contend for the L1D cache. Even though a 32-way cache is impractical for
real GPU architectures, this experiment shows that eliminating associativity conflicts are critical
for high performance in benchmarks with memory-divergent instructions.
In benchmarks with multidimensional data arrays, the column-strided access pattern is
prone to create this high intra-warp contention on associativity. The most common example of
this pattern is A[tid∗STRIDE+offset], where tid is the unique thread ID and STRIDE is the
user-defined stride size. By using this pattern, each thread iterates a stride of data independently. In
a conventional cache indexing function, the target set is computed as set = (addr/blkSz)%nset ,
where addr is the target memory address, blkSz is the length of cache line and nset is the number
of cache sets. For example, in the Listing 3.1, when the address stride between two consecutive
threads is equal to a multiple of blkSz ∗ nset, all blocks needed by a single warp are mapped into
the same cache set. When the stride size (STRIDE) is 4096 bytes as in the kernel 1 below, the
32 consecutive intra-warp memory addresses, 0x00000, 0x01000, 0x02000, ..., 0x1F000, will be
mapped into the set 0 in our baseline L1D that has 4-way associativity, 32 cache sets, and 128B
cache lines.
Since cache associativity is often much smaller than warp size, 4 (associativity) versus 32
(warp size) in this example, associativity conflict occurs within each single memory-divergent load
instruction and then the memory pipeline is congested by the burst of intra-warp memory accesses.
3.3.3 Cache Pollution
While GPGPU applications may exhibit good data reuse, due to the small size of the cache
and heavy contention in L1D cache as well as many active memory requests, the distance between
those accesses that could exploit reuse effectively becomes farther apart, resulting in a miss. Fig-
ure 3.7 shows the distribution of reuse from the execution of each benchmark with 16 KB, 32-set,
30
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
mea
n
0
20
40
60
80
100
perc
enta
ge (
%)
Reuse 0Reuse 1Reuse 2Reuse ≥ 3
Figure 3.7. Block reuse percentage in the L1D cache. Reuse0 represents no-reuse until eviction.
4-way, 128-byte block size cache. The value of reuse is defined as the number of accesses of a
line after insertion to the eviction. The initial load to the cache line sets the value to zero. Each
successive hit of the line increases the reuse value by one. Reuse0 represents that the line is never
reused after its insertion to the cache. As can be seen, the Reuse0 dominates the distribution with
about 85%. That is, many of the cache lines are polluted by the one time use data. If the one time
use data is detected before insertion and is not inserted, the efficiency of the cache will increase.
Also, if the lines are not polluting the cache, the reuse frequency of other lines would increase.
3.3.4 Memory Request Stall
When the LDST unit becomes saturated, the new request to the LDST unit will not be
accepted. When the LDST unit is in a stall by one of the cache resources, for example, a line,
an MSHR entry or a miss queue entry, the request fully owns the LDST unit and retries for the
resource. Whenever the contending resource becomes free, the retried request finally acquires the
resource and the LDST unit can accept the next ready warp. While the stalled request retries, the
LDST unit is blocked by the request and cannot be preempted by another request from other ready
warps. Usually, the retry time is large because the active requests occupy the resource during the
long memory latency to fetch data from the lower level of the cache or the global memory.
During this stall, other ready warps which may have the data they need in the cache cannot
31
W1W2W3
Ready Warps
W0W3 data
L1D cache
R
R
R
R
LDST unit
stall
issue
Figure 3.8. LDST unit is in a stall. A memory request from ready warps cannot progress becausethe previous request is in stall in the LDST unit.
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop
hots
pot
lud
nw
srad
1
srad
2
mea
n
0
10
20
30
# of
rea
dy w
arps
Figure 3.9. The average number of ready warps when cache resource contention occurs.
progress because the LDST unit is in a stall caused by the previous request. This situation is
illustrated in Figure 3.8. Assume W0 is stalled in the LDST unit and there are 3 ready warps in the
issue stage. While W0 is in a stall, the 3 ready warps in the issue stage cannot issue any request
to the LDST unit. Even though W3 data is in the cache at that moment, it cannot be accessed by
W3. While W0, W1, and W2 are being serviced in the LDST unit, the W3 data in the cache may
be evicted by the other requests from W0, W1, or W2. When W3 finally accesses the cache, the
previous W3 data in the cache has already been evicted and then the W3 request misses in the
L1D cache and would need to send a fetch request to the lower level to fetch the data. If W3 had
a chance to be scheduled to the stalled LDST unit, it would make a hit in the cache and save extra
cycles.
To estimate the potential opportunity for the hit under this circumstance, we measured the
number of ready warps when the memory request stall occurs. This number does not dictate the
32
number of hit increase, but it gives us an estimate. Figure 3.9 shows that 12 warps on average are
ready to be issued when the LDST unit is in a stall.
33
CHAPTER 4
CONTENTION-AWARE SELECTIVE CACHING
4.1 Introduction
Our analysis in Chapter 3 shows that if the memory accesses from a warp do not coalesce,
L1D cache gets populated fast. The worst case scenario is a column-strided access pattern that
maps many accesses to the same cache set. This creates severe resource contention, resulting in
stalls in the memory pipeline. Furthermore, the simultaneous memory accesses from several in-
flight warps cause contention as well. The widely used associativity of L1D cache is 4-way and
it can be significantly smaller compared to the number of per-thread divergent memory accesses.
There could be as many as 32 threads generating divergent memory accesses. This contention puts
severe stress on the L1 data cache.
To distribute these concentrated accesses across cache sets, memory address randomization
techniques for GPU have been proposed [75, 86]. They permute the cache index bits by logical
xoring to distribute the concentrated memory accesses over the entire cache uniformly. However,
dispersion over the entire cache does not work well since there are many in-flight warps and the
memory access pattern of the warps are similar. It does not effectively reduce the active working
set size.
To mitigate the contention generated by memory divergence, this chapter presents a proac-
tive contention detection mechanism and selective caching algorithm depending on the concentra-
tion per cache set measure and a Program Counter (PC)-based locality measure to maximize the
benefits of caching. In this chapter, we identify the problematic intra-warp associativity contention
34
in GPU and analyze the cause of the problem in depth. Then, we present a proactive contention
detection mechanism to use when contention-prone memory access pattern occurs. We also pro-
pose a selective caching algorithm based on the concentration per cache set measure and per-PC
based locality measure. Thorough analysis on the experimental result follows.
4.2 Intra-Warp Cache Contention
4.2.1 Impact of Memory Access Patterns on Memory Access Coalescing
Depending on the stride of the memory addresses among threads, the number of resulting
memory requests is determined by the MACU. For example, when 4 threads of a warp access 4
consecutive words (i.e., a stride of 1) in a cache line aligned data block, the MACU will gener-
ate only one memory request to L1D cache. We call this case a memory-convergent instruction
as shown in Figure 4.1a. Otherwise, simultaneous multiple requests are not fully coalesced and
generate several memory requests to L1D cache to fetch all demanded data. In the worst case, the
4 memory requests are not coalesced at all and generate 4 distinct memory requests to L1D cache.
We call this case a fully memory-divergent instruction as shown in Figure 4.1c. If the number of
the generated memory requests are in between, we call it a partially memory-divergent instruction
as shown in Figure 4.1b. We define the number of the resulting memory requests as the memory
divergence degree.
4.2.2 Coincident Intra-warp Contention Access Pattern
As described in Section 3.2.1, a large portion of the cache contention for the benchmarks
are from coincident intra-warp contention. Listing 4.1 shows the bicg benchmark kernel code and
a column-strided access pattern. The column-major strided access pattern is prone to create this
high coincident intra-warp contention on associativity. The most common example of this pattern
is A[tid∗stride+offset], where tid is the unique thread ID and stride is user-defined stride size.
35
(a) Fully convergent
add
(b) Partially divergent
ad
(c) Fully divergent
Figure 4.1. (Revisited) Coalescing example for memory-convergent instruction and memory-divergent instruction.
Set 0
Set 1
Set 2
Set 31
...
blksize
n sets
32 reqs
tid0, 0x00100tid1, 0x01100tid2, 0x02100
...
tid31, 0x1F100
Figure 4.2. Example of contending set by column-strided accesses.
By using this pattern, each thread iterates a stride of data independently. In a conventional cache
indexing function, the target set index is computed as set = (addr/blkSz)%nset, where addr is
the target memory address, blkSz is the length of cache line and nset is the number of cache sets.
From Listing 4.1 and Figure 4.2, when the address stride between two consecutive threads is equal
to a multiple of blkSz ∗ nset, all blocks needed by a single warp are mapped into the same cache
set. When the stride size, stride, is 4096 words as in the kernel 1, the 32 consecutive intra-warp
memory addresses, 0x00100, 0x01100, 0x02100, ..., 0x1F100, will be mapped into the set 2 in our
baseline L1D that has 4-way associativity, 32 cache sets, and 128 B cache lines.
Since cache associativity is often smaller than warp size, 32 in this example, associativity
conflict occurs within each single memory-divergent load instruction and then the memory pipeline
36
Thread 1 ~ 4 Thread 13 ~ 16Thread 5 ~ 8 Thread 9 ~ 12
Thread 1 ~ 16
Saved Cycles
Scenario A: baseline
Scenario B: Bypassing
T0 T1
T4
T3 T5 T6T2
Cycle Time
T7
(a) Illustration of the cache congestion in L1D cache when column-strided patternoccurs (Scenario A) and the bypassed L1D accesses (Scenario B). Each arrowrepresents turnaround time of the request.
Set 0
Set k
Set 31
1 ~ 4
at T0 at T1 at T6
Cache
MSHR
1 ~ 4 13 ~ 16
at T7
(b) Cache contents and MSHR contents at T0, T1, and T7
Figure 4.3. Example of BICG memory access pattern.
37
Listing 4.1. BICG benchmark kernel 1 and kernel 21// Benchmark bicg’s two kernels2// Thread blocks: 1-dimensional 256 threads3// A[] is a 4096-by-4096 2D matrix stored4// as a 1D array5// Code segment is simplified for demo6
7__global__ bicg_kernel_1 (...) {8int tid = blkIdx.x * blkDim.x + tIdx.x;9for (int i = 0; i < NX; i++) // NX = 409610s[tid] += A[i * NY + tid] * r[i];11}12
13__global__ bicg_kernel_2 (...) {14int tid = blkIdx.x * blkDim.x + tIdx.x;15for (int j = 0; j < NY; j++) // NY = 409616q[tid] += A[tid * NY + j] * p[j];17}
is congested by the burst of intra-warp memory accesses. Figure 4.3a Scenario A illustrates the
case. When the fully divergent warp instruction is issued, it reaches the MACU. Since it is a fully
divergent instruction, it generates as many memory requests as the warp size. Since, as assumed,
this instruction has column-strided pattern, the generated cache set indexes are the same, namely
set k. Cycle T0 in Figure 4.3a indicates the beginning of the memory access. The generated
memory requests begin sending requests by probing the L1D cache. With the allocation-on-miss
policy, the first 4 accesses for threads 1 through 4 acquire cache lines in the set k between T0 and
T1, and send requests to the lower level cache after allocating MSHR entries. The MSHR entry
is filled at T1. Until the response for the request arrives, the lines are allocated and marked as
reserved, so that later attempts to acquire the line fails. Since 4 lines are already allocated by the
first 4 memory accesses, the remaining requests from 5 through 32 requests should wait until at
least one of the responses of the first 4 requests arrives from the lower level. At T2, the response for
the request of thread 1 arrives, fills the cache and modifies the state of the line to valid. However, as
soon as the response is cached, the waiting request evicts the line right away and reserves the line
for its request. The final request from the warp memory requests fills the cache line as illustrated
38
in Figure 4.3b at T7.
4.3 Selective Caching
High cache contention on GPUs can change cache behavior by destroying locality and cre-
ating contention. Our analysis shows that the majority of cache contention is caused by coincident
intra-warp contention. Therefore, we propose a locality and contention aware selective caching
that avoids non-beneficial caching. Figure 4.4 shows the overall flow of selective caching pro-
posed in this chapter. It consists of three major components: memory divergence detection, cache
index calculation, and locality degree calculation.
4.3.1 Memory Divergence Detection
When a group of memory addresses requested yields a linear access pattern [40], all mem-
ory accesses from a warp are coalesced into one memory request by MACU. Otherwise, it gener-
ates more than one memory request to the lower level cache. As described in Section 4.2.2, when
the memory divergence is greater than the size of the cache associativity, the LDST unit starts to
stall to acquire the cache resource as in Figure 4.3a. Therefore, detecting this memory divergence
degree plays a key role in avoiding coincident intra-warp contention. We implement this memory
divergence detector using a counter in each LDST unit per SM. The counter value for the detec-
tor varies from 1 (when fully coalesced) to the SIMD width which is typically 32 (when fully
divergent). The memory divergence detection is marked as 2 in Figure 4.4.
Once the memory divergence degree is detected, the next step is to decide whether to cache
or bypass1. As described in Section 4.2.2, when the memory divergence is greater than the size of
the cache associativity, n, a stall can start and cause a serious bottleneck. We propose 3 decision
schemes when memory divergence is detected. Scheme a is to bypass all the requests (no caching
1Bypassing can be turned on and off by flipping a dedicated bit of memory instruction bit stream on most modernGPUs.
39
Memory Access Instruction
Bypass
Divergent mem inst?
Cache
NO
YES
Cache Index Calculation
Count per SetIndex
> # assoc ?
YES
NO
Locality degree
< Entries in the cache?YES NO
Bypass Cache
More set indices ?YES
NO
END
1
2
3
4
5
More requests?YES
NO
per set iteration
per req. iteration
Figure 4.4. The task flow of the proposed selective caching algorithm in an LDST unit.
at all, Figure 4.5a). Scheme b is to partially cache the first n requests and bypass all other requests
(Figure 4.5b). The last scheme, scheme c, is to partially cache the last n requests and bypass all
other requests (Figure 4.5c). The motivation of schemes b and c is that the instruction can preserve
intra-warp or cross-warp locality which may occur by the cached lines in subsequent instruction
execution. Scheme c shows the same cache entry of the baseline after executing the instruction as
shown in Scenario A of Figure 4.3a.
The example in Figure 4.3b Scenario B illustrates how bypassing the memory requests from
the memory-divergent instruction reduces the associativity congestion. Since memory requests
from each thread are being bypassed, there is no need to allocate a line in the cache set. Also, the
lines already in the cache do not need to be evicted, thus preserving the locality of the previously
40
allocated lines in the cache. As in Scenario B, when all requests are passed to lower level cache,
they do not congest the cache resource. At T4, all the responses are returned and the number of
cycles, marked as ‘Saved Cycles’, T6− T4, is reduced as a result of the selective caching.
4.3.2 Cache Index Calculation
When all the generated memory accesses map to the same cache set, the selective caching
schemes described in Section 4.3.1 do not cause cache congestion and the schemes a, b, c operate
effectively. However, when all the generated memory accesses do not fall into the same cache set,
but into a couple of different cache sets, we need to expand the schemes explained in Section 4.3.1
to apply to each of the sets.
We calculate the cache set index for each request and count the number of requests for each
set. When the number of requests for a set is greater than the associativity n, then we cache the last
n and bypass the others for the set as described in Scheme c in Section 4.3.1. We iterate this for
the calculated sets. The cache index calculation is marked as 3 and 4 in Figure 4.4. The portion
marked with ‘per set iteration’ is iterated per set.
Assume that there are 8 memory requests total. After cache index calculation, 5 requests
map to the cache set i and 3 requests map to the cache set k in a cache with the associativity of 2
in Figure 4.5d. Since set associativity is 2 in this example, the last 2 out of 5 memory requests for
the cache set i are cached and the remaining 3 requests are bypassed. Likewise, the last 2 out of 3
memory requests for the set k are cached and the remaining 1 request is bypassed. Since the cache
index mapping logic simply extracts the address bits in the memory request to calculate the cache
set index, the additional hardware is negligible.
4.3.3 Locality Degree Calculation
The selective cache scheme with the cache index calculation from Section 4.3.2 always
caches as many requests as the associativity size whenever memory-divergent instruction is de-
41
Bypass All
idx k
Memory
requests
Cache
(a) Bypassing all.
Bypass Others
Cache First n
idx k
(b) Partial caching first n.
Bypass Others
Cache Last n
idx k
(c) Partial caching last n.
Bypass Others
Cache Last n
Cache Last n
idx i
idx k
(d) Partial caching by cache index.
Figure 4.5. Different selective caching schemes with associativity size n when the memory diver-gence is detected.
tected. When replacing the cache line with new requests, if the evicted line is more frequently or
actively used by the same warp or other warps, then it destroys the locality. When finding a victim
cache line to be evicted, if we use locality degree for each cache line, we can preserve the most
frequently used line.
The locality degree calculation is marked as 5 in Figure 4.4. The portion marked with
‘per req. iteration’ is iterated for every request. Locality degree can be calculated by counting
the re-references for each instruction. When the locality degree of the current memory request is
smaller than the locality degree of any of the existing lines in the destination cache set, then it
is better for the new line to bypass the cache. If any of the locality degree of the existing lines
is smaller than the locality degree of the current memory request, then the line is evicted and the
current memory request is cached. This strategy keeps frequently-used cache lines and preserves
discovered locality.
42
To implement this locality degree feature, we need to add two things to the baseline, new
fields in each cache line and a re-reference table. For the cache lines, a 7-bit PC and a 4 bit re-
reference counter are added. For the PC representation, 7 hashed bit is enough to distinguish among
PCs for distinct load instructions. Since the number of distinct load instructions in compute kernels
are known to be not many, 18 on average over all Polybench [31] and Rodinia [14] benchmarks,
7 hashed bit for the PC is enough. The re-reference table contains two fields 7-hashed-bit PC and
the average re-reference counter. Whenever a hit occurs in a cache, the counter is incremented by
one. Whenever a line is evicted, the PC and re-reference counter pair is added to the re-reference
table. When added to the table, the re-reference counter is added as a running sum.
When a request needs to be cached after memory divergence detection and cache index
calculation, the average re-reference counter of the new request indexed by its PC is compared
with the re-reference counter of the lines in the cache set. If the re-reference counter of the new
request is higher than any others in the cache set, then a victim line is selected and the new request
is cached into the victim line. Otherwise, it is bypassed.
4.4 Experiment Methodology
4.4.1 Simulation Setup
We configured and modified GPGPU-Sim v3.2 [82], a cycle-accurate GPU architecture
simulator to find contention in current GPU memory hierarchy, and implemented the proposed
algorithms. The NVIDIA GTX480 hardware configuration is used for the system description. The
baseline GPGPU-Sim configurations for this chapter are summarized in Table 4.1.
4.4.2 Benchmarks
To perform our evaluations, we chose benchmarks from the Rodinia [14] and PolyBench/
GPU [31]. We pruned our workload list by omitting the applications provided in the benchmark for
43
Number of SMs 15SM configuration 1400 Mhz, SIMD Width: 16
Warp size: 32 threadsMax threads per SM: 1536Max Warps per SM: 48 WarpsMax Blocks per SM: 8 blocks
Warp schedulers per core 2Warp scheduler Greedy-Then-Oldest (GTO) [73] (default)
Loose Round-Robin (LRR)Two-Level (TL)[63]
Cache/SM L1 Data: 16 KB 128B-line/4-way (default)L1 Data: 16 KB 128B-line/2-wayL1 Data: 16 KB 128B-line/8-wayRep. Policy: Least Recently Used (LRU)Shared memory: 48 KB
L2 unified cache 768 KB, 128 B line, 16-wayMemory partitions 6Instruction dispatch 2 instructions per cyclethroughput per schedulerMemory Scheduler Out of Order (FR-FCFS)DRAM memory timing tCL=12, tRP=12,
tRC=40, tRAS=28,tRCD=12, tRRD=6
DRAM bus width 384 bits
Table 4.1. Baseline GPGPU-Sim configuration.
basic hardware profiling and graphics interoperability. We also omitted kernels that did not have
a grid size large enough to fill all the cores and whose grid size could not be increased without
significantly changing the application code. The benchmarks tested are listed in Table 4.2.
4.5 Experimental Results
4.5.1 Performance Improvement
Figure 4.6a shows the normalized IPC improvement of our proposed algorithm over the
baseline. The baseline graph is shown with a normalized IPC of 1. Assoc-32 has the same size
44
Name Suite Description
2MM PolyBench 2 Matrix MultiplicationATAX PolyBench Matrix Transpose and Vector Mult.BICG PolyBench BiCG SubKernel of BiCGStab Linear Sol.
GESUMMV PolyBench Scalar, Vector and Matrix MultiplicationMVT PolyBench Matrix Vector Product and Transpose
SYR2K PolyBench Symmetric rank-2k operationsSYRK PolyBench Symmetric rank-k operations
HotSpot Rodinia Hot spot physics simulationLUD Rodinia LU DecompositionNW Rodinia Needleman-Wunsch
SRAD1 Rodinia Speckle Reducing Anisotropic DiffusionBTREE Rodinia B+tree
Table 4.2. Benchmarks from PolyBench [31] and Rodinia [14].
cache but 32-way associativity. BypassAll bypasses all the incoming load instructions. IndSel-
Caching represents the cache index calculation selective caching. LocSelCaching represents the
locality degree based selective caching. For performance comparison with other techniques to re-
duce intra-warp contention, we chose an alternative indexing scheme, IndXor, [86] and memory
request prioritization, MRPB [42].
The benchmarks that have severe coincident intra-warp contention such as atax, bicg,
gesummv, mvt, syr2k, syrk in Figure 3.4 show significant IPC improvement. For other
benchmarks that do not suffer much from intra-warp contention, the selective caching algorithm
does not provide much benefit. The average IPC improvement is calculated using the geometric
mean. Note that the simple bypassing algorithm, BypassAll, outperforms the baseline with the
given benchmark sets, confirming that GPGPU applications suffer from unnecessary caching. Our
proposed scheme, LocSelCaching, outperforms the baseline by 2.25x, and outperforms the IndXor
and MRPB by 9% and 12%, respectively.
Figure 4.6b shows the reduction of L1D cache accesses. BypassAll, as expected, reduces
45
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
hots
pot
lud
nw
srad
1
btre
e
gm
ean
0
2
4
6
8N
orm
aliz
ed IP
C
BaselineAssoc-32BypassAllIndXorMRPBIndSelCachingLocSelCaching
(a) IPC improvement.
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
hots
pot
lud
nw
srad
1
btre
e
gm
ean
0
0.2
0.4
0.6
0.8
1
1.2
Nor
m. L
1D C
ache
Acc
esse
s
BaselineAssoc-32BypassAllIndXorMRPBIndSelCachingLocSelCaching
(b) L1D cache access reduction.
Figure 4.6. Overall improvement - IPC improvement and L1D cache access reduction.
the L1D cache accesses the most2. However, its performance is not better than other schemes
because simple bypassing does not take advantage of locality. IndXor does not reduce the L1D
cache access at all since it only changes the cache indexing scheme to reduce intra-warp contention.
MRPB and IndSelCaching reduce similar amounts of L1D cache traffic. LocSelCaching reduces
about 71% of the L1D cache accesses. The cache access reduction makes direct positive impact
on power consumption, average latency, and finally the overall performance.
2Since we do not exclude the write accesses, even the BypassAll case has L1D accesses. The L1D cache accessreduction graph contains write cache accesses to show the overall reduction rate.
46
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
hots
pot
lud
nw
srad
1
btre
e
gm
ean
0
2
4
6
8
Nor
mal
ized
IPC GTO
TwoLevelRR
Figure 4.7. IPC improvement for different schedulers.
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
hots
pot
lud
nw
srad
1
btre
e
gm
ean
0
5
10
15
Nor
mal
ized
IPC Assoc-2
Assoc-4Assoc-8
Figure 4.8. IPC improvement for different associativities.
4.5.2 Effect of Warp Scheduler
Our experiment described in Section 4.5.1 uses the Greedy-Then-Oldest (GTO) warp sched-
uler. Since our selective caching algorithm reduces the coincident intra-warp contention, other
schedulers should not affect the trend of overall performance improvement. Figure 4.7 shows the
result with different warp schedulers, such as TwoLevel [63] and basic RoundRobin warp scheduler.
Since GTO is designed to reduce cross-warp contention and TwoLevel is designed to improve core
utilization by considering branch diversity, GTO scheduler enhancement is the least while the RR
scheduler enhancement is the most. Overall, experiments demonstrate that the selective caching
scheme improves coincident intra-warp contention effectively with different schedulers.
47
4.5.3 Cache Associativity sensitivity
Figure 4.8 shows the result with different cache associativities: 2-, 4-, and 8-way. Coinci-
dent intra-warp contention tends to be much more severe with a smaller associativity cache since
the associativity to warp size ratio is even worse. According to the analysis shown in Figure 4.3a,
memory requests become more serialized and the stall becomes worse than for other associativity
cases. Therefore, the IPC improvement for associativity 2 is the most among the three different
associativity cases. Cache with associativity 8 still improves the performance by about 1.77x.
4.6 Related Work
4.6.1 Cache Bypassing
Jia et al. [41] evaluated GPU L1D cache locality in current GPUs. To justify the effect of
the cache in GPUs, they showed a simulation result with and without L1D cache. Then, the authors
classified cache contentions into three categories: within-warp, within-block, and program-wide.
Based on the category, they proposed compile-time methods, and their proposed compile-time
methods analyze GPU programs and determine if caching is beneficial or detrimental and apply it
to control the L1D cache to bypass or not.
Jia et al., in their MRPB paper [42] showed an improved algorithm to bypass L1D cache
accesses. When any of the resource unavailable events happen that may lead to a pipeline stall,
the memory requests from the L1D cache are bypassed until resources are available. They also
prioritize memory accesses what occurred from cross-warp contention to minimize cross-warp
contention. They discovered that the massive memory accesses from different warps worsen the
memory resource usage. Therefore, instead of sending the memory requests from the SMs to
L1D cache directly, they introduce a buffer to rearrange the input requests and prioritize per-warp
accesses and reduce stalls in a memory hierarchy. This technique reduces cross-warp contention;
however, without cooperation with the warp scheduler, the effect is not significant. Their bypassing
48
technique is triggered when it detects cache contention. While it reacts to the contention, our
algorithm proactively detects cache contention in advance by detecting memory divergence and
selectively caches depending on the locality information.
The most recent research by Li et al. [57] also exploits bypassing to reduce contention.
They extract locality information during compile time and throttle the warp scheduled to avoid
thrashing due to warp contentions. Their work focuses on reducing cross-warp contention.
While the existing works determine bypassing upon occurred contention or static compile
time information which may prove to be incorrect at runtime, our work focuses more on proac-
tively detecting and avoiding cache contention. Also, our work incorporates dynamically obtained
locality information to preserve locality.
4.6.2 Memory Address Randomization
Memory address randomization techniques for CPU caches are well studied. Pseudo-
random cache indexing methods have been extensively studied to reduce conflict misses. Topham
et al. [81] use XOR to build a conflict-avoiding cache; Seznec and Bodin [77, 11] combine XOR
indexing and circular shift in a skewed associative cache to form a perfect shuffle across all cache
banks. XOR is also widely used for memory indexing [54, 69, 70, 91]. Khairy et al. [49] use
XOR-based Pseudo Random Interleaving Cache (PRIC) which is used in [81, 70].
Another common approach is to use a secondary indexing method for alternative cache sets
when conflicts happen. This category of work includes skewed-associative cache [77], column-
associative cache [1], and v-way cache [67].
Some works have also noticed that certain bits in the address are more critical in reducing
cache miss rate. Givargis [27] uses off-line profiling to detect feature bits for embedded systems.
This scheme is only applicable for embedded systems where workloads are often known prior to
execution. Ros et al. [75] propose ASCIB, a three-phase algorithm, to track the changes in address
49
bits at runtime and dynamically discard the invariable bits for cache indexing. ASCIB needs to
flush certain cache sets whenever the cache indexing method changes, so it is best suited for a
direct-mapped cache. ASCIB also needs extra storage to track the changes in the address bits.
Wang et al. [86] applied an XOR-based index algorithm in their work. It finds the address range
that is more critical to reduce intra-warp cache contention.
The main purpose of the different cache indexing algorithms is to randomly distribute the
congested set over all cache sets to minimize the cache thrashing. However, GPGPU executes
massively parallelized workloads as much as possible, so requests from different warps can easily
overflow the cache set size. For example, NVIDIA Fermi architecture uses 16KB L1 data cache
which has 128 cache lines (32 sets with 4 lines per set). Just 4 warp’s set contending accesses
quickly fill up the cache. Indexing may distribute the cache access throughout the whole cache;
however, caching the massive amount of cache accesses which may not be used more than once is
polluting cache and evicts cache lines which have better locality.
4.7 Summary
This chapter presented that the massive parallel thread execution of GPUs causes significant
cache resource contention that is not sufficient to support massively parallel thread execution when
memory access patterns are not hardware friendly. By observing the contention classification for
miss and resource contention, we also identified that the coincident intra-warp contention is the
main culprit of the contention, and the stall caused by the contention severely impacts the overall
performance.
In this chapter, we proposed a locality and contention aware selective caching based on
memory access divergence to mitigate coincident intra-warp resource contention in the L1 data
(L1D) cache on GPUs. First, we detect memory divergence degree of the memory instruction to
determine whether selective caching is needed. Second, we use a cache index calculation to further
50
decide which cache sets are congesting. Finally, we calculate the locality degree to find a better
victim cache line.
Our proposed scheme improves the IPC performance by 2.25x over the baseline. It outper-
forms the two state-of-the-art mechanisms, IndXor and MRPB, by 9% and 12%, respectively. Our
algorithm also reduces 71% of the L1D cache access which results in reducing power consumption.
51
CHAPTER 5
LOCALITY-AWARE SELECTIVE CACHING
5.1 Introduction
Unlike CPUs, GPUs run thousands of concurrent threads, greatly reducing the per-thread
cache capacity. Moreover, typical GPGPU workloads process a large amount of data that do not
fit into any reasonably sized caches. Streaming-style data accesses on GPUs tend to evict to-be-
referenced blocks in a cache. Such pollution of cache hierarchy by streaming data degrades the
system performance.
A simple solution is to increase the cache size. However, this is not a cost effective ap-
proach since caches are expensive components. Therefore, a good GPU cache management tech-
nique which dynamically detects streaming patterns and selectively caches the memory requests
for frequently used data is very desired. A technique that dynamically determines which blocks are
likely or unlikely to be reused can avoid polluting the cache. This can make a non-trivial impact
on overall performance.
To reduce the contention caused by placing the streaming data into the cache, this chapter
presents a locality-aware selective caching mechanism. To track the locality of memory requests
dynamically, we propose a hardware efficient reuse frequency table which maintains the average
reuse frequency per instruction. Since GPUs typically execute a relatively small number of in-
structions per kernel (i.e., threads execute the same single sequence of instructions), the number
of memory instructions are usually quite small. Maintaining the reuse frequency table indexed by
a hashed Program Counter (PC) keeps the hardware inexpensive. By carefully investigating the
52
threshold for a caching decision, we minimize the implementation of the average reuse frequency
to a single bit. This chapter makes the following contributions:
• Through the identification of the cache resource contention in GPU cache hierarchy, line
allocation fail, MSHR fail and miss queue fail, we identifies that there is a large number of
no-reuse blocks polluting cache.
• We propose a hardware efficient locality tracking table, reuse frequency table, for dynam-
ically maintaining the average reuse frequency per instruction. We also define the table
update procedure.
• We propose a selective caching algorithm based on the dynamic reuse frequency table which
has a per-instruction locality measure to effectively identify streaming accesses and bypass.
5.2 Motivation
5.2.1 Severe Cache Resource Contention
To reduce memory traffic and latency, modern GPUs have widely adopted hardware-managed
cache hierarchies inspired by their successful deployment in CPUs. However, traditional cache
management strategies are mostly designed for CPUs and sequential programs; replicating them
directly on GPUs may not deliver expected performance as GPUs’ relatively smaller cache can be
easily congested by thousands of threads, causing serious contention and thrashing.
For example, the Intel Haswell CPU has 16 KB cache per thread per core available, but,
NVIDIA Fermi GPU has only 32B cache per thread available, which is significantly smaller. Even
if we consider the cache capacity per warp (i.e., a thread execution unit in GPU), it has only 1 KB
per warp, which is still far smaller than the CPU cache. Generally, the per-thread or per-warp cache
share for GPUs is much smaller than for CPUs. This suggests the useful data fetched by one warp
is very likely to be evicted by other warps before actual reuse. Likewise, the useful data fetched
53
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
mea
n
0
20
40
60
80
100
perc
enta
ge (
%)
Figure 5.1. Stall time percentage over simulation cycle time.
by a thread can also be evicted by other threads in the same warp. Such eviction and thrashing
conditions destroy discovered locality and impair performance. Moreover, the excessive incoming
memory requests can lead to significant delay when threads are queuing for the limited resources
in caches (e.g., a certain cache set, MSHR entries, miss buffers, etc).
This resource contention represents the cause of resource acquisition fail to service a re-
quest. The graph in Figure 3.5 shows which cache resource is a contention source and how much
each resource contention occurs for each benchmark. The stall time over all simulation cycle time
due to this resource contention is illustrated in Figure 5.1. Average stall cycle for all the bench-
marks is about 55% of the total simulation cycle.
5.2.2 Low Cache Line Reuse
While GPGPU applications may exhibit good data reuse, due to the small size of the cache
and heavy contention in L1D cache as well as many active memory requests, the distance between
those accesses that could exploit reuse effectively become farther apart, resulting in a miss. As in
Figure 3.7, the Reuse0 dominates the distribution with about 85%. That is, many of the cache lines
are polluted by the one time use data. If such data is detected before insertion and is avoided from
insertion, the efficiency of the cache will increase.
54
5.3 Locality-Aware Selective Caching
Cache line pollution by no-reuse memory requests causes severe cache contention. Caching
lines likely to be reused and bypassing lines not likely to be reused is the key to locality-aware
selective caching.
Existing CPU cache bypass techniques use memory addresses for bypass decision. The
decision is based on hit rate of the memory access instructions [85], temporal locality [28], access
frequency of the cache blocks [46], reuse distance [39], references and access intervals [50]. How-
ever, due to the massive parallel thread execution of GPUs, using memory addresses for bypass de-
cision in GPUs is impractical. Figure 5.2a shows the number of distinct memory addresses present
in the GPU benchmarks investigated. Hundreds of thousands of memory blocks are accessed dur-
ing the execution of these kernels. On average, the number of distinct memory addresses is about
320,000.
Compared to this, the number of load instructions in GPU kernels is relatively small. Due
to the SIMD nature of the GPUs, the kernel code size is small and each thread shares the same
code while executing with different data. Figure 5.2a presents that the number of distinct load
instructions identified by their PC are small. The average number of distinct PCs is about 11 in the
benchmarks tested. Therefore, keeping track of average reuse frequency per instruction indexed
by PC appears to be a manageable solution for making the cache/bypass decision.
5.3.1 Reuse Frequency Table Design and Operation
We propose a reuse frequency table as in Figure 5.3 to store each load instruction’s PC
and reuse frequency. It has a hashed-PC field that stores the memory instruction’s PC. As can
be seen from Figure 5.2b, 64 entries would suffice to distinguish instructions. Because of the
characteristic of hashing, a kernel with more than 64 instructions can still be accommodated by
the table. The other field, ReuseFrequencyValue, holds the moving average of reuse frequency for
55
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
mea
n
0
200000
400000
600000
800000
1000000
# of
dis
tinct
add
r
(a) The number of distinct memory addresses during the execution of kernels.
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
mea
n
0
10
20
30
# of
dis
tinct
PC
s
(b) The number of distinct PCs of the load instructions.
Figure 5.2. The number of addresses and instructions for load.
caching decisions. We revisit this entry in Section 5.3.2 for optimization.
This table is maintained globally to be shared by all the SMs since thread blocks are ran-
domly distributed to each SM and their average memory access behavior is similar. Since this
table is only updated when an eviction occurs in a cache, the frequency of update is lower than
the frequency of cache accesses. Therefore, contention to update entries on this global structure is
manageable.
Figure 5.4 shows the algorithm procedure in detail. It consists of two phases: reuse fre-
quency table update and a cache/bypass decision as in Figure 5.4a and Figure 5.4b, respectively.
Reuse frequency table update (Figure 5.4a): when memory access is requested, it probes
cache for a hit or a miss ( 1 ). A miss requires a decision whether or not a line has to be evicted
from the cache set ( 2 ). When no eviction is needed ( 4 ), the request simply allocates a line
56
Set k
L1D Cache
{Hashed PC, ReuseFreqValue}
AvgReuseFreq
ReuseFreqTable
at eviction
......
Figure 5.3. Reuse frequency table entry and operation.
Memory request
ReuseFreqValue++
in the cache line
Probe cacheMISS
Evict ?
HIT
NO YES
End
ReuseFreqValue 1
in the cache line
Update ReuseFreqValue
@ ReuseFreqTable[PC]
1
2
3 4
5
(a) Reuse frequency table update procedure.
Memory request
Update ReuseFreqValue
in the cache line
ReuseFreqTable Hit &
Value <= Threshold
NO
HITYES
End
Bypass
Probe cacheMISS
1
2
3
4
ReuseFreqTable
update procedure5
(b) Caching decision.
Figure 5.4. Reuse frequency table update and caching decision.
with the hashed-PC and the ReuseFreqValue of 1. If an eviction is needed ( 5 ), the victim line’s
ReuseFreqValue updates the entry in the reuse frequency table indexed by the victim line’s hashed-
PC. If the request is a hit ( 3 ), the ReuseFreqValue in the cache line is increased by one.
In summary, the request updates cache line 1) when a request hits a cache line (reuse
frequency increased by 1) and 2) when the request is a miss and no eviction occurs (reuse frequency
set to 1). The request updates reuse frequency table with the evicted line’s PC and ReuseFreqValue
only when an eviction occurs in a cache.
Caching decision (Figure 5.4b): When memory access is requested, it probes the reuse
57
frequency table to see whether it is a hit or a miss. When it is a hit and the ReuseFreqValue in the
entry is less than or equal to a threshold ( 1 ), it indicates the request is not likely to be referenced in
the future. This request is determined to be bypassed. However, if the request is bypassed, there is
no chance to update the reuse statistics for the PC. Then, the PC’s reuse frequency becomes stale,
and the LDST unit falsely bypasses memory requests of the PC. To avoid this situation, even when
the LDST unit decides to bypass the memory request, it probes the cache ( 2 ) and if it is a hit, it
updates the cache entry’s ReuseFreqValue ( 3 ). The request is bypassed after that ( 4 ). When the
memory request does not satisfy the bypass criterion, it follows the reuse frequency table update
procedure ( 5 ).
5.3.2 Threshold Consideration
When a line is evicted from the cache, the ReuseFreqValue of the evicted line updates
the AvgReuseFreq field in the reuse frequency table. The average value can be calculated either
as a true average or a moving average. A true average calculation needs to track the number of
evictions and the summation of all the reuse frequency values, while a moving average needs a
previous average and a forgetting factor α. This calculation needs either value accumulation or
floating point multiplication.
Figure 5.5 shows the IPC improvement over baseline with different thresholds. The result
indicates that all the simulated threshold values improve performance. Th1.0 shows the least im-
provement while Th1.5 shows the most improved performance. However, when the threshold is
greater than 1.0, which means some requests that have potential reuse are bypassed, some bench-
marks such as 2dconv, 2mm, and backprop suffer performance degradation. To avoid such a
degradation, we choose the threshold 1.0.
When a threshold value of 1.0 is used, we do not need to calculate an average of reuse
frequency, but, we simply need to record the ReuseFreqValue 0 or 1, where 0 indicates no-reuse
58
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
geo
mea
n
0
2
4
6
8
Nor
mal
ized
IPC BASELINE
Th1.0Th1.1Th1.2Th1.5
Figure 5.5. IPC improvement with different threshold values for caching decision.
and 1 indicates reuse. We now simplify the ReuseFreqValue field implementation in a cache line
to 1 bit to hold the bypass-or-not information. Our reuse frequency table also can be simplified to
have 1 bit for AvgReuseFreq.
5.3.3 Algorithm Features
1 bit for cache/bypass decision: From the threshold simulation analysis in Section 5.3.2,
we minimize the implementation of ReuseFreqValue and AvgReuseFreq to 1 bit. The initial load
to a cache line sets the bit to 0, indicating no-reuse. Whenever reuse occurs, it flips the bit to
1, indicating reuse in the cache line. When the Reuse Frequency Table is updated upon cache
line eviction, the current value in the table will be ORed with the new value. This significantly
simplifies the design complexity and latency. We do not need to hold several bits for the reuse
frequency value in a cache line nor to calculate a running average for the table entry.
Conservative bypassing: The selective caching decision using the reuse frequency value
for our proposed scheme is conservative. When memory requests from the same instruction (PC)
have a different reuse characteristic, that is, some requests are no-reuse and the others are reuse,
then the reuse frequency in the reuse frequency table is set to reuse. Then, the memory requests
from the PC are determined not to be bypassed. Likewise, when multiple PCs are mapped to the
59
same hashed-PC entry, only when both of the PCs’ requests are no-reuse, the requests from those
PCs are bypassed. Otherwise, the requests are not bypassed. This guarantees no performance
degradation when the threshold value is 1.0.
Avoid stale ReuseFreqValue statistics: In Figure 5.4b, 3 updates the ReuseFreqValue in the
cache line even if the request is to be bypassed. Without the step 3 , the ReuseFreqValue would
become stale once a bypass decision is made for a PC. Step 3 keeps updating the ReuseFreqValue
whenever a hit occurs. This changes the status of the PC from no-reuse to reuse so that the request
with the PC is not bypassed the next time.
5.3.4 Contention-Aware Selective Caching Option
While locality-aware selective caching increases IPC performance by reducing cache pollu-
tion, there is still another cause for a large portion of the resource contention. The previously men-
tioned approach, Contention-aware Selective Caching in Chapter 4 addresses one source of a large
portion of the resource contention as coincident intra-warp contention which is mainly caused by
a column-strided pattern. The chapter identifies that the coincident intra-warp contention severely
blocks the overall cache resources. The selective caching detects the problematic memory diver-
gence, identifies the congested sets, and decides caching or not. Since this locality-aware selective
caching can be used along with contention-aware selective caching without interfering with each
other, a synergistic effect is expected.
5.4 Experiment Methodology
5.4.1 Simulation Setup
We configured and modified GPGPU-Sim v3.2 [82], a cycle-accurate GPU architecture
simulator, to find contention in the GPU memory hierarchy, and implemented the proposed algo-
rithms. The NVIDIA GTX480 hardware configuration is used for the system description. The
60
Number of SMs 15SM configuration 1400 Mhz, SIMD Width: 16
Warp size: 32 threadsMax threads per SM: 1536Max Warps per SM: 48 WarpsMax Blocks per SM: 8 blocks
Warp schedulers per core 2Warp scheduler Greedy-Then-Oldest (GTO) [73] (default)
Loose Round-Robin (LRR)Two-Level (TL)[63]
Cache/SM L1 Data: 16 KB 128B-line/4-way (default)L1 Data: 16 KB 128B-line/2-wayL1 Data: 16 KB 128B-line/8-wayRep. Policy: Least Recently Used (LRU)Shared memory: 48 KB
L2 unified cache 768 KB, 128 B line, 16-wayMemory partitions 6Instruction dispatch 2 instructions per cyclethroughput per schedulerMemory Scheduler Out of Order (FR-FCFS)DRAM memory timing tCL=12, tRP=12,
tRC=40, tRAS=28,tRCD=12, tRRD=6
DRAM bus width 384 bits
Table 5.1. Baseline GPGPU-Sim configuration.
baseline GPGPU-Sim configurations for this chapter are summarized in Table 5.1.
5.4.2 Benchmarks
To perform our evaluations, we chose benchmarks from Rodinia [14] and PolyBench/
GPU [31]. We pruned our workload list by omitting the applications provided in the benchmark for
basic hardware profiling and graphics interoperability. We also omitted kernels that did not have
a grid size large enough to fill all the cores and whose grid size could not be increased without
significantly changing the application code. The benchmarks tested are listed in Table 5.2.
61
Name Suite Description
2DCONV PolyBench 2D Convolution2MM PolyBench 2 Matrix MultiplicationATAX PolyBench Matrix Transpose and Vector Mult.BICG PolyBench BiCG SubKernel of BiCGStab Linear Sol.
GESUMMV PolyBench Scalar, Vector and Matrix MultiplicationMVT PolyBench Matrix Vector Product and Transpose
SYR2K PolyBench Symmetric rank-2k operationsSYRK PolyBench Symmetric rank-k operations
BACKPROP Rodinia Back propagationBFS Rodinia Breadth-First Search
HOTSPOT Rodinia Hot spot physics simulationLUD Rodinia LU DecompositionNW Rodinia Needleman-Wunsch
SRAD1 Rodinia Speckle Reducing Anisotropic DiffusionSRAD2 Rodinia SRAD 2D
Table 5.2. Benchmarks from PolyBench [31] and Rodinia [14].
5.5 Experimental Results
5.5.1 Performance Improvement
Figure 5.6a shows the normalized IPC improvement of our proposed algorithm over the
baseline. The baseline graph is shown with a normalized IPC of one. LASC1.0 and LASC1.5 are
the proposed schemes with different threshold values for comparison. LASC1.0+SelCaching rep-
resents the scheme with contention-aware selective caching. For performance comparison with
other state-of-the-art techniques using bypassing, we choose a memory request prioritization,
MRPB [42].
Note that there is no performance degradation with threshold value 1.0. When the threshold
value is greater than 1.0, benchmarks suffer from the performance degradation due to excessive by-
passing. LASC1.0 performance enhancement over baseline is about 1.39x. Our proposed scheme
with contention-aware selective caching described in Chapter 4, LASC1.0+SelCaching, outper-
62
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
geo
mea
n
0
2
4
6
8
Nor
mal
ized
IPC
BASELINELASC1.0LASC1.532WMRPBLASC1.0+SelCaching
(a) IPC improvement.
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
mea
n
0
20
40
60
80
100
perc
enta
ge (
%)
(b) Reduction of no-reuse memory request in percentage.
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
geo
mea
n
0
2
4
6
8
Nor
m. A
vg R
euse
75.4 95.7 31.3 27.4116.9 43.0 33.5
(c) Reuse Frequency enhancement.
Figure 5.6. Overall improvement: IPC improvement and L1D cache access reduction.
63
forms baseline by 2.01x and MRPB by 7%.
Figure 5.6b shows the reduction of no-reuse memory requests. For most of the benchmarks,
atax, bicg, gesummv, mvt, syr2k, syrk, bfs, hotspot, nw, and srad1, more
than 99% of the no-reuse memory requests are bypassed and do not pollute the cache. On aver-
age, 73% of the no-reuse memory requests are bypassed. The reason for less removal in some
benchmarks such as 2dconv, 2mm, backprop, lud, and srad2 is because the no-reuse
requests are mixed with reuse requests within the same PCs. As described in Section 5.3.3, since
our scheme is conservatively bypassing, those requests are not bypassed.
Figure 5.6c shows the enhanced reuse frequency when LASC1.0+SelCaching is used. With
the average reuse frequency for the baseline one, 27x more references are made in the cache on
average. Notably, for the bicg benchmark, the average reuse frequency is enhanced by 117x.
Since most of the no-reuse traffic has been bypassed, there are more chances for other requests to
be referenced.
5.5.2 Effect of Warp Scheduler
Our experiment used the Greedy-Then-Oldest (GTO) warp scheduler as a default warp
scheduler. Since our locality-aware selective caching algorithm improves performance based on
the reuse frequency, other schedulers should not affect the trend of overall performance improve-
ment. Figure 5.7 shows the results with different warp schedulers, such as the TwoLevel [63]
and RoundRobin warp scheduler. Since GTO is designed to reduce cross-warp contention and
TwoLevel is designed to improve core utilization by considering branch diversity, GTO scheduler
enhancement is the least while the RR scheduler enhancement is the most. Overall, the exper-
iments demonstrate that the locality-aware selective caching scheme improves the performance
effectively regardless of scheduler.
64
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
geo
mea
n
0
2
4
6
8
Nor
mal
ized
IPC BASELINE
GTORRTwoLevel
Figure 5.7. IPC improvement with different schedulers.
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop bf
s
hots
pot
lud
nw
srad
1
srad
2
geo
mea
n
0
5
10
15
Nor
mal
ized
IPC
BASELINEAssoc-2Assoc-4Assoc-8
Figure 5.8. IPC improvement with different associativities.
5.5.3 Effect of Cache Associativity
Figure 5.8 shows the results with different cache associativities: 2-, 4-, and 8-way. Cache
contention tends to be much more severe with smaller associativity cache. Generally, memory
requests become more congested in smaller associativity and the stall becomes worse compared to
larger associativity cases. Therefore, the IPC improvement for associativity 2 is the most among
the three different associativity cases. Cache with associativity 8 still improves the performance
about 1.63x.
65
5.6 Related Work
5.6.1 CPU Cache Bypassing
Much of the existing research focuses on CPU cache management techniques [28, 38, 39,
46, 71, 85, 87]. We only show bypassing related techniques here. Among these, a selection of
papers have explored bypassing in CPU caches. Tyson et al. proposed bypassing based on the
hit rate of the memory access instructions [85], while Johnson et al. propose to use the access
frequency of the cache blocks to predict bypassing [71]. Kharbutli and Solihin propose using
counters of events such as number of references and access intervals to make bypass predictions
in the CPU last-level cache [50]. All of these techniques use memory address-related information
to make the prediction, costing significant storage overhead that would be impractical for GPU
caches. Program counter trace-based dead block prediction [53] leveraged the fact that sequences
of memory instruction PCs tend to lead to the same behavior for different memory blocks. This
dead block prediction scheme is useful for making bypass predictions in CPUs. We show that GPU
kernels are small with few distinct memory instructions. Using only the PC of the last memory
instruction to access a block is sufficient for a GPU bypassing prediction. Cache Bursts [59] is
another dead block prediction technique that exploits bursts of accesses hitting the MRU position
to improve predictor efficiency. For GPU workloads that use scratch-pad memories, the majority
of re-references have been filtered. Gaur et al. [25] proposed bypass and insertion algorithms for
exclusive LLCs to adaptively avoid unmodified dead blocks being written into the exclusive LLC.
5.6.2 GPU Cache Bypassing
Jia et al. [41] justified the effect of the cache in GPUs. They presented a simulation result
that L1D cache may degrade the overall system performance. Then, they proposed a static method
to analyze GPU programs and determine if caching is beneficial or detrimental at compile time
and to apply it to control the L1D cache to bypass or not. Jia et al. [42] later proposed a memory
66
request prioritization buffer (MRPB) to improve GPU performance. MRPB prioritized the mem-
ory requests in order to reduce the reuse distance within a warp. It also used cache bypassing to
mitigate intra-warp contention. When bypassing, it blindly bypassed memory requests whenever
they detect resource contention. Therefore, there are some benchmarks that suffer from perfor-
mance degradation. Compared to MRPB, our locality-aware selective caching does not degrade
performance since we measure the reuse frequency dynamically and conservatively decide caching
according to the reuse frequency.
Rogers et al. proposed cache-conscious wavefront scheduling (CCWS) to improve GPU
cache efficiency by avoiding data thrashing that causes cache pollution [72]. CCWS estimates
working set size of active warps and dynamically restricts the number of warps. This may adversely
affect the ability to hide high memory access latency of GPUs. Our selective caching bypasses the
no-reuse blocks without under-utilizing the SIMD pipeline to reduce cache thrashing.
Lee and Kim proposed a thread-level-parallelism-aware cache management policy to im-
prove performance of the shared last level cache (LLC) in heterogeneous multi-core architec-
ture [55]. They focus on shared LLCs that are dynamically partitioned between CPUs and GPUs.
Mekkat et al. proposed a similar idea for heterogeneous LLC management [60], to better partition
LLC for GPUs and CPUs in a heterogeneous system.
Li et al. [57] exploits bypassing to reduce contention. They extract locality information
during compile time and throttle the warp scheduler to avoid thrashing due to warp contentions.
Their work focuses on reducing cross-warp contention. Static analysis does not reflect the dynamic
behavior of the application.
Tian et al. [80] also exploits PC to predict bypassing. Their method maintains a table for
bypass prediction. They use confidence count to control bypassing. Every reuse decreases the con-
fidence count and every miss increases the confidence count. When the confidence count is greater
than a predetermined value for the PC of a memory request, then the request is bypassed. How-
67
ever, this scheme takes a training time until it actually bypasses a request since it uses a confidence
counter and the tables are maintained for each L1D cache. To compensate for misprediction, they
use a bypassBit in the L2 cache. However, when the bit is set and reset by multiple SMs’ requests,
its bypass decision is not accurate. Our scheme maintains a global reuse frequency table to reflect
the program’s overall behavior. Also, our scheme dynamically updates the bypass table, even by
bypassed requests, to avoid misprediction.
5.7 Summary
This chapter shows that the massive parallel thread execution of GPUs can result in mem-
ory access patterns that cause significant cache resource contention. By observing the resource
contention classification and reuse statistics, we also identify that the streaming requests are dom-
inant in the memory requests and they severely pollute the cache, resulting in severe degradation
on the overall performance.
We proposed a locality-aware selective caching algorithm based on per-PC memory reuse
frequency in the L1 data (L1D) cache on GPUs. When we choose 1.0 for the caching threshold,
the design becomes very simple to deploy. Our scheme uses a conservative bypassing approach,
ensuring that no performance degradation occurs.
Our proposed scheme improves the IPC performance by 1.39x over the baseline. To-
gether with contention-aware selective caching, it improves the overall performance by 2.01x.
Our scheme outperforms the state-of-the-art bypassing mechanism, MRPB, by 7%. Our algorithm
also reduces 73% of the no-reuse memory requests helping to avoid polluting the L1D cache and
enhances the average reuse frequency by 27x.
68
CHAPTER 6
MEMORY REQUEST SCHEDULING
6.1 Introduction
Column-strided memory access patterns are a source of cache contention commonly found
in data-intensive applications. The access pattern is compiled to warp instructions and generates
memory access patterns that are not well coalesced, causing resource contention. The resulting
memory accesses are serially processed in the LDST unit, stress the L1 data (L1D) caches, and
result in serious performance degradation.
Cache contention also arises from cache pollution caused by low reuse frequency data. For
the cache to be effective, a cached line must be reused before its eviction. But, the streaming
characteristic of GPGPU workloads and the massively parallel GPU execution model increase the
effective reuse distance, or equivalently reduce reuse frequency of data. In GPUs, the pollution
caused by a low reuse frequency (i.e., large reuse distance) data is significant.
Due to the contention, the LDST unit becomes saturated and stalled. A stalled LDST unit
does not execute memory requests from any ready warps in the previous issue stage. The warp
in the LDST unit retries until the cache resource becomes available. During this stall, the private
L1D cache is also in a stall, so no other warp requests can probe the cache. However, there may
be data in an L1D cache which may hit if other ready warps in the issue stage could access the
cache. In such cases, the current structure of the issue stage and L1D unit execution do not allow
the provisional cache probing. This stall prevents the potential hit chances for the ready warps.
This chapter proposes a memory request schedule queue that holds ready warps’ memory
69
W1W2W3
Ready Warps
W0W3 data
L1D cache
R
R
R
R
LDST unit
stall
issue
(a) LDST unit is in a stall. A memory request from readywarps cannot progress because the previous request is in stallin the LDST unit.
...
W1
W2
W3
W0
W3 data
L1D cache
R
R
R
R
LDST unit
issue
scheduling
W4W5
(b) LDST unit with scheduling queue.
Figure 6.1. LDST unit in stall and the scheduling queue.
requests and a scheduler that effectively schedules them in a manner that increases the chances of a
hit. In this chapter, we identify the problematic GPU LDST unit stalls and analyze the potential hit
in depth. Then, we detail the various design factors of a queue mechanism and scheduling policy.
A thorough analysis of the experimental results follows.
6.2 Cache Contention
6.2.1 Memory Request Stall due to Cache Resource Contention
When the LDST unit becomes saturated, a new request to the LDST unit will not be ac-
cepted. When the LDST unit is stalled by one of the cache resources, for example, a line, an
MSHR entry or a miss queue entry, the request fully owns the LDST unit and retries for the re-
source. Whenever the contending resource becomes free, the retried request finally acquires the
70
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop
hots
pot
lud
nw
srad
1
srad
2
mea
n
0
10
20
30
# of
rea
dy w
arps
Figure 6.2. The average number of ready warps when cache resource contention occurs.
resource and the LDST unit can accept the next ready warp. While the stalled request retries, the
LDST unit is blocked by the request and cannot be preempted by another request from other ready
warps. Usually, the retry time is large because the active requests occupy the resource during the
long memory latency to fetch data from the lower level of the cache or the global memory.
During this stall, other ready warps which may have the data they need in the cache cannot
progress because the LDST unit is stalled by the previous request. This situation is illustrated in
the Figure 6.1a. Assume the W0 is stalled in the LDST unit and there are 3 ready warps in the
issue stage. While W0 is stalled, the 3 ready warps in the issue stage cannot issue any request to
the LDST unit. Even when the W3 data is in the cache at that moment, it cannot be hit by W3.
While W0, W1, and W2 are being serviced in the LDST unit, the W3 data in the cache may be
evicted by the other requests from W0, W1, or W2. If that happens what would have been a W3
hit now misses in the L1D cache and must send a fetch request to the lower level to recall the data.
If W3 had a chance to be scheduled to the stalled LDST unit, it would hit and save the extra cycles
to fetch the data from the lower level.
To estimate the potential opportunity for a hit under these circumstances, we measured
the number of ready warps when a memory request stall occurs. This number does not directly
represent the increase in the number of hits, but it gives us a potential estimate. From Figure 6.2,
about 12 warps on average are ready to be issued when the LDST unit is in a stall.
71
Pick a request from the queue
according to the scheduler
Process it
Probe cacheMISS
Cache resource
available ?
HIT
NO
YES
End of cycleIterate k times
Figure 6.3. A procedure for the memory request scheduler.
6.3 Memory Request Scheduling
6.3.1 Memory Request Queuing and Scheduling
To enable potential hits under LDST unit stall, we introduce a memory request queue for
rescheduling the memory request upon stall.
Queuing: When the issue unit issues a warp instruction, instead of monitoring the status
of the LDST unit, the memory requests from the instruction are pushed into the appropriate slot
depending on the warp ID. When the queue is full for the slot, then the warp backs up and retries
at the next cycle since the LDST unit has memory requests to schedule from the same warp.
Scheduling: In the LDST unit, one request from the queue is picked by a scheduler and it
probes the for a hit. If it hit, the instruction is executed in the LDST unit. When it is a miss and
the resource is available to process the instruction, then the instruction is executed in the LDST
unit. However, when the resource is not available, the next available warp instruction is picked by
a scheduler and repeats the process depending on an iteration factor, k. This iteration factor, k, is
implemented as multiple cache probe units to be probed concurrently.
Our design allows the memory request queue to hold up to n number of requests generated
by a warp instruction per slot. The queue size is the same as the maximum number of warps per
72
scheduling
...
W1
W2
W3
W0
...
W47
...
Queue depth: n items
Cache
probe unit
Cache
probe unit
Cache
probe unit
k cache probe
units
W3 data
L1D cache
Memory request queue
Que
ue
size
:48
slots
Figure 6.4. A detailed view of the memory request queue and scheduler.
SM, which is 48 in our simulation setup. This scheme is illustrated in the Figure 6.1b and the flow
is in Figure 6.3.
6.3.2 Queue Depth
When the issue unit issues a warp instruction, the generated memory requests are pushed
into the memory request queue. Since one warp instruction can generate up to the SIMD lane size
of memory access requests, the depth of the queue for a warp must be at least 32 slots to fully
queue the generated memory request. If the depth is smaller than 32, the LDST unit is still in a
stall. The queue depth is illustrated in Figure 6.4.
6.3.3 Scheduling Policy
After memory requests are enqueued to an appropriate queue slot, a scheduling policy
imposes a service priority between these requests. One straightforward scheduling policy is to
apply a fixed-order policy (FIXED), i.e., an item from queue i can be scheduled only if queues 1
to i− 1 have been emptied. Another policy is a round-robin policy (RR) which selects items from
queues in turn, i.e., one item per non-empty queue. Another policy is a grouped round-robin policy
(GRR) which allows multiple items from consecutive slots (one item per slot) in the queue to probe
cache at the same time. The factor, k, depends on the hardware support. For example, with 4-port
73
cache support, four probes can be done at the same time.
The multiple cache probe units are illustrated in Figure 6.4. k items are chosen in a round-
robin fashion to probe cache for potential hits. When multiple hits occur in a group, the earlier slot
in a group is serviced in that cycle, and the next ks are probed in the next cycle. The multiple units
require multiple L1D cache ports, but increasing the number of probe units reduces contention and
increases the probability of a hit.
6.3.4 Contention-Aware Selective Caching Option
While memory request scheduling increases the hit rate when the LDST unit stall due to the
resource contention, there are other resource contention causes. Chapter 4 addresses resource con-
tention arising from coincident intra-warp contention, which is mainly caused by column-strided
pattern. The chapter shows that the coincident intra-warp contention severely blocks the overall
cache resources. It proposes selective caching which detects the problematic memory divergence,
identifies the congested sets, and decides to cache or not using per-PC reuse count information.
Since the proposed memory request scheduling can be used along with selective caching without
interfering with each other, selective caching can increase cache hits.
6.4 Experiment Methodology
6.4.1 Simulation Setup
We configured and modified GPGPU-Sim v3.2 [82], a cycle-accurate GPU architecture
simulator to find miss contention and resource contention in current GPU memory hierarchy, and
implemented the proposed algorithms. The NVIDIA GTX480 hardware configuration is used for
the system description. The baseline GPGPU-Sim configurations for this study are summarized in
Table 6.1.
74
Number of SMs 15SM configuration 1400 Mhz, SIMD Width: 16
Warp size: 32 threadsMax threads per SM: 1536Max Warps per SM: 48 WarpsMax Blocks per SM: 8 blocks
Warp schedulers per core 2Warp scheduler Greedy-Then-Oldest (GTO) [73] (default)
Loose Round-Robin (LRR)Two-Level (TL)[63]
Cache/SM L1 Data: 16 KB 128B-line/4-way (default)L1 Data: 16 KB 128B-line/2-wayL1 Data: 16 KB 128B-line/8-wayRep. Policy: Least Recently Used (LRU)Shared memory: 48 KB
L2 unified cache 768 KB, 128 B line, 16-wayMemory partitions 6Instruction dispatch 2 instructions per cyclethroughput per schedulerMemory Scheduler Out of Order (FR-FCFS)DRAM memory timing tCL=12, tRP=12,
tRC=40, tRAS=28,tRCD=12, tRRD=6
DRAM bus width 384 bits
Table 6.1. Baseline GPGPU-Sim configuration.
6.4.2 Benchmarks
To perform our evaluations, we chose benchmarks from Rodinia [14] and PolyBench/
GPU [31]. We pruned our workload list by omitting the applications provided in the benchmark for
basic hardware profiling and graphics interoperability. We also omitted kernels that did not have
a grid size large enough to fill all the cores and whose grid size could not be increased without
significantly changing the application code. The benchmarks tested are listed in Table 6.2.
75
Name Suite Description
2DCONV PolyBench 2D Convolution2MM PolyBench 2 Matrix MultiplicationATAX PolyBench Matrix Transpose and Vector Mult.BICG PolyBench BiCG SubKernel of BiCGStab Linear Sol.
GESUMMV PolyBench Scalar, Vector and Matrix MultiplicationMVT PolyBench Matrix Vector Product and Transpose
SYR2K PolyBench Symmetric rank-2k operationsSYRK PolyBench Symmetric rank-k operations
BACKPROP Rodinia Back propagationHOTSPOT Rodinia Hot spot physics simulation
LUD Rodinia LU DecompositionNW Rodinia Needleman-Wunsch
SRAD1 Rodinia Speckle Reducing Anisotropic DiffusionSRAD2 Rodinia SRAD 2D
Table 6.2. Benchmarks from PolyBench [31] and Rodinia [14].
6.5 Experimental Results
6.5.1 Design Evaluation
Figure 6.5 shows the normalized IPC improvement using different design implementations.
Queue Depth: Figure 6.5a shows the result using different queue depths such as 1, 32 and
64. When the queue depth is 1, the warp scheduler cannot issue all the generated memory access
requests for the warp when more than 1 memory request is generated. Therefore, the scheduler
still stalls on that instruction. When the queue depth is 32 or larger, the queue can hold all gen-
erated memory access requests and the warp scheduler is free to issue the next warp instruction.
Figure 6.5a shows that the queue sizes 32 and 64 give similar results since the queue size 32 is
large enough to free the warp scheduler.
Multiple Cache Probe Units: Multiple cache probe units add extra hardware complexity,
but saves execution cycles by improving the probability of a hit. The results with different factors
such as 2, 3, and 4 are shown in Figure 6.5b. As expected, more probe units give better perfor-
76
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop
hots
pot
lud
nw
srad
1
srad
2
gm
ean
0
1
2
3
4
5
6
Nor
mal
ized
IPC
BaselineSIZE_1SIZE_32SIZE_64
(a) IPC improvement with different queue depths.
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop
hots
pot
lud
nw
srad
1
srad
2
gm
ean
0
1
2
3
4
5
6
Nor
mal
ized
IPC
Baseline2 units3 units4 units
(b) IPC improvement with different cache probe unit counts.
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop
hots
pot
lud
nw
srad
1
srad
2
gm
ean
0
1
2
3
4
5
6
Nor
mal
ized
IPC
BaselineRoundRobinFixedGroupedRR
(c) IPC improvement with different queue schedulers.
Figure 6.5. IPC improvement with different implementations.
77
mance. The factor of 4 is chosen as our design choice.
Memory Request Scheduling Policy: We experimented with multiple scheduling policies.
Figure 6.5c shows the three different schedulers studied: Fixed, RoundRobin, and GroupedRR.
Fixed scheduler schedules an item from queue i only if queues 1 to i − 1 have been emptied.
RoundRobin scheduler selects an item from the queue in turn, one item per non-empty queue.
GroupedRR picks items from queue i to i − k − 1 with the factor k. With the factor of four, four
cache probe units are implied. As expected, GroupedRR outperforms other schedulers with the
help of multiple probing units.
6.5.2 Performance Improvement
From the design evaluation in Section 6.5.1, we choose the queue depth of 32, 4 cache probe
units, and the GroupedRR scheduling policy. Figure 6.6 shows the normalized IPC improvement
of our proposed algorithm over the baseline. The baseline graph is shown with a normalized IPC
of 1. MemSched represents the proposed memory request scheduling. MemSched+SelCaching
represents the combined scheme with the proposed memory request scheduling and the selective
caching as in Section 6.3.4. For performance comparison with another technique, we choose mem-
ory request prioritization, MRPB [42].
MemSched performance enhancement over baseline is about 1.95x. MemSched without
any bypassing scheme has similar performance to the state-of-the-art bypassing scheme MRPB.
The proposed scheme with the contention-aware selective caching described in Chapter 4, Mem-
Sched+SelCaching, outperforms baseline by 2.06x and the MRPB by 7%.
6.5.3 Effect of Warp Scheduler
Our experiment uses the Greedy-Then-Oldest (GTO) warp scheduler as a default warp
scheduler. Since our memory request scheduling algorithm improves performance by finding po-
tential hits during a LDST unit stall, other warp schedulers should not affect the trend of overall
78
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop
hots
pot
lud
nw
srad
1
srad
2
gm
ean
0
1
2
3
4
5
6
7
Nor
mal
ized
IPC
BaselineMemSchedMRPBMemSched+SelCaching
Figure 6.6. Overall IPC improvement.
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop
hots
pot
lud
nw
srad
1
srad
2
gm
ean
0
2
4
6
8
Nor
mal
ized
IPC
BaselineRoundRobinTwoLevelGTO
Figure 6.7. IPC improvement with different schedulers.
performance improvement. Figure 6.7 shows the results with different warp schedulers, such as
the TwoLevel [63] and RoundRobin warp scheduler. Overall, experiments demonstrate that the
memory request scheduling scheme improves the performance effectively regardless of scheduler.
6.5.4 Effect of Cache Associativity
Figure 6.8 shows the result with different cache associativities: 2-, 4-, and 8-way. Cache
contention tends to be much more severe with a smaller associativity cache. Generally, memory
requests in smaller associativity become more congested and the stall becomes worse than in other
79
2dco
nv
2mm
atax
bicg
gesu
mm
v
mvt
syr2
k
syrk
back
prop
hots
pot
lud
nw
srad
1
srad
2
gm
ean
0
2
4
6
8
10
Nor
mal
ized
IPC
BaselineAssoc-2Assoc-4Assoc-8
Figure 6.8. IPC improvement with different associativities.
larger associativity cases. Therefore, the IPC improvement for associativity 2 is the most among
the three different associativity cases. Cache with associativity 8 still improves the performance
by about 1.51x.
6.6 Conclusion
In this chapter, we identified that when the LDST unit is stalled, no other ready warps can
probe the cache even if there are potential hits to be found if they could proceed and probe the
cache. In order to address this issue, we proposed a memory request scheduling which queues the
memory requests from the warp instructions, schedules items in the queue to probe potential hits
during LDST unit stall and processes the hit request to efficiently use the cache. The proposed
scheme improves the IPC performance by 2.06x over the baseline. It also outperforms the state-
of-the-art algorithm, MRPB, by 7%.
80
CHAPTER 7
RELATED WORK
We introduced chapter-specific related works in earlier chapters. This chapter integrates
those with related works to other topics in order to serve as the single central place for all related
works to this dissertation.
7.1 Cache Bypassing
7.1.1 CPU Cache Bypassing
Much of the existing research focuses on CPU cache management techniques [28, 38, 39,
46, 71, 85, 87]. Among these, a selection of papers have explored bypassing in CPU caches.
Tyson et al. [85] proposed bypassing based on the hit rate of memory access instructions, while
Johnson et al. [46] proposed using the access frequency of the cache blocks to predict bypassing.
Kharbutli and Solihin [50] proposed using counters of events such as number of references and
access intervals to make bypass predictions in the CPU last-level cache. All of these techniques use
memory address-related information to make the prediction, costing significant storage overhead
that would be impractical for GPU caches.
Program counter trace-based dead block prediction [53] leveraged the fact that sequences
of memory instruction PCs tend to lead to the same behavior for different memory blocks. This
dead block prediction scheme is useful for making bypass predictions in CPUs. We show that GPU
kernels are small, containing only a few distinct memory instructions. Using only the PC to access
a block is sufficient for a GPU bypassing prediction.
81
Cache Bursts [59] is another dead block prediction technique that exploits bursts of ac-
cesses hitting the MRU position to improve predictor efficiency. For GPU workloads that use
scratch-pad memories, the majority of re-references have been filtered. Gaur et al. [25] proposed
bypass and insertion algorithms for exclusive LLCs to adaptively avoid unmodified dead blocks
from being written into the exclusive LLC.
7.1.2 GPU Cache Bypassing
Jia et al. [41] justified the effect of the cache in GPUs. They presented a simulation result
that L1D cache may degrade the overall system performance. Then, they proposed a static method
to analyze GPU programs and determine if caching is beneficial or detrimental at compile time
by calculating access stride pattern and applying it to control whether to bypass the L1D cache.
Jia et al. [42] later proposed a memory request prioritization buffer (MRPB) to improve GPU
performance. MRPB prioritized the memory requests in order to reduce the reuse distance within
a warp. It also used cache bypassing to mitigate intra-warp contention. When bypassing, it blindly
bypassed memory requests whenever it detects resource contention. Therefore, there are some
benchmarks which suffer from performance degradation. Compared to MRPB, our locality-aware
selective caching does not degrade performance since we measure the reuse frequency dynamically
and conservatively decide caching according to the reuse frequency.
Rogers et al. proposed cache-conscious wavefront scheduling (CCWS) to improve GPU
cache efficiency by avoiding data thrashing that causes cache pollution [72]. CCWS estimates
working set size of active warps and dynamically restricts the number of warps. This may adversely
affect the ability to hide high memory access latency of GPUs. Our locality-aware selective caching
bypasses the no-reuse blocks without under-utilizing the SIMD pipeline to reduce cache thrashing.
Lee and Kim proposed a thread-level-parallelism-aware cache management policy to im-
prove performance of the shared last level cache (LLC) in heterogeneous multi-core architec-
82
ture [55]. They focus on shared LLCs that are dynamically partitioned between CPUs and GPUs.
Mekkat et al. proposed a similar idea for heterogeneous LLC management [60], to better partition
LLC for GPUs and CPUs in a heterogeneous system.
Li et al. [57] exploit bypassing to reduce contention. They extract locality information
during compile time and throttle the warp scheduler to avoid thrashing due to warp contention.
Their work focuses on reducing cross-warp contention. Static analysis does not reflect the dynamic
behavior of the application.
Tian et al. [80] also exploit the PC to predict bypassing. This method maintains a table
for bypass prediction. They use a confidence count to control bypassing. Every reuse decreases
the confidence count and every miss increases the confidence count. When the confidence count
is greater than the predetermined value for the PC of a memory request, then the request is by-
passed. However, this scheme takes a training time until it actually bypasses a request since it
uses a confidence counter and the tables are maintained for each L1D cache. To compensate for
misprediction, they use a bypassBit in the L2 cache. However, when the bit is set and reset by
multiple SM’s requests, its bypass decision is not accurate. Our locality-aware selective bypassing
maintains a global reuse frequency table to reflect the overall behavior of the program. Also, our
scheme dynamically updates the bypass table even by a bypassed request to avoid misprediction.
While the existing works determine bypassing based upon occurred contention or static
compile time information, which may lead to be incorrect at runtime, our contention-aware selec-
tive caching focuses more on proactively detecting and avoiding cache contention. Also, our work
incorporates dynamically obtained locality information to preserve locality.
7.2 Memory Address Randomization
Memory address randomization techniques for CPU caches are well studied. Pseudo-
random cache indexing methods have been extensively studied to reduce conflict misses. Topham
83
et al. [81] use XOR to build a conflict-avoiding cache; Seznec and Bodin [77, 11] combine XOR
indexing and circular shift in a skewed associative cache to form a perfect shuffle across all cache
banks. XOR is also widely used for memory indexing [54, 69, 70, 91]. Khairy et al. [49] use
XOR-based Pseudo Random Interleaving Cache (PRIC) which is used in [81, 70].
Another common approach is to use a secondary indexing method for alternative cache sets
when conflicts happen. This category of work includes skewed-associative cache [77], column-
associative cache [1], and v-way cache [67].
Some works have also noticed that certain bits in the address are more critical in reducing
cache miss rate. Givargis [27] uses off-line profiling to detect feature bits for embedded systems.
This scheme is only applicable for embedded systems where workloads are often known prior to
execution. Ros et al. [75] propose ASCIB, a three-phase algorithm, to track the changes in address
bits at runtime and dynamically discard the invariable bits for cache indexing. ASCIB needs to
flush certain cache sets whenever the cache indexing method changes, so it is best suited for a
direct-mapped cache. ASCIB also needs extra storage to track the changes in the address bits.
Wang et al. [86] applied XOR-based index algorithm in their work. It finds that the address range
that is more critical to reduce intra-warp cache contention.
The main purpose of the different cache indexing algorithms is to randomly distribute the
congested set over all cache sets to minimize the cache thrashing. However, GPGPU executes
massively parallelized workloads as much as possible, and requests from different warp can easily
overflow the cache sets size. For example, the NVIDIA Fermi architecture uses 16KB L1 data
cache which has 128 cache lines (32 sets with 4 lines per set). Just 4 warp’s set contending accesses
quickly fill up the cache. Indexing may distribute the cache access throughout the whole cache,
however, caching the massive amount of cache accesses which may not be used more than once is
polluting cache and evicts cache lines which have better locality feature.
84
7.3 Warp Scheduling
Warp scheduling plays a critical role in sustaining GPU performance and various schedul-
ing algorithms have been proposed based on different heuristics.
Some warp scheduling algorithms use a concurrent throttling technique to reduce con-
tention in an L1D cache. Static Warp Limiting (SWL) [72] statically limits the number of warps
that can be actively scheduled and needs to be tuned on a per-benchmark basis. Cache Conscious
Warp Scheduling (CCWS) [72] relies on a dedicated victim cache and a 6-bit Warp ID field in
the tag of an cache block to detect intra-warp locality and other storage to track per-warp local-
ity changes. The warp that has the largest locality loss is exclusively prioritized. MASCAR [76]
exclusively prioritizes memory instructions from one “owner” warp when the memory subsystem
is saturated; otherwise, memory instructions of all warps are prioritized over any computation
instruction. MASCAR uses a re-execution queue to replay L1D accesses that are stalled due to
MSHR unavailability or network congestion. Saturation here means that the MSHR has only 1
entry or the queue inside memory part has only 1 slot. On top of CCWS, Divergence Aware Warp
Scheduling (DAWS) [73] actively schedules warps whose aggregate memory footprint does not ex-
ceed L1D capacity. The prediction of memory footprint requires compiler support to mark loops in
the PTX ISA and other structures. Khairy et al. [49] proposed DWT-CS, which use core sampling
to throttle concurrency. When L1D Miss Per Kilo Instruction (MPKI) is above a given threshold,
DWT-CS samples all SMs with a different number of active warps and applies the best-performing
active warp count on all SMs.
Some other warp scheduling algorithms are designed to improve GPU resource utilization.
Fung et al. [24, 23] investigated the impact of warp scheduling on techniques aiming at branch
divergence reduction, i.e., dynamic warp formation and thread block compaction. Jog et al. [45]
proposed an orchestrated warp scheduling to increase the timeliness of GPU L1D prefetching.
Narasiman et al. [63] proposed a two-level round robin scheduler to prevent memory instructions
85
from being issued consecutively. By doing so, memory latency can be better overlapped by compu-
tations. Gebhart et al. [26] introduced another two-level warp scheduler to manage a hierarchical
register file design. On top of the two-level warp scheduling, Yu et al. [90] proposed a Stall-
Aware Warp Scheduling (SAWS) to adjust the fetch group size when pipeline stalls are detected.
SAWS mainly focuses on pipeline stalls. Kayiran et al. [48] proposed a dynamic Cooperative
Thread Array (CTA) scheduling mechanism to enable the optimal number of CTAs according to
application characteristics. It typically reduces concurrent CTAs for data-intensive applications
to reduce LD/ST stalls. Lee et al. [56] proposed two alternative CTA scheduling schemes. Lazy
CTA scheduling (LCS) utilizes a 3-phase mechanism to determine the optimal number of CTAs
per core, while Block CTA scheduling (BCS) launches consecutive CTAs onto the same cores to
exploit inter-CTA data locality. Jog et al. [44] proposed the OWL scheduler, which combines four
component scheduling policies to improve L1D locality and the utilization of off-chip memory
bandwidth.
The aforementioned warp scheduling techniques do not focus on the problem of LDST
stalls and preserving L1D locality especially for cross-warp locality.
7.4 Warp Throttling
Bakhoda et al. [6] present data for several GPU configurations, each with a different max-
imum number of CTAs that can be concurrently assigned to a core. They observe that some
workloads performed better when less CTAs are scheduled concurrently. The data they present
is for a GPU without an L1 data cache, running a round-robin warp scheduling algorithm. They
conclude that this increase in performance occurs because scheduling less concurrent CTAs on the
GPU reduces contention for the interconnection network and DRAM memory system.
Guz et al. [32] use an analytical model to quantify the “performance valley” that exists
when the number of threads sharing a cache is increased. They show that increasing the thread
86
count increases performance until the aggregate working set no longer fits in cache. Increasing
threads beyond this point degrades performance until enough threads are present to hide the sys-
tems memory latency.
Cheng et al. [18] introduce a thread throttling mechanism to reduce memory latency in mul-
tithreaded CPU systems. They propose an analytical model and memory task throttling mechanism
to limit thread interference in the memory stage. Their model relies on a stream programming lan-
guage which decomposes applications into separate tasks for computation and memory and their
technique schedules tasks at this granularity.
Ebrahimi et al. [21] examine the effect of disjointed resource allocation between the vari-
ous components of a chip-multiprocessor system, in particular in the cache hierarchy and memory
controller. They observed that uncoordinated fairness-based decisions made by disconnected com-
ponents could result in a loss of both performance and fairness. Their proposed technique seeks
to increase performance and improve fairness in the memory system by throttling the memory ac-
cesses generated by CMP cores. This throttling is accomplished by capping the number of MSHR
entries that can be used and constraining the rate at which requests in the MSHR are issued to the
L2.
These warp throttling techniques do not identify the memory access characteristics of GPU
but try to resolve contention by dynamically throttling the number of thread blocks or warps. Our
work identifies the memory access characteristic and analyses when caching is beneficial and when
not and resolves the contention by reducing the cause of contention.
7.5 Cache Replacement Policy
There is a body of work attempting to increase cache hit rate by improving the replacement
or insertion policy [9, 13, 38, 43, 61, 68, 88]. All these attempt to exploit different heuristics
of program behavior to predict a blocks re-reference interval and mirror the Belady-optimal [10]
87
policy as closely as possible.
Li et al. [58] propose Priority Based Cache Replacement (PCAL) policy to tightly couple
the thread scheduling mechanism with the cache replacement policy such that GPU cache pollu-
tion is minimized while off-chip memory throughput is enhanced. They prioritize the subset of
high-priority threads while simultaneously allowing lower priority threads to execute without con-
tending for the cache. By tuning thread-level parallelism while both optimizing cache efficiency as
well as other shared resource usage, PCAL improves overall performance. Chen et al. [17] propose
G-Cache to alleviate cache thrashing. To detect thrashing, the tag array of L2 cache is enhanced
with extra bits (victim bits) to provide L1 cache by some information about the hot lines that have
been evicted before. An adaptive cache replacement policy is used by an L1 cache to protect these
hot lines. However, the previous works do not incorporate locality information between warps or
thread blocks.
88
CHAPTER 8
CONCLUSION AND FUTURE WORK
8.1 Conclusion
Leveraging the massive computation power of GPUs to accelerate data-intensive applica-
tions is a recent trend that embraces the arrival of the big data era. While a throughput processor’s
cache hierarchy exploits application-inherent locality and can increase the overall performance,
the massively parallel execution model of GPUs suffers from cache contention. For applications
that are performance-sensitive to caching efficiency, such contention degrades the effectiveness of
caches in exploiting locality, thereby suffering from significant performance drop.
This dissertation has categorized the contention into two different categories depending
on the source of contention and has examined the memory access request bottlenecks that cause
serious cache contention such as memory-divergent instruction caused by column-strided access
pattern, cache pollution by no-reuse data blocks, and memory request stall. This dissertation em-
bodies a collection of research efforts to reduce the performance impacts of these bottlenecks from
their sources, including contention-aware selective caching, locality-based selective caching, and
memory request scheduling. Based on the comprehensive experimental results and systematic
comparisons with state-of-the-art techniques, this dissertation has made the following three key
contributions:
Contention-aware Selective Caching is proposed to detect the column-strided pattern and
its resulting memory-divergent instruction which generates divergent memory accesses, calculates
the contending cache sets and locality information, and caches selectively. We demonstrate that
89
contention-aware selective caching can improve the system more than 2.25x over baseline and
reduce memory accesses.
Locality-aware Selective Caching is proposed to detect the locality of memory requests
based on per-PC reuse frequency and cache selectively. We demonstrate that the low hardware
complexity technique outperforms baseline by 1.39x alone and 2.01x together with contention-
aware selective caching, prevents 73% of the no-reuse data from caching and improves reuse fre-
quency in the cache by 27x.
Memory Request Scheduling is proposed to address the memory request stalls at LDST
unit. It consists of a memory request schedule queue that holds ready warps’ memory requests and
a scheduler to effectively schedule them to increase the chances of a hit in the cache lines. We
demonstrate that there are 12 ready warps on average when the LDST unit is in a stall and this po-
tential improves the overall performance by 1.95x over baseline and 2.06x along with contention-
aware selective caching over baseline.
8.2 Future Work
This dissertation has also opened up opportunities for future architectural research on op-
timizing the performance of GPU memory subsystem. Particularly, the following two topics are
immediate future works.
8.2.1 Locality-Aware Scheduling
From the memory access locality analysis in Section 3.1, the warp scheduler can signif-
icantly change the memory access pattern, and thus playing an important role in system perfor-
mance. Depending on the selection of the scheduler, the overall system performance improvement
can be around 20% on average [56]. For some benchmarks, the overall performance is about 3
times better. The technique used in locality-aware selective caching as in Section 5, can be ex-
90
ploited to schedule the warps to preserve the locality, minimize the eviction from the cache, and
also minimize the memory resource contention in the LDST unit. Through the reuse frequency
analysis, a dynamic working set can also be calculated to aid the scheduler.
8.2.2 Locality-Aware Cache Replacement Policy
When a cache is full, the new request needs to find an entry to be replaced. The Least
Recently Used (LRU) is a commonly used replacement policy. This policy finds a victim by
discarding the least recently used item in the cache. This algorithm requires keeping track of what
was used and when, which is expensive if one wants to make sure the algorithm always discards
the least recently used item. General implementations of this technique require keeping “age bits”
for cache-lines and track the “Least Recently Used” cache-line based on the age-bits. Due to the
complexity, many implementations of LRU are based on pseudo-LRU.
A GPU thread traffic pattern described in Figure 3.1a may fit with the pseudo-LRU cache
replacement policy because of the small working set size, but when warps and thread blocks are
involved, pseudo-LRU may no longer efficiently reflect the locality of the GPU memory accesses.
However, if the no-reuse blocks are filtered by the locality-based selective caching scheme, the
resulting memory access pattern along with LRU policy with reduced insertion and promotion
policy, that is, adjusted block insertion and block promotion, can be effectively cached in the L1D.
Therefore, cache replacement policy using the locality information developed in Section 5 may
improve overall system performance.
91
PUBLICATION CONTRIBUTIONS
This dissertation has contributed to the following publications. From the thorough analysis
of the characteristics of GPU with William Panlener and Dr. Byunghyun Jang, we published a
paper titled Understanding and Optimizing GPU Cache Memory Performance for Compute Work-
loads in IEEE 13th International Symposium on Parallel and Distributed Computing (ISPDC) in
2014.
Through cache contention analysis in Chapter 3, we identified the major causes of cache
contention. From the first factor of cache contention, column-strided memory traffic patterns
and the resulting memory-divergent instruction as introduced in Chapter 4, we submitted a pa-
per, Contention-Aware Selective Caching to Mitigate Intra-Warp Contention on GPUs to IISWC
2016. From the second factor, cache pollution caused by caching no reuse data in Chapter 5, we
submitted a paper, Locality-Aware Selective Caching on GPUs to SBAC-PAD 2016. From the
third factor, memory request stall by the non-preemptive LDST unit in Chapter 6, we are preparing
a paper titled Memory Request Scheduling to Promote Potential Cache Hit on GPUs. I would like
to thank my co-authors of the paper, David Troendle, Esraa Abdelmageed, and Dr. Byunghyun
Jang for their continuous support, discussion on the idea development, editing and revising.
92
BIBLIOGRAPHY
93
BIBLIOGRAPHY
[1] Agarwal, A., and S. D. Pudar (1993), Column-associative Caches: A Technique for Reducingthe Miss Rate of Direct-mapped Caches, SIGARCH Comput. Archit. News, 21(2), 179–190,doi:10.1145/173682.165153.
[2] AMD, Inc. (2012), AMD Graphics Cores Next (GCN) Architecture, https://www.amd.com/Documents/GCN Architecture whitepaper.pdf.
[3] AMD, Inc. (2015), The OpenCL Programming Guide, http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD OpenCL Programming User Guide2.pdf.
[4] Anderson, J. A., C. D. Lorenz, and A. Travesset (2008), General purpose molecular dynam-ics simulations fully implemented on graphics processing units, Journal of ComputationalPhysics, 227(10), 5342 – 5359, doi:http://dx.doi.org/10.1016/j.jcp.2008.01.047.
[5] Bakhoda, A., G. Yuan, W. Fung, H. Wong, and T. Aamodt (2009), Analyzing CUDA work-loads using a detailed GPU simulator, in Performance Analysis of Systems and Software,2009. ISPASS 2009. IEEE International Symposium on, pp. 163–174, doi:10.1109/ISPASS.2009.4919648.
[6] Bakhoda, A., G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt (2009), Analyzing cudaworkloads using a detailed gpu simulator, in Performance Analysis of Systems and Software,2009. ISPASS 2009. IEEE International Symposium on, pp. 163–174, doi:10.1109/ISPASS.2009.4919648.
[7] Bakkum, P., and K. Skadron (2010), Accelerating sql database operations on a gpu withcuda, in Proceedings of the 3rd Workshop on General-Purpose Computation on GraphicsProcessing Units, GPGPU-3, pp. 94–103, ACM, New York, NY, USA, doi:10.1145/1735688.1735706.
[8] Banakar, R., S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel (2002), Scratchpadmemory: a design alternative for cache on-chip memory in embedded systems, in Hardware/-Software Codesign, 2002. CODES 2002. Proceedings of the Tenth International Symposiumon, pp. 73–78, doi:10.1109/CODES.2002.1003604.
[9] Bansal, S., and D. S. Modha (2004), Car: Clock with adaptive replacement, in Proceedingsof the 3rd USENIX Conference on File and Storage Technologies, FAST ’04, pp. 187–200,USENIX Association, Berkeley, CA, USA.
[10] Belady, L. A. (1966), A study of replacement algorithms for a virtual-storage computer, IBMSystems Journal, 5(2), 78–101, doi:10.1147/sj.52.0078.
94
[11] Bodin, F., and A. Seznec (1997), Skewed associativity improves program performance andenhances predictability, Computers, IEEE Transactions on, 46(5), 530–544, doi:10.1109/12.589219.
[12] Brunie, N., S. Collange, and G. Diamos (2012), Simultaneous Branch and Warp Interweavingfor Sustained GPU Performance, SIGARCH Comput. Archit. News, 40(3), 49–60, doi:10.1145/2366231.2337166.
[13] Chaudhuri, M. (2009), Pseudo-lifo: The foundation of a new family of replacement policiesfor last-level caches, in 2009 42nd Annual IEEE/ACM International Symposium on Microar-chitecture (MICRO), pp. 401–412, doi:10.1145/1669112.1669164.
[14] Che, S., M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron (2009),Rodinia: A benchmark suite for heterogeneous computing, in Workload Characterization,2009. IISWC 2009. IEEE International Symposium on, pp. 44–54, doi:10.1109/IISWC.2009.5306797.
[15] Chen, L., and G. Agrawal (2012), Optimizing mapreduce for gpus with effective sharedmemory usage, in Proceedings of the 21st International Symposium on High-PerformanceParallel and Distributed Computing, HPDC ’12, pp. 199–210, ACM, New York, NY, USA,doi:10.1145/2287076.2287109.
[16] Chen, L., X. Huo, and G. Agrawal (2012), Accelerating mapreduce on a coupled cpu-gpu ar-chitecture, in Proceedings of the International Conference on High Performance Computing,Networking, Storage and Analysis, SC ’12, pp. 25:1–25:11, IEEE Computer Society Press,Los Alamitos, CA, USA.
[17] Chen, X., S. Wu, L.-W. Chang, W.-S. Huang, C. Pearson, Z. Wang, and W.-M. W. Hwu(2014), Adaptive Cache Bypass and Insertion for Many-core Accelerators, in Proceedings ofInternational Workshop on Manycore Embedded Systems, MES ’14, pp. 1:1–1:8, ACM, NewYork, NY, USA, doi:10.1145/2613908.2613909.
[18] Cheng, H.-Y., C.-H. Lin, J. Li, and C.-L. Yang (2010), Memory latency reduction via threadthrottling, in Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium onMicroarchitecture, MICRO ’43, pp. 53–64, IEEE Computer Society, Washington, DC, USA,doi:10.1109/MICRO.2010.39.
[19] Choo, K., W. Panlener, and B. Jang (2014), Understanding and Optimizing GPU Cache Mem-ory Performance for Compute Workloads, in Parallel and Distributed Computing (ISPDC),2014 IEEE 13th International Symposium on, pp. 189–196, doi:10.1109/ISPDC.2014.29.
[20] Denning, P. J. (1980), Working sets past and present, IEEE Trans. Softw. Eng., 6(1), 64–84,doi:10.1109/TSE.1980.230464.
[21] Ebrahimi, E., C. J. Lee, O. Mutlu, and Y. N. Patt (2010), Fairness via source throttling:A configurable and high-performance fairness substrate for multi-core memory systems, inProceedings of the Fifteenth Edition of ASPLOS on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS XV, pp. 335–346, ACM, New York, NY, USA,doi:10.1145/1736020.1736058.
95
[22] Fatahalian, K., and M. Houston (2008), A closer look at gpus, Commun. ACM, 51(10), 50–57,doi:10.1145/1400181.1400197.
[23] Fung, W., and T. Aamodt (2011), Thread block compaction for efficient SIMT control flow,in High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Sympo-sium on, pp. 25–36, doi:10.1109/HPCA.2011.5749714.
[24] Fung, W., I. Sham, G. Yuan, and T. Aamodt (2007), Dynamic Warp Formation and Schedul-ing for Efficient GPU Control Flow, in Microarchitecture, 2007. MICRO 2007. 40th AnnualIEEE/ACM International Symposium on, pp. 407–420, doi:10.1109/MICRO.2007.30.
[25] Gaur, J., M. Chaudhuri, and S. Subramoney (2011), Bypass and insertion algorithms forexclusive last-level caches, in Computer Architecture (ISCA), 2011 38th Annual InternationalSymposium on, pp. 81–92.
[26] Gebhart, M., D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, andK. Skadron (2011), Energy-efficient Mechanisms for Managing Thread Context in Through-put Processors, SIGARCH Comput. Archit. News, 39(3), 235–246, doi:10.1145/2024723.2000093.
[27] Givargis, T. (2003), Improved indexing for cache miss reduction in embedded systems, inDesign Automation Conference, 2003. Proceedings, pp. 875–880, doi:10.1109/DAC.2003.1219143.
[28] Gonzalez, A., C. Aliagas, and M. Valero (1995), A Data Cache with Multiple CachingStrategies Tuned to Different Types of Locality, in Proceedings of the 9th InternationalConference on Supercomputing, ICS ’95, pp. 338–347, ACM, New York, NY, USA, doi:10.1145/224538.224622.
[29] Govindaraju, N., J. Gray, R. Kumar, and D. Manocha (2006), Gputerasort: High performancegraphics co-processor sorting for large database management, in Proceedings of the 2006ACM SIGMOD International Conference on Management of Data, SIGMOD ’06, pp. 325–336, ACM, New York, NY, USA, doi:10.1145/1142473.1142511.
[30] Govindaraju, N. K., B. Lloyd, W. Wang, M. Lin, and D. Manocha (2004), Fast computationof database operations using graphics processors, in Proceedings of the 2004 ACM SIGMODInternational Conference on Management of Data, SIGMOD ’04, pp. 215–226, ACM, NewYork, NY, USA, doi:10.1145/1007568.1007594.
[31] Grauer-Gray, S., L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos (2012), Auto-tuning ahigh-level language targeted to GPU codes, in Innovative Parallel Computing (InPar), 2012,pp. 1–10, doi:10.1109/InPar.2012.6339595.
[32] Guz, Z., E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser (2009), Many-core vs. many-thread machines: Stay away from the valley, IEEE Computer ArchitectureLetters, 8(1), 25–28, doi:10.1109/L-CA.2009.4.
96
[33] Hakura, Z. S., and A. Gupta (1997), The Design and Analysis of a Cache Architec-ture for Texture Mapping, in Proceedings of the 24th Annual International Symposiumon Computer Architecture, ISCA ’97, pp. 108–120, ACM, New York, NY, USA, doi:10.1145/264107.264152.
[34] Han, S., K. Jang, K. Park, and S. Moon (2010), Packetshader: A gpu-accelerated softwarerouter, SIGCOMM Comput. Commun. Rev., 40(4), 195–206, doi:10.1145/1851275.1851207.
[35] He, B., W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang (2008), Mars: A mapreduceframework on graphics processors, in Proceedings of the 17th International Conference onParallel Architectures and Compilation Techniques, PACT ’08, pp. 260–269, ACM, NewYork, NY, USA, doi:10.1145/1454115.1454152.
[36] He, B., M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander (2009),Relational query coprocessing on graphics processors, ACM Trans. Database Syst., 34(4),21:1–21:39, doi:10.1145/1620585.1620588.
[37] Hennessy, J. L., and D. A. Patterson (2011), Computer Architecture, Fifth Edition: A Quan-titative Approach, 5th ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
[38] Jaleel, A., K. B. Theobald, S. C. Steely, Jr., and J. Emer (2010), High Performance Cache Re-placement Using Re-reference Interval Prediction (RRIP), SIGARCH Comput. Archit. News,38(3), 60–71, doi:10.1145/1816038.1815971.
[39] Jalminger, J., and P. Stenstrom (2003), A novel approach to cache block reuse predictions,in Parallel Processing, 2003. Proceedings. 2003 International Conference on, pp. 294–302,doi:10.1109/ICPP.2003.1240592.
[40] Jang, B., D. Schaa, P. Mistry, and D. Kaeli (2011), Exploiting Memory Access Patterns toImprove Memory Performance in Data-Parallel Architectures, IEEE Transactions on Paralleland Distributed Systems, 22(1), 105–118, doi:10.1109/TPDS.2010.107.
[41] Jia, W., K. A. Shaw, and M. Martonosi (2012), Characterizing and Improving the Use ofDemand-fetched Caches in GPUs, in Proceedings of the 26th ACM International Conferenceon Supercomputing, ICS ’12, pp. 15–24, ACM, New York, NY, USA, doi:10.1145/2304576.2304582.
[42] Jia, W., K. Shaw, and M. Martonosi (2014), MRPB: Memory request prioritization for mas-sively parallel processors, in High Performance Computer Architecture (HPCA), 2014 IEEE20th International Symposium on, pp. 272–283, doi:10.1109/HPCA.2014.6835938.
[43] Jiang, S., and X. Zhang (2002), Lirs: An efficient low inter-reference recency set replacementpolicy to improve buffer cache performance, SIGMETRICS Perform. Eval. Rev., 30(1), 31–42, doi:10.1145/511399.511340.
[44] Jog, A., O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu,R. Iyer, and C. R. Das (2013), OWL: Cooperative Thread Array Aware Scheduling Tech-niques for Improving GPGPU Performance, SIGPLAN Not., 48(4), 395–406, doi:10.1145/2499368.2451158.
97
[45] Jog, A., O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das (2013),Orchestrated Scheduling and Prefetching for GPGPUs, in Proceedings of the 40th AnnualInternational Symposium on Computer Architecture, ISCA ’13, pp. 332–343, ACM, NewYork, NY, USA, doi:10.1145/2485922.2485951.
[46] Johnson, T. L., D. A. Connors, M. C. Merten, and W. M. W. Hwu (1999), Run-time cachebypassing, IEEE Transactions on Computers, 48(12), 1338–1354, doi:10.1109/12.817393.
[47] Katz, G. J., and J. T. Kider, Jr (2008), All-pairs Shortest-paths for Large Graphs on theGPU, in Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graph-ics Hardware, GH ’08, pp. 47–55, Eurographics Association, Aire-la-Ville, Switzerland,Switzerland.
[48] Kayiran, O., A. Jog, M. T. Kandemir, and C. R. Das (2013), Neither More nor Less: Op-timizing Thread-level Parallelism for GPGPUs, in Proceedings of the 22Nd InternationalConference on Parallel Architectures and Compilation Techniques, PACT ’13, pp. 157–166,IEEE Press, Piscataway, NJ, USA.
[49] Khairy, M., M. Zahran, and A. G. Wassal (2015), Efficient Utilization of GPGPU CacheHierarchy, in Proceedings of the 8th Workshop on General Purpose Processing Using GPUs,GPGPU-8, pp. 36–47, ACM, New York, NY, USA, doi:10.1145/2716282.2716291.
[50] Kharbutli, M., and Y. Solihin (2008), Counter-Based Cache Replacement and Bypassing Al-gorithms, IEEE Transactions on Computers, 57(4), 433–447, doi:10.1109/TC.2007.70816.
[51] Khronos OpenCL Working Group (2012), The OpenCL Specification, https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf.
[52] Khronos OpenCL Working Group (2016), The OpenCL Specification, https://www.khronos.org/registry/cl/specs/opencl-2.2.pdf.
[53] Lai, A.-C., C. Fide, and B. Falsafi (2001), Dead-block Prediction &Amp; Dead-block Cor-relating Prefetchers, in Proceedings of the 28th Annual International Symposium on Com-puter Architecture, ISCA ’01, pp. 144–154, ACM, New York, NY, USA, doi:10.1145/379240.379259.
[54] Lawrie, D. H., and C. Vora (1982), The Prime Memory System for Array Access, Computers,IEEE Transactions on, C-31(5), 435–442, doi:10.1109/TC.1982.1676020.
[55] Lee, J., and H. Kim (2012), TAP: A TLP-aware cache management policy for a CPU-GPUheterogeneous architecture, in IEEE International Symposium on High-Performance CompArchitecture, pp. 1–12, doi:10.1109/HPCA.2012.6168947.
[56] Lee, M., S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu (2014), Improving GPGPUresource utilization through alternative thread block scheduling, in High Performance Com-puter Architecture (HPCA), 2014 IEEE 20th International Symposium on, pp. 260–271, doi:10.1109/HPCA.2014.6835937.
98
[57] Li, A., G.-J. van den Braak, A. Kumar, and H. Corporaal (2015), Adaptive and transparentcache bypassing for GPUs, in Proceedings of the International Conference for High Perfor-mance Computing, Networking, Storage and Analysis, p. 17, ACM.
[58] Li, D., M. Rhu, D. Johnson, M. O’Connor, M. Erez, D. Burger, D. Fussell, and S. Red-der (2015), Priority-based cache allocation in throughput processors, in High PerformanceComputer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pp. 89–100,doi:10.1109/HPCA.2015.7056024.
[59] Liu, H., M. Ferdman, J. Huh, and D. Burger (2008), Cache Bursts: A New Approach forEliminating Dead Blocks and Increasing Cache Efficiency, in Proceedings of the 41st AnnualIEEE/ACM International Symposium on Microarchitecture, MICRO 41, pp. 222–233, IEEEComputer Society, Washington, DC, USA, doi:10.1109/MICRO.2008.4771793.
[60] Mekkat, V., A. Holey, P.-C. Yew, and A. Zhai (2013), Managing Shared Last-level Cache ina Heterogeneous Multicore Processor, in Proceedings of the 22Nd International Conferenceon Parallel Architectures and Compilation Techniques, PACT ’13, pp. 225–234, IEEE Press,Piscataway, NJ, USA.
[61] Meng, J., and K. Skadron (2009), Avoiding cache thrashing due to private data placement inlast-level cache for manycore scaling, in Proceedings of the 2009 IEEE International Con-ference on Computer Design, ICCD’09, pp. 282–288, IEEE Press, Piscataway, NJ, USA.
[62] Mosegaard, J., and T. S. Sørensen (2005), Real-time deformation of detailed geometry basedon mappings to a less detailed physical simulation on the gpu, in Proceedings of the 11thEurographics Conference on Virtual Environments, EGVE’05, pp. 105–111, EurographicsAssociation, Aire-la-Ville, Switzerland, Switzerland, doi:10.2312/EGVE/IPT EGVE2005/105-111.
[63] Narasiman, V., M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt (2011),Improving GPU Performance via Large Warps and Two-level Warp Scheduling, in Proceed-ings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pp. 308–317, ACM, New York, NY, USA, doi:10.1145/2155620.2155656.
[64] NVIDIA Corporation (2009), NVIDIA Fermi white paper, http://www.nvidia.com/content/pdf/fermi white papers/nvidia fermi compute architecture whitepaper.pdf.
[65] NVIDIA Corporation (2012), NVIDIA Kepler GK110 white paper, https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
[66] NVIDIA Corporation (2015), CUDA C Programming Guide, https://docs.nvidia.com/cuda/cuda-c-programming-guide/.
[67] Qureshi, M., D. Thompson, and Y. Patt (2005), The V-Way cache: demand-based associa-tivity via global replacement, in Computer Architecture, 2005. ISCA ’05. Proceedings. 32ndInternational Symposium on, pp. 544–555, doi:10.1109/ISCA.2005.52.
99
[68] Qureshi, M. K., A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer (2007), Adaptive insertionpolicies for high performance caching, in Proceedings of the 34th Annual International Sym-posium on Computer Architecture, ISCA ’07, pp. 381–391, ACM, New York, NY, USA,doi:10.1145/1250662.1250709.
[69] Raghavan, R., and J. P. Hayes (1990), On Randomly Interleaved Memories, in Proceedings ofthe 1990 ACM/IEEE Conference on Supercomputing, Supercomputing ’90, pp. 49–58, IEEEComputer Society Press, Los Alamitos, CA, USA.
[70] Rau, B. R. (1991), Pseudo-randomly Interleaved Memory, in Proceedings of the 18th AnnualInternational Symposium on Computer Architecture, ISCA ’91, pp. 74–83, ACM, New York,NY, USA, doi:10.1145/115952.115961.
[71] Rivers, J. A., E. S. Tam, G. S. Tyson, E. S. Davidson, and M. Farrens (1998), UtilizingReuse Information in Data Cache Management, in Proceedings of the 12th InternationalConference on Supercomputing, ICS ’98, pp. 449–456, ACM, New York, NY, USA, doi:10.1145/277830.277941.
[72] Rogers, T. G., M. O’Connor, and T. M. Aamodt (2012), Cache-Conscious WavefrontScheduling, in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium onMicroarchitecture, MICRO-45, pp. 72–83, IEEE Computer Society, Washington, DC, USA,doi:10.1109/MICRO.2012.16.
[73] Rogers, T. G., M. O’Connor, and T. M. Aamodt (2013), Divergence-aware Warp Scheduling,in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitec-ture, MICRO-46, pp. 99–110, ACM, New York, NY, USA, doi:10.1145/2540708.2540718.
[74] Roh, L., and W. Najjar (1995), Design of storage hierarchy in multithreaded architectures, inMicroarchitecture, 1995., Proceedings of the 28th Annual International Symposium on, pp.271–278, doi:10.1109/MICRO.1995.476836.
[75] Ros, A., P. Xekalakis, M. Cintra, M. E. Acacio, and J. M. Garcıa (2012), ASCIB: AdaptiveSelection of Cache Indexing Bits for Removing Conflict Misses, in Proceedings of the 2012ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED ’12,pp. 51–56, ACM, New York, NY, USA, doi:10.1145/2333660.2333674.
[76] Sethia, A., D. Jamshidi, and S. Mahlke (2015), Mascar: Speeding up GPU warps by reduc-ing memory pitstops, in High Performance Computer Architecture (HPCA), 2015 IEEE 21stInternational Symposium on, pp. 174–185, doi:10.1109/HPCA.2015.7056031.
[77] Seznec, A. (1993), A Case for Two-way Skewed-associative Caches, in Proceedings of the20th Annual International Symposium on Computer Architecture, ISCA ’93, pp. 169–178,ACM, New York, NY, USA, doi:10.1145/165123.165152.
[78] Stuart, J. A., and J. D. Owens (2011), Multi-gpu mapreduce on gpu clusters, in Proceedings ofthe 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS ’11, pp.1068–1079, IEEE Computer Society, Washington, DC, USA, doi:10.1109/IPDPS.2011.102.
100
[79] Ta, T., K. Choo, E. Tan, B. Jang, and E. Choi (2015), Accelerating DynEarthSol3D on tightlycoupled CPUGPU heterogeneous processors, Computers & Geosciences, 79, 27 – 37, doi:http://dx.doi.org/10.1016/j.cageo.2015.03.003.
[80] Tian, Y., S. Puthoor, J. L. Greathouse, B. M. Beckmann, and D. A. Jimenez (2015), AdaptiveGPU cache bypassing, in Proceedings of the 8th Workshop on General Purpose Processingusing GPUs, pp. 25–35, ACM.
[81] Topham, N., A. Gonzalez, and J. Gonzalez (1997), The Design and Performance of a Conflict-avoiding Cache, in Proceedings of the 30th Annual ACM/IEEE International Symposium onMicroarchitecture, MICRO 30, pp. 71–80, IEEE Computer Society, Washington, DC, USA.
[82] Tor M. Admodt and Wilson W.L. Fung (2014), GPGPUSim 3.x Manual, http://gpgpu-sim.org/manual/index.php/GPGPU-Sim 3.x Manual.
[83] Trancoso, P., D. Othonos, and A. Artemiou (2009), Data parallel acceleration of decision sup-port queries using cell/be and gpus, in Proceedings of the 6th ACM Conference on ComputingFrontiers, CF ’09, pp. 117–126, ACM, New York, NY, USA, doi:10.1145/1531743.1531763.
[84] Trapnell, C., and M. C. Schatz (2009), Optimizing Data Intensive GPGPU Computations forDNA Sequence Alignment, Parallel Comput., 35(8-9), 429–440, doi:10.1016/j.parco.2009.05.002.
[85] Tyson, G., M. Farrens, J. Matthews, and A. R. Pleszkun (1995), A Modified Approach toData Cache Management, in Proceedings of the 28th Annual International Symposium onMicroarchitecture, MICRO 28, pp. 93–103, IEEE Computer Society Press, Los Alamitos,CA, USA.
[86] Wang, B., Z. Liu, X. Wang, and W. Yu (2015), Eliminating Intra-warp Conflict Misses inGPU, in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhi-bition, DATE ’15, pp. 689–694, EDA Consortium, San Jose, CA, USA.
[87] Wierzbicki, A., N. Leibowitz, M. Ripeanu, and R. Wozniak (2004), Cache Replacement Poli-cies Revisited: The Case of P2P Traffic.
[88] Wu, C.-J., A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer (2011),Ship: Signature-based hit predictor for high performance caching, in Proceedings of the 44thAnnual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pp. 430–441, ACM, New York, NY, USA, doi:10.1145/2155620.2155671.
[89] Wu, H., G. Diamos, S. Cadambi, and S. Yalamanchili (2012), Kernel weaver: Automaticallyfusing database primitives for efficient gpu computation, in 2012 45th Annual IEEE/ACMInternational Symposium on Microarchitecture, pp. 107–118, doi:10.1109/MICRO.2012.19.
[90] Yu, Y., W. Xiao, X. He, H. Guo, Y. Wang, and X. Chen (2015), A Stall-Aware Warp Schedul-ing for Dynamically Optimizing Thread-level Parallelism in GPGPUs, in Proceedings of the29th ACM on International Conference on Supercomputing, ICS ’15, pp. 15–24, ACM, NewYork, NY, USA, doi:10.1145/2751205.2751234.
101
[91] Zhang, Z., Z. Zhu, and X. Zhang (2000), A Permutation-based Page Interleaving Scheme toReduce Row-buffer Conflicts and Exploit Data Locality, in Proceedings of the 33rd AnnualACM/IEEE International Symposium on Microarchitecture, MICRO 33, pp. 32–41, ACM,New York, NY, USA, doi:10.1145/360128.360134.
102
VITA
Education
2016: Ph.D. in Computer Science University of Mississippi, University, MS2001: M.S. in Electrical Engineering Systems University of Michigan, Ann Arbor, MI2000: B.S. in EE and CS Handong Global University, Pohang, Korea
Professional Experience
May-Aug 2015: Software engineering intern. Google Inc., Mountain View, CAMay-Aug 2014: Research intern. Samsung Research America, San Jose, CA2014-2015: Graduate instructor. University of Mississippi, University, MS2012-2016: Graduate research assistant. University of Mississippi, University, MS2002-2012: Senior engineer / manager. Samsung Electronics, Suwon, Korea2000-2001: Graduate research assistant. University of Michigan, Ann Arbor, MI
Publication List
Papers
1. K. Choo, W. Panlener, and B. Jang,Understanding and optimizing GPU cache memory performance for compute workloads,Parallel and Distributed Computing (ISPDC), 2014 IEEE 13th International Symposium on,pages 189-196, IEEE, 2014.
2. T. Ta, K. Choo, E. Tan, B. Jang, and E. Choi,Accelerating DynEarthSol3D on tightly coupled CPUGPU heterogeneous processors, Com-puters & Geosciences, vol. 79, pp. 27-37, Jun. 2015.
3. K. Choo, D. Troendle, E. Abdelmageed, and B. Jang,Contention-Aware Selective Caching to Mitigate Intra-Warp Contention on GPUs, submittedto IISWC 2016 (from Chapter 4).
4. K. Choo, D. Troendle, E. Abdelmageed, and B. Jang,Locality-Aware Selective Caching on GPUs, submitted to SBAC-PAD 2016 (from Chap-ter 5).
5. K. Choo, D. Troendle, E. Abdelmageed, and B. Jang,Memory request scheduling to promote potential cache hit on GPU, to be submitted (fromChapter 6).
103
6. D. Troendle, K. Choo, and B. Jang,Recency Rank Tracking (RRT): A Scalable, Configurable, Low Latency Cache Replace-mentPolicy, to be submitted.
Patents
1. E. Park, J.H. Lee, and K. ChooDigital transmission system for transmitting additional data and method thereof. US8,891,674,2014.
2. S. Park, H.J. Jeong, S.J. Park, J.H. Lee, K. Kim, Y.S. Kwon, J.H. Jeong, G. Ryu, K. Choo,and K.R. JiDigital broadcasting transmitter, digital broadcasting receiver, and methods for configuringand processing a stream for same. US 8,891,465, 2014.
3. G. Ryu, Y.S. Kwon, J.H. Lee, C.S. Park, J. Kim, K. Choo, K.R. Ji, S. Park, and J.H. KimDigital broadcast transmitter, digital broadcast receiver, and methods for configuring andprocessing streams thereof. US 8,811,304, 2014.
4. J.H. Jeong, H.J. Lee, S.H. Myung, Y.S. Kwon, K.R. Ji, J.H. Lee, C.S. Park, G. Ryu, J. Kim,and K. ChooDigital broadcasting transmitter, digital broadcasting receiver, and method for composingand processing streams thereof. US 8,804,805, 2014.
5. K. Choo and J.H. LeeOFDM transmitting and receiving systems and methods. US 8,804,477, 2014.
6. Y.S. Kwon, G. Ryu, J.H. Lee, C.S. Park, J. Kim, K. Choo, K.R. Ji, S. Park, J.H. KimDigital broadcast transmitter, digital broadcast receiver, and methods for configuring andprocessing streams thereof. US 8,798,138, 2014.
7. Y.S. Kwon, G. Ryu, J.H. Lee, C.S. Park, J. Kim, K. Choo, K.R. Ji, S. Park, and J.H. KimDigital broadcast transmitter, digital broadcast receiver, and methods for configuring andprocessing digital transport streams thereof. US 8,787,220, 2014.
8. G. Ryu, S. Park, J.H. Kim, and K. ChooMethod and apparatus for transmitting broadcast, method and apparatus for receiving broad-cast. US 8,717,961, 2014.
9. J.H. Lee, K. Choo, K. Ha, H.J. JeongService relay device, service receiver for receiving the relayed service signal, and methodsthereof. US 8,140,008, 2012.
10. E. Park, J. Kim, S.H. Yoon, K. Choo, K. SeokTrellis encoder and trellis encoding device having the same. US 8,001,451, 2011.
104
Honors and Awards
• Graduate Student Achievement Award, Spring 2016, University of Mississippi
• President of ΥΠE (UPE), CS Honor Society, Fall 2015 - Spring 2016, University of Missis-sippi Chapter
• Doctoral Dissertation Fellowship Award, Spring 2016, University of Mississippi
• Computer Science SAP Scholarship Award, Spring 2016, University of Mississippi
• UPE Scholarship Award, Fall 2015, ΥΠE (UPE)
• Academic Excellence Award (2nd place in class of 2000), 2000, Handong Global University
• 4-year full Scholarship, 1996 - 2000, Handong Global University
105