TESTING AND EXPOSING WEAK GPU MEMORY MODELS
-
Upload
violet-yang -
Category
Documents
-
view
64 -
download
0
description
Transcript of TESTING AND EXPOSING WEAK GPU MEMORY MODELS
![Page 1: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/1.jpg)
1
TESTING AND EXPOSING WEAK GPU
MEMORY MODELS
MS Thesis Defenseby
Tyler SorensenAdvisor : Ganesh Gopalakrishnan
May 30, 2014
![Page 2: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/2.jpg)
2
• Joint Work with:Jade Alglave (University College London), Daniel Poetzl (University of Oxford), Luc Maranget (Inria), Alastair Donaldson, John Wickerson, (Imperial College London), Mark Batty (University of Cambridge)
![Page 3: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/3.jpg)
3
Roadmap• Background and Approach• Prior Work• Testing Framework• Results• CUDA Spin Locks• Bulk Testing• Future Work and Conclusion
![Page 4: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/4.jpg)
4
Roadmap• Background and Approach• Prior Work• Testing Framework• Results• CUDA Spin Locks• Bulk Testing• Future Work and Conclusion
![Page 5: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/5.jpg)
5
GPU Background
Images from Wikipedia [16,17,18]
• GPU is a highly parallel co-processor
• Currently found in devicesfrom tablets to top supercomputers (Titan)
• Not just used for visualization anymore!
![Page 6: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/6.jpg)
6
GPU Programming
Explicit Hierarchical concurrency model
• Thread Hierarchy:• Thread
• Warp
• CTA (Cooperative Thread Array)
• Kernel (GPU program)
• Memory Hierarchy:• Shared Memory
• Global Memory
![Page 7: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/7.jpg)
7
GPU Programming
![Page 8: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/8.jpg)
8
GPU Programming
• GPUs are SIMT (Single Instruction, Multiple Thread)
• NVIDIA GPUs may be programmed using CUDA or OpenCL
![Page 9: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/9.jpg)
9
GPU Programming
![Page 10: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/10.jpg)
10
Weak Memory Models• Consider the test known as Store Buffering (SB)
![Page 11: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/11.jpg)
11
Weak Memory Models• Consider the test known as Store Buffering (SB)• Initial State: x and y are memory locations
![Page 12: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/12.jpg)
12
Weak Memory Models• Consider the test known as Store Buffering (SB)• Thread IDs
![Page 13: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/13.jpg)
13
Weak Memory Models• Consider the test known as Store Buffering (SB)• Program: for each thread ID
![Page 14: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/14.jpg)
14
Weak Memory Models• Consider the test known as Store Buffering (SB)• Assertion: question about the final state of registers
![Page 15: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/15.jpg)
15
Weak Memory Models• Consider the test known as Store Buffering (SB)• Can this assertion be satisfied?
![Page 16: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/16.jpg)
16
![Page 17: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/17.jpg)
17
![Page 18: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/18.jpg)
18
![Page 19: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/19.jpg)
19
![Page 20: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/20.jpg)
20
![Page 21: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/21.jpg)
21
Assertion cannotbe satisfied by interleavings
This is knownas sequentialconsistency (or SC) [1]
![Page 22: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/22.jpg)
22
Weak Memory Models
• Can we assume assertion will never pass?
![Page 23: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/23.jpg)
23
Weak Memory Models
• Can we assume assertion will never pass? No!
![Page 24: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/24.jpg)
24
Weak Memory Models• Executing this test with the Litmus tool [2] on an Intel i7 x86 processor
for 1000000 iterations, we get the following histogram of results:
![Page 25: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/25.jpg)
25
Weak Memory Models•What Happened?
• Architectures implement weak memory models where the hardware is allowed to re-order certain memory instructions.
• On x86 architectures, the hardware is allowed to re-order write instructions with program-order later read instructions [3]
![Page 26: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/26.jpg)
26
GPU Memory Models•What type of memory model do current GPUs implement?
• Documentation is sparse
• CUDA has 1 page + 1 example [4] • PTX has 1 page + 0 examples [5]
• No specifics about which instructions are allowed to be re-ordered
• We need to know if we are to write correct GPU programs!
![Page 27: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/27.jpg)
27
Our Approach• Empirically explore the memory model implemented on deployed
NVIDIA GPUs
• Achieved by developing a memory model testing tool for NVIDIA GPUs with specialized heuristics
• We analyze classic memory model properties and CUDA applications in this framework with unexpected results
• We test large families of tests on GPUs as a basis for modeling and bug hunting
![Page 28: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/28.jpg)
28
Our Approach• Disclaimer: Testing is not guaranteed to reveal all behaviors
![Page 29: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/29.jpg)
29
Roadmap• Background and Approach• Prior Work• Testing Framework• Results• CUDA Spin Locks• Bulk Testing• Future Work and Conclusion
![Page 30: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/30.jpg)
30
Prior Work• Testing Memory Models:
• Pioneered by Bill Collier in ARCHTEST in 1992 [6]
• TSOTool in 2004 [7]
• Litmus in 2011 [2] • We extend this tool
![Page 31: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/31.jpg)
31
Prior Work (GPU Memory Models) • June 2013:• Hower et al. proposed a SC for race-free memory model for GPUs [8]
• Sorensen et al. proposed an operational weak GPU memory model based on available documentation [9]
• 2014:• Hower et al. proposed two SC for race-free memory model for GPUs, HRF-direct
and HRF-indirect [10]
It remains unclear what memory model deployed GPUs implement
![Page 32: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/32.jpg)
32
Roadmap• Background and Approach• Prior Work• Testing Framework• Results• CUDA Spin Locks• Bulk Testing• Future Work and Conclusion
![Page 33: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/33.jpg)
33
Testing Framework• GPU litmus test
![Page 34: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/34.jpg)
34
Testing Framework• GPU litmus test• PTX instructions
![Page 35: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/35.jpg)
35
Testing Framework• GPU litmus test• What memory region (shared or global) are x and y in?
![Page 36: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/36.jpg)
36
Testing Framework• GPU litmus test• Are T0 and T1 in the same CTA? Or different CTAs?
![Page 37: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/37.jpg)
37
Testing Framework
• We consider three different GPU configurations for tests:
• D-warp:S-cta-Shared: Different warp, Same CTA, targeting shared memory
• D-warp:S-cta-Global: Different warp, Same CTA, targeting global memory
• D-cta:S-ker-Global: Different CTA, Same kernel, targeting global memory
![Page 38: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/38.jpg)
38
Testing Framework
•Given a GPU Litmus test produce executable
• CUDAor
• OpenCL
![Page 39: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/39.jpg)
39
Testing Framework• Host (CPU) generated code
![Page 40: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/40.jpg)
40
Testing Framework• Host (CPU) generated code
![Page 41: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/41.jpg)
41
Testing Framework• Host (CPU) generated code
![Page 42: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/42.jpg)
42
Testing Framework• Host (CPU) generated code
![Page 43: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/43.jpg)
43
Testing Framework• Host (CPU) generated code
![Page 44: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/44.jpg)
44
Testing Framework• Host (CPU) generated code
![Page 45: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/45.jpg)
45
Testing Framework• Host (CPU) generated code
![Page 46: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/46.jpg)
46
Testing Framework• Kernel generated code
![Page 47: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/47.jpg)
47
Testing Framework• Kernel generated code
![Page 48: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/48.jpg)
48
Testing Framework• Kernel generated code
![Page 49: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/49.jpg)
49
Testing Framework• Kernel generated code
![Page 50: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/50.jpg)
50
Testing Framework• Kernel generated code
![Page 51: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/51.jpg)
51
Testing Framework• Basic Framework shows NO weak behaviors
•We develop heuristics (we dub incantations) to encourage weak behaviors to appear
![Page 52: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/52.jpg)
52
Testing Framework• General bank conflict incantation• Each access in test is exclusively one of:
Optimal
![Page 53: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/53.jpg)
53
Testing Framework• General bank conflict incantation• Each access in test is exclusively one of:
Optimal Broadcast
![Page 54: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/54.jpg)
54
Testing Framework• General bank conflict incantation• Each access in test is exclusively one of:
Optimal Broadcast Bank Conflict
![Page 55: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/55.jpg)
55
Testing Framework• General Bank Conflict Heuristic• Given this test:
![Page 56: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/56.jpg)
56
Testing Framework• General Bank Conflict Heuristic• One possible general bank conflict scheme:
Bank Conflict
Optimal
Optimal
Broadcast
![Page 57: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/57.jpg)
57
Testing Framework
• Two critical incantations (without them we observe no weak executions):
• General Bank Conflicts (shown previously)
• Memory Stress: All non-testing threads read/write to memory
![Page 58: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/58.jpg)
58
Testing Framework
• Two extra incantations:
• Sync: testing threads synchronize before test
• Randomization: testing thread IDs are randomized
![Page 59: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/59.jpg)
59
Roadmap• Background and Approach• Prior Work• Testing Framework• Results• CUDA Spin Locks• Bulk Testing• Future Work and Conclusion
![Page 60: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/60.jpg)
60
Traditional Tests• We show the results for these tests which have been studied for CPUs in
[3]:
• MP (Message Passing): can stale values can be read in a handshake idiom?
• SB (Store Buffering): can stores can be buffered after loads?
• LD (Load Delaying): can loads can be delayed after stores?
• Results show running 100,000 iterations over 3 chips:Tesla C2075 (Fermi), GTX Titan (Kepler), and GTX 750 (Maxwell)
![Page 61: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/61.jpg)
61
Message Passing• Tests how to implement a handshake idiom
![Page 62: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/62.jpg)
62
Message Passing• Tests how to implement a handshake idiom
Flag
Flag
![Page 63: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/63.jpg)
63
Message Passing• Tests how to implement a handshake idiom
Data
Data
![Page 64: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/64.jpg)
64
Message Passing• Tests how to implement a handshake idiom
Stale Data
![Page 65: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/65.jpg)
65
Message Passing
![Page 66: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/66.jpg)
66
Message Passing• How do we disallow reading stale data?
• PTX gives 2 fences for intra-device [5 p.165]
• membar.cta – Gives ordering properties intra-CTA
• membar.gl – Gives ordering properties over device
![Page 67: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/67.jpg)
67
Message Passing• Test amended with a parameterizable fence
![Page 68: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/68.jpg)
68
Message Passing
![Page 69: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/69.jpg)
69
Message Passing
![Page 70: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/70.jpg)
70
Message Passing
![Page 71: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/71.jpg)
71
Store Buffering• Can stores can be delayed after loads?
![Page 72: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/72.jpg)
72
Store Buffering
![Page 73: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/73.jpg)
73
Load Delaying• Can loads can be delayed after stores?
![Page 74: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/74.jpg)
74
Load Delaying
![Page 75: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/75.jpg)
75
CoRR Test• Coherence is SC per memory location [11, p. 14]
•Modern processors (ARM, POWER, x86) implement coherence
• All language models require coherence (C++11, OpenCL 2.0)
• Has been observed and confirmed buggy in ARM chips [3, 12]
![Page 76: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/76.jpg)
76
CoRR Test• Coherence of Read-Read test• Can loads from the same location be return stale values?
![Page 77: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/77.jpg)
77
CoRR Test
![Page 78: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/78.jpg)
78
CoRR Test
![Page 79: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/79.jpg)
79
CoRR Test
![Page 80: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/80.jpg)
80
CoRR Test• Coherence of Read-Read test• Test amended with a parameterized fence
![Page 81: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/81.jpg)
81
CoRR Test
![Page 82: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/82.jpg)
82
CoRR Test
![Page 83: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/83.jpg)
83
CoRR Test
![Page 84: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/84.jpg)
84
Results Take Away• Current GPUs implement observably weak memory models
with scoped properties.
•Without formal docs, how can developers know what behaviors to rely on?
• This is biting developers even now (discussed next)
![Page 85: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/85.jpg)
85
Roadmap• Background and Approach• Prior Work• Testing Framework• Results• CUDA Spin Locks• Bulk Testing• Future Work and Conclusion
![Page 86: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/86.jpg)
86
GPU Spin Locks• Inter-CTA lock presented in the book CUDA By Example [13]
![Page 87: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/87.jpg)
87
GPU Spin Locks• Inter-CTA lock presented in the book CUDA By Example [13]
![Page 88: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/88.jpg)
88
GPU Spin Locks• Inter-CTA lock presented in the book CUDA By Example [13]
![Page 89: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/89.jpg)
89
GPU Spin Locks• Inter-CTA lock presented in the book CUDA By Example [13]
![Page 90: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/90.jpg)
90
GPU Spin Locks• Distilled to a litmus test (y is mutex, x is data):
![Page 91: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/91.jpg)
91
GPU Spin Locks• Distilled to a litmus test (y is mutex, x is data):
Initially Locked by T0
![Page 92: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/92.jpg)
92
GPU Spin Locks• Distilled to a litmus test (y is mutex, x is data):
Unlock
CS*
*CS = Critical Section
![Page 93: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/93.jpg)
93
GPU Spin Locks• Distilled to a litmus test (y is mutex, x is data):
lock
CS*
*CS = Critical Section
![Page 94: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/94.jpg)
94
GPU Spin Locks• Distilled to a litmus test (y is mutex, x is data):
T1 Observes Stale Value
*CS = Critical Section
![Page 95: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/95.jpg)
95
GPU Spin Locks• Distilled to a litmus test (y is mutex, x is data):
*CS = Critical Section
![Page 96: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/96.jpg)
96
GPU Spin Locks• Do we observe stale data in the Critical Section?
![Page 97: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/97.jpg)
97
GPU Spin Locks• Do we observe stale data in the Critical Section? Yes!
![Page 98: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/98.jpg)
98
GPU Spin Locks• Spin lock test amended with fences
![Page 99: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/99.jpg)
99
GPU Spin Locks• Now test with fences:
![Page 100: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/100.jpg)
100
GPU Spin Locks• Now test with fences:• Is membar.cta enough?
Is membar.cta enough?Is?
![Page 101: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/101.jpg)
101
GPU Spin Locks• Now test with fences:• Is membar.cta enough? No! It is an inter-CTA lock!• Is membar.gl enough?
Is membar.cta enough?
![Page 102: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/102.jpg)
102
GPU Spin Locks• Now test with fences:• Is membar.cta enough? No! It is an inter-CTA lock! • Is membar.gl enough? Yes!
![Page 103: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/103.jpg)
103
GPU Spin Lock•More examples without fences, which have similar issues:
• Mutex in Efficient Synchronization Primitives for GPUs [14]
• Non-blocking GPU deque in GPU Computing Gems Jade Edition [15]
• GPU applications must use fences!!!
![Page 104: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/104.jpg)
104
Roadmap• Background and Approach• Prior Work• Testing Framework• Results• CUDA Spin Locks• Bulk Testing• Future Work and Conclusion
![Page 105: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/105.jpg)
105
Bulk Testing• Daniel Poetzl (University of Oxford) is developing GPU extensions to
DIY test generation [3]
• Test generation is based on criticalcycles
• Used for validating models,finding bugs, gaining intuitionabout observable behaviors
Image used with permission from [3]
![Page 106: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/106.jpg)
106
Bulk Testing• We have generated over 8000 tests across intra/inter CTA interactions
and targeting both shared and global memory
• Tests include memory barriers (e.g. membar.{cta,gl,sys}), and dependencies (data, address, and control)
• Tested 5 chips across 3 generations• GTX 540m (Fermi), Tesla C2075 (Fermi), GTX 660 (Kepler), GTX Titan (Kepler)
GTX 750 Ti (Maxwell)
![Page 107: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/107.jpg)
107
Roadmap• Background and Approach• Prior Work• Testing Framework• Results• CUDA Spin Locks• Bulk Testing• Future Work and Conclusion
![Page 108: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/108.jpg)
Future Work• Test more complicated GPU configurations (e.g. both shared and
global in the same test)
• Example: Intra-CTA Store Buffering (SB) test is observable on Maxwell only with mixed shared and global memory locations.
![Page 109: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/109.jpg)
109
Future Work• Axiomatic memory model in Herd [3]
• New scoped relations:
Internal–CTA: Contains pairs of instructions that are in the same CTA
• Can easily compare model to observations
• Based on acyclic relations Image used with permission from [3]
![Page 110: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/110.jpg)
110
Conclusion• Current GPUs have observably weak memory models which are largely
undocumented
• GPU programming in proceeding without adequate guidelines which results in buggy code (development of reliable GPU code impossible without specs)
• Rigorous documentation, testing, and verification of GPU programs based on formal tools is the way forward in terms of developing reliable GPU applications
![Page 111: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/111.jpg)
111
References[1] L. Lamport, "How to make a multiprocessor computer that correctly executes multi-process programs," IEEE Trans. Comput., pp. 690-691, Sep. 1979.
[2] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, "Litmus: Running tests against hardware," ser. TACAS'11. Springer-Verlag, pp. 41-44.
[3] J. Alglave, L. Maranget, and M. Tautschnig, "Herding cats: modelling, simulation, testing, and data-mining for weak memory," 2014, to appear in TOPLAS.
[4] NVIDIA, "CUDA C programming guide, version 6," http://docs.nvidia.com/cuda/pdf/CUDA C Programming Guide.pdf, July 2014.
[5] NVIDIA, "Parallel Thread Execution ISA: Version 4.0 (Feb. 2014)," http://docs.nvidia.com/cuda/parallel-thread-execution.
[6] W. W. Collier, Reasoning About Parallel Architectures. Prentice-Hall, Inc., 1992.
[7] S. Hangal, D. Vahia, C. Manovit, and J.-Y. J. Lu, "TSOtool: A program for verifying memory systems using the memory consistency model," ser. ISCA '04. IEEE Computer Society, 2004, pp. 114.
![Page 112: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/112.jpg)
112
References[8] D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Sequential consistency for heterogeneous-race-free," ser. MSPC'13. ACM, 2013.
[9] T. Sorensen, G. Gopalakrishnan, and V. Grover, "Towards shared memory consistency models for GPUs," ser. ICS'13. ACM, 2013, pp. 489-490.
[10] D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous-race-free memory models," ser. ASPLOS'14. ACM, 2014, pp. 427-440.
[11] D. J. Sorin, M. D. Hill, and D. A. Wood, A Primer on Memory Consistency and Cache Coherence, ser. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2011.
[12] ARM, "Cortex-A9 MPCore, programmer advice notice, read-after-read hazards," ARM Reference 761319. http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A a9 read read.pdf, accessed: May 2014.
[13] J. Sanders and E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 2010.
![Page 113: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/113.jpg)
113
References[14] J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs," CoRR, 2011, http://arxiv.org/pdf/1110.4623.pdf.
[15] W.-m. W. Hwu, GPU Computing Gems Jade Edition. Morgan Kaufmann Publishers
Inc., 2011.
[16] http://en.wikipedia.org/wiki/Samsung_Galaxy_S5
[17] http://en.wikipedia.org/wiki/Titan_(supercomputer)
[18] http://en.wikipedia.org/wiki/Barnes_Hut_simulation
![Page 114: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/114.jpg)
114
Acknowledgements• Advisor: Ganesh Gopalakrishnan
• Committee: Zvonimir Rakamaric, Mary Hall
• UK Group: Jade Alglave (University College London), Daniel Poetzl (University of Oxford), Luc Maranget (Inria), John Wickerson, Alastair Donaldson (Imperial College London), Mark Batty (University of Cambridge)
• Mohammed for feedback on practice runs
![Page 115: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/115.jpg)
115
Thank You
![Page 116: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/116.jpg)
116
Prior Work (GPU Memory Models) • June 2010: Feng and Xiao revisit their GPU device-wide
synchronization method [?] to repair it with fences [?]
• Speaking about weak behaviors, they state:
In practice, it is infinitesimally unlikely that this will ever happen giventhe amount of time that is spent spinning at the barrier, e.g., none ofour thousands of experimental runs ever resulted in an incorrect answer.Furthermore, no existing literature has been able to show how to triggerthis type of error.
![Page 117: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/117.jpg)
117
Testing Framework• Evaluate inter-CTA incantations using these tests:
• MP: checks if stale values can be read in a handshake idiom
• LD: checks if loads can be delayed after stores
• SB: checks if stores can be delayed after loads
• Results show average of running 100,000 iterations over 3 chips:Tesla C2075 (Fermi), GTX Titan (Kepler), and GTX 750 (Maxwell)
![Page 118: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/118.jpg)
118Inter-CTA interactions
![Page 119: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/119.jpg)
119
Without Critical Incantations, No Weak Behaviors Are
Observed
Inter-CTA interactions
![Page 120: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/120.jpg)
120Inter-CTA interactions
![Page 121: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/121.jpg)
121
Most Effective Incantations
Inter-CTA interactions
![Page 122: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/122.jpg)
122
Testing Framework• Evaluate intra-CTA incantations using these tests*:
• MP-Global: Message Passing tests targeting global memory region
• MP-Shared: Message Passing tests targeting global memory region
* The previous tests (LD, SB) are not observable intra-CTA
![Page 123: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/123.jpg)
123Intra-CTA interactions
![Page 124: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/124.jpg)
124Intra-CTA interactions
Without Critical Incantations, No Weak Behaviors Are
Observed
![Page 125: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/125.jpg)
125Intra-CTA interactions
![Page 126: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/126.jpg)
126Intra-CTA interactions
Most Effective Incantations
![Page 127: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/127.jpg)
127
Bulk Testing• Invalidated GPU memory
model from [?]
• Model disallows behaviors observed on hardware
• Gives too strong of orderings to load operations inter-CTA
![Page 128: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/128.jpg)
128
Bulk Testing• Invalidated GPU memory
model from [?]
• Model disallows behaviors observed on hardware
• Gives too strong of orderings to load operations inter-CTA
![Page 129: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/129.jpg)
129
GPU Hardware• Multiple SMs (Streaming Multiprocessors)
• SMs contain CUDA Cores
• Each SM has an L1 cache
• All SMs share an L2 cache and DRAM
• Warp scheduler executes in groups of 32
![Page 130: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/130.jpg)
130
GPU Hardware
![Page 131: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/131.jpg)
131
GPU Programming to Hardware• Threads in same CTA are mapped to same SM
• Shared memory is in L1 (Maxwell is an Exception)
• Global memory is in DRAM and cached in L2 (Fermi is an Exception)
• Warp scheduler executes threads in groups of 32
![Page 132: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/132.jpg)
132
Testing Framework
![Page 133: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/133.jpg)
133
Testing Framework• Initial value of shared memory locations
![Page 134: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/134.jpg)
134
Testing Framework• Thread IDs
![Page 135: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/135.jpg)
135
Testing Framework• Programs (written in NVIDIA PTX)
![Page 136: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/136.jpg)
136
Testing Framework• Assertion about final state of system
![Page 137: TESTING AND EXPOSING WEAK GPU MEMORY MODELS](https://reader036.fdocuments.us/reader036/viewer/2022062408/5681375b550346895d9ee99f/html5/thumbnails/137.jpg)
137
GPU Terminology
We Use