PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)
Transcript of PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)
![Page 2: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/2.jpg)
Realism of modern GPUs 2
http://www.youtube.com/watch?v
=bJDeipvpjGQ&feature=play
er_embedded#t=49s
![Page 3: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/3.jpg)
Schedule 3
1. Introduction, performance metrics & analysis
2. Many-core hardware, low-level optimizations
3. Cuda class 1: basics
4. Cuda class 2: advanced
5. Case study: LOFAR telescope with many-cores
![Page 4: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/4.jpg)
Hierarchical systems 4
Grid
Cluster
Node
Multiple GPUs per node
Multiple chips per GPU
Streaming multiprocessors
Hardware threads
...
} This course
![Page 5: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/5.jpg)
Multi-core CPUs 5
![Page 6: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/6.jpg)
General Purpose Processors 6
Architecture
Few fat cores
Vectorization Streaming SIMD Extensions (SSE)
Advanced Vector Extensions (AVX)
Homogeneous
Stand-alone
Memory
Shared, multi-layered
Per-core cache and shared cache
Programming
Multi-threading
OS Scheduler
Coarse-grained parallelism
![Page 7: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/7.jpg)
Intel 7
![Page 8: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/8.jpg)
AMD Magny-Cours 8
![Page 9: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/9.jpg)
AMD Magny-Cours
Two 6-core processors on a single chip
Up to four of these chips in a single compute node
48 cores in total
Non-uniform memory access
Per-core cache
Per-chip cache
Local memory
Remote memory (hypertransport)
9
![Page 10: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/10.jpg)
AMD Magny-Cours 10
![Page 11: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/11.jpg)
AMD Magny-Cours 11
![Page 12: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/12.jpg)
AWARI on the Magny-Cours 12
DAS-2
51 hours
72 machines / 144 cores
72 GB RAM in total
1.4 TB disk in total
Magny-Cours
45 hours
1 machine, 48 cores
128 GB RAM in 1 machine
4.5 TB disk in 1 machine
Less than 12 hours with new algorithm (needs more RAM)
![Page 13: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/13.jpg)
Multi-core CPU programming
Threads
Pthreads, Java threads, …
OpenMP
MPI
OpenCL
Vectorization
Streaming SIMD Extensions (SSE)
Advanced Vector Extensions (AVX)
13
![Page 14: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/14.jpg)
Vectorizing with SSE
Assembly instructions
16 registers
C or C++: intrinsics
Name instruction, but not registers
Work on variables, not registers
Declare vector variables
14
![Page 15: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/15.jpg)
Vectorizing with SSE examples
float data[1024];
// init: data[0] = 0.0, data[1] = 1.0, data[2] = 2.0, etc.
init(data);
// Set all elements in my vector to zero.
__m128 myVector0 = _mm_setzero_ps();
// Load the first 4 elts of the array into my vector.
__m128 myVector1 = _mm_load_ps(data);
// Load the second 4 elts of the array into my vector.
__m128 myVector2 = _mm_load_ps(data+4);
0.0
0 element
value
1 2 3
0.0 0.0 0.0
0.0
0 element
value
1 2 3
3.0 2.0 1.0
4.0
0 element
value
1 2 3
7.0 6.0 5.0
15
![Page 16: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/16.jpg)
Vectorizing with SSE examples
// Add vectors 1 and 2; instruction performs 4 FLOPs.
__m128 myVector3 = _mm_add_ps(myVector1, myVector2);
// Multiply vectors 1 and 2; instruction performs 4 FLOPs.
__m128 myVector4 = _mm_mul_ps(myVector1, myVector2);
// _MM_SHUFFLE(w,x,y,z) selects w&x from vec1 and y&z from vec2.
__m128 myVector5 = _mm_shuffle_ps(myVector1, myVector2,
_MM_SHUFFLE(2, 3, 0, 1));
0 element
value
1 2 3
4.0 = + 6.0 8.0 10.0
0 element
value
1 2 3
0.0 1.0 2.0 3.0
0 element
value
1 2 3
4.0 5.0 6.0 7.0
0 element
value
1 2 3
0.0 = x 5.0 12.0 21.0
0 element
value
1 2 3
2.0 = 3.0 4.0 5.0 s
0 element
value
1 2 3
0.0 1.0 2.0 3.0
0 element
value
1 2 3
4.0 5.0 6.0 7.0
0 element
value
1 2 3
0.0 1.0 2.0 3.0
0 element
value
1 2 3
4.0 5.0 6.0 7.0
16
![Page 17: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/17.jpg)
Vector add
void vectorAdd(int size, float* a, float* b, float* c) {
for(int i=0; i<size; i++) {
c[i] = a[i] + b[i];
}
}
17
![Page 18: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/18.jpg)
Vector add with SSE: unroll loop
void vectorAdd(int size, float* a, float* b, float* c) {
for(int i=0; i<size/4; i += 4) {
c[i+0] = a[i+0] + b[i+0];
c[i+1] = a[i+1] + b[i+1];
c[i+2] = a[i+2] + b[i+2];
c[i+3] = a[i+3] + b[i+3];
}
}
18
![Page 19: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/19.jpg)
Vector add with SSE: vectorize loop
void vectorAdd(int size, float* a, float* b, float* c) {
for(int i=0; i<size/4; i += 4) {
__m128 vecA = _mm_load_ps(a + i); // load 4 elts from a
__m128 vecB = _mm_load_ps(b + i); // load 4 elts from b
__m128 vecC = _mm_add_ps(vecA, vecB); // add four elts
_mm_store_ps(c + i, vecC); // store four elts
}
}
19
![Page 20: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/20.jpg)
The Cell Broadband Engine 20
![Page 21: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/21.jpg)
Cell/B.E. 21
![Page 22: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/22.jpg)
Cell/B.E. 22
Architecture
Heterogeneous
1 PowerPC (PPE)
8 vector-processors (SPEs)
Programming
User-controlled scheduling
6 levels of parallelism, all under user control
Fine- and coarse-grain parallelism
![Page 23: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/23.jpg)
Cell/B.E. memory 23
“Normal” main memory
PPE: normal read / write
SPEs: Asynchronous manual transfers: DMA
Per-core fast memory: the Local Store (LS)
Application-managed cache
256 KB
128 x 128 bit vector registers
![Page 24: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/24.jpg)
Roadrunner (IBM) 24
Los Alamos National Laboratory
#1 of top500 June 2008 – November 2009
Now #10
122,400 cores, 1.4 petaflops
First petaflops system
PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz
![Page 25: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/25.jpg)
The Cell’s vector instructions
Differences with SSE
SPEs execute only vector instructions
More advanced shuffling
Not 16, but 128 registers!
Fused Multiply Add support
25
![Page 26: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/26.jpg)
FMA instruction
A B
Product
C
D =
( truncate digits )
A B
Product
C
D =
+
×
= (retain all digits)
×
=
+
(no loss of precision)
Multiply-Add (MAD): D = A * B + C
Fused Multiply-Add (FMA): D = A * B + C
26
![Page 27: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/27.jpg)
Cell Programming models
IBM Cell SDK
C + MPI
OpenCL
Many models from academia...
27
![Page 28: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/28.jpg)
Cell SDK
Threads, but only on the PPE
Distributed memory
Local stores = application-managed cache!
DMA transfers
Signaling and mailboxes
Vectorization
28
![Page 29: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/29.jpg)
Direct Memory Access (DMA)
Start asynchronous DMA mfc_get (local store space, main mem address, #bytes, tag);
Wait for DMA to finish mfc_write_tag_mask(tag);
mfc_read_tag_status_all();
DMA lists
Overlap communication with useful work
Double buffering
29
![Page 30: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/30.jpg)
Vector sum
float vectorSum(int size, float* vector) {
float result = 0.0;
for(int i=0; i<size; i++) {
result += vector[i];
}
Return result;
}
30
![Page 31: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/31.jpg)
Parallelization strategy
Partition problem into 8 pieces
(Assuming a chunk fits in the Local Store)
PPE starts 8 SPE threads
Each SPE processes 1 piece
Has to load data from PPE with DMA
PPE adds the 8 sub-results
31
![Page 32: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/32.jpg)
Vector sum SPE code (1)
float vectorSum(int size, float* PPEVector) {
float result = 0.0;
int chunkSize = size / NR_SPES; // Partition the data.
float localBuffer[chunkSize]; // Allocate a buffer in
// my private local store.
int tag = 42;
// Points to my chunk in PPE memory.
float* myRemoteChunk = PPEVector + chunkSize * MY_SPE_NUMBER;
32
![Page 33: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/33.jpg)
Vector sum SPE code (2)
// Copy the input data from the PPE.
mfc_get(localBuffer, myRemoteChunk, chunkSize, tag);
mfc_write_tag_mask(tag);
mfc_read_tag_status_all();
// The real work.
for(int i=0; i<chunkSize; i++) {
result += localBuffer[i];
}
return result;
}
33
![Page 34: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/34.jpg)
Can we optimize this strategy? 34
![Page 35: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/35.jpg)
Can we optimize this strategy? 35
Vectorization
Overlap communication and computation
Double buffering
Strategy:
Split in more chunks than SPEs
Let each SPE download the next chunk while processing the
current chunk
![Page 36: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/36.jpg)
DMA double buffering example (1)
float vectorSum(float* PPEVector, int size, int nrChunks) {
float result = 0.0;
int chunkSize = size / nrChunks;
int chunksPerSPE = nrChunks / NR_SPES;
int firstChunk = MY_SPE_NUMBER * chunksPerSPE;
int lastChunk = firstChunk + nrChunks;
// Allocate two buffers in my private local store.
float localBuffer[2][chunkSize];
int currentBuffer = 0;
// Start asynchronous DMA of first chunk.
float* myRemoteChunk = PPEVector + firstChunk * chunkSize;
mfc_get(localBuffer[currentBuffer], myRemoteChunk, chunkSize,
currentBuffer);
36
![Page 37: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/37.jpg)
DMA double buffering example (2)
for (int chunk = firstChunk; chunk < lastChunk; chunk++) {
// Prefetch next chunk asynchronously.
if(chunk != lastChunk - 1) {
float* nextRemoteChunk = PPEVector + (chunk+1) * chunkSize;
mfc_get(localBuffer[!currentBuffer], nextRemoteChunk,
chunkSize, !currentBuffer);
}
// Wait for of current buffer DMA to finish.
mfc_write_tag_mask(currentBuffer); mfc_read_tag_status_all();
// The real work.
for(int i=0; i<chunkSize; i++)
result += localBuffer[currentBuffer][i];
currentBuffer = !currentBuffer;
}
return result;
}
37
![Page 38: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/38.jpg)
Double and triple buffering
Read-only data
Double buffering
Read-write data
Triple buffering!
Work buffer
Prefetch buffer, asynchronous download
Finished buffer, asynchronous upload
General technique
On-chip networks
GPUs (PCI-e)
MPI (cluster)
…
38
![Page 39: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/39.jpg)
Intel’s many-core platforms 39
![Page 40: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/40.jpg)
Intel Single-chip Cloud Computer 40
Architecture
Tile-based many-core (48 cores)
A tile is a dual-core
Stand-alone
Memory
Per-core and per-tile
Shared off-chip
Programming
Multi-processing with message passing
User-controlled mapping/scheduling
Gain performance …
Coarse-grain parallelism
Multi-application workloads (cluster-like)
![Page 41: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/41.jpg)
Intel Single-chip Cloud Computer 41
![Page 42: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/42.jpg)
Intel SCC Tile
2 cores
16K L1 cache per core
256K L2 per core
8K Message passing buffer
On-chip network router
42
![Page 43: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/43.jpg)
Intel's Larrabee
GPU based on x86 architecture
Hardware multithreading
Wide SIMD
Achieved 1 tflop sustained application performance (SC09)
Canceled in Dec 2009, re-targeted to HPC market
43
![Page 44: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/44.jpg)
Intel's Many Integrated Core (MIC)
May 2010: Larrabee + 80-core research chip + SCC → MIC
X86 vector cores
Knights Ferry: 32 Cores, 128 Threads, 1.2GHz, 8MB shared cache
Knights Corner: 22 nm, 50+ cores
44
![Page 45: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/45.jpg)
GPU hardware introduction 45
![Page 46: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)](https://reader030.fdocuments.us/reader030/viewer/2022020702/61fb17b12e268c58cd5a0ff5/html5/thumbnails/46.jpg)
CPU vs GPU 46
Movie
The Mythbusters
Jamie Hyneman & Adam Savage
Discovery Channel
Appearance at NVIDIA’s NVISION 2008