GMAC Global Memory for Accelerators

41
GMAC Global Memory for Accelerators Isaac Gelado, John E. Stone, Javier Cabezas, Nacho Navarro and Wen-mei W. Hwu GTC 2010

description

GMAC Global Memory for Accelerators. Isaac Gelado , John E. Stone, Javier Cabezas , Nacho Navarro and Wen- mei W. Hwu GTC 2010. GMAC in a nutshell. GMAC: Unified Virtual Address Space for CUDA Simplifies the CPU code Exploits advanced CUDA features for free Vector addition example - PowerPoint PPT Presentation

Transcript of GMAC Global Memory for Accelerators

Page 1: GMAC Global Memory for Accelerators

GMACGlobal Memory for Accelerators

Isaac Gelado, John E. Stone, Javier Cabezas, Nacho Navarro and Wen-mei W. Hwu

GTC 2010

Page 2: GMAC Global Memory for Accelerators

GMAC in a nutshell• GMAC: Unified Virtual Address Space for CUDA – Simplifies the CPU code– Exploits advanced CUDA features for free

• Vector addition example– Really simple kernel code

– But, what about the CPU code?

__global__ void vector(float *c, float *a, float *b, size_t size){

int idx = threadIdx.x + blockIdx.x * blockDim.x;if(idx < size) c[idx] = a[idx] + b[idx];

}

9/23/10 2GTC 2010

Page 3: GMAC Global Memory for Accelerators

CPU CUDA code (I)• Read from disk, transfer to GPU and computeint main(int argc, char *argv[]) {

float *h_a, *h_b, *h_c, *d_a, *d_b, *d_c;size_t size = LENGTH * sizeof(float);

assert((h_a = malloc(size) != NULL);assert((h_b = malloc(size) != NULL);assert((h_c = malloc(size) != NULL);

assert(cudaMalloc((void **)&d_a, size) == cudaSuccess));assert(cudaMalloc((void **)&d_b, size) == cudaSuccess));assert(cudaMalloc((void **)&d_c, size) == cudaSuccess));

read_file(argv[A], h_a);read_file(argv[B], h_b);

assert(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice) == cudaSuccess);

assert(cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice) == cudaSuccess);

9/23/10 3GTC 2010

Page 4: GMAC Global Memory for Accelerators

CPU CUDA code (and II)• Read from disk, transfer to GPU and compute

Db(BLOCK_SIZE);Dg(LENGTH / BLOCK_SIZE);if(LENGTH % BLOCK_SIZE) Dg.x++;vector<<<Dg, Db>>>(d_c, d_a, d_b, LENGTH);assert(cudaThreadSynchronize() == cudaSuccess);

assert(cudaMemcpy(d_c, h_c, LENGTH * sizeof(float), cudaMemcpyDeviceToHost) == cudaSuccess);

save_file(argv[C], h_c);

free(h_a); cudaFree(d_a);free(h_b); cudaFree(d_b);free(h_c); cudaFree(d_c);

return 0;}

9/23/10 4GTC 2010

Page 5: GMAC Global Memory for Accelerators

CPU GMAC code

9/23/10 5GTC 2010

int main(int argc, char *argv[]) {float *a, *b, *c;size_t size = LENGTH * sizeof(float);

assert(gmacMalloc((void **)&a, size) == gmacSuccess));assert(gmacMalloc((void **)&b, size) == gmacSuccess));assert(gmacMalloc((void **)&c, size) == gmacSuccess));

read_file(argv[A], a);read_file(argv[B], b);

Db(BLOCK_SIZE);Dg(LENGTH / BLOCK_SIZE);if(LENGTH % BLOCK_SIZE) Dg.x++;vector<<<Dg, Db>>>(c, a, b, LENGTH);assert(gmacThreadSynchronize() == gmacSuccess);

save_file(argv[C], c);

gmacFree(a); gmacFree(b); gmacFree(c);

return 0;}

There is no memory

copy

There is no memory

copy

Page 6: GMAC Global Memory for Accelerators

Getting GMAC

• GMAC is at http://adsm.googlecode.com/• Debian / Ubuntu binary and

development .deb files• UNIX (also MacOS X) source code package• Experimental versions from mercurial

repository

9/23/10 6GTC 2010

Page 7: GMAC Global Memory for Accelerators

Outline• Introduction• GMAC Memory Model– Asymmetric Memory– Global Memory

• GMAC Execution Model– Multi-threading– Inter-thread communication

• Conclusions

9/23/10 7GTC 2010

Page 8: GMAC Global Memory for Accelerators

GMAC Memory Model• Unified CPU / GPU virtual address space• Asymmetric address space accessibility

CPU

Memory

GPU

Shared Data

CPU Data

9/23/10 8GTC 2010

Page 9: GMAC Global Memory for Accelerators

GMAC Consistency Model• Implicit acquire / release primitives at

accelerator call / return boundaries

9/23/10 GTC 2010 9

CPU ACC

CPU ACC

Page 10: GMAC Global Memory for Accelerators

GMAC Memory API• Allocate shared memorygmacError_t gmacMalloc(void **ptr, size_t size)– Allocated memory address (returned by reference)– Gets the size of the data to be allocated– Error code, gmacSuccess if no error

• Example usage#include <gmac.h>

int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . .}

9/23/10 10GTC 2010

Page 11: GMAC Global Memory for Accelerators

GMAC Memory API• Release shared memorygmacError_t gmacFree(void *ptr)– Memory address to be released– Error code, gmacSuccess if no error

• Example usage#include <gmac.h>

int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . gmacFree(foo);}

9/23/10 11GTC 2010

Page 12: GMAC Global Memory for Accelerators

GMAC Unified Address Space• Use fixed-size segments to map accelerator memory• Implement and export Accelerator Virtual Memory

AcceleratorSystemMemory

CPU AcceleratorMemory

0x00100000 0x00100000

9/23/10 12GTC 2010

Page 13: GMAC Global Memory for Accelerators

GMAC Memory API• Translate shared memory (multi-GPU)void *gmacPtr(void *ptr)template<typename T> T *gmacPtr(T *ptr)– Receives CPU memory address– Returns GPU memory address

• Example usage#include <gmac.h>

int main(int argc, char *argv[]) { . . . kernel<<<Dg, Db>>>(gmacPtr(buffer), size); . . .}

9/23/10 13GTC 2010

Page 14: GMAC Global Memory for Accelerators

GMAC Example Code (I)int fdtd(FILE *fpMat, FILE *fpMed, int N) { /* Read and create data structures */ MaterialList materials if(readMaterials(fpMat, materials) == 0) return -1; Media media; if(readMedia(fpMed, media) == 0) return -1; Field field; if(createField(media.dim, field) == 0) return -1;

for(int n = 0; n < N; n++) { . . . updateElectic<<<Dg, Db>>>(materials, media, field); . . . n++; updateMagnetic<<<Dg, Db>>>(materials, media, field); . . . }}

9/23/10 GTC 2010 14

Page 15: GMAC Global Memory for Accelerators

GMAC Example Code (II)typedef struct { float Ke[3][3], km[3][3]; } Material;typedef struct { size_t n; Material *data; } MaterialList;

/* Read materials from disk */size_t readMaterials(FILE *fp, MaterialList &list) { uint16_t n = 0; fread(&n, sizeof(n), 1, fp); ret = gmacMalloc((void **)&list.data, n * sizeof(Material)); if(ret != gmacSuccess) return 0; fread(list.data, sizeof(Material), n, fp); return n; }/* Read media description from file */typedef struct { dim3 dim; uint16_t *data } Media;void readMedia(FILE *fp, Media &media);

/* Allocate a electromagnetic field */typedef struct { dim3 dim; float3 *e; float3 *h; float3 *p; float3 *m } Field;void allocateField(Field &f, dim3 dim);

9/23/10 GTC 2010 15

Page 16: GMAC Global Memory for Accelerators

GMAC I/O Handling• Functions overridden (interposition) by GMAC:– Memory: memset(), memcpy()– I/O: fread(), fwrite(), read(), write()– MPI: MPI_Send(), MPI_Receive

• Get advanced CUDA features for free– Asynchronous data transfers– Pinned memory

9/23/10 GTC 2010 16

Asynchronous Copies to device memory

Pinned memory for I/O transfers

Page 17: GMAC Global Memory for Accelerators

GMAC Example Code (III)__global__ void updateElectric(Materials mats, Media media, Field f) { int Idx = threadIdx.x + blockDim.x * blockIdx.x; int Idy = threadIdx.y + blockDim.y * blockIdx.y;

for(int Idz = 0; Idz < f.dim.z; Idz++) { int pos = Idx + Idy * f.dim.x + Idz * f.dim.x * f.dim.y; float3 E = f.e[pos]; Material m = mats[media[pos]]; float3 P; P.x = E.x * m.ke[0][0] + E.y * m.ke[0][1] + E.z * m.ke[0][2]; P.y = E.x * m.ke[1][0] + E.y * m.ke[1][1] + E.z * m.ke[1][2]; P.z = E.x * m.ke[2][0] + E.y * m.ke[2][1] + E.z * m.ke[2][2]; f.p[pos] = P; }}

9/23/10 GTC 2010 17

Page 18: GMAC Global Memory for Accelerators

Outline• Introduction• GMAC Memory Model– Asymmetric Memory– Global Memory

• GMAC Execution Model– Multi-threading– Inter-thread communication

• Conclusions

9/23/10 18GTC 2010

Page 19: GMAC Global Memory for Accelerators

GMAC Global Memory• For multi-GPU systems• Data accessible by all accelerators, but owned

by the CPU

CPU

Memory

GPU

GPU

9/23/10 19GTC 2010

Page 20: GMAC Global Memory for Accelerators

GMAC Global memory API• Allocate global shared MemorygmacError_t gmacGlobalMalloc(void **ptr, size_t size)– Allocated memory address (returned by reference)– Gets the size of the data to be allocated– Error code, gmacSuccess if no error

• Example usage#include <gmac.h>

int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacGlobalMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . .}9/23/10 20GTC 2010

Page 21: GMAC Global Memory for Accelerators

GMAC Example Code (I)typedef struct { float Ke[3][3], km[3][3]; } Material;typedef struct { size_t n; Material *data; } MaterialList;

/* Read materials from disk */size_t readMaterials(FILE *fp, MaterialList &list) { uint16_t n = 0; fread(&n, sizeof(n), 1, fp); ret = gmacGlobalMalloc((void **)&list.data, n * sizeof(Material)); if(ret != gmacSuccess) return 0; fread(list.data, sizeof(Material), n, fp); return n; }/* Read media description from file */typedef struct { dim3 dim; uint16_t *data } Media;void readMedia(FILE *fp, Media &media);

/* Allocate a electromagnetic field */typedef struct { dim3 dim; float3 *e; float3 *h; float3 *p; float3 *m } Field;void allocateField(Field &f, dim3 dim);

9/23/10 GTC 2010 21

Page 22: GMAC Global Memory for Accelerators

Outline• Introduction• GMAC Memory Model– Asymmetric Memory– Global Memory

• GMAC Execution Model– Multi-threading– Inter-thread communication

• Conclusions

9/23/10 22GTC 2010

Page 23: GMAC Global Memory for Accelerators

GMAC and Multi-threading• In the past, one host thread had one CPU• In GMAC, each host thread has:– One CPU– One GPU

• A GMAC thread is running at GPU or at the CPU, but not in both at the same time

• Create threads using what you already know– pthread_create(...)

9/23/10 23GTC 2010

Page 24: GMAC Global Memory for Accelerators

GMAC and Multi-threading• Virtual memory accessibility:– Complete address space in CPU mode– Partial address space in GPU mode

CPU CPU

GPU GPUMemory

9/23/10 24GTC 2010

Page 25: GMAC Global Memory for Accelerators

Getting Full-duplex PCIe• Use multi-threading to fully utilize the PCIe– One CPU thread launch kernels– One CPU thread writes to shared memory– Once CPU thread reads from shared memory

9/23/10 GTC 2010 25

GPUCPU

GPU MemorySystem

Memory

PCIe

Page 26: GMAC Global Memory for Accelerators

Outline• Introduction• GMAC Memory Model– Asymmetric Memory– Global Memory

• GMAC Execution Model– Multi-threading– Inter-thread communication

• Conclusions

9/23/10 26GTC 2010

Page 27: GMAC Global Memory for Accelerators

GPU Handoff and Copying• GPU handoff:– Send the thread’s virtual GPU to another thread– Do not move data, move computation

• API Calls– Virtual GPU sendinggmacError_t gmacSend(thread_id dest)– Virtual GPU receivinggmacError_t gmacReceive()– Virtual GPU copyinggmacError_t gmacCopy(thread_id dest)

9/23/10 27GTC 2010

Page 28: GMAC Global Memory for Accelerators

GPU virtual GPUs use Case• Exploit data locality in the CPU and GPU• Example: MPEG-4 Encoder: – Each GMAC thread executes one stage– Then, moves to the GPU where the input data is

9/23/10 GTC 2010 28

GPU GPU GPUGPU

DCT and Quantization

Motion Compensation

Dequantization and IDCT

Motion Estimation

Page 29: GMAC Global Memory for Accelerators

Outline• Introduction• GMAC Memory Model– Asymmetric Memory– Global Memory

• GMAC Execution Model– Multi-threading– Inter-thread communication

• Conclusions

9/23/10 29GTC 2010

Page 30: GMAC Global Memory for Accelerators

GMAC Performance

9/23/10 GTC 2010 30

Page 31: GMAC Global Memory for Accelerators

GMAC on Actual Applications (I)• Reverse Time Migration (BSC / Repsol)– Six months – one programmer– Currently in use by Repsol

• Single-GPU using CUDA Run-time– Can live with it: double-allocations, memory consistency– Nightmare: overlap GPU computation and data transfers

(CUDA streams and double-buffering with pinned memory)• Multi-GPU using CUDA Run-time– Can live with it: lack of IDE for Linux– Nightmare: everything else

9/23/10 GTC 2010 31

Cancelled

Page 32: GMAC Global Memory for Accelerators

GMAC on Actual Applications (II)• Multi-GPU using GMAC:– Double-buffering and pinned memory for free• Disk transfers• GPU to GPU (inter-domain) communication• MPI communication

– Clean threading model• One task per CPU thread• Well-know synchronization primitives

• It took shorter than the single-GPU version

9/23/10 GTC 2010 32

Page 33: GMAC Global Memory for Accelerators

Conclusions• Single virtual address space for CPUs and

GPUs• Use CUDA advanced features– Automatic overlap data communication and

computation– Get access to any GPU from any CPU thread

• Get more performance from your application more easily

• Go: http://adsm.googlecode.com

9/23/10 33GTC 2010

Page 34: GMAC Global Memory for Accelerators

Future Features

• OpenCL and Windows 7 support coming soon• Data-dependence tracking:– Avoid transferring data to the GPU when not used

by kernels– Avoid transferring data to the CPU when not

modified kernels• Global shared memory partitioning between

multiple GPUs

9/23/10 GTC 2010 34

Page 35: GMAC Global Memory for Accelerators

GMACGlobal Memory for Accelerators

http://adsm.googlecode.com

Page 36: GMAC Global Memory for Accelerators

Backup Slides

Page 37: GMAC Global Memory for Accelerators

GMAC Advanced Free Features• Get advanced CUDA features for free– Asynchronous data transfers– Pinned memory

9/23/10 37GTC 2010

Asynchronous Copies to device memory

Pinned memory for I/O transfers

Page 38: GMAC Global Memory for Accelerators

GMAC Unified Address Space• When allocating memory

1. Allocate accelerator memory2. Allocate CPU memory at the same virtual address

CPU AcceleratorSystemMemory

AcceleratorMemory

9/23/10 38GTC 2010

Page 39: GMAC Global Memory for Accelerators

Lazy Update Data Transfers• Avoid unnecessary data copies• Lazy-update:

– Call: transfer modified data– Return: transfer when needed

9/23/10 GTC 2010 39

AcceleratorSystemMemory

AcceleratorMemory

CPU

Page 40: GMAC Global Memory for Accelerators

Rolling Update Data Transfers• Overlap CPU execution and data transfers• Minimal transfer on-demand• Rolling-update:

– Memory-block size granularity

9/23/10 GTC 2010 40

AcceleratorSystemMemory

AcceleratorMemory

CPU

Page 41: GMAC Global Memory for Accelerators

GMACGlobal Memory for Accelerators

http://adsm.googlecode.com