Siggraph Asia 2012 - Wil Braithwaite NVIDIA Applied Engineering · 2012-12-13 · Persistent CUDA...

Siggraph Asia 2012 - Wil Braithwaite – NVIDIA Applied Engineering

Using CUDA contexts in Maya

• CUDA context

• We must create one, but where?

• Choices:

1. Share existing context.

2. Save & Restore any previous CUDA context (using driver API).

3. Create our own “persistent” thread with dedicated CUDA context.

Sharing the CUDA context – Code

// Initialize plug-in...

void Plugin::_initialize()

{

cudaSetDevice(_cudaDeviceIndex);

}

// On every update of the plug-in...

void Plugin::_update()

{

// do update here...

}

// Clean-up plug-in...

void Plugin::_destroy()

{

// Oh dear! This could be bad...

cudaDeviceReset();

}

Sharing the CUDA context

• Since CUDA 4.0, Runtime-API uses Primary CUDA Contexts.

• Primary contexts created once per device, per process.

• Other plug-ins (or “future” Maya) might change context configuration.

• cudaDeviceSynchronize()/cudaDeviceReset() affect everyone!

• cudaSetDevice(i+1) will switch the primary context for all plug-ins.

• cudaDeviceSet???(...) changes configuration for every plug-in.

• If assert() fails in a kernel then the context is tainted.

• Vigilant error-checking required!

Save & Restore CUDA contexts

• Use the CUDA Driver-API.

• Context is sandboxed, but...

• Every push/pop requires a context switch.

• Each context uses host and device memory, and spawns host threads.

• If a device's compute-mode is “exclusive” then only one context can be active on that device at a time.

• Using only Driver-API means no Runtime-API DLL when deploying.

Save & Restore CUDA contexts - Code

// Initialize plug-in...


{

CUdevice dev;

cuDeviceGet(&dev, _cudaDeviceIndex);

CUcontext cuCxt;

cuGLCtxCreate(&cuCxt, 0, dev);

cuCtxPopCurrent(0); // we expect this to fail if there is no previous context.

}

// On every update of the plug-in...


{

cuCtxPushCurrent(cuCxt);

// do update here...

cuCtxPopCurrent(0);

}

// Clean-up plug-in...


{

cuCtxPushCurrent(cuCxt);

cuCtxDestroy(cuCxt);

}

Save & Restore CUDA contexts

• using the CUDA Driver-API: • cuFuncSetCacheConfig(...)

• cuFuncSetSharedMemConfig(...)

• ... will only affect this context.

• Stay away from dangerous cudaDevice*() calls.

• Since CUDA 4.1, you can use Runtime-API with a context created by the Driver-API.

Persistent CUDA context thread

• Create a thread for each CUDA device.

• Jobs are dispatched from the host to a thread-safe queue.

• All my plug-ins share my “Multi-GPU Job Dispatcher”.

• Use different streams for each plug-in and use cudaStreamSynchronize().

• If we want GL interoperability, Maya’s GL context must be passed into our persistent thread via OpenGL context sharing.

• This may reduce performance on Geforce... ... but Quadro works much better on multithreaded workstation apps.

Persistent CUDA context thread – Code (win7)

// Clean-up from within Maya thread.


{

launchJob(jobExit);

killThread(_hdl);

}

// Initialize from within Maya thread.


{

HDC display = getDisplay();

HGLRC mayaGlCxt = getGlContext();

HGLRC glCxt = wglCreateContext(display);

wglShareLists(mayaGlCxt, glCxt);

_hdl = runThread(_threadFn, display,

glCxt,

_cudaDeviceIndex);

launchJob(jobInit);

wglMakeCurrent(display, mayaGlCxt);

}

// Clean-up from within Maya thread.


{

launchJob(jobUpdate);

}

Persistent CUDA context thread – Code (win7)

void jobInit(Data* )

{

// initialize CUDA things...

}

void jobUpdate(Data*)

{

// update CUDA things...

}

void jobExit(Data*)

{

// clean-up CUDA things...

}

// Our persistent thread function.

void _threadFn(HDC display, HGLRC glCxt, int i)

{

wglMakeCurrent(display, glCxt);

CUdevice dev;

cuDeviceGet(&dev, i);

CUcontext cuCxt;

cuGLCtxCreate(&cuCxt, 0, dev);

while (_isRunning) {

// pop & run jobs from thread-safe queue.

}

wglMakeCurrent(0, 0);

wglDeleteContext(glCxt);

cuCtxDestroy(cuCxt);

}

Our Particle pipeline

Particle Emission

• Upload emitted data using nParticles as offset.

• nParticles += nEmitted

• We use fast host-to-device transfer for emission data.

• i.e. page-locked & write-combined host memory.

• For large emissions we can double-buffer and use Async API.

Pattr*

0 1 2 3 4 5

nParticles

Eattr

0 1 2

nEmitted

• Maya plug-in callbacks:

• Plugin::compute(...) (called during data computation)

• Upload & Simulate the particles using CUDA.

• Blit simulated data to VBO.

• Plugin::draw(...) (called during Maya’s scene-graph traversal)

• Render particle VBO.

BCA.render

The App Pipeline - (Autodesk Maya)

B.compute A.compute Maya C.draw

OpenGL

frame 1 frame 2

A.compute C.compute B.draw A.draw B.compute C.compute

Particle Initialization

• Simulation & Rendering initialization

struct Data

{

int nParticles; // number of particles

GLuint vboPositions; // OpenGL VBO of particle positions

cudaGraphicsResource* vboRes; // CUDA registered VBO

cudaStream_t stream; // CUDA stream

float4* d_positions; // CUDA buffer of particle positions

};

void jobInit(Data* d)

{

cudaStreamCreate(&d->stream);

cudaMalloc(d->d_positions, d->nParticles * sizeof(float4));

glGenBuffers(1, &d->vboPositions)

glBindBuffer(GL_ARRAY_BUFFER, d->vboPositions);

glBufferData(GL_ARRAY_BUFFER, d->nParticles * sizeof(float4), 0, GL_DYNAMIC_DRAW);

glBindBuffer(GL_ARRAY_BUFFER, 0);

cudaGraphicsGLRegisterBuffer(&d->vboRes, d->vboPositions, cudaGraphicsRegisterFlagsNone);

cudaGraphicsResourceSetMapFlags(&d->vboRes, cudaGraphicsMapFlagsWriteDiscard);

}

Particle Simulation

• CUDA kernel operates on CUDA buffers.

void jobUpdate(Data* d)

{

// ...

somethingCrazyKernel<<<dimGrid, dimBlock, 0, d->stream>>>(d->nParticles, d->d_positions);

}

Particle Render Blit

• Blit CUDA simulation data into the GL render buffers.

void jobBlit(Data* d)

{

void* vboPtr = 0;

size_t nBytes = 0;

cudaGraphicsMapResources(1, &d->vboRes, d->stream);

cudaGraphicsResourceGetMappedPointer(&vboPtr, (size_t*)&nBytes, d->vboRes);

cudaMemcpyAsync(vboPtr, d->d_positions, nBytes, cudaMemcpyDeviceToDevice,

d->stream);

cudaGraphicsUnmapResources(1, &d->vboRes, d->stream);

cudaStreamSynchronize(d->stream); // now we can be sure the data is copied.

}

Particle Render

void Plugin::draw()

{

waitForAllJobsToFinish(); // this stalls our main thread.

Data* d = &_data;


glVertexPointer(4, GL_FLOAT, 0, 0);

glEnableClientState(GL_VERTEX_ARRAY);

glDrawArrays(GL_POINTS, 0, d->nParticles);


glDisableClientState(GL_VERTEX_ARRAY);

}

Example 1: Synchronous CUDA thread

• Sync with CUDA thread before we draw the buffers.

• use a semaphore to signal when the stream is synchronized...

• ... or synchronize the CUDA thread so all the jobs are complete.

render1

sim1

compute2 Maya

OpenGL

frame 1 frame 2

compute1

blit1 sim2 Thread

draw waits for

blit to finish

draw1


• What about a slow GL render?

render2


render1

sim2

compute3 Maya

OpenGL

frame 2 frame 3

compute2

blit2 sim3 Thread

Blit may affect

current buffers draw2


• CUDA GL-interop will wait until GL is complete.

render2


render1

sim2

compute3 Maya

OpenGL

frame 2 frame 3

compute2

sim3 Thread

Blit “unmap” call will wait

for GL context to finish

draw2

blit2

Particle Render Blit – Explicit GL sync

• Wait for GL to finish rendering from the vbo.


{

d->blitMutex.claim();

glClientWaitSync(d->glVboSync, GL_SYNC_FLUSH_COMMANDS_BIT, GL_FOREVER);

void* vboPtr = 0;

size_t nBytes = 0;

cudaGraphicsMapResources(1, &d->vboRes, d->stream);

cudaGraphicsResourceGetMappedPointer(&vboPtr, (size_t*)&nBytes, d->vboRes);

cudaMemcpyAsync(vboPtr, d->d_positions, nBytes, cudaMemcpyDeviceToDevice, d->stream);

cudaGraphicsUnmapResources(1, &d->vboRes, d->stream);

cudaStreamSynchronize(d->stream);

d->blitMutex.release();

}

Particle Render – Explicit GL sync

• Explicitly wait for GL to finish rendering from the vbo.

void Plugin::draw()

{

Data* d = &_data;








d->glVboSync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);


}


• Advantages:

• Simple to implement.

• sand-boxed context protects our plug-in / host app.

• Disadvantages:

• We are relying on Maya calling our update at the start, and our render at the end!

• Allow latency and overlap simulation with host app.

• Ping-pong into double-buffered VBOs.

• (We still require sync objects if render[0] overlaps with blit[2]!)

• or use triple-buffering!

render1

Example 2: Double-buffered CUDA thread

render0

sim2

compute3 Maya draw1

OpenGL

frame 2 frame 3

compute2

sim3 Thread

draw the

previous frame

Wait for previous

frame to complete

blit2

Particle Render Blit - doublebuffered


{


glClientWaitSync(d->glVboSync[d->current], GL_SYNC_FLUSH_COMMANDS_BIT, GL_FOREVER);

void* vboPtr = 0;

size_t nBytes = 0;

cudaGraphicsMapResources(1, &d->vboRes[d->current], d->stream);

cudaGraphicsResourceGetMappedPointer(&vboPtr, (size_t*)&nBytes, d->vboRes[d->current]);

cudaMemcpyAsync(vboPtr, d->d_positions, nBytes, cudaMemcpyDeviceToDevice, d->stream);

cudaGraphicsUnmapResources(1, &d->vboRes[d->current], d->stream);

d->current = (++d->current)%2; // ping-pong


}

Particle Render - doublebuffered

void Plugin::draw()

{

Data* d = &_data;


glBindBuffer(GL_ARRAY_BUFFER, d->vboPositions[d->current]);






d->glVboSync[d->current] = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);


}

• Create auxiliary CUDA-context in the main-thread on the GL-device.

• register VBO with the auxiliary-context, and blit from pinned host mem.


{

cudaMemcpyAsync(d->h_positions[d->current], d->d_positions, d->nPositions*sizeof(float4),

cudaMemcpyDeviceToHost, d->stream);

...

}

void Plugin::_blit(Data* d)

{

...

void* vboPtr = 0;

size_t nBytes = 0;

cudaGraphicsMapResources(1, &vboRes);

cudaGraphicsResourceGetMappedPointer(&vboPtr, (size_t*)&nBytes, vboRes);

cudaMemcpy(vboPtr, d->h_positions[d->current+1)%2], nBytes, cudaMemcpyHostToDevice);

cudaGraphicsUnmapResources(1, &vboRes);

}

Particle Render Blit – Aux-CUDA method

Asynchronous CUDA streams

• Batching the data means…

• We hide the cost of data transfer between device and host.

• Extension to Multi-GPU is now trivial.

• NB. Not all algorithms can batch data.

• Each batch’s stream requires resource allocation.

• Be sure to do this up-front, (before OpenGL gets it all!)

• Number of batches can be chosen based on available resources.

NVIDIA Maximus

• Quadro K5000 & Tesla K20

• Uses GL-interop for efficient data movement to the Quadro GPU

• OpenGL Insights, Patrick Cozzi, Christophe Riccio, 2012. ISBN 1439893764.

• www.openglinsights.com

• DCC apps like Maya already use many GPU resources.

• Share the data between two cards allowing for larger simulations.

• Multi-GPU is scalability for the future.

• Batch simulations on your workstation.

Siggraph Asia 2012 - Wil Braithwaite NVIDIA Applied Engineering · 2012-12-13 · Persistent CUDA...

Documents

Transcript of Siggraph Asia 2012 - Wil Braithwaite NVIDIA Applied Engineering · 2012-12-13 · Persistent CUDA...