CUDA.NET_2.0
-
Upload
gadre-nayan-a -
Category
Documents
-
view
226 -
download
0
Transcript of CUDA.NET_2.0
-
8/12/2019 CUDA.NET_2.0
1/13
Page | 1 All rights reserved 2008. Company for Advanced Supercomputing Solutions
CUDA.NETManual
Reference for programmers
Written by: Mordechai Butrashvily
Date: 17/08/2008
E-mail:[email protected]
Website:http://www.gass-ltd.co.il/products/cuda.net
Revision Writers Date Changes
1.1 Mordechai Butrashvily 17/08/2008 2nd
revision, final version for
CUDA.NET 1.1
1.0 Mordechai Butrashvily 10/08/2008 First revision
mailto:[email protected]:[email protected]:[email protected]://www.gass-ltd.co.il/products/cuda.nethttp://www.gass-ltd.co.il/products/cuda.nethttp://www.gass-ltd.co.il/products/cuda.nethttp://www.gass-ltd.co.il/products/cuda.netmailto:[email protected] -
8/12/2019 CUDA.NET_2.0
2/13
Page | 2 All rights reserved 2008. Company for Advanced Supercomputing Solutions
NoticeALL COMPANY'S DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS,
LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, MATERIALS) ARE BEING PROVIDED
AS IS. THE COMPANY MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE
WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OFNONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to be accurate and reliable. However, the company assumes no
responsibility for the consequences of use of such information or for any infringement of patents or
other rights of third parties that may result from its use. No license is granted by implication or
otherwise under any patent or patent rights of the company. Specifications mentioned in this
publication are subject to change without notice. This publication supersedes and replaces all
information previously supplied. Company's products are not authorized for use as critical
components in life support devices or systems without express written approval of the company.
Trademarks
NVIDIA is a trademark or registered trademarks of NVIDIA Corporation. Other company and productnames may be trademarks of the respective companies with which they are associated.
Copyright 2008 - Company for Advanced Supercomputing Solutions Ltd, All rights reserved.
Bosmat 2a Street
Shoham, 73142,
Israel
http://www.gass-ltd.co.il
http://www.gass-ltd.co.il/http://www.gass-ltd.co.il/http://www.gass-ltd.co.il/ -
8/12/2019 CUDA.NET_2.0
3/13
Page | 3 All rights reserved 2008. Company for Advanced Supercomputing Solutions
Contents
Introduction.................................................................................................................
4
CUDA.NET basic objects .................................................................................................5
Driver Objects ...........................................................................................................5
Data Types ...............................................................................................................5
Working with devices ................................................................................................6
Working with device memory .....................................................................................6
Launching CUDA code ...................................................................................................7
Working with modules...............................................................................................
7
Working with functions ..............................................................................................8
Setting function parameters .......................................................................................8
Setting execution configuration ..................................................................................9
Working with CUFFTDriver .............................................................................................9
Higher Level Objects ...................................................................................................11
New Object Model ......................................................................................................11
CUDA Object...........................................................................................................12
Working with CUFFT ................................................................................................13
Working with CUBLAS ..............................................................................................13
-
8/12/2019 CUDA.NET_2.0
4/13
Page | 4 All rights reserved 2008. Company for Advanced Supercomputing Solutions
Introduction
CUDA.NET is a library that provides the same functionality by CUDA driver (exposed throughC interface) for .NET based applications.
The version of CUDA.NET this document relates to is CUDA.NET 1.1. That implies, by the
version of CUDA.NET, that the API of CUDA 1.1 is supported.
The library has been tested and can run without a problem on CUDA 2.0, but new features
that are available as of CUDA 2.0 are not yet supported by CUDA.NET.
As such, it wraps all the functionality of CUDA for .NET, practically speaking:
Device enumeration Context management Memory allocation and transfer (including arrays management) Texture management Asynchronous data transfer and execution - through streams
In addition, it provides access to all other routines provided by CUDA:
FFT 1D3D BLAS routines
To simplify development of .NET based applications, the library includes data types that
correspond to CUDA specifications, especially vector types:
CUDA.NET CUDA
Char1, Char2, Char3, Char4 char1, char2, char3, char4UChar1, UChar2, UChar3, UChar4 uchar1, uchar2 ,uchar3, uchar4
Short1, Short2, Short3, Short4 short1, short2, short3, short4
UShort1, UShort2, UShort3, UShort4 ushort1, ushort2, ushort3, ushort4
Int1, Int2, Int3, Int4 int1, int2, int3, int4
UInt1, UInt2, UInt3, UInt4 uint1, uint2, uint3, uint4
Long1, Long2, Long3, Long4 long1, long2, long3, long4
ULong1, ULong2, ULong3, ULong4 ulong1, ulong2, ulong3, ulong4
Float1, Float2, Float3, Float4 float1, float2, float3, float4
That is, while supporting the basic primitive types (CUDA.NET syntax conforms to C#):
CUDA.NET CUDA
sbyte, byte char, unsigned char
short, ushort short, unsigned short
int, uint int, unsigned int
long, ulong long, unsigned long
float float
-
8/12/2019 CUDA.NET_2.0
5/13
Page | 5 All rights reserved 2008. Company for Advanced Supercomputing Solutions
CUDA.NET basic objects
As stated in the previous section, CUDA.NET is a wrapper over CUDA driver.To ease development and migration from existing CUDA application written in C to .NET, the
same API was reserved.
Accessing the driver API of CUDA from .NET can be done by using CUDADriverobject of
CUDA.NET.
All methods are static, thus allowing direct access to the same functions.
For example, let's consider the following CUDA application written in C:
#include
intmain()
{
// Initialize the driver.cuInit(0);
}
The same code with CUDA.NET looks like this:
usingGASS.CUDA;
namespaceCUDATest
{
classTest
{
staticvoidMain(string[] args){
// Initialize the driver.
CUDADriver.cuInit(0);
}
}
}
The same approach can be applied to all other functions of the CUDA driver API.
Driver ObjectsThe set of basic wrapper objects provided by CUDA.NET are:
CUDADriverprovides access to CUDA API CUFFTDriverprovides access to CUFFT API CUBLASDriverprovides access to CUBLAS API and routines
Data Types
Looking into GASS.CUDA.Types namespace reveals some types that were created to support
all features of CUDA from a .NET application:
CUdeviceRepresents a pointer to a device object CUdeviceptrRepresents a pointer to device memory
-
8/12/2019 CUDA.NET_2.0
6/13
Page | 6 All rights reserved 2008. Company for Advanced Supercomputing Solutions
CUcontextRepresents a pointer to a context object CUmoduleRepresents a pointer to a loaded module object CUfunction - Represents a pointer to a function in a module CUarray - Represents a pointer to an allocated array in device memory CUtexref - Represents a pointer to a texture in device memory CUevent - Represents a pointer to an event CUstream - Represents a pointer to a stream that can be used for asynchronous
operations
All these objects conform to the declarations in CUDA.
Working with devices
Before starting to perform CUDA operations, we must initialize the driver and select a device
to work with. Selecting a device happens of behalf of creating a context.
An example for that might be:
staticvoidMain(string[] args)
{
// Initialize the driverthis call must be the first before any CUDA operation!
CUDADriver.cuInit(0);
// Get the first device from the driver.
CUdevice dev = newCUdevice();
CUDADriver.cuDeviceGet(refdev, 0);
// Create a new context with default flags.
CUcontext ctx = newCUcontext();
CUDADriver.cuCtxCreate(refctx, 0, dev);
}
By creating a context we tell the driver that this is the one to be used throughout all CUDA
operations (we can use attach and detach functions later to manage the context we work
with).
It should be noted that a context is always related to a single device.
Working with device memory
Using pointers from .NET code (with unsafe semantics) is discouraged, that is why all
functions that accept pointers to device memory receive an object of type "CUdeviceptr"
instead.
This way we keep .NET code clean, and maintain compatibility with the C API of CUDA, since
all this objects are declared in the C environment as well.
An example for allocating device memory from .NET:
-
8/12/2019 CUDA.NET_2.0
7/13
Page | 7 All rights reserved 2008. Company for Advanced Supercomputing Solutions
staticvoidMain(string[] args)
{
// Assuming the driver was initialized and a context was created.
CUdeviceptr p1 = newCUdeviceptr();
// Allocate 1K of data in device memory.
CUDADriver.cuMemAlloc(refp1, 1
-
8/12/2019 CUDA.NET_2.0
8/13
Page | 8 All rights reserved 2008. Company for Advanced Supercomputing Solutions
CUmodule mod = newCUmodule();
CUDADriver.cuModuleLoad(refmod, "compute.cubin");
* It is highly encouraged to use full path to denote a module file name.
After executing the code above we end up with a module that is loaded by the driver.The next step will be to get a function to execute from that module.
Working with functions
In the previous section we said that the CUDA driver can load modules in run-time, the same
holds for functions too, although functions are hosted by modules.
Once we have a loaded module, and a reference to its object, we can get a reference to one
of its global functions in the following way, using CUDA.NET:
CUfunction func = newCUfunction();
CUDADriver.cuModuleGetFunction(reffunc, mod, "compute");
We used the module we loaded previously to get a function name compute.
At this point you can understand why the declaration of the function in the compute.cu file
involved the use of extern "C"keyword. The reason for that is because nvcc is a C++
compiler, so it emit symbol with name mangling of C++. But, to simplify our process when
we wish to load a function, we want to specify its direct name.
At this point we have a function in hand that is almost ready for execution.
The next step will be to set the function's parametersdynamically.
Setting function parameters
So after we have a function, before it is being executed in the GPU we need to specify some
parameters and configuration information.
Investigating the function signature we used ("compute"), we find that it accepts one
parameter, which is a pointer to device memory:
extern"C" __global__ voidcompute(float4* data);
Before we set parameter information, it is necessary to allocate the memory in the device:
Float4[] data = newFloat4[100];
CUdeviceptr ptr = newCUdeviceptr();
CUDADriver.cuMemAlloc(refptr, (uint)Marshal.SizeOf(data));
// Copy the data to the device
Setting parameters using CUDA.NET for this function looks like this:
CUDADriver.cuParameterSeti(func, 0, (uint)ptr.Pointer);
-
8/12/2019 CUDA.NET_2.0
9/13
Page | 9 All rights reserved 2008. Company for Advanced Supercomputing Solutions
But that is not enough. We still need to tell the driver a hint that indicates how much
memory to reserve for function parameters:
CUDADriver.cuParameterSetSize(func, 4);
* NOTE: When working under 32 bit systems and compiling CUDA code for 32 bit function
pointer will always be in the length of 4 bytes. Under 64 bit systems, specifically when
compiling CUDA code to 64 bit, function pointers have a length of 8 bytes, so the last
parameter of cuParameterSetSize varies with the platformit is possible to get the pointer
size in run-time using the size of IntPtr object in .NET.
Setting execution configuration
One last step before executing the code in the GPU we need to set execution configuration
for our context (meaning the function to be executed).
As already known, with CUDA execution is divided into grids that in turn are divided intoblocks, which are divided to threads (the basic execution element).
It is not the goal of this document to describe this approach as it is widely covered in the
documentation provided by NVIDIA for CUDA.
The driver API provides functions to set each of these parameters:
Grid size by means of blocks Block size by means of threads
To set threads count for every block of execution:
CUDADriver.cuFunctionSetBlockShape(func, 64, 8, 0);
This way, we set the block size to be 64 threads in the X axis, 8 threads in the Y axis and 0 in
the Z, for a total of 512 threads in each block.
It is possible to set only one of the axes.
To launch the function in a grid:
CUDADriver.cuLaunchGrid(func, 512, 512);
The code above really executes the function in the GPU with a configuration of 512 blocks in
the X and Y axes respectively, for a total amount of 262,144 blocks and 134,217,728 threads
to be executed.
Working with CUFFTDriver
CUFFT routines provided by CUDA allow a programmer to perform FFT calculation in the
GPU.
The same API exposed by including cufft.h is used in CUDA.NET.
For example, let's consider the following code given in the official documentation of CUDA
(written by NVIDIA):
1D Complex-to-Complex Transform
-
8/12/2019 CUDA.NET_2.0
10/13
Page |
10
All rights reserved 2008. Company for Advanced Supercomputing Solutions
#defineNX 256
#defineBATCH 10
cufftHandle plan;
cufftComplex *data;cudaMalloc((void**)&data, sizeof(cufftComplex) * NX * BATCH);
/* Create a 1D FFT plan. */
cufftPlan1D(&plan, NX, CUFFT_C2C, BATCH);
/* Use the CUFFT plan to transform the signal in place. */
cufftExecC2C(plan, data, data, CUFFT_FORWARD);
/* Inverse transform the signal in place. */
cufftExecC2C(plan, data, data, CUFFT_INVERSE);
/* Destroy the CUFFT plan. */
cufftDestroy(plan);
cudaFree(data);
Performing the same operations with CUDA.NET, looks like this:
usingGASS.CUDA;
usingGASS.CUDA.FFT;
usingGASS.CUDA.FFT.Types;
usingSystem.Runtime.InteropServices;
namespaceCUFFTTest{
classTest
{
constintNX = 256;
constintBATCH = 10;
staticvoidMain(string[] args)
{
// Assume driver is initialized and a context was created.
// Allocate data for the array.CUdeviceptr data = newCUdeviceptr();
CUDADriver.cuMemAlloc(refdata,
Marshal.SizeOf(typeof(cufftComplex)) * NX * BATCH);
/* Create a 1D plan. */
cufftHandle plan = newcufftHandle();
CUFFTDriver.cufftPlan1D(refplan, NX,
CUFFTType.ComplexToComplex, BATCH);
/* Perform a forward FFT. */
CUFFTDriver.cufftExecC2C(plan, data, data,CUFFTDirection.Forward);
-
8/12/2019 CUDA.NET_2.0
11/13
Page |
11
All rights reserved 2008. Company for Advanced Supercomputing Solutions
/* Perform an inverse FFT. */
CUFFTDriver.cufftExecC2C(plan, data, data,
CUFFTDirection.Inverse);
/* Clean resources and free memory. */
CUFFTDriver.cufftDestroy(plan);
CUDADriver.cuMemFree(data);
}
}
}
The approach can be used with other types of FFT.
Higher Level ObjectsAs with the final release of CUDA.NET, three object were added to simplify development
with CUDA.NET:
CUDAto provide all CUDA functionality CUFFTprovides CUFFT functionality with simplified functions CUBLASsimplifies working with CUBLAS routines
All new objects use the respective driver, so backward compatibility is maintained with
previous versions.
To provide better feedback of what happened in the driver, all objects will through a
runtime exception that is specific to the class itself:
CUDACUDAException CUFFTCUFFTException CUBLASCUBLASException
When an error occurs and the return value from the relevant driver function is different
from CUResult.Success.
This behavior can be controlled through the UseRuntimeExceptions property, which is by
default true.
To turnoff runtime exceptions, simply set the value of this property to false, and it can beturned on again later.
New Object ModelThe major change was in CUDA to allow programmers work easily with CUDA and devices.
A new object oriented approach was suggested for this purpose. For example, it is possible
to enumerate devices that are recognized by CUDA simply by accessing the following
property of CUDA:
CUDA cuda = newCUDA(true);
-
8/12/2019 CUDA.NET_2.0
12/13
Page |
12
All rights reserved 2008. Company for Advanced Supercomputing Solutions
foreach(Device dev incuda.Devices)
{
Console.WriteLine("{0} -> {1}", dev.Ordinal, dev.Name);
}
The rationale behind the object model was to provide the same API with better syntax and
function names, and to add some useful functions that will improve programming agility.
CUDA Object
This object was created in mind to provide simpler access to CUDA functions, without using
refkeywords or using too low-level API.
Most of the functionality that is supported by CUDADriver is available through this object
although some functions didn't find their way into. In future releases they will be added ifthere will be necessity for that.
Let's consider the case with memory allocation -
We can simply allocate memory using ordinary usage through:
CUdeviceptr ptr = cuda.Allocate(128);
This fragment of code simply allocates 128 bytes of device memory and returns the
appropriate pointer.
It should be noted at this point, that all functions can still operate with low-level driver
objects, to allow interoperability with CUDADriver object.
Allocating memory for a .NET array can be done like this:
UInt3[] data = newUInt3[128];
CUdeviceptr ptr = cuda.Allocate(data);
The fragment above allocates enough memory for 128 elements of UInt3 vector type, for a
total of 1536 bytes.
Using generic code and some explicit reflection code the amount of memory to allocate is
computed by the functions so that there is no necessary to provide such detailsonly the
array to allocate memory for.
To ease programming, some further functions were provided to allow allocating memory
and copying data to device memory in a single call:
UInt3[] data = newUInt3[256];
CUdeviceptr ptr = cuda.CopyHostToDevice(data);
This code fragment allocates device memory for 256 elements of UInt3 (total of 3072 bytes)
and copies the array to device memory. Of course that this mechanism can be used with
other types of arrays and CUDA.NET supported primitives.
-
8/12/2019 CUDA.NET_2.0
13/13
Page |
13
All rights reserved 2008. Company for Advanced Supercomputing Solutions
Working with CUFFT
As with CUFFT object the API now supports the older function with nicer usage, but allows
performing most of FFT operations in a single call.
Creating a 1D plan can be done by:
CUFFT cufft = newCUFFT(new CUDA(true));
cufftHandle plan = cufft.Plan1D(nx, type, batch);
But it is possible to run any of the 1D FFT routines through calling:
cufftReal[] realData = newcufftReal[256];
cufftComplex[] cmlxData = newcufftComplex[256];
cufft.Execute1D(realData, cmlxData, nx, batch);
The function handles memory management by itself and executes the appropriate FFT basedon the provided parameters.
The same holds for all other types of FFT.
Working with CUBLAS
CUBLAS object was created in mind to provide better usage for working with vector and
matrix memory, while all other operations are still accessible from CUBLASDriver object.
It is possible that in future versions all supported functions will enter CUBLAS as well with
simpler signature.
An example for initializing a vector:
CUDA cuda = newCUDA(true);
CUBLAS blas = newCUBLAS(cuda);
blas.Init();
float[] data = newfloat[] { 0.0f, 1.5f, 2.5f, 5.224f };
CUdeviceptr vector = blas.Allocate(data);
blas.SetVector(data, vector);
blas.Free(vector);
blas.Shutdown();
The example above demonstrates how to create a vector in device memory and copy data to
be used by one of CUBLAS routines.