CUDA.NET_2.0

download CUDA.NET_2.0

of 13

Transcript of CUDA.NET_2.0

  • 8/12/2019 CUDA.NET_2.0

    1/13

    Page | 1 All rights reserved 2008. Company for Advanced Supercomputing Solutions

    CUDA.NETManual

    Reference for programmers

    Written by: Mordechai Butrashvily

    Date: 17/08/2008

    E-mail:[email protected]

    Website:http://www.gass-ltd.co.il/products/cuda.net

    Revision Writers Date Changes

    1.1 Mordechai Butrashvily 17/08/2008 2nd

    revision, final version for

    CUDA.NET 1.1

    1.0 Mordechai Butrashvily 10/08/2008 First revision

    mailto:[email protected]:[email protected]:[email protected]://www.gass-ltd.co.il/products/cuda.nethttp://www.gass-ltd.co.il/products/cuda.nethttp://www.gass-ltd.co.il/products/cuda.nethttp://www.gass-ltd.co.il/products/cuda.netmailto:[email protected]
  • 8/12/2019 CUDA.NET_2.0

    2/13

    Page | 2 All rights reserved 2008. Company for Advanced Supercomputing Solutions

    NoticeALL COMPANY'S DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS,

    LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, MATERIALS) ARE BEING PROVIDED

    AS IS. THE COMPANY MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE

    WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OFNONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.

    Information furnished is believed to be accurate and reliable. However, the company assumes no

    responsibility for the consequences of use of such information or for any infringement of patents or

    other rights of third parties that may result from its use. No license is granted by implication or

    otherwise under any patent or patent rights of the company. Specifications mentioned in this

    publication are subject to change without notice. This publication supersedes and replaces all

    information previously supplied. Company's products are not authorized for use as critical

    components in life support devices or systems without express written approval of the company.

    Trademarks

    NVIDIA is a trademark or registered trademarks of NVIDIA Corporation. Other company and productnames may be trademarks of the respective companies with which they are associated.

    Copyright 2008 - Company for Advanced Supercomputing Solutions Ltd, All rights reserved.

    Bosmat 2a Street

    Shoham, 73142,

    Israel

    http://www.gass-ltd.co.il

    http://www.gass-ltd.co.il/http://www.gass-ltd.co.il/http://www.gass-ltd.co.il/
  • 8/12/2019 CUDA.NET_2.0

    3/13

    Page | 3 All rights reserved 2008. Company for Advanced Supercomputing Solutions

    Contents

    Introduction.................................................................................................................

    4

    CUDA.NET basic objects .................................................................................................5

    Driver Objects ...........................................................................................................5

    Data Types ...............................................................................................................5

    Working with devices ................................................................................................6

    Working with device memory .....................................................................................6

    Launching CUDA code ...................................................................................................7

    Working with modules...............................................................................................

    7

    Working with functions ..............................................................................................8

    Setting function parameters .......................................................................................8

    Setting execution configuration ..................................................................................9

    Working with CUFFTDriver .............................................................................................9

    Higher Level Objects ...................................................................................................11

    New Object Model ......................................................................................................11

    CUDA Object...........................................................................................................12

    Working with CUFFT ................................................................................................13

    Working with CUBLAS ..............................................................................................13

  • 8/12/2019 CUDA.NET_2.0

    4/13

    Page | 4 All rights reserved 2008. Company for Advanced Supercomputing Solutions

    Introduction

    CUDA.NET is a library that provides the same functionality by CUDA driver (exposed throughC interface) for .NET based applications.

    The version of CUDA.NET this document relates to is CUDA.NET 1.1. That implies, by the

    version of CUDA.NET, that the API of CUDA 1.1 is supported.

    The library has been tested and can run without a problem on CUDA 2.0, but new features

    that are available as of CUDA 2.0 are not yet supported by CUDA.NET.

    As such, it wraps all the functionality of CUDA for .NET, practically speaking:

    Device enumeration Context management Memory allocation and transfer (including arrays management) Texture management Asynchronous data transfer and execution - through streams

    In addition, it provides access to all other routines provided by CUDA:

    FFT 1D3D BLAS routines

    To simplify development of .NET based applications, the library includes data types that

    correspond to CUDA specifications, especially vector types:

    CUDA.NET CUDA

    Char1, Char2, Char3, Char4 char1, char2, char3, char4UChar1, UChar2, UChar3, UChar4 uchar1, uchar2 ,uchar3, uchar4

    Short1, Short2, Short3, Short4 short1, short2, short3, short4

    UShort1, UShort2, UShort3, UShort4 ushort1, ushort2, ushort3, ushort4

    Int1, Int2, Int3, Int4 int1, int2, int3, int4

    UInt1, UInt2, UInt3, UInt4 uint1, uint2, uint3, uint4

    Long1, Long2, Long3, Long4 long1, long2, long3, long4

    ULong1, ULong2, ULong3, ULong4 ulong1, ulong2, ulong3, ulong4

    Float1, Float2, Float3, Float4 float1, float2, float3, float4

    That is, while supporting the basic primitive types (CUDA.NET syntax conforms to C#):

    CUDA.NET CUDA

    sbyte, byte char, unsigned char

    short, ushort short, unsigned short

    int, uint int, unsigned int

    long, ulong long, unsigned long

    float float

  • 8/12/2019 CUDA.NET_2.0

    5/13

    Page | 5 All rights reserved 2008. Company for Advanced Supercomputing Solutions

    CUDA.NET basic objects

    As stated in the previous section, CUDA.NET is a wrapper over CUDA driver.To ease development and migration from existing CUDA application written in C to .NET, the

    same API was reserved.

    Accessing the driver API of CUDA from .NET can be done by using CUDADriverobject of

    CUDA.NET.

    All methods are static, thus allowing direct access to the same functions.

    For example, let's consider the following CUDA application written in C:

    #include

    intmain()

    {

    // Initialize the driver.cuInit(0);

    }

    The same code with CUDA.NET looks like this:

    usingGASS.CUDA;

    namespaceCUDATest

    {

    classTest

    {

    staticvoidMain(string[] args){

    // Initialize the driver.

    CUDADriver.cuInit(0);

    }

    }

    }

    The same approach can be applied to all other functions of the CUDA driver API.

    Driver ObjectsThe set of basic wrapper objects provided by CUDA.NET are:

    CUDADriverprovides access to CUDA API CUFFTDriverprovides access to CUFFT API CUBLASDriverprovides access to CUBLAS API and routines

    Data Types

    Looking into GASS.CUDA.Types namespace reveals some types that were created to support

    all features of CUDA from a .NET application:

    CUdeviceRepresents a pointer to a device object CUdeviceptrRepresents a pointer to device memory

  • 8/12/2019 CUDA.NET_2.0

    6/13

    Page | 6 All rights reserved 2008. Company for Advanced Supercomputing Solutions

    CUcontextRepresents a pointer to a context object CUmoduleRepresents a pointer to a loaded module object CUfunction - Represents a pointer to a function in a module CUarray - Represents a pointer to an allocated array in device memory CUtexref - Represents a pointer to a texture in device memory CUevent - Represents a pointer to an event CUstream - Represents a pointer to a stream that can be used for asynchronous

    operations

    All these objects conform to the declarations in CUDA.

    Working with devices

    Before starting to perform CUDA operations, we must initialize the driver and select a device

    to work with. Selecting a device happens of behalf of creating a context.

    An example for that might be:

    staticvoidMain(string[] args)

    {

    // Initialize the driverthis call must be the first before any CUDA operation!

    CUDADriver.cuInit(0);

    // Get the first device from the driver.

    CUdevice dev = newCUdevice();

    CUDADriver.cuDeviceGet(refdev, 0);

    // Create a new context with default flags.

    CUcontext ctx = newCUcontext();

    CUDADriver.cuCtxCreate(refctx, 0, dev);

    }

    By creating a context we tell the driver that this is the one to be used throughout all CUDA

    operations (we can use attach and detach functions later to manage the context we work

    with).

    It should be noted that a context is always related to a single device.

    Working with device memory

    Using pointers from .NET code (with unsafe semantics) is discouraged, that is why all

    functions that accept pointers to device memory receive an object of type "CUdeviceptr"

    instead.

    This way we keep .NET code clean, and maintain compatibility with the C API of CUDA, since

    all this objects are declared in the C environment as well.

    An example for allocating device memory from .NET:

  • 8/12/2019 CUDA.NET_2.0

    7/13

    Page | 7 All rights reserved 2008. Company for Advanced Supercomputing Solutions

    staticvoidMain(string[] args)

    {

    // Assuming the driver was initialized and a context was created.

    CUdeviceptr p1 = newCUdeviceptr();

    // Allocate 1K of data in device memory.

    CUDADriver.cuMemAlloc(refp1, 1

  • 8/12/2019 CUDA.NET_2.0

    8/13

    Page | 8 All rights reserved 2008. Company for Advanced Supercomputing Solutions

    CUmodule mod = newCUmodule();

    CUDADriver.cuModuleLoad(refmod, "compute.cubin");

    * It is highly encouraged to use full path to denote a module file name.

    After executing the code above we end up with a module that is loaded by the driver.The next step will be to get a function to execute from that module.

    Working with functions

    In the previous section we said that the CUDA driver can load modules in run-time, the same

    holds for functions too, although functions are hosted by modules.

    Once we have a loaded module, and a reference to its object, we can get a reference to one

    of its global functions in the following way, using CUDA.NET:

    CUfunction func = newCUfunction();

    CUDADriver.cuModuleGetFunction(reffunc, mod, "compute");

    We used the module we loaded previously to get a function name compute.

    At this point you can understand why the declaration of the function in the compute.cu file

    involved the use of extern "C"keyword. The reason for that is because nvcc is a C++

    compiler, so it emit symbol with name mangling of C++. But, to simplify our process when

    we wish to load a function, we want to specify its direct name.

    At this point we have a function in hand that is almost ready for execution.

    The next step will be to set the function's parametersdynamically.

    Setting function parameters

    So after we have a function, before it is being executed in the GPU we need to specify some

    parameters and configuration information.

    Investigating the function signature we used ("compute"), we find that it accepts one

    parameter, which is a pointer to device memory:

    extern"C" __global__ voidcompute(float4* data);

    Before we set parameter information, it is necessary to allocate the memory in the device:

    Float4[] data = newFloat4[100];

    CUdeviceptr ptr = newCUdeviceptr();

    CUDADriver.cuMemAlloc(refptr, (uint)Marshal.SizeOf(data));

    // Copy the data to the device

    Setting parameters using CUDA.NET for this function looks like this:

    CUDADriver.cuParameterSeti(func, 0, (uint)ptr.Pointer);

  • 8/12/2019 CUDA.NET_2.0

    9/13

    Page | 9 All rights reserved 2008. Company for Advanced Supercomputing Solutions

    But that is not enough. We still need to tell the driver a hint that indicates how much

    memory to reserve for function parameters:

    CUDADriver.cuParameterSetSize(func, 4);

    * NOTE: When working under 32 bit systems and compiling CUDA code for 32 bit function

    pointer will always be in the length of 4 bytes. Under 64 bit systems, specifically when

    compiling CUDA code to 64 bit, function pointers have a length of 8 bytes, so the last

    parameter of cuParameterSetSize varies with the platformit is possible to get the pointer

    size in run-time using the size of IntPtr object in .NET.

    Setting execution configuration

    One last step before executing the code in the GPU we need to set execution configuration

    for our context (meaning the function to be executed).

    As already known, with CUDA execution is divided into grids that in turn are divided intoblocks, which are divided to threads (the basic execution element).

    It is not the goal of this document to describe this approach as it is widely covered in the

    documentation provided by NVIDIA for CUDA.

    The driver API provides functions to set each of these parameters:

    Grid size by means of blocks Block size by means of threads

    To set threads count for every block of execution:

    CUDADriver.cuFunctionSetBlockShape(func, 64, 8, 0);

    This way, we set the block size to be 64 threads in the X axis, 8 threads in the Y axis and 0 in

    the Z, for a total of 512 threads in each block.

    It is possible to set only one of the axes.

    To launch the function in a grid:

    CUDADriver.cuLaunchGrid(func, 512, 512);

    The code above really executes the function in the GPU with a configuration of 512 blocks in

    the X and Y axes respectively, for a total amount of 262,144 blocks and 134,217,728 threads

    to be executed.

    Working with CUFFTDriver

    CUFFT routines provided by CUDA allow a programmer to perform FFT calculation in the

    GPU.

    The same API exposed by including cufft.h is used in CUDA.NET.

    For example, let's consider the following code given in the official documentation of CUDA

    (written by NVIDIA):

    1D Complex-to-Complex Transform

  • 8/12/2019 CUDA.NET_2.0

    10/13

    Page |

    10

    All rights reserved 2008. Company for Advanced Supercomputing Solutions

    #defineNX 256

    #defineBATCH 10

    cufftHandle plan;

    cufftComplex *data;cudaMalloc((void**)&data, sizeof(cufftComplex) * NX * BATCH);

    /* Create a 1D FFT plan. */

    cufftPlan1D(&plan, NX, CUFFT_C2C, BATCH);

    /* Use the CUFFT plan to transform the signal in place. */

    cufftExecC2C(plan, data, data, CUFFT_FORWARD);

    /* Inverse transform the signal in place. */

    cufftExecC2C(plan, data, data, CUFFT_INVERSE);

    /* Destroy the CUFFT plan. */

    cufftDestroy(plan);

    cudaFree(data);

    Performing the same operations with CUDA.NET, looks like this:

    usingGASS.CUDA;

    usingGASS.CUDA.FFT;

    usingGASS.CUDA.FFT.Types;

    usingSystem.Runtime.InteropServices;

    namespaceCUFFTTest{

    classTest

    {

    constintNX = 256;

    constintBATCH = 10;

    staticvoidMain(string[] args)

    {

    // Assume driver is initialized and a context was created.

    // Allocate data for the array.CUdeviceptr data = newCUdeviceptr();

    CUDADriver.cuMemAlloc(refdata,

    Marshal.SizeOf(typeof(cufftComplex)) * NX * BATCH);

    /* Create a 1D plan. */

    cufftHandle plan = newcufftHandle();

    CUFFTDriver.cufftPlan1D(refplan, NX,

    CUFFTType.ComplexToComplex, BATCH);

    /* Perform a forward FFT. */

    CUFFTDriver.cufftExecC2C(plan, data, data,CUFFTDirection.Forward);

  • 8/12/2019 CUDA.NET_2.0

    11/13

    Page |

    11

    All rights reserved 2008. Company for Advanced Supercomputing Solutions

    /* Perform an inverse FFT. */

    CUFFTDriver.cufftExecC2C(plan, data, data,

    CUFFTDirection.Inverse);

    /* Clean resources and free memory. */

    CUFFTDriver.cufftDestroy(plan);

    CUDADriver.cuMemFree(data);

    }

    }

    }

    The approach can be used with other types of FFT.

    Higher Level ObjectsAs with the final release of CUDA.NET, three object were added to simplify development

    with CUDA.NET:

    CUDAto provide all CUDA functionality CUFFTprovides CUFFT functionality with simplified functions CUBLASsimplifies working with CUBLAS routines

    All new objects use the respective driver, so backward compatibility is maintained with

    previous versions.

    To provide better feedback of what happened in the driver, all objects will through a

    runtime exception that is specific to the class itself:

    CUDACUDAException CUFFTCUFFTException CUBLASCUBLASException

    When an error occurs and the return value from the relevant driver function is different

    from CUResult.Success.

    This behavior can be controlled through the UseRuntimeExceptions property, which is by

    default true.

    To turnoff runtime exceptions, simply set the value of this property to false, and it can beturned on again later.

    New Object ModelThe major change was in CUDA to allow programmers work easily with CUDA and devices.

    A new object oriented approach was suggested for this purpose. For example, it is possible

    to enumerate devices that are recognized by CUDA simply by accessing the following

    property of CUDA:

    CUDA cuda = newCUDA(true);

  • 8/12/2019 CUDA.NET_2.0

    12/13

    Page |

    12

    All rights reserved 2008. Company for Advanced Supercomputing Solutions

    foreach(Device dev incuda.Devices)

    {

    Console.WriteLine("{0} -> {1}", dev.Ordinal, dev.Name);

    }

    The rationale behind the object model was to provide the same API with better syntax and

    function names, and to add some useful functions that will improve programming agility.

    CUDA Object

    This object was created in mind to provide simpler access to CUDA functions, without using

    refkeywords or using too low-level API.

    Most of the functionality that is supported by CUDADriver is available through this object

    although some functions didn't find their way into. In future releases they will be added ifthere will be necessity for that.

    Let's consider the case with memory allocation -

    We can simply allocate memory using ordinary usage through:

    CUdeviceptr ptr = cuda.Allocate(128);

    This fragment of code simply allocates 128 bytes of device memory and returns the

    appropriate pointer.

    It should be noted at this point, that all functions can still operate with low-level driver

    objects, to allow interoperability with CUDADriver object.

    Allocating memory for a .NET array can be done like this:

    UInt3[] data = newUInt3[128];

    CUdeviceptr ptr = cuda.Allocate(data);

    The fragment above allocates enough memory for 128 elements of UInt3 vector type, for a

    total of 1536 bytes.

    Using generic code and some explicit reflection code the amount of memory to allocate is

    computed by the functions so that there is no necessary to provide such detailsonly the

    array to allocate memory for.

    To ease programming, some further functions were provided to allow allocating memory

    and copying data to device memory in a single call:

    UInt3[] data = newUInt3[256];

    CUdeviceptr ptr = cuda.CopyHostToDevice(data);

    This code fragment allocates device memory for 256 elements of UInt3 (total of 3072 bytes)

    and copies the array to device memory. Of course that this mechanism can be used with

    other types of arrays and CUDA.NET supported primitives.

  • 8/12/2019 CUDA.NET_2.0

    13/13

    Page |

    13

    All rights reserved 2008. Company for Advanced Supercomputing Solutions

    Working with CUFFT

    As with CUFFT object the API now supports the older function with nicer usage, but allows

    performing most of FFT operations in a single call.

    Creating a 1D plan can be done by:

    CUFFT cufft = newCUFFT(new CUDA(true));

    cufftHandle plan = cufft.Plan1D(nx, type, batch);

    But it is possible to run any of the 1D FFT routines through calling:

    cufftReal[] realData = newcufftReal[256];

    cufftComplex[] cmlxData = newcufftComplex[256];

    cufft.Execute1D(realData, cmlxData, nx, batch);

    The function handles memory management by itself and executes the appropriate FFT basedon the provided parameters.

    The same holds for all other types of FFT.

    Working with CUBLAS

    CUBLAS object was created in mind to provide better usage for working with vector and

    matrix memory, while all other operations are still accessible from CUBLASDriver object.

    It is possible that in future versions all supported functions will enter CUBLAS as well with

    simpler signature.

    An example for initializing a vector:

    CUDA cuda = newCUDA(true);

    CUBLAS blas = newCUBLAS(cuda);

    blas.Init();

    float[] data = newfloat[] { 0.0f, 1.5f, 2.5f, 5.224f };

    CUdeviceptr vector = blas.Allocate(data);

    blas.SetVector(data, vector);

    blas.Free(vector);

    blas.Shutdown();

    The example above demonstrates how to create a vector in device memory and copy data to

    be used by one of CUBLAS routines.