Computer Architecture Department University of · PDF fileComputer Architecture Department...

download Computer Architecture Department University of · PDF fileComputer Architecture Department University of Malaga (Spain) ... Voodoo 5 5500 3Dfx (1999) Rage Fury Maxx ATI (2000) Volari

If you can't read please download the document

Transcript of Computer Architecture Department University of · PDF fileComputer Architecture Department...

  • GPU multiprocessing

    Manuel Ujaldn MartnezComputer Architecture Department

    University of Malaga (Spain)

  • Outline

    1. Multichip solutions [10 slides]2. Multicard solutions [2 slides]3. Multichip + multicard [3]4. Performance on matrix decompositions [2]5. CUDA programming [5]6. Scalability on 3DFD [4]

  • 3

    A world of possibilities

    From lower to higher cost, we have:1. Multichip: Voodoo5 (3Dfx), 3D1 (Gigabyte).

    2. Multicard: SLI(Nvidia) / CrossFire(ATI).

    3. Combination:Two chips/card and/or two cards/connector.

    Gigabyte (2005)NVIDIA(2007) ATI

    (2007)

    Evans &Sutherland (2004):

    NVIDIA(2008)

  • I. Multichip solutions

    4

  • 5

    First choice: Multichip. A retrospective:

    Voodoo 5 55003Dfx(1999)

    Rage FuryMaxxATI(2000)

    Volari V8DuoXGI(2002)

    2 Rad9800(prototype)Sapphire(2003)

  • 6

    First choice: Multichip.Example 1: 3D1 (Gigabyte - 2005).

    A double GeForce 6600GT GPU on the same card (december 2005).

    Each GPU endowed with 128 MB of memory and a 128 bits bus width.

  • 7

    First choice: Multichip.Example 2: GeForce 7950 GX2 (Nvidia 2006)

  • 8

    First choice: Multichip.Example 3: GeForce 9800 GX2 (Nvidia - 2008)

    Double GeForce 8800 GPU, double printed circuit board and double video memory of 512 MB. A single PCI-express connector.

  • 9

    First choice: Multichip.3D1 (Gigabyte). Cost and performance

    Card

    GeForce 6600 GT

    3D1 using a single GPU

    GeForce 6800 GT

    GeForce 6600 GT SLI

    3D1 using two GPUs

    3DMark 20033DMark 2003 3DMark 20053DMark 2005

    1024x768 1600x1200 1024x768 1600x1200

    8234 2059 3534 2503

    8529 2063 3572 2262

    11493 3846 4858 3956

    14049 3924 6122 3542

    14482 4353 6307 3609

    Cost: row 3 > row 4 > row 5 > row 1 > row 1

  • 10

    First choice: Multichip.3D1 (Gigabyte). Analysis.

    As compared to a single GeForce 6800 GT, 3D1 has: Lower cost. Higher arithmetic performance. Better at poorer resolution and

    software innovations (shaders). Similar bandwidth. Lower memory space and usability:

    Vertices and textures must be replicated. A GPU cannot see the memory of its twin.

    As compared to two GeForce 6600 GT connected through SLI: Slightly lower cost. Greater performance without demanding CPU bandwidth. Less versatile: Future expansion and/or single-card use.

  • 11

    First choice: Multichip. GeForce 7950 GX2 (2006)

    GPU developed by Nvidia in June 2006. The GPU has twin soul (duality affects design).

    Clocks are slower than the single-GPU model: GPU: 500 MHz (twin) versus 650 MHz (stand alone). Memory: 2x600 MHz (twin) versus 2x800 MHz (stand alone).

    Drivers were released almost a year later, which penalized initially the popularity of this card.

    It allows to use 48 pixel processors (24 on each GPU) and a video memory of 1 GB (512 MB connected to each GPU through a couple of buses 256 bits wide).

  • 12

    First choice: Multichip (2006). Transistors.

    A smaller chip with smaller transistors allows growing through a GPU replication

  • 13

    First choice: Multichip (2006). Frequency.

    A double GPU allows to relax clocks, with less heat and power consumption.

  • 14

    First choice: Multichip (2006). Bandwidth.

    Two GPUs placed on parallel planes make it easier to duplicate the bus width to 512 bits.

  • II. Multicard solutions

    15

  • 16

    Second choice: Multicard.A couple of GPUs

    SLI (Nvidia on GeForces) CrossFire (ATI on Radeons)

  • 17

    Second choice: Multicard.SLI (Nvidia). Elements.

    - The motherboard must have several slots PCI-express 2.0 and PCI-express x16:

    - The power supply must reach at least 700 Watts.

    - Performance issues: A twin card may increment performance 60-80%. A new generation of GPUs may increment even more. Time frame becomes crucial!

  • III. Multichip + multicard

    18

  • 19

    1+2 choice: Multichip+multicard

    First solution available on the marketplace: Gigabyte (2005) based on GeForce 6 GPUs. It allows heterogeneous graphics cards, but workload balance gets complicated.

  • 20

    1+2 choice: Multichip+multicard.Implementation details

  • 2 GPUs 8 GPUs4 GPUs

    1+2 choice: Multichip+multitarjeta.Newer designs

    It combines a number of GeForce 9800 GX2 GPUs and a multi-socket motherboard to configure up to quad-SLI: 2 GPUs/card x up to 4 cards = 8 GPUs.

    21

  • IV. Performance on matrix decompositions

    22

  • 23

    Multicard performance versus a newer generation (LU decomposition)

    A second (twin) GPU improves 1.6x, but does not reach the performance of a single card coming from the next generation.

  • 24

    CPU+GPU performance versus a single quad-core CPU (more on this later)

    The benchmark is composed of three popular matrix decompositions used in linear algebra

  • V. CUDA programming for multi-GPU applications

    25

  • 26

    Device Management

    CPU can query and select GPU devices cudaGetDeviceCount( int *count ) cudaSetDevice( int device ) cudaGetDevice( int *current_device ) cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) cudaChooseDevice( int *device, cudaDeviceProp* prop )

    Multi-GPU setup: device 0 is used by default one CPU thread can control only one GPU

    multiple CPU threads can control the same GPU calls are serialized by the driver

    41

  • 27

    Multiple CPU Threads and CUDA

    CUDA resources allocated by a CPU thread can be consumed only by CUDA calls from the same CPU thread.

    Violation example: CPU thread 2 allocates GPU memory, stores address in p thread 3 issues a CUDA call that accesses memory via p

    42

  • When using several GPUs, the implementation gets complicated

    GPUs dont share video memory, so programmer must move data around PCI-express (even when GPUs belong to the same graphics card, as in the GeForce 9800 GX2).

    Steps to follow: Copy data from GPU A to CPU thread A. Copy data from CPU thread A to CPU thread B using MPI. Copy data from CPU thread B to GPU B.

    We can use asynchronous copies to overlap the kernel execution on the GPU with data copies, and pinned memory to share copies among CPU threads (use cudaHostAlloc())

    28

  • 29

    Host Synchronization

    All kernel launches are asynchronous control returns to CPU immediately kernel executes after all previous CUDA calls have completed

    cudaMemcpy is synchronous control returns to CPU after copy completes copy starts after all previous CUDA calls have completed

    cudaThreadSynchronize() blocks until all previous CUDA calls complete

    39

  • 30

    CPUGPU interactions: Conclusions

    CPUGPU mem BW much lower than GPU mem BW. Use page-locked host memory (cudaMallocHost()) for

    maximum CPU GPU bandwidth 3.2 GB/s common on PCI-e x16. ~4 GB/s measured on nForce 680i chipsets (8 GB/s for PCI-e 2.0). Be cautious however since allocating too much page-locked memory can

    reduce overall system performance.

    Minimize CPU GPU data transfers by moving more code from CPU to GPU: Even if that means running kernels with low parallelism. Intermediate data structs. can be allocated, operated on, and

    deallocated without ever copying them to CPU memory.

    Group data transfers: One large transfer much better than many small ones.

  • VI. Scalability for 3DFD (Nvidia code)

    31

  • Example: Multi-GPU implementation for 3DFD

    3DFD is a finite differences code for the discretization of the seismic wave equation. 8th order in space, 2nd order in time. Using a regular mesh.

    Fixed X and Y dimensions, varying Z. Data is partitioned among GPUs along Z axis.

    Computation increases with z, communication (per node) stays constant.

    A GPU has to exchange 4 xy-planes (ghost nodes) with each of its neighbors.

    Executed on a cluster of 2 GPUS per node and Infiniband SDR network.

    32

  • Performance for a couple of GPUs

    Linear scaling is achieved when computation time exceeds communication time.

    33

  • Three or more cluster nodes

    Times are per cluster node. At least one cluster node needs two MPI communications,

    one with each of the neighbors.

    34

  • Performance with 8 GPUs

    8x improvement factor is sustained at Z>1300, exactly where computation exceeds communication.

    35