Research Presentation for Computer ... - Malcolm Roberts · Research Presentation for Computer...

67
Research Presentation for Computer Modelling Group Ltd. Malcolm Roberts University of Strasbourg 2016-04-26 [email protected], www.malcolmiwroberts.com

Transcript of Research Presentation for Computer ... - Malcolm Roberts · Research Presentation for Computer...

  • Research Presentation for Computer

    Modelling Group Ltd.

    Malcolm Roberts

    University of Strasbourg

    2016-04-26

    [email protected], www.malcolmiwroberts.com

  • Outline

    I ConvolutionsI Implicitly Dealiased FFT-based convolutionsI Shared-memory implementationI Parallel OpenMP/MPI implementationI Pseudospectral simulations

    I GPU programmingI OpenCLI schnapsI Performance analysis.

    Malcolm Roberts malcolmiwroberts.com 2

    malcolmiwroberts.com

  • FFT-based convolutions

    The convolution of {Fk}m−1k=0 and {Gk}m−1k=0 is

    (1)(F ? G )k =k∑`=0

    F`Gk−`, k=0, . . . ,m − 1.

    For example, if F and G are:

    Then F ∗ G is:

    Malcolm Roberts malcolmiwroberts.com 3

    malcolmiwroberts.com

  • FFT-based convolutions

    Applications:

    I Signal processing

    I Machine learning: convolutional neural networks

    I Image processing

    I Particle-Image-Velocimitry

    I Pseudospectral simulations of nonlinear PDEs

    The convolution theorem:

    (2)F [F ∗ G ] = F [F ]�F [G ] .

    Using FFTs improves speed and accuracy.

    Malcolm Roberts malcolmiwroberts.com 4

    malcolmiwroberts.com

  • FFT-based convolutions

    Let ζm = exp(2πim

    ). Forward and backward Fourier transforms

    are given by:

    (3)fj =m−1∑k=0

    ζ jkmFk , Fk=1

    m

    m−1∑j=0

    ζ−kjm fk ,

    We will use the identity

    (4)m−1∑j =0

    ζ`jm =

    {m if ` = sm for s ∈ Z,1−ζ`mm1−ζmm = 0 otherwise.

    Malcolm Roberts malcolmiwroberts.com 5

    malcolmiwroberts.com

  • FFT-based convolutions

    The convolution theorem works because

    (5)

    m−1∑j =0

    fjgjζ−jkm =

    m−1∑j=0

    ζ−jkm

    (m−1∑p=0

    ζ jpmFp

    )(m−1∑q=0

    ζ jqmGq

    )

    =m−1∑p=0

    Fp

    m−1∑q=0

    Gq

    m−1∑j=0

    ζ j(−k+p+q)m

    = m∑s

    m−1∑p=0

    FpGk−p+sm.

    The terms s 6= 0 are aliases; they are bad.

    Malcolm Roberts malcolmiwroberts.com 6

    malcolmiwroberts.com

  • Conventional dealiasing: zero padding

    Let F̃.

    = {F0,F1, . . . ,Fm−2,Fm−1, 0, . . . , 0︸ ︷︷ ︸m

    }. Then,

    (6)

    (F̃ ∗2m G̃

    )k

    =2m−1∑`=0

    F̃` mod (2m)G̃(k−`) mod (2m)

    =m−1∑`=0

    F`G̃(k−`) mod (2m)

    =k∑`=0

    F`Gk−`.

    There is also a “2/3” padded version for pseudospectralsimulations, where the input {Fk}m−1k=−m is padded to 3m.

    Malcolm Roberts malcolmiwroberts.com 7

    malcolmiwroberts.com

  • Dealiasing with conventional zero-padding

    F GF G

    f g

    fg

    F ∗GF ∗G

    Malcolm Roberts malcolmiwroberts.com 8

    malcolmiwroberts.com

  • Dealiasing with implicit zero-padding

    We modify the FFT to account for the zeros implicitly.

    Let ζn = exp (−i2π/n). The Fourier transform of F̃ is

    (7)fx =2m−1∑k =0

    ζxk2mF̃k =m−1∑k=0

    ζxk2mF̃k

    We can compute this using two discontiguous buffers:

    (8)f2x =m−1∑k=0

    ζxkm Fk f2x+1=m−1∑k=0

    ζxkm(ζk2mFk

    ).

    Malcolm Roberts malcolmiwroberts.com 9

    malcolmiwroberts.com

  • Dealiasing with implicit zero-padding

    F G

    Malcolm Roberts malcolmiwroberts.com 10

    malcolmiwroberts.com

  • Dealiasing with implicit zero-padding

    FFT−1x {F}nx even

    FFT−1x {F}nx odd

    FFT−1x {G}nx even

    FFT−1x {G}nx odd

    Malcolm Roberts malcolmiwroberts.com 10

    malcolmiwroberts.com

  • Dealiasing with implicit zero-padding

    Malcolm Roberts malcolmiwroberts.com 10

    malcolmiwroberts.com

  • Dealiasing with implicit zero-padding

    Malcolm Roberts malcolmiwroberts.com 10

    malcolmiwroberts.com

  • Dealiasing with implicit zero-padding

    Malcolm Roberts malcolmiwroberts.com 10

    malcolmiwroberts.com

  • Dealiasing with implicit zero-padding

    FFT−1x {F ∗G}nx even

    FFT−1x {F ∗G}nx odd

    Malcolm Roberts malcolmiwroberts.com 10

    malcolmiwroberts.com

  • Dealiasing with implicit zero-padding

    F ∗G

    Malcolm Roberts malcolmiwroberts.com 10

    malcolmiwroberts.com

  • Shared-memory implementation

    I Implicit dealiasing requires less memory.

    I We avoid FFTs on zero-data.

    I By using discontiguous buffers, we can use multipleNUMA nodes.

    I SSE2 vectorization instructions.

    I Additional threads requires additional sub-dimensionalwork buffers.

    I We use strides instead of transposes because we need tomulti-thread.

    Malcolm Roberts malcolmiwroberts.com 11

    malcolmiwroberts.com

  • Multi-threaded performance: 1D

    200

    300

    400

    500

    600

    700

    800

    perform

    ance:m

    log2m/tim

    e,(ns)

    −1

    102 103 104 105 106m

    Implicit T=1

    Implicit T=4

    Explicit T=1

    Explicit T=4

    Malcolm Roberts malcolmiwroberts.com 12

    malcolmiwroberts.com

  • Multi-threaded performance: 2D

    100

    200

    300

    400

    500

    600

    perform

    ance:m

    2log2m

    2/tim

    e,(ns)

    −1

    102 103m

    Implicit T=1

    Implicit T=4

    Explicit T=1

    Explicit T=4

    Malcolm Roberts malcolmiwroberts.com 13

    malcolmiwroberts.com

  • Multi-threaded performance: 3D

    100

    200

    300

    perform

    ance:m

    3log2m

    3/tim

    e,(ns)

    −1

    101 102m

    Implicit T=1

    Implicit T=4

    Explicit T=1

    Explicit T=4

    Malcolm Roberts malcolmiwroberts.com 14

    malcolmiwroberts.com

  • Multi-threaded speedup: 3D

    1

    2

    3

    4

    5

    6

    7relative

    speed

    101 102m

    T=1T=4

    Malcolm Roberts malcolmiwroberts.com 15

    malcolmiwroberts.com

  • Distributed-memory implementation

    I Implicit dealiasing requires less communication.

    I By using discontiguous buffers, we can overlapcommunication and computation.

    I We use a hybrid OpenMP/MPI parallelization for clustersof multi-core machines.

    I 2D MPI data decomposition.

    I We make use of the hybrid transpose algorithm.

    Malcolm Roberts malcolmiwroberts.com 16

    malcolmiwroberts.com

  • Hybrid MPI Transpose

    Matrix transpose is an essential primitive of high-performancecomputing.

    They allow one to localize data on one process so thatshared-memory algorithms can be applied.

    I will discuss two algorithms for transposes:

    I Direct Transpose.

    I Recursive Transpose.

    We combine these into a hybrid transpose.

    Malcolm Roberts malcolmiwroberts.com 17

    malcolmiwroberts.com

  • Direct (AlltoAll) Transpose

    I Efficient for P � m (large messages).I Most direct method.

    I Many small messages when P ≈ m.Implementations:

    I MPI_Alltoall

    I MPI_Send, MPI_Recv

    Malcolm Roberts malcolmiwroberts.com 18

    malcolmiwroberts.com

  • Direct (AlltoAll) Transpose

    0

    1

    2

    3

    4

    5

    6

    7

    Process

    Malcolm Roberts malcolmiwroberts.com 19

    malcolmiwroberts.com

  • Direct (AlltoAll) Transpose

    0

    1

    2

    3

    4

    5

    6

    7

    Process

    Malcolm Roberts malcolmiwroberts.com 19

    malcolmiwroberts.com

  • Direct (AlltoAll) Transpose

    0

    1

    2

    3

    4

    5

    6

    7

    Process

    Malcolm Roberts malcolmiwroberts.com 19

    malcolmiwroberts.com

  • Recursive Transpose

    I Efficient for P � m (large messages).I Recursively subdivides transpose into smaller block

    transposes.

    I logm phases.

    I Communications are grouped to reduce latency.

    I Requires intermediate communication.

    Implementations:

    I FFTW

    Malcolm Roberts malcolmiwroberts.com 20

    malcolmiwroberts.com

  • Recursive Transpose

    0

    1

    2

    3

    4

    5

    6

    7

    Process

    Malcolm Roberts malcolmiwroberts.com 21

    malcolmiwroberts.com

  • Recursive Transpose

    0

    1

    2

    3

    4

    5

    6

    7

    Process

    Malcolm Roberts malcolmiwroberts.com 21

    malcolmiwroberts.com

  • Recursive Transpose

    0

    1

    2

    3

    4

    5

    6

    7

    Process

    Malcolm Roberts malcolmiwroberts.com 21

    malcolmiwroberts.com

  • Recursive Transpose

    0

    1

    2

    3

    4

    5

    6

    7

    Process

    Malcolm Roberts malcolmiwroberts.com 21

    malcolmiwroberts.com

  • Recursive Transpose

    0

    1

    2

    3

    4

    5

    6

    7

    Process

    Malcolm Roberts malcolmiwroberts.com 21

    malcolmiwroberts.com

  • Recursive Transpose

    0

    1

    2

    3

    4

    5

    6

    7

    Process

    Malcolm Roberts malcolmiwroberts.com 21

    malcolmiwroberts.com

  • Recursive Transpose

    0

    1

    2

    3

    4

    5

    6

    7

    Process

    Malcolm Roberts malcolmiwroberts.com 21

    malcolmiwroberts.com

  • Hybrid Transpose

    I Recursive, but just one level.

    I Use the empirical properties of the cluster to determinebest parameters.

    I Optionally group messages to reduce latency.

    Implementation:

    I FFTW++

    Direct transpose communication cost: P−1P2

    m2, P messages.

    Hybrid cost with P = ab: (a−1)bm2

    P2+ (b−1)am

    2

    P2, a + b messages.

    Malcolm Roberts malcolmiwroberts.com 22

    malcolmiwroberts.com

  • Hybrid Transpose

    Let τ` be the message latency, and τd the time to send oneelement. The time to send n elements is

    (9)τ` + nτd .

    The time required to do a direct transpose is

    (10)TD = τ` (P − 1) + τdP − 1P2

    m2=(P − 1)(τ` + τd

    m2

    P2

    )The time for a block transpose is

    (11)TB(a) = τ`

    (a +

    P

    a− 2)

    + τd

    (2P − a − P

    a

    )m2

    P2.

    Malcolm Roberts malcolmiwroberts.com 23

    malcolmiwroberts.com

  • Hybrid Transpose

    104

    105

    106

    Com

    municationCost

    101 102 103

    P

    Zero Latency

    Direct

    Block

    Malcolm Roberts malcolmiwroberts.com 24

    malcolmiwroberts.com

  • Hybrid Transpose

    2000

    2500

    3000

    time(µs)

    1024× 1 512× 2 256× 4 128× 8nodes × threads

    FFTW: 40962

    hybrid: 40962

    Malcolm Roberts malcolmiwroberts.com 25

    malcolmiwroberts.com

  • Hybrid Transpose

    The hybrid transpose

    I Uses a direct transpose for large message sizes.

    I Uses a block transpose for small message sizes.

    I Offers a performance advantage when P ≈ m.I Can be tuned based upon the values of τ` and τd for the

    cluster.

    We use the hybrid transpose in computing convolutions usingimplicit dealiasing on clusters.

    Malcolm Roberts malcolmiwroberts.com 26

    malcolmiwroberts.com

  • MPI Convolution: 2D performance

    1000

    2000

    perform

    ance:m

    2log2m

    2/tim

    e,(ns)

    −1

    102 103 104m

    Implicit P=24 T=1

    Implicit P=48 T=1

    Implicit P=96 T=1

    Explicit P=24 T=1

    Explicit P=48 T=1

    Explicit P=96 T=1

    Malcolm Roberts malcolmiwroberts.com 27

    malcolmiwroberts.com

  • MPI Convolution: 2D performance

    3

    6

    9

    12

    relative

    speed

    102 103 104m

    P=24 T=1P=48 T=1P=96 T=1

    Malcolm Roberts malcolmiwroberts.com 28

    malcolmiwroberts.com

  • MPI Convolution: multithreaded 2D performance

    500

    1000

    1500

    perform

    ance:m

    2log2m

    2/tim

    e,(ns)

    −1

    102 103 104m

    Implicit P=1 T=24

    Implicit P=2 T=24

    Implicit P=4 T=24

    Explicit P=1 T=24

    Explicit P=2 T=24

    Explicit P=4 T=24

    Malcolm Roberts malcolmiwroberts.com 29

    malcolmiwroberts.com

  • MPI Convolution: 3D performance

    500

    1000

    1500

    perform

    ance:m

    3log2m

    3/tim

    e,(ns)

    −1

    102 103m

    Implicit P=24 T=1

    Implicit P=48 T=1

    Implicit P=96 T=1

    Explicit P=24 T=1

    Explicit P=48 T=1

    Explicit P=96 T=1

    Malcolm Roberts malcolmiwroberts.com 30

    malcolmiwroberts.com

  • MPI Convolution: 3D performance

    2.5

    3

    3.5

    relative

    speed

    102m

    P=24 T=1P=48 T=1P=96 T=1

    Malcolm Roberts malcolmiwroberts.com 31

    malcolmiwroberts.com

  • MPI Convolution: multithreaded 3D performance

    500

    1000

    1500

    perform

    ance:m

    3log2m

    3/tim

    e,(ns)

    −1

    102 103m

    Implicit P=1 T=24

    Implicit P=2 T=24

    Implicit P=4 T=24

    Explicit P=1 T=24

    Explicit P=2 T=24

    Explicit P=4 T=24

    Malcolm Roberts malcolmiwroberts.com 32

    malcolmiwroberts.com

  • MPI Convolution: 3D scaling

    24 48 96 768 1536Number of cores

    1

    2

    4

    32

    64Speedup

    5122

    10242

    20482

    40962

    81922

    163842

    327682

    Malcolm Roberts malcolmiwroberts.com 33

    malcolmiwroberts.com

  • Application: Pseudospectral simulation

    Malcolm Roberts malcolmiwroberts.com 34

    malcolmiwroberts.com

  • Application: Pseudospectral simulation

    Malcolm Roberts malcolmiwroberts.com 35

    robertsetal_tubulent_helical_mhd.aviMedia File (video/avi)

    malcolmiwroberts.com

  • Convolutions Summary

    Implicitly dealiased convolutions:

    I use less memory

    I have less communication costs,

    I and are faster than conventional zero-padding techniques.

    The hybrid transpose is faster for small message size.

    Collaboration with John Bowman, University of Alberta.

    Implementation in the open-source project FFTW++:

    fftwpp.sf.net

    We have around 13 000 downloads (plus clones).

    Malcolm Roberts malcolmiwroberts.com 36

    fftwpp.sf.netmalcolmiwroberts.com

  • Running on GPUs

    Computing on general-purpose GPU has two advantages:

    I High performance

    I Low energy consumption

    There are a variety of options for running on GPU:

    I CUDA: Libraries available, tools available. Nvidia-only.

    I OpenMP 4.0: pragma-based, high-level.

    I OpenACC: Being rolled into OpenMP

    I OpenCL: Similar to CUDA, but released later.I Works on all vendors, very flexible.I Runs on GPUs, CPUs, mics (Xeon Phi).

    Malcolm Roberts malcolmiwroberts.com 37

    malcolmiwroberts.com

  • OpenCL

    One writes a normal program, in which the code for the GPU iscontained in a string.

    At run-time, the program:

    1. Selects the OpenCL platform(s) and device(s).

    2. Creates an OpenCL context and queue.

    3. Compiles the programs into kernels.

    4. Allocates buffers on the device.

    5. Launches kernels in the queue: managed with events.

    Malcolm Roberts malcolmiwroberts.com 38

    malcolmiwroberts.com

  • OpenCL

    Kernels are the code from the interior of loops.Example: the C code

    void myfunc ( double∗ a , double∗ b , i n t n ) {fo r ( i n t i = 0 ; i < n ; ++i ) {

    a [ i ] ∗= b [ i ] ;}

    becomes:

    k e r n e l void m y k e r n e l ( g l o b a l double∗ a ,g l o b a l double∗ b ) {

    i n t i = g e t l o c a l i d ( 0 ) ;a [ i ] ∗= b [ i ] ;

    }

    Malcolm Roberts malcolmiwroberts.com 39

    malcolmiwroberts.com

  • OpenCL

    Since the kernel has no loop dependencies, everything isvectorized: even RAM buffers are aligned to the vector width.

    The __global keyword specifies that one uses the globaldevice memory.

    One has access to the cache with __local; if one wants tohave data in the cache, then one writes a loop to put it there.

    Coalescent memory access is crucial.

    So, one has a lot of control, but there’s a bit more work.

    But the performance is good!

    Malcolm Roberts malcolmiwroberts.com 40

    malcolmiwroberts.com

  • OpenCL

    We developed a discontinuous-Galerkin code for solvinghyperbolic conservation laws:

    schnaps

    Solver for Conservative Hypebolic Non-linear systems Appliedto PlasmaS

    ∂w

    ∂t+

    k=d∑k=1

    ∂kF k(w) = S (12)

    Malcolm Roberts malcolmiwroberts.com 41

    malcolmiwroberts.com

  • schnaps

    Malcolm Roberts malcolmiwroberts.com 42

    malcolmiwroberts.com

  • schnaps

    Discontinuous Galerkin method:

    I Deals well with complex geometries.

    I Local refinement: non-uniform grid.

    OpenCL implementation:

    I Hexahedral elements for coalescent memory access.

    I Macrocell / subcell formulation.

    I Array of structs of arrays: yet more coalescence.

    Malcolm Roberts malcolmiwroberts.com 43

    malcolmiwroberts.com

  • schnaps

    Malcolm Roberts malcolmiwroberts.com 44

    malcolmiwroberts.com

  • schnaps

    But, is it fast?

    Malcolm Roberts malcolmiwroberts.com 45

    malcolmiwroberts.com

  • Performance analysis of schnaps

    10−1

    100

    tim

    e(s

    econ

    ds)

    101

    refinement

    C

    OpenCL

    Malcolm Roberts malcolmiwroberts.com 46

    malcolmiwroberts.com

  • Performance analysis schnaps

    clFFT, an FFT library written in OpenCL by AMD.

    10−4

    10−3

    10−2

    10−1

    time(secon

    ds)

    102 103

    problem size

    CPUGPU

    Malcolm Roberts malcolmiwroberts.com 47

    malcolmiwroberts.com

  • Performance analysis of schnaps

    100

    tim

    e(s

    econ

    ds)

    101

    refinement

    CPUGPU

    Malcolm Roberts malcolmiwroberts.com 48

    malcolmiwroberts.com

  • Performance analysis of schnaps

    schnaps works well on the Xeon Phi.

    101

    102

    103

    time(s)

    101 102 103

    N

    12 CPU1 GPU1 MIC

    Malcolm Roberts malcolmiwroberts.com 49

    malcolmiwroberts.com

  • schnaps summary

    We observe that:

    1. The C code makes use of all the cores.

    2. The C and OpenCL code speeds on the CPU are close forlarge problem sizes.

    3. The performance difference of schnaps between the CPUand GPU is near what we should expect.

    Thus, we claim that our code makes effective use of the GPU.

    We can further improve the code by profiling.

    Collaboration with Philippe Helluy and TONUS, University ofStrasbourg.

    Malcolm Roberts malcolmiwroberts.com 50

    malcolmiwroberts.com

  • Example simulation: Maxwell’s equations

    Malcolm Roberts malcolmiwroberts.com 51

    malcolmiwroberts.com

  • Conclusion

    I presented two projects:

    I FFTW++

    I Implicitly Dealiased Convolutions: faster, less memory.I OpenMP and/or MPI implementation.I Hybrid MPI transpose.I Application to a wide variety of situations.

    I schnaps

    I OpenCL implementation of the discontinuous Galerkinmethod.

    I Good performance on the CPU, GPU, and mic.

    Thank you for your attention!

    Malcolm Roberts malcolmiwroberts.com 52

    malcolmiwroberts.com

  • Timing statistics

    100

    101

    102

    103

    104

    frequency

    5×10−5 7×10−5time (s)

    Malcolm Roberts malcolmiwroberts.com 53

    malcolmiwroberts.com