Research Presentation for Computer ... - Malcolm Roberts · Research Presentation for Computer...
Transcript of Research Presentation for Computer ... - Malcolm Roberts · Research Presentation for Computer...
-
Research Presentation for Computer
Modelling Group Ltd.
Malcolm Roberts
University of Strasbourg
2016-04-26
[email protected], www.malcolmiwroberts.com
-
Outline
I ConvolutionsI Implicitly Dealiased FFT-based convolutionsI Shared-memory implementationI Parallel OpenMP/MPI implementationI Pseudospectral simulations
I GPU programmingI OpenCLI schnapsI Performance analysis.
Malcolm Roberts malcolmiwroberts.com 2
malcolmiwroberts.com
-
FFT-based convolutions
The convolution of {Fk}m−1k=0 and {Gk}m−1k=0 is
(1)(F ? G )k =k∑`=0
F`Gk−`, k=0, . . . ,m − 1.
For example, if F and G are:
Then F ∗ G is:
Malcolm Roberts malcolmiwroberts.com 3
malcolmiwroberts.com
-
FFT-based convolutions
Applications:
I Signal processing
I Machine learning: convolutional neural networks
I Image processing
I Particle-Image-Velocimitry
I Pseudospectral simulations of nonlinear PDEs
The convolution theorem:
(2)F [F ∗ G ] = F [F ]�F [G ] .
Using FFTs improves speed and accuracy.
Malcolm Roberts malcolmiwroberts.com 4
malcolmiwroberts.com
-
FFT-based convolutions
Let ζm = exp(2πim
). Forward and backward Fourier transforms
are given by:
(3)fj =m−1∑k=0
ζ jkmFk , Fk=1
m
m−1∑j=0
ζ−kjm fk ,
We will use the identity
(4)m−1∑j =0
ζ`jm =
{m if ` = sm for s ∈ Z,1−ζ`mm1−ζmm = 0 otherwise.
Malcolm Roberts malcolmiwroberts.com 5
malcolmiwroberts.com
-
FFT-based convolutions
The convolution theorem works because
(5)
m−1∑j =0
fjgjζ−jkm =
m−1∑j=0
ζ−jkm
(m−1∑p=0
ζ jpmFp
)(m−1∑q=0
ζ jqmGq
)
=m−1∑p=0
Fp
m−1∑q=0
Gq
m−1∑j=0
ζ j(−k+p+q)m
= m∑s
m−1∑p=0
FpGk−p+sm.
The terms s 6= 0 are aliases; they are bad.
Malcolm Roberts malcolmiwroberts.com 6
malcolmiwroberts.com
-
Conventional dealiasing: zero padding
Let F̃.
= {F0,F1, . . . ,Fm−2,Fm−1, 0, . . . , 0︸ ︷︷ ︸m
}. Then,
(6)
(F̃ ∗2m G̃
)k
=2m−1∑`=0
F̃` mod (2m)G̃(k−`) mod (2m)
=m−1∑`=0
F`G̃(k−`) mod (2m)
=k∑`=0
F`Gk−`.
There is also a “2/3” padded version for pseudospectralsimulations, where the input {Fk}m−1k=−m is padded to 3m.
Malcolm Roberts malcolmiwroberts.com 7
malcolmiwroberts.com
-
Dealiasing with conventional zero-padding
F GF G
f g
fg
F ∗GF ∗G
Malcolm Roberts malcolmiwroberts.com 8
malcolmiwroberts.com
-
Dealiasing with implicit zero-padding
We modify the FFT to account for the zeros implicitly.
Let ζn = exp (−i2π/n). The Fourier transform of F̃ is
(7)fx =2m−1∑k =0
ζxk2mF̃k =m−1∑k=0
ζxk2mF̃k
We can compute this using two discontiguous buffers:
(8)f2x =m−1∑k=0
ζxkm Fk f2x+1=m−1∑k=0
ζxkm(ζk2mFk
).
Malcolm Roberts malcolmiwroberts.com 9
malcolmiwroberts.com
-
Dealiasing with implicit zero-padding
F G
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
-
Dealiasing with implicit zero-padding
FFT−1x {F}nx even
FFT−1x {F}nx odd
FFT−1x {G}nx even
FFT−1x {G}nx odd
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
-
Dealiasing with implicit zero-padding
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
-
Dealiasing with implicit zero-padding
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
-
Dealiasing with implicit zero-padding
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
-
Dealiasing with implicit zero-padding
FFT−1x {F ∗G}nx even
FFT−1x {F ∗G}nx odd
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
-
Dealiasing with implicit zero-padding
F ∗G
Malcolm Roberts malcolmiwroberts.com 10
malcolmiwroberts.com
-
Shared-memory implementation
I Implicit dealiasing requires less memory.
I We avoid FFTs on zero-data.
I By using discontiguous buffers, we can use multipleNUMA nodes.
I SSE2 vectorization instructions.
I Additional threads requires additional sub-dimensionalwork buffers.
I We use strides instead of transposes because we need tomulti-thread.
Malcolm Roberts malcolmiwroberts.com 11
malcolmiwroberts.com
-
Multi-threaded performance: 1D
200
300
400
500
600
700
800
perform
ance:m
log2m/tim
e,(ns)
−1
102 103 104 105 106m
Implicit T=1
Implicit T=4
Explicit T=1
Explicit T=4
Malcolm Roberts malcolmiwroberts.com 12
malcolmiwroberts.com
-
Multi-threaded performance: 2D
100
200
300
400
500
600
perform
ance:m
2log2m
2/tim
e,(ns)
−1
102 103m
Implicit T=1
Implicit T=4
Explicit T=1
Explicit T=4
Malcolm Roberts malcolmiwroberts.com 13
malcolmiwroberts.com
-
Multi-threaded performance: 3D
100
200
300
perform
ance:m
3log2m
3/tim
e,(ns)
−1
101 102m
Implicit T=1
Implicit T=4
Explicit T=1
Explicit T=4
Malcolm Roberts malcolmiwroberts.com 14
malcolmiwroberts.com
-
Multi-threaded speedup: 3D
1
2
3
4
5
6
7relative
speed
101 102m
T=1T=4
Malcolm Roberts malcolmiwroberts.com 15
malcolmiwroberts.com
-
Distributed-memory implementation
I Implicit dealiasing requires less communication.
I By using discontiguous buffers, we can overlapcommunication and computation.
I We use a hybrid OpenMP/MPI parallelization for clustersof multi-core machines.
I 2D MPI data decomposition.
I We make use of the hybrid transpose algorithm.
Malcolm Roberts malcolmiwroberts.com 16
malcolmiwroberts.com
-
Hybrid MPI Transpose
Matrix transpose is an essential primitive of high-performancecomputing.
They allow one to localize data on one process so thatshared-memory algorithms can be applied.
I will discuss two algorithms for transposes:
I Direct Transpose.
I Recursive Transpose.
We combine these into a hybrid transpose.
Malcolm Roberts malcolmiwroberts.com 17
malcolmiwroberts.com
-
Direct (AlltoAll) Transpose
I Efficient for P � m (large messages).I Most direct method.
I Many small messages when P ≈ m.Implementations:
I MPI_Alltoall
I MPI_Send, MPI_Recv
Malcolm Roberts malcolmiwroberts.com 18
malcolmiwroberts.com
-
Direct (AlltoAll) Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 19
malcolmiwroberts.com
-
Direct (AlltoAll) Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 19
malcolmiwroberts.com
-
Direct (AlltoAll) Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 19
malcolmiwroberts.com
-
Recursive Transpose
I Efficient for P � m (large messages).I Recursively subdivides transpose into smaller block
transposes.
I logm phases.
I Communications are grouped to reduce latency.
I Requires intermediate communication.
Implementations:
I FFTW
Malcolm Roberts malcolmiwroberts.com 20
malcolmiwroberts.com
-
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
-
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
-
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
-
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
-
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
-
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
-
Recursive Transpose
0
1
2
3
4
5
6
7
Process
Malcolm Roberts malcolmiwroberts.com 21
malcolmiwroberts.com
-
Hybrid Transpose
I Recursive, but just one level.
I Use the empirical properties of the cluster to determinebest parameters.
I Optionally group messages to reduce latency.
Implementation:
I FFTW++
Direct transpose communication cost: P−1P2
m2, P messages.
Hybrid cost with P = ab: (a−1)bm2
P2+ (b−1)am
2
P2, a + b messages.
Malcolm Roberts malcolmiwroberts.com 22
malcolmiwroberts.com
-
Hybrid Transpose
Let τ` be the message latency, and τd the time to send oneelement. The time to send n elements is
(9)τ` + nτd .
The time required to do a direct transpose is
(10)TD = τ` (P − 1) + τdP − 1P2
m2=(P − 1)(τ` + τd
m2
P2
)The time for a block transpose is
(11)TB(a) = τ`
(a +
P
a− 2)
+ τd
(2P − a − P
a
)m2
P2.
Malcolm Roberts malcolmiwroberts.com 23
malcolmiwroberts.com
-
Hybrid Transpose
104
105
106
Com
municationCost
101 102 103
P
Zero Latency
Direct
Block
Malcolm Roberts malcolmiwroberts.com 24
malcolmiwroberts.com
-
Hybrid Transpose
2000
2500
3000
time(µs)
1024× 1 512× 2 256× 4 128× 8nodes × threads
FFTW: 40962
hybrid: 40962
Malcolm Roberts malcolmiwroberts.com 25
malcolmiwroberts.com
-
Hybrid Transpose
The hybrid transpose
I Uses a direct transpose for large message sizes.
I Uses a block transpose for small message sizes.
I Offers a performance advantage when P ≈ m.I Can be tuned based upon the values of τ` and τd for the
cluster.
We use the hybrid transpose in computing convolutions usingimplicit dealiasing on clusters.
Malcolm Roberts malcolmiwroberts.com 26
malcolmiwroberts.com
-
MPI Convolution: 2D performance
1000
2000
perform
ance:m
2log2m
2/tim
e,(ns)
−1
102 103 104m
Implicit P=24 T=1
Implicit P=48 T=1
Implicit P=96 T=1
Explicit P=24 T=1
Explicit P=48 T=1
Explicit P=96 T=1
Malcolm Roberts malcolmiwroberts.com 27
malcolmiwroberts.com
-
MPI Convolution: 2D performance
3
6
9
12
relative
speed
102 103 104m
P=24 T=1P=48 T=1P=96 T=1
Malcolm Roberts malcolmiwroberts.com 28
malcolmiwroberts.com
-
MPI Convolution: multithreaded 2D performance
500
1000
1500
perform
ance:m
2log2m
2/tim
e,(ns)
−1
102 103 104m
Implicit P=1 T=24
Implicit P=2 T=24
Implicit P=4 T=24
Explicit P=1 T=24
Explicit P=2 T=24
Explicit P=4 T=24
Malcolm Roberts malcolmiwroberts.com 29
malcolmiwroberts.com
-
MPI Convolution: 3D performance
500
1000
1500
perform
ance:m
3log2m
3/tim
e,(ns)
−1
102 103m
Implicit P=24 T=1
Implicit P=48 T=1
Implicit P=96 T=1
Explicit P=24 T=1
Explicit P=48 T=1
Explicit P=96 T=1
Malcolm Roberts malcolmiwroberts.com 30
malcolmiwroberts.com
-
MPI Convolution: 3D performance
2.5
3
3.5
relative
speed
102m
P=24 T=1P=48 T=1P=96 T=1
Malcolm Roberts malcolmiwroberts.com 31
malcolmiwroberts.com
-
MPI Convolution: multithreaded 3D performance
500
1000
1500
perform
ance:m
3log2m
3/tim
e,(ns)
−1
102 103m
Implicit P=1 T=24
Implicit P=2 T=24
Implicit P=4 T=24
Explicit P=1 T=24
Explicit P=2 T=24
Explicit P=4 T=24
Malcolm Roberts malcolmiwroberts.com 32
malcolmiwroberts.com
-
MPI Convolution: 3D scaling
24 48 96 768 1536Number of cores
1
2
4
32
64Speedup
5122
10242
20482
40962
81922
163842
327682
Malcolm Roberts malcolmiwroberts.com 33
malcolmiwroberts.com
-
Application: Pseudospectral simulation
Malcolm Roberts malcolmiwroberts.com 34
malcolmiwroberts.com
-
Application: Pseudospectral simulation
Malcolm Roberts malcolmiwroberts.com 35
robertsetal_tubulent_helical_mhd.aviMedia File (video/avi)
malcolmiwroberts.com
-
Convolutions Summary
Implicitly dealiased convolutions:
I use less memory
I have less communication costs,
I and are faster than conventional zero-padding techniques.
The hybrid transpose is faster for small message size.
Collaboration with John Bowman, University of Alberta.
Implementation in the open-source project FFTW++:
fftwpp.sf.net
We have around 13 000 downloads (plus clones).
Malcolm Roberts malcolmiwroberts.com 36
fftwpp.sf.netmalcolmiwroberts.com
-
Running on GPUs
Computing on general-purpose GPU has two advantages:
I High performance
I Low energy consumption
There are a variety of options for running on GPU:
I CUDA: Libraries available, tools available. Nvidia-only.
I OpenMP 4.0: pragma-based, high-level.
I OpenACC: Being rolled into OpenMP
I OpenCL: Similar to CUDA, but released later.I Works on all vendors, very flexible.I Runs on GPUs, CPUs, mics (Xeon Phi).
Malcolm Roberts malcolmiwroberts.com 37
malcolmiwroberts.com
-
OpenCL
One writes a normal program, in which the code for the GPU iscontained in a string.
At run-time, the program:
1. Selects the OpenCL platform(s) and device(s).
2. Creates an OpenCL context and queue.
3. Compiles the programs into kernels.
4. Allocates buffers on the device.
5. Launches kernels in the queue: managed with events.
Malcolm Roberts malcolmiwroberts.com 38
malcolmiwroberts.com
-
OpenCL
Kernels are the code from the interior of loops.Example: the C code
void myfunc ( double∗ a , double∗ b , i n t n ) {fo r ( i n t i = 0 ; i < n ; ++i ) {
a [ i ] ∗= b [ i ] ;}
becomes:
k e r n e l void m y k e r n e l ( g l o b a l double∗ a ,g l o b a l double∗ b ) {
i n t i = g e t l o c a l i d ( 0 ) ;a [ i ] ∗= b [ i ] ;
}
Malcolm Roberts malcolmiwroberts.com 39
malcolmiwroberts.com
-
OpenCL
Since the kernel has no loop dependencies, everything isvectorized: even RAM buffers are aligned to the vector width.
The __global keyword specifies that one uses the globaldevice memory.
One has access to the cache with __local; if one wants tohave data in the cache, then one writes a loop to put it there.
Coalescent memory access is crucial.
So, one has a lot of control, but there’s a bit more work.
But the performance is good!
Malcolm Roberts malcolmiwroberts.com 40
malcolmiwroberts.com
-
OpenCL
We developed a discontinuous-Galerkin code for solvinghyperbolic conservation laws:
schnaps
Solver for Conservative Hypebolic Non-linear systems Appliedto PlasmaS
∂w
∂t+
k=d∑k=1
∂
∂kF k(w) = S (12)
Malcolm Roberts malcolmiwroberts.com 41
malcolmiwroberts.com
-
schnaps
Malcolm Roberts malcolmiwroberts.com 42
malcolmiwroberts.com
-
schnaps
Discontinuous Galerkin method:
I Deals well with complex geometries.
I Local refinement: non-uniform grid.
OpenCL implementation:
I Hexahedral elements for coalescent memory access.
I Macrocell / subcell formulation.
I Array of structs of arrays: yet more coalescence.
Malcolm Roberts malcolmiwroberts.com 43
malcolmiwroberts.com
-
schnaps
Malcolm Roberts malcolmiwroberts.com 44
malcolmiwroberts.com
-
schnaps
But, is it fast?
Malcolm Roberts malcolmiwroberts.com 45
malcolmiwroberts.com
-
Performance analysis of schnaps
10−1
100
tim
e(s
econ
ds)
101
refinement
C
OpenCL
Malcolm Roberts malcolmiwroberts.com 46
malcolmiwroberts.com
-
Performance analysis schnaps
clFFT, an FFT library written in OpenCL by AMD.
10−4
10−3
10−2
10−1
time(secon
ds)
102 103
problem size
CPUGPU
Malcolm Roberts malcolmiwroberts.com 47
malcolmiwroberts.com
-
Performance analysis of schnaps
100
tim
e(s
econ
ds)
101
refinement
CPUGPU
Malcolm Roberts malcolmiwroberts.com 48
malcolmiwroberts.com
-
Performance analysis of schnaps
schnaps works well on the Xeon Phi.
101
102
103
time(s)
101 102 103
N
12 CPU1 GPU1 MIC
Malcolm Roberts malcolmiwroberts.com 49
malcolmiwroberts.com
-
schnaps summary
We observe that:
1. The C code makes use of all the cores.
2. The C and OpenCL code speeds on the CPU are close forlarge problem sizes.
3. The performance difference of schnaps between the CPUand GPU is near what we should expect.
Thus, we claim that our code makes effective use of the GPU.
We can further improve the code by profiling.
Collaboration with Philippe Helluy and TONUS, University ofStrasbourg.
Malcolm Roberts malcolmiwroberts.com 50
malcolmiwroberts.com
-
Example simulation: Maxwell’s equations
Malcolm Roberts malcolmiwroberts.com 51
malcolmiwroberts.com
-
Conclusion
I presented two projects:
I FFTW++
I Implicitly Dealiased Convolutions: faster, less memory.I OpenMP and/or MPI implementation.I Hybrid MPI transpose.I Application to a wide variety of situations.
I schnaps
I OpenCL implementation of the discontinuous Galerkinmethod.
I Good performance on the CPU, GPU, and mic.
Thank you for your attention!
Malcolm Roberts malcolmiwroberts.com 52
malcolmiwroberts.com
-
Timing statistics
100
101
102
103
104
frequency
5×10−5 7×10−5time (s)
Malcolm Roberts malcolmiwroberts.com 53
malcolmiwroberts.com