Post on 04-Apr-2018
7/29/2019 GPU CG Presentation
1/30
GPU Implementation of CG
solver for MILC
Guochun Shi
Innovative Systems Laboratory
7/29/2019 GPU CG Presentation
2/30
7/29/2019 GPU CG Presentation
3/30
Lattice QCD: Solving the following linear system
7/29/2019 GPU CG Presentation
4/30
The wilson dslash operator
7/29/2019 GPU CG Presentation
5/30
BU code: dslash reference implementation in CPU
7/29/2019 GPU CG Presentation
6/30
BU code: GPU kernel code ( x+ direction)
7/29/2019 GPU CG Presentation
7/30
Disclaimer
The source code is from Bosten Universitys
Quda package.
The diagrams/formulas are from two papers C. Bernarda, C. DeTarb, S. Gottliebc, U.M. Hellerd, J. Hetricke, N. Ishizukaa, L. Karkkainenf , S.R. Lantzg, K.
Rummukainenc, R. Sugarh, D. Toussainte and M. Wingatei, Lattice QCD on the IBM Scalable POWERParallel Systems
SP2
K. Z. Ibrahim, F. Bodin, Efficient SIMDization and Data Management of the Lattice QCD Computation on the Cell
Broadand Engine
7/29/2019 GPU CG Presentation
8/30
CG in BU code
r2
r2_old
p
(where is the preconditioned matrix)
Standard CG procedure from An Introduction to
the Conjugate Gradient Method. Without the
Agonizing Pain : A x = b
7/29/2019 GPU CG Presentation
9/30
CG code in BU code
7/29/2019 GPU CG Presentation
10/30
Different solution types solve different
equations
the same
QUDA_MAT_SOLUTION
QUDA_MATPC_SOLUTION
QUDA_MATPCDAG_MATPC_SOLUTION
7/29/2019 GPU CG Presentation
11/30
Staggered Dslash reference Implementation
7/29/2019 GPU CG Presentation
12/30
Staggered Dslash GPU implementation
Similar to the Wilson Dslash
Link representation is the same Use 5 float4 to represent 3x3 complex matrix (18
floats used, 2 floats ununsed)
But staggered Dslash has 2 links, wilson has 1 link only
Spinor representation differ slightly Wilson: 6 float4 to represent 4x3 complex matrix
(total 24 floats)
Stagger: 3 float2 to represent 1x3 complex vector(total 6 floats)
7/29/2019 GPU CG Presentation
13/30
Preliminary Results
GPU results and CPU reference does not match (yet)
Flops per site: CM(complex multiplication)=6, CA(complexaddtion)=2
3*(3*CM+2*CA)*8*2 + 15*(3*CA) = 1146 flops
In/Out bytes (12 construct): ((8*6*2) + 6 + (8*12*2)) * sizeof(float) = 1176 bytes
Nvidia GTX 280 GFLOPS: 106.7
Bandwidth: 102.0 GB/s
7/29/2019 GPU CG Presentation
14/30
Preliminary Results (single precision
only)
GPU and CPU results agree
Fixed small errors in both CPU and GPU code
Conjugate Gradient (CG) works
Solves where
93 Gflops with GTX280
7/29/2019 GPU CG Presentation
15/30
Whats next
Optimizing the single precision version in GPU
Make other flavors work
8 reconstruct
Double precision/half precision, especially double
precision because of next GPU architecture
Multi-gpu / multi-node implementation for
large lattice size
Incorporating the code into MILC (?)
7/29/2019 GPU CG Presentation
16/30
Staggered Dslash
CPU data layout
Each site contains:
1 spinor (1x3 complex)
4 fatlink (3x3 complex)
4 longlink (3x3 complex)
Sites are divided into even and
odd sites. For site (x,y,z,t)
(x+y+z+t)%2 == 0 even site
(x+y+z+t)%2 ==1 odd site
Total number of sites
V = dimX * dimY * dimZ * dimT
Half of total sites Vh = V/2
Fatlink:Array of
pointers
+X
+Y +Z
+T
Each link contains 18 floats(3x3x2)
spinor
even site starts odd site starts
6*V floats
Each spinor contains 6 (3*2) floats
Longlink:Same as fatlink
7/29/2019 GPU CG Presentation
17/30
CPU spinor
even site starts odd site starts
6*V floats
CPU parityspinor
6 *Vh floats
Vh *float2
GPU parityspinor
float2
One spinor
GPU kernel code
to read spinor
Spinor CPU-> GPU mapping
7/29/2019 GPU CG Presentation
18/30
Y/Z/T linkIntermediatedata format
+X links
Y/Z/T linkfloat4Vh *float4
CPU links
layout
GPU linkslayout
12-construct
One link
Link CPU-> GPU mapping
GPU kernel code
to read link
7/29/2019 GPU CG Presentation
19/30
Progress in last week
8 reconstruct works (for long link), full load for fatlink works Long link is loaded using n (n=2 or 3) float4
Fat link is loaded using m(m=9) float2, no bandwidth
wasted
Performance (8 reconstruct for long link, full loadwith fat link) Dslash
97 Gflops, bandwidth achieved 97.9 GB/s CG
86.7 Gflops
7/29/2019 GPU CG Presentation
20/30
optimization using shared memory
Link is not shared in the Dslash computation
Each spinor is shared 16 times
Since the majority of the bandwidth requirement comes from links,there is an upper limit even we share the spinor perfectly, i.e. eachspinor is only loaded once
Normal data requirement for each site (12-reconstruct):
(8*6*2+6)+ 8*18+8*12= 342 bytes
The best spinor shared stragety can reduce that to (1*6+6)+8*18+8*12= 252 bytes, leading to 26.3% improvement
Shared memory size is limitedthe number of spinor in sharedmemory is limited (16KB can hold 682 spinors, approximately 6^4/2)
Need to rearrange data Probably need to use 4-D tile to scan through spinors
Implementation nontrivial
Low priority task
7/29/2019 GPU CG Presentation
21/30
Progress in last week
Double/single/half precision all works
Need to know the range of long link values inorder to implement half precision, now assume [-
1, 1] Mixed precision for spinor/gauge should work, not
tested completely yet
The sloppy precision should also work in CG but
not tested completely yet Bug fix: feedback to BU
7/29/2019 GPU CG Presentation
22/30
Double Single half 8-reconstruct 17.4 (35.1) 97.1(97.9) 152.5(76.9)
12-reconstruct 32(71.1) 87.6(97.4) 143.8(80)
Double Single half
8-reconstruct 16.6 87.8 134.7
12-reconstruct 30 78.3 126.5
Convergesteps
63 64 90
Dslash performance (GFLOPS and bandwidth)
CG performance (GFLOPS)
All tests running with 24^3 * 32 lattice with GTX280
7/29/2019 GPU CG Presentation
23/30
CG performance
(spinor, link, recon, spinor_sloppy, link_sloppy, recon_sloppy): total 108
combinations
Some typical flavor performance is shown in the table below
Residual is determined by higher accuracy spinor/link/recon
Gflops and iterations are determined by sloppy spinor/link/recon
Spinor link recon Spinor
sloppy
Link
sloppy
Recon
sloppy
residual gflops iterarions
double double 12 double double 12 1.88e-12 29.97 63
double double 12 single single 8 1.88e-12 79.58 64
double double 12 half half 8 2.02e-12 116.46 69
single single 8 single single 8 3.29e-07 86.68 64
single single 8 half half 8 3.30e-07 130.61 72
half half 8 half half 8 1.6e-03 134.91 90
7/29/2019 GPU CG Presentation
24/30
CG in MILC
rsq
oldrsq
cg_p
(where M = + 2 m )
Standard CG procedure from An Introduction to
the Conjugate Gradient Method. Without the
Agonizing Pain : A x = b
- ttt
resid
a
beta
dest
src
7/29/2019 GPU CG Presentation
25/30
7/29/2019 GPU CG Presentation
26/30
Interface function to MILC
b in the @src
Guess solution in @dest
Solve
7/29/2019 GPU CG Presentation
27/30
Direct CG performance(I)
Solve instead of
Some typical flavor performance is shown in the table below
Some combination does not converge after maximum(9999) iterations, e.g. (--sprec double --gprec double --recon 12 --sprec_sloppy half --gprec_sloppy double --recon_sloppy 12) .
All non-converging runs involve half precision
CPU
precision
Spinor link reco
n
Spinor
sloppy
Link
sloppy
Recon
sloppy
residual gflops iterarions
double
double double 12 double double 12 8.34e-13 29.98 88
double double 12 single single 8 9.96e-13 78.94 88
double double 12 half half 8 1.13e-12 130.04 1808
single single 8 single single 8 1.70e-07 83.60 88
single single 8 half half 8 2.93e-07 131.09 1999
half half 8 half half 8 9.63e-04 131.65 3534
7/29/2019 GPU CG Presentation
28/30
Direct CG performance (II)
CPU in single precision
The cpu precision has no effect on the accuracy
CPUprecision Spinor link recon Spinorsloppy Linksloppy Reconsloppy Residual Gflops Iterarions
single
single single 8 single single 8 4.27e-07 83.66 88
single single 8 half half 8 4.27e-07 130.74 1692
half half 8 half half 8 9.62e-4 131.41 2508
7/29/2019 GPU CG Presentation
29/30
Interface function to MILC
int ks_congrad_parity_gpu( su3_vector *t_src, su3_vector *t_dest,
quark_invert_control *qic, Real mass,
ferm_links_t *fn)
Replace the function ks_congrad_parity() in MILC 7.6.3
The program runs
Results doe not match with CPU
Reason:
the long link is not reconstructed correctly
How to do it correctly?
7/29/2019 GPU CG Presentation
30/30
Multi mass CG solver
Standalone test program works for all precisions
All solution precisions are good
Mixed precision CG solver
Only the first solutions accuracy is good, the rest of
solutions are as good as the sloppy precision
Interface function to MILC written but untested
Small test input needed