Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be...
-
date post
21-Dec-2015 -
Category
Documents
-
view
224 -
download
1
Transcript of Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be...
Linear Algebra on GPUs
Vasily Volkov
GPU Architecture Features• SIMD architecture
– Don’t be confused by scalar ISA which is only a program model• We use vector thread program model instead
– Similar to SSE/Cell SPE/vector units, not superscalar units– Native vector length is 32; can simulate larger (thread blocks)
• Multithreading using register windows– Context switch never flushes registers to memory– If more threads than can fit, some don’t start until others finish
• Huge register files, small high-latency caches– Fundamental difference with CPUs, similar to vector processors– Cache is to save memory bandwidth, not reduce latency– Use register blocking, not cache blocking
• Large startup costs (≈5s)– Not good for fine-grain computations– Use global barriers instead? (≈1.5 s)
Relation to Other Multicores
Processor Core2 SSE, 2.66GHz Cell SPEs G80/G84
Gflop/s/core 21.3 Gflop/s 25.6 Gflop/s 19-23 Gflop/s
Vector length 4 4 32
# of cores 2-4 8 4-16
• All offer both multithreading and SIMD capabilities• Use CUDA to program all of them?
Pointer Chase Benchmark, 8800GTX
8-word cache lineL1: 8 x 5kB, each is 20-way associativeL2: 6 x 32kB, each is 4-way associative
512kB memory pagesTLB: 16 entries, fully associative
run k=A[k] in a loop in a scalar thread
latency bound
larger latency at cache missReveals cache sizes
8800GTX/8800GTS/8600GTS/FX5600:Different number of similar caches
GPU Memory System (8800GTX)
Matrix-Matrix multiply, C = C + AB• 64x16 blocks in C, rank-4 updates
– Ensures that our code is compute-bound• Each thread computes one block in C
– ijk/jik form; other choices produce race condition in C• Keep A’s and C’s block in vector registers
– Similarly done on IBM 3090 VF and Cray X1• Keep B’s block in local memory
– Others keep it in scalar registers (n/a on GPU); or caches (slow)• Use compiler options to enforce tight register budget
– To hide memory latencies better by multithreading• Use prefetching to hide latencies even better
– Now, performance is not bound by latency and bandwidth in reading blocks in A and B!
– Bound by instruction issue and local memory ops (230 Gflop/s)
Performance of SGEMM
CUBLAS 1.1: • keeps square blocks in A and B in local memory• uses long vectors (poor instruction mix)
• exposing too much of data parallelism may cost youOur SGEMM is now in CUBLAS 2.0 beta
SGEMM, 8800GTX, k = 1024
Constant work per vector thread (function of k)Optimized version does better load balancing by computing partial sums
Panel Factorization
CPU: runtime on Core2 Duo 2.66GHz, Intel MKL 10.0 (includes CPU-GPU transfer!)
GPU: estimated for 8800GTX as
Time 5s bandwidth required75GB /s
Design of Matrix Factorizations• Right-looking scheme = most parallelism = best on 16-core GPUs• Crout scheme = least bandwidth = best on 4-core GPU and if using CUBLAS
1.1• Left-looking = half of work in triangular solve = limited parallelism =
inefficient• 2-level blocking
– Both levels are right-looking + premature exit from finer level to keep – Up to 6% speedup only, at large matrices (n≈10,000)
• Block size on GPU is 64 (same as in matrix multiply)– Autotuning in QR (up to 7% speedup)
• Row-major layout on GPU in LU decomposition– Since gathers with large stride are inefficient– Requires transposition at every CPU-GPU transfer– >2x speedup!
• Panel factorization on CPU overlapped with BLAS3 on GPU (use lookahead)• Multiply by inverse (GEMM) instead of triangular solve (TRSM)
– TRSM vs. GEMM is 13 Gflop/s vs. 160 Gflop/s if matrix is 64x64– Parallelism in TRSM is not embarrassing enough
Test Platform
• GPU– Core2 Duo 2.67GHz + GeForce 8800 GTX– Core2 Duo 2.67GHz + two GeForce 8800 GTX
• CPU– Core2 Duo 2.67GHz– Core2 Quad 2.4GHz
Summary of Performance
Speedups vs. CPU
Summary of Speedups
8800GTXGflop/s
Core2 Duo Core2 Quad
Gflop/s speedup Gflop/s speedup
Cholesky 183 23.0 7.4 34.9 5.5
LU 179 22.7 7.7 59.2 3.0
QR 192 22.5 8.3 44.8 4.3
SGEMM 206 25.9 8.0 69.8 3.0
Speedup using 2 GPUs
Using column-cyclic layout
Breakdown of runtime (LU)
What if omit one optimization in LU?
Other Work Done• Tridiagonal eigenvalue solver (bisection)
– Most work: factorize A–iI = LDLT, count signs in D (compute bound)– Done for many i in parallel — traditionally vectorized– If need more parallelism — do multisection instead of bisection
• But it increases total flop count– Rest is difficult to parallelize, does not worth it– Our solution:
• Run vectorized loops on the GPU, rest (least work) on the CPU• Autotune to decide optimal redundancy and when involve CPU• Use features of IEEE arithmetic to save another 15-30% of runtime• Up to 156x faster than LAPACK on Pentium 4
• Tridiagonal eigenvector solver (inverse iteration)– Most work: Solve (A–iI)xk+1=xk for fixed i (bandwidth bound)– Factorize A–iI = LDLT once, keep D only. Reconstruct L on need
• Reconstruction is overlapped with memory access; still bandwidth bound– Don’t pivot — recompute using safe code if fails (do it on CPU)– Up to 120x faster than LAPACK on Core2 Duo so far– More complicated when eigenvalues are clustered
• Stencil computation (7-point on 3D grid)– Blocks in registers and local memory– Bandwidth-bound, runs at up to 66% of pin-bandwidth
Future Work• Analysis of architecture
– Find best parallels in past architectures to reuse methods– Catching up with newer GPUs– More micro-benchmarking to get better performance models
• More scientific kernels– CUFFT is ≈50Gflop/s, can do better (e.g. by not doing sin/cos in the inner loop)
• More LAPACK– Two-sided factorizations used in eigensolvers and SVD
• LAPACK does 50% of work is in BLAS1/BLAS2• Mostly BLAS3 algorithm is known, but has requires more flops if eigenvectors are
needed• May use divide-and-conquer instead
– MRRR (improved inverse iteration algorithm, also rich in parallelism)– Non-symmetric eigensolvers such as QR iterations
• currently fine-grained, can do better?– Iterative refinement for eigenvalue problem?
• ScaLAPACK (distributed memory LAPACK)– One-sided factorizations on a GPU cluster