The view from space Last weekend in Los Angeles, a few miles from my apartment…
-
Upload
elle-worthing -
Category
Documents
-
view
215 -
download
0
Transcript of The view from space Last weekend in Los Angeles, a few miles from my apartment…
![Page 1: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/1.jpg)
The view from space
Last weekend in Los Angeles,a few miles from my apartment…
![Page 2: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/2.jpg)
The view from space
Estimating fill accurately and efficiently
Idea: Sample matrix Fraction of matrix to sample: s [0,1] Cost ~ O(s · nnz) Control run-time cost by controlling s
Control s by observing statistical confidence intervals Idea: Monitor variance automatically
Cost of tuning Lower bound: convert matrix in 5 to 40 unblocked SpMVs Heuristic: 1 to 11 SpMVs
![Page 3: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/3.jpg)
The view from space
Empirical model evaluation
Tuning loop Compute a “tuning time budget” based on workload While (time remains and no tuning chosen)
Try a heuristic
Heuristic for blocked SpMV: Choose r x c to minimize
predicted time(A,r,c)estimated flops(A,r,c)
benchmark Mflop /s(r,c)
Tuning for workloads Weighted sums of empirical models Dynamic programming for alternatives
Example: Combined y = ATAx vs. separate (w = Ax, y = ATw)
![Page 4: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/4.jpg)
The view from space
The cost of tuning
Non-trivial run-time cost: up to ~40 mat-vecs Dominated by conversion time (~ 80%)
Design point: user calls “tune” routine explicitly Exposes cost Tuning time limited using estimated workload
Provided by user or inferred by library
User may save tuning results To apply on future runs with similar matrix Stored in “human-readable” format
![Page 5: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/5.jpg)
The view from space
Related Work
Code generation Generative & generic programming Sparse compilers Domain-specific generators
Empirical search-based tuning Kernel-centric: linear algebra, signal processing, sorting,
MPI, … Compiler-centric: profiling + FDO, iterative compilation,
superoptimizers, autotuning compilers, continuous program optimization
Tuning-free cache-oblivious algorithms
![Page 6: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/6.jpg)
The view from space
Bug hunting in MPI programs
Motivation: MPI is a large, complex API Bug pattern detectors
Check basic API usage Adapt existing tools: MPI-CHECK; FindBugs; Farchi, et al.
VC’05
Tasks requiring deeper program analysis Properly matched sends/receives, barriers, collectives Buffer errors, e.g., overruns, read before non-blocking op
completes Temporal usage properties See error survey by DeSouza, Kuhn, & de Supinski ‘05 Extend existing analyses by Shires, et al., PDPTA’99;
Strout, et al. ICPP‘06
![Page 7: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/7.jpg)
The view from space
Outline
Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in the
wild” Toward end-to-end application autotuning Summary and future work
![Page 8: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/8.jpg)
The view from space
Tour of application-specific optimizations
Five case studies Common characteristics
Complex code Heavy use of abstraction Use generated code (e.g., SWIG C++/Python bindings)
Benefit from extensive code and data restructuring Multiple bottlenecks
![Page 9: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/9.jpg)
The view from space
[1] Loop transformations for SMG2000
SMG2000, implements semi-coarsening multigrid on structured grids (ASC Purple benchmark) Residual computation has an SpMV bottleneck Loop below looks simple but non-trivial to extract
for (si = 0; si < NS; ++si) for (k = 0; k < NZ; ++k) for (j = 0; j < NY; ++j) for (i = 0; i < NX; ++i) r[i + j*JR + k*KR] -= A[i + j*JA + k*KA + SA[si]] * x[i + j*JX + k*KX + Sx[si]]
![Page 10: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/10.jpg)
The view from space
[1] Before transformation
for (si = 0; si < NS; si++) /* Loop1 */ for (kk = 0; kk < NZ; kk++) { /* Loop2 */ for (jj = 0; jj < NY; jj++) { /* Loop3 */
for (ii = 0; ii < NX; ii++) { /* Loop4 */
r[ii + jj*Jr + kk*Kr] -= A[ii + jj*JA + kk*KA + SA[si]] * x[ii + jj*JA + kk*KA + SA[si]];
} /* Loop4 */
} /* Loop3 */ } /* Loop2 */ } /* Loop1 */
![Page 11: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/11.jpg)
The view from space
[1] After transformation, including interchange, unrolling, and prefetching
for (kk = 0; kk < NZ; kk++) { /* Loop2 */ for (jj = 0; jj < NY; jj++) { /* Loop3 */ for (si = 0; si < NS; si++) { /* Loop1 */ double* rp = r + kk*Kr + jj*Jr; const double* Ap = A + kk*KA + jj*JA + SA[si]; const double* xp = x + kk*Kx + jj*Jx + Sx[si]; for (ii = 0; ii <= NX-3; ii += 3) { /* core Loop4 */ _mm_prefetch (Ap + PFD_A, _MM_HINT_NTA); _mm_prefetch (xp + PFD_X, _MM_HINT_NTA); rp[0] -= Ap[0] * xp[0]; rp[1] -= Ap[1] * xp[1]; rp[2] -= Ap[2] * xp[2]; rp += 3; Ap += 3; xp += 3; } /* core Loop4 */ for ( ; ii < NX; ii++) { /* fringe Loop4 */ rp[0] -= Ap[0] * xp[0]; rp++; Ap++; xp++; } /* fringe Loop4 */ } /* Loop1 */ } /* Loop3 */ } /* Loop2 */
![Page 12: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/12.jpg)
The view from space
[1] Loop transformations for SMG2000
2x speedup on kernel from specialization, loop interchange, unrolling, prefetching But only 1.25x overall---multiple bottlenecks
Lesson: Need complex sequences of transformations Use profiling to guide Inspect run-time data for specialization Transformations are automatable
Research topic: Automated specialization of hypre?
![Page 13: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/13.jpg)
The view from space
[1] SMG2000 demo
![Page 14: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/14.jpg)
The view from space
[2] Slicing and dicing 3P
Accelerator design code from SLAC calcBasis() very expensive Scaling problems as |
Eigensystem| grows In principle, loop interchange or
precomputation via slicing possible
/* Post-processing phase */foreach mode in Eigensystem foreach elem in Mesh b = calcBasis (elem) f = calcField (b, mode)
![Page 15: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/15.jpg)
The view from space
[2] Slicing and dicing 3P
Accelerator design code calcBasis() very expensive Scaling problems as |
Eigensystem| grows In principle, loop interchange or
precomputation via slicing possible
Challenges in practice “Loop nest” ~ 500+ LOC 150+ LOC to calcBasis() calcBasis() in 6-deep call chain,
4-deep loop nest, 2 conditionals File I/O Changes must be unobtrusive
/* Post-processing phase */foreach mode in Eigensystem foreach elem in Mesh // { … b = calcBasis (elem) // } f = calcField (b, mode) writeDataToFiles (…);
![Page 16: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/16.jpg)
The view from space
[2] 3P: Impact and lessons
4-5x speedup for post-processing step; 1.5x overall
Changes “checked-in” Lesson: Need clean source-level transformations
To automate, need robust program analysis and developer guidance
Research: Annotation framework for developers [w/ Quinlan, Schordan, Yi: POHLL’06]
![Page 17: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/17.jpg)
The view from space
[3] Structure splitting
Convert (array of structs) into (struct of arrays) Improve spatial locality through increased stride-1 accesses Make code hardware-prefetch and vector/SIMD unit “friendly”c
struct Type { double p; double x, y, z; double E; int k;} X[N], Y[N];
for (i = 0; i < N; i++) Y[i].E += Y[X[i].k].p;
double Xp[N];double Xx[N], Xy[N], Xz[N];double XE[N];int Xk[N];// … same for Y …
for (i = 0; i < N; i++) YE[i] += sqrt (Yp[Xk[i]]);
![Page 18: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/18.jpg)
The view from space
[3] Structure splitting: Impact and challenges
2x speedup on a KULL benchmark (suggested by Brian Miller)
Implementation challenges Potentially affects entire code Can apply only locally, at a cost
Extra storage Overhead of copying
Tedious to do by hand
Lesson: Extensive data restructuring may be necessary
Research: When and how best to split?
![Page 19: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/19.jpg)
The view from space
[4] Finding a loop-fusion needle in a haystack
Interprocedural loop fusion finder [w/ B. White : Cornell U.] Known example had 2x speedup on benchmark (Miller) Built “abstraction-aware” analyzer using ROSE
First pass: Associate “loop signatures” with each function Second pass: Propagate signatures through call chains
for (Zone::iterator z = zones.begin (); z != zones.end (); ++z) for (Corner::iterator c = (*z).corners().begin (); …) for (int s = 0; s < c->sides().size(); s++) …
![Page 20: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/20.jpg)
The view from space
[4] Finding a loop-fusion needle in a haystack
Found 6 examples of 3- and 4-deep nested loops “Analysis-only” tool Finds, though does not verify/transform
Lesson: “Classical” optimizations relevant to abstraction use
Research Recognizing and optimizing abstractions [White’s thesis,
on-going] Extending traditional optimizations to abstraction use
![Page 21: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/21.jpg)
The view from space
[5] Aggregating messages (on-going)
Idea: Merge sends (suggested by Miller)
Implementing a fully automated translator to find and transform
Research: When and how best to aggregate?
DataType A;// … operations on A …A.allToAll();
// …
DataType B;// … operations on B …B.allToAll();
DataType A;// … operations on A …// …DataType B;// … operations on B …
bulkAllToAll(A, B);
![Page 22: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/22.jpg)
The view from space
Summary of application-specific optimizations
Like library-based approach, exploit knowledge for big gains Guidance from developer Use run-time information
Would benefit from automated transformation tools Real code is hard to process Changes may become part of software re-engineering Need robust analysis and transformation infrastructure Range of tools possible: analysis and/or transformation
No silver bullets or magic compilers
![Page 23: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/23.jpg)
The view from space
Outline
Motivation OSKI: An autotuned sparse kernel library “Real world” optimization Toward end-to-end application autotuning Summary and future work
![Page 24: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/24.jpg)
The view from space
A framework for performance tuningSource: SciDAC Performance Engineering Research Institute (PERI)
![Page 25: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/25.jpg)
The view from space
OSKI’s place in the tuning framework
![Page 26: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/26.jpg)
The view from space
Creating structure:Traveling Salesman-based Reordering
Application: Stanford accelerator design (Omega3P) Idea: Reorder by approximately solving TSP [Pinar
’97] Nodes = columns of A Weights(u, v) = no. of nz u, v have in common Tour = ordering of columns Choose maximum weight tour Also: symmetric storage, register blocking
Manually selected optimizations
Just an idea High-cost of computing approximate solution to TSP in
practice
![Page 27: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/27.jpg)
The view from space
100x100 Submatrix Along Diagonal
![Page 28: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/28.jpg)
The view from space
“Microscopic” Effect of Combined RCM+TSP Reordering
Before: Green + RedAfter: Green + Blue
![Page 29: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/29.jpg)
The view from space
![Page 30: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/30.jpg)
The view from space
Interfaces to performance tools
Mark-up AST with data, analysis, to identify optimizable target(s) gprof HPCToolkit [Mellor-Crummey : Rice] VizzAnalyzer / Vizz3D [Panas : LLNL] In progress: Open SpeedShop [Schulz : LLNL]
Needed: Analysis to identify targets
![Page 31: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/31.jpg)
The view from space
Outlining
Outline target into dynamically loadable library routine Extends initial implementations by Liao [U. Houston], Jula
[TAMU]
Handles many details of C & C++ Wraps up variables, inserts declarations, generates call Produces suitable interfaces for dynamic loading Handles non-local control flow
void OUT_38725__ (double* r, int JR, int KR, const double* A, …) { int si, j, k, i; for (si = 0; si < NS; si++) … r[i + j*JR + k*KR] -= A[i + …
![Page 32: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/32.jpg)
The view from space
Making a benchmark
Make “benchmark” by inserting checkpoint library calls Measure application behavior “in context” Use ckpt (user-level) [Zander : U. Wisc.] Insert timing code (cycle counter) May insert arbitrary code to distinguish calling contexts
Reasonably fast in practice Checkpoint read/write bandwidth: 500 MB/s on my Pentium-M For SMG2000: Problem consuming ~500 MB footprint takes ~30s
to run
Needed Best procedure to get accurate and fair comparisons?
Do restarts resume in comparable states?
![Page 33: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/33.jpg)
The view from space
Example of “benchmark” (pseudo)code
static int num_calls = 0; // no. of invocations of outlined codeif (!num_calls) { ckpt (); // Checkpoint/resume OUT_38725__ = dlsym (…); // Load an implementation startTimer (); }
OUT_38725__ (…); // outlined call-site
if (++num_calls == CALL_LIMIT) { // Measured CALL_LIMIT calls stopTimer (); outputTime (); exit (0); }
![Page 34: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/34.jpg)
The view from space
SMG2000 kernel POET instantiation
for (kk = 0; kk < NZ; kk++) { /* L4 */ for (jj = 0; jj < NY; jj++) { /* L3 */ for (si = 0; si < NS; si++) { /* L1 */ double* rp = r + kk*Kr + jj*Jr; const double* Ap = A + kk*KA + jj*JA + SA[si]; const double* xp = x + kk*Kx + jj*Jx + Sx[si]; for (ii = 0; ii <= NX-3; ii += 3) { /* core L2 */ _mm_prefetch (Ap + PFD_A, _MM_HINT_NTA); _mm_prefetch (xp + PFD_X, _MM_HINT_NTA); rp[0] -= Ap[0] * xp[0]; rp[1] -= Ap[1] * xp[1]; rp[2] -= Ap[2] * xp[2]; rp += 3; Ap += 3; xp += 3; } /* core L2 */ for ( ; ii < NX; ii++) { /* fringe L2 */ rp[0] -= Ap[0] * xp[0]; rp++; Ap++; xp++; } /* fringe L2 */ } /* L1 */ } /* L3 */ } /* L4 */
![Page 35: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/35.jpg)
The view from space
Search
We are search-engine agnostics Many possible hybrid modeling/search techniques
![Page 36: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/36.jpg)
The view from space
Summary of autotuning compiler approach
End-to-end framework leverages existing work ROSE provides a heavy-duty (robust) source-level
infrastructure Assemble stand-alone components
Current and future work Assembling a more complete end-to-end example Interfaces between components? Extending basic ROSE infrastructure, particularly
program analysis
![Page 37: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/37.jpg)
The view from space
Compiler-based testing tools
Instrumentation and dynamic analysis to measure coverage [IBM]
Measurement-unit validation via Osprey [Jiang and Su, UC Davis]
Numerical interval/bounds analysis [Sun] Interface to MOPS model-checker [Collingbourne,
Imperial College] Interactive program visualization via VizzAnalyzer
[Panas, LLNL]
![Page 38: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/38.jpg)
The view from space
SpMV trends, using pre-2007 data
![Page 39: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/39.jpg)
The view from space
SpMV trends, pre-2007: Fraction of peak
![Page 40: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/40.jpg)
The view from space
Motivation: The Difficulty of Tuning SpMV
// y <-- y + A*x
for all A(i,j):
y(i) += A(i,j) * x(j)
![Page 41: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/41.jpg)
The view from space
Motivation: The Difficulty of Tuning SpMV
// y <-- y + A*x
for all A(i,j):
y(i) += A(i,j) * x(j)
// Compressed sparse row (CSR)
for each row i:
t = 0
for k=ptr[i] to ptr[i+1]-1:
t += A[k] * x[J[k]]
y[i] = t
![Page 42: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/42.jpg)
The view from space
Motivation: The Difficulty of Tuning SpMV
// y <-- y + A*x
for all A(i,j):
y(i) += A(i,j) * x(j)
// Compressed sparse row (CSR)
for each row i:
t = 0
for k=ptr[i] to ptr[i+1]-1:
t += A[k] * x[J[k]]
y[i] = t
• Exploit 8x8 dense blocks
![Page 43: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/43.jpg)
The view from space
Speedups on Itanium 2: The Need for Search
ReferenceMflop/s (7.6%)
Mflop/s (31.1%)
![Page 44: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/44.jpg)
The view from space
Speedups on Itanium 2: The Need for Search
ReferenceMflop/s (7.6%)
Mflop/s (31.1%)
Best: 4x2
![Page 45: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/45.jpg)
The view from space
SpMV Performance—raefsky3
![Page 46: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/46.jpg)
The view from space
SpMV Performance—raefsky3
![Page 47: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/47.jpg)
The view from space
Better, worse, or about the same?Pentium 4, 1.5 GHz Xeon, 3.2 GHz
![Page 48: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/48.jpg)
The view from space
Better, worse, or about the same?Pentium 4, 1.5 GHz Xeon, 3.2 GHz
* Faster, but relative improvement increases (20% ~50%) *
![Page 49: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/49.jpg)
Problem-Specific Performance Tuning
![Page 50: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/50.jpg)
The view from space
Problem-Specific Optimization Techniques
Optimizations for SpMV Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 3x over CSR Multiple vectors (SpMM): 7x over CSR And combinations…
Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR
Higher-level kernels AAT*x, ATA*x: 4x over CSR, 1.8x over RB A*x: 2x over CSR, 1.5x over RB
![Page 51: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/51.jpg)
The view from space
Problem-Specific Optimization Techniques
Optimizations for SpMV Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over
CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 3x over CSR Multiple vectors (SpMM): 7x over CSR And combinations…
Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR
Higher-level kernels AAT*x, ATA*x: 4x over CSR, 1.8x over RB A*x: 2x over CSR, 1.5x over RB
![Page 52: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/52.jpg)
The view from space
BCSR Captures Regularly Aligned Blocks
n = 21216 nnz = 1.5 M Source: NASA
structural analysis problem
8x8 dense substructure
Reduces storage
![Page 53: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/53.jpg)
The view from space
Problem: Forced Alignment
BCSR(2x2) Stored / true nz = 1.24
![Page 54: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/54.jpg)
The view from space
Problem: Forced Alignment
BCSR(2x2) Stored / true nz = 1.24
BCSR(3x3) Stored / true nz = 1.46
![Page 55: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/55.jpg)
The view from space
Problem: Forced Alignment Implies UBCSR
BCSR(2x2) Stored / true nz = 1.24
BCSR(3x3) Stored / true nz = 1.46
Forces i mod 3 = j mod 3 = 0
Unaligned BCSR format: Store row indices
![Page 56: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/56.jpg)
The view from space
The Speedup GapThe Speedup Gap: BCSR vs. CSR
Speedup:BCSR/CSR
Machine
1.1—1.5x gap
![Page 57: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/57.jpg)
The view from space
Approach: Splitting + Relaxed Block Alignment
Goal: Close the gap between FEM classes
Our approach: Capture actual structure more precisely Split: A = A1 + A2 + … + As
Store each Ai in unaligned BCSR (UBCSR) format Relax both row and column alignment Buttari, et al. (2005) show improvements from relaxed
column alignment 2.1x over no blocking, 1.8x over blocking When not faster than BCSR, may still reduce storage
![Page 58: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/58.jpg)
The view from space
Variable Block Row (VBR) Analysis
Partition by grouping consecutive rows/columns having same pattern
![Page 59: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/59.jpg)
The view from space
From VBR, Identify Multiple Natural Block Sizes
![Page 60: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/60.jpg)
The view from space
VBR with Fill
Can also pad by matching rows/columns with nearly similar patterns
Define VBR() = VBR where consecutive
rows grouped when “similarity”
01
![Page 61: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/61.jpg)
The view from space
VBR with Fill
Fill of 1%
![Page 62: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/62.jpg)
The view from space
A Complex Tuning Problem
Many parameters need “tuning” Fill threshold, .5 1 Number of splittings, 2 s 4 Ordering of block sizes, rici; rscs = 11
See paper in HPCC 2005 for proof-of-concept experiments based on a semi-exhaustive search Heuristic in progress (uses Buttari, et al. (2005) work)
![Page 63: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/63.jpg)
The view from space
FEM 2 MatricesMatrix Dimensio
n# non-zeros
Dominant blocks
10-ct20stifEngine block
52k 2.7M 6x6 (39%), 3x3 (15%)
12-raefsky4Buckling
20k 1.3M 3x3 (96%)
13-ex11Fluid flow
16k 1.1M 1x1 (38%), 3x3 (23%)
15-Vavasis32D PDE
41k 1.7M 2x1 (81%), 2x2 (19%)
17-rimFluid flow
23k 1.0M 1x1 (75%), 3x1 (12%)
A-bmw7st_1Car chassis
141k 7.3M 6x6 (82%)
B-cop20k_mAccel. Cavity
121k 4.8M 2x1 (26%), 1x2 (26%),1x1 (26%), 2x2 (22%)
C-pwtkWind tunnel
218k 11.6M 6x6 (94%)
D-rma10Charleston Harbor
47k 2.4M 2x2 (17%), 3x2 (15%),2x3 (15%), 4x2 (9%), 2x4 (9%)
E-s3dkqm4Cylindrical shell
90k 4.8M 6x6 (99%)
![Page 64: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/64.jpg)
The view from space
Power 4 Performance
![Page 65: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/65.jpg)
The view from space
Storage Savings
![Page 66: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/66.jpg)
The view from space
Traveling Salesman Problem-based Reordering
Application: Stanford accelerator design problem (Omega3P)
Reorder by approximately solving TSP [Pinar & Heath ‘97] Nodes = columns of A Weights(u, v) = no. of nz u, v have in common Tour = ordering of columns Choose maximum weight tour See [Pinar & Heath ’97] Also: symmetric storage, register blocking
Manually selected optimizations Problem: High-cost of computing approximate
solution to TSP
![Page 67: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/67.jpg)
The view from space
100x100 Submatrix Along Diagonal
![Page 68: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/68.jpg)
The view from space
“Microscopic” Effect of Combined RCM+TSP Reordering
Before: Green + RedAfter: Green + Blue
![Page 69: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/69.jpg)
The view from space
![Page 70: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/70.jpg)
The view from space
Inter-Iteration Sparse Tiling (1/3)
y1
y2
y3
y4
y5
t1
t2
t3
t4
t5
x1
x2
x3
x4
x5
Idea: Strout, et al., ICCS 2001
Let A be 5x5 tridiagonal
Consider y=A2x t=Ax, y=At
Nodes: vector elements
Edges: matrix elements aij
![Page 71: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/71.jpg)
The view from space
Inter-Iteration Sparse Tiling (2/3)
y1
y2
y3
y4
y5
t1
t2
t3
t4
t5
x1
x2
x3
x4
x5
Idea: Strout, et al., ICCS 2001
Let A be 5x5 tridiagonal
Consider y=A2x t=Ax, y=At
Nodes: vector elements Edges: matrix elements
aij
Orange = everything needed to compute y1
Reuse a11, a12
![Page 72: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/72.jpg)
The view from space
Inter-Iteration Sparse Tiling (3/3)
Idea: Strout, et al., ICCS 2001
Let A be 5x5 tridiagonal Consider y=A2x
t=Ax, y=At Nodes: vector elements Edges: matrix elements aij
Orange = everything needed to compute y1
Reuse a11, a12
Grey = y2, y3
Reuse a23, a33, a43
y1
y2
y3
y4
y5
t1
t2
t3
t4
t5
x1
x2
x3
x4
x5
![Page 73: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/73.jpg)
The view from space
Serial Sparse Tiling Performance (Itanium 2)
![Page 74: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/74.jpg)
OSKI Software Architecture and API
![Page 75: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/75.jpg)
The view from space
Interface supports legacy app migrationint* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */
double* x = …, *y = …; /* Vectors */
/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, , x, , y );r = ddot (x, y); /* Some dense BLAS op on vectors */
![Page 76: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/76.jpg)
The view from space
Interface supports legacy app migrationint* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */
double* x = …, *y = …; /* Vectors */
/* Step 1: Create OSKI wrappers */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
num_cols, SHARE_INPUTMAT, …);oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, , x, , y );r = ddot (x, y);
![Page 77: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/77.jpg)
The view from space
Interface supports legacy app migrationint* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */
double* x = …, *y = …; /* Vectors */
/* Step 1: Create OSKI wrappers */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
num_cols, SHARE_INPUTMAT, …);oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Step 2: Call tune (with optional hints) */oski_SetHintMatMult (A_tunable, …, 500);oski_TuneMat (A_tunable);
/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, , x, , y );r = ddot (x, y);
![Page 78: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/78.jpg)
The view from space
Interface supports legacy app migrationint* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */
double* x = …, *y = …; /* Vectors */
/* Step 1: Create OSKI wrappers */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
num_cols, SHARE_INPUTMAT, …);oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Step 2: Call tune (with optional hints) */oski_setHintMatMult (A_tunable, …, 500);oski_TuneMat (A_tunable);
/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ ) oski_MatMult (A_tunable, OP_NORMAL, , x_view, , y_view);// Step 3r = ddot (x, y);
![Page 79: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/79.jpg)
The view from space
Quick-and-dirty Parallelism: OSKI-PETSc
Extend PETSc’s distributed memory SpMV (MATMPIAIJ)
p0
p1
p2
p3
PETSc Each process stores
diag (all-local) and off-diag submatrices
OSKI-PETSc: Add OSKI wrappers Each submatrix tuned
independently
![Page 80: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/80.jpg)
The view from space
OSKI-PETSc Proof-of-Concept Results
Matrix 1: Accelerator cavity design (R. Lee @ SLAC) N ~ 1 M, ~40 M non-zeros 2x2 dense block substructure Symmetric
Matrix 2: Linear programming (Italian Railways) Short-and-fat: 4k x 1M, ~11M non-zeros Highly unstructured Big speedup from cache-blocking: no native PETSc
format
Evaluation machine: Xeon cluster Peak: 4.8 Gflop/s per node
![Page 81: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/81.jpg)
The view from space
Accelerator cavity matrix from SLAC’s T3P code
![Page 82: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/82.jpg)
The view from space
Embedded scripting language for selecting customized, complex transformations Mechanism to save/restore transformations
# In file, “my_xform.txt”
# Compute Afast = P*A*PT using Pinar’s reordering algorithm
A_fast, P = reorder_TSP(InputMat);
# Split Afast = A1 + A2, where A1 in 2x2 block format, A2 in CSR
A1, A2 = A_fast.extract_blocks(2, 2);
return transpose(P)*(A1+A2)*P;
/* In “my_app.c” */fp = fopen(“my_xform.txt”, “rt”);fgets(buffer, BUFSIZE, fp);
oski_ApplyMatTransform(A_tunable, buffer);
oski_MatMult(A_tunable, …);
Additional Features: OSKI-Lua
![Page 83: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/83.jpg)
Current Work and Future Directions
![Page 84: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/84.jpg)
The view from space
Current and Future Work on OSKI
OSKI 1.0.1 at bebop.cs.berkeley.edu/oski “Pre-alpha” version of OSKI-PETSc available; “Beta” for Kokkos
(Trilinos) Future work
Evaluation on full solves/apps Bay area lithography shop - 2x speedup in full solve Code generators Studying use of higher-level OSKI kernels
Port to additional architectures (e.g., vectors, SMPs) Additional heuristics [Buttari, et al. (2005)] Many BeBOP projects on-going
SpMV benchmark for HPC-Challenge [Gavari & Hoemmen] Evaluation of Cell [Williams] Higher-level kernels, solvers [Hoemmen, Nishtala] Tuning collective communications [Nishtala] Cache-oblivious stencils [Kamil]
![Page 85: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/85.jpg)
The view from space
ROSE: A Compiler-Based Approach to Tuning General Applications ROSE: Tool for building customized source-to-source tools (Quinlan,
et al.) Full support for C and C++; Fortran90 in development Targets users with little or no compiler background
Focus on performance optimization for scientific computing Domain-specific analysis and optimizations Object-oriented abstraction recognition Rich loop-transformation support Annotation language support Additional infrastructure support for s/w assurance, testing, and
debugging Toward an end-to-end empirical tuning compiler
Combines profiling, checkpointing, analysis, parameterized code generation, search
Joint work with Qing Yi (University of Texas at San Antonio) Sponsored by U.S. Department of Energy
![Page 86: The view from space Last weekend in Los Angeles, a few miles from my apartment…](https://reader036.fdocuments.us/reader036/viewer/2022062511/5518a181550346c31f8b48c8/html5/thumbnails/86.jpg)
The view from space
ROSE Architecture
Front-end (EDG-based)
Back-end
Transformed application source
Application Library Interface
Mid-end
Source
fragmentAST fragment
AST fragmentSource
fragment
AST fragment
AST
AST
Annotations
Tools
Abtraction RecognitionAbstraction Aware Analysis
Abstraction EliminationExtended Traditional Optimizations
Source+AST Transformations