Hardware-aware thread scheduling: the case of asymmetric multicore processors
Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and...
Transcript of Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and...
1
Hardware Aware ProgrammingExploiting the memory hierarchy and
parallel multicore processors
Lehrstuhl für Informatik 10 (Systemsimulation)Universität Erlangen-Nürnberg
www10.informatik.uni-erlangen.de
Canberra, July 2008
U. Rüde (LSS Erlangen, [email protected])joint work with
J. Götz, M. Stürmer, K. Iglberger, S. Donath, C. Feichtinger, T. Preclik, T. Gradl, C. Freundl, H. Köstler, T. Pohl, D. Ritter, D. Bartuschat, P. Neumann,
G. Wellein, G. Hager, T. Zeiser, J. Habich (RRZE)N. Thürey (ETH Zürich)
2
OverviewTo PetaScale and BeyondOptimizing Memory Access and Cache-Aware ProgrammingMassively Parallel Multigrid: Performance ResultsMultiCore ArchitecturesCase study: Lattice Boltzmann Methods for Flow Simulation on the Play StationConclusions
3
Part I
Towards PetaScale and Beyond
4
0
2000
4000
6000
8000
729 4.913 35.937 274.625 2,146,689
JDSStencils
HHG Motivation I Structured vs. Unstructured Grids
(on Hitachi SR 8000)
MFlops rates for matrix-vector multiplication on one node Structured versus sparse matrixMany emerging architectures have similar properties (Cell, GPU)
Extinct Dinosaur HLRB-I: Hitachi SR 8000
No. 5 in TOP-500 in 20002 TFlops
5
HHG Motivation II: DiMe - Project
Started jointly with Linda Stals in 1996 in Ausgburg!Cache-optimizations for sparse matrix/stencil codes (1996-2007)Efficient hardware optimized
Multigrid SolversLattice-Boltzmann CFD
with free surface flowfluid structure interaction
www10.informatik.uni-erlangen.de/de/Research/Projects/DiME/
Data Local Iterative Methods (1996-2007) for theEfficient Solution of Partial Differential Equations
6
Evolution of Semiconductor Technolgy
Collects trends in semiconductor technologySee http://www.itrs.net/reports.html
7
Where does Computer Architecture Go?Computer architects have capitulated: It may not be possible anymore to exploit progress in semiconductor technology for automatic performance improvements
Even today a single core CPU is a highly parallel system:superscalar execution, complex pipeline, ... and additional tricksInternal parallelism is a major reason for the performance increases until now, but ... There is a limited amount of parallelism that can be exploited automatically
Multi-core systems concede the architects´ defeat:Architects fail to build faster single core CPUs given more transistorsClock rate increases only slowly (due to power considerations)
Therefore architects have started to put several cores on a chip:programmers must use them directly
8
What are the consequences?
For the application developers “the free lunch is over”
Without explicitly parallel algorithms, the performance potential cannot be used any more
For HPCCPUs will have 2, 4, 8, 16, ..., 128, ..., ??? cores - maybe sooner than we are ready for thisWe will have to deal with systems with millions of cores
9
Memory access as a major bottleneck25+ years ago: Telefunken TR440: 16 000 words
Memory fills a rack with8 × 20 drawerseach with 100 data cardsone rack of about 0,8m × 2m
Today: HLRB-II (Altix 4700): 5 × 1 000 000 000 000 wordsMemory fills a rack2m high and ranging roughly from earth to the moon
or, better organized, a rack system500 m wide (500 rows of racks)2m high (20 drawers, 100 cards, each)500 km long (5.000.000 columns)
10
Part II
Optimizing Memory Access andCache-Aware Programming
11
Increasing single-CPU performance by optimizing data locality
Caches work due to the locality of memory accesses (instructions + data)
(Numerically intensive) codes should exhibit:Spatial locality:
Data items accessed within a short time period are located close to each other in memoryTemporal locality:
Data that has been accessed recently is likely to be accessed again in the near future
Goal: Increase spatial and temporal locality in order to enhance cache utilization (cache-aware progr.)
12
Cache performance optimizations
Data layout optimizations: Change the data layout in memory to enhance spatial
localityData access optimizations:
Change the order of data accesses to enhance spatial and temporal locality
These transformations preserve numerical results and their introduction can (theoretically) be automated!
13
Data access optimizations:Loop fusion
Example: red/black Gauss-Seidel iteration in 2D
14
Data access optimizations:Loop fusion (cont’d)
Code before applying loop fusion technique(standard implementation w/ efficient loop ordering, Fortran semantics: row major order):
for it= 1 to numIter do // Red nodes for i= 1 to n-1 do for j= 1+(i+1)%2 to n-1 by 2 do relax(u(j,i)) end for end for
15
Data access optimizations:Loop fusion (cont’d)
// Black nodes for i= 1 to n-1 do for j= 1+i%2 to n-1 by 2 do relax(u(j,i)) end for end forend for
This requires two sweeps through the wholedata set per single GS iteration!
16
Data access optimizations:Loop fusion (cont’d)
How the fusion technique works:
17
Data access optimizations:Loop fusion (cont’d)
Code after applying loop fusion technique:
for it= 1 to numIter do // Update red nodes in first grid row for j= 1 to n-1 by 2 do
relax(u(j,1)) end for
18
Data access optimizations:Loop fusion (cont’d)
// Update red and black nodes in pairs for i= 1 to n-1 do for j= 1+(i+1)%2 to n-1 by 2 do
relax(u(j,i)) relax(u(j,i-1))
end for end for
19
Data access optimizations:Loop fusion (cont’d)
// Update black nodes in last grid row for j= 2 to n-1 by 2 do relax(u(j,n-1)) end for
Solution vector u passes through thecache only once instead of twice per GSiteration!
20
Data access optimizations:Loop split
The inverse transformation of loop fusionDivide work of one loop into two to make body less complicated
Leverage compiler optimizationsEnhance instruction cache utlization
21
Data access optimizations:Loop blocking
Loop blocking = loop tilingDivide the data set into subsets (blocks) which are small enough to fit in cachePerform as much work as possible on the data in cache before moving to the next blockThis is not always easy to accomplish because of data dependencies
22
Data access optimizations:Loop blocking
Example: 1D blocking for red/black GS, respect the data dependencies!
23
Data access optimizations:Loop blocking
Code after applying 1D blocking techniqueB = number of GS iterations to be blocked/combined
for it= 1 to numIter/B do // Special handling: rows 1, …, 2B-1 // Not shown here …
24
Data access optimizations:Loop blocking
// Inner part of the 2D grid for k= 2*B to n-1 do for i= k to k-2*B+1 by –2 do for j= 1+(k+1)%2 to n-1 by 2 do relax(u(j,i)) relax(u(j,i-1)) end for end for end for
25
Data access optimizations:Loop blocking
// Special handling: rows n-2B+1, …, n-1 // Not shown here …end for
Result: Data is loaded once into the cache per B Gauss-Seidel iterations, provided that 2*B+2 grid rows fit in the cache simultaneouslyIf grid rows are too large, 2D blocking can be applied
26
Data access optimizationsLoop blocking
More complicated blocking schemes existIllustration: 2D square blocking
27
Part III
Towards Scalable FE Software
28
Multigrid: V-Cycle
Relax on
Residual
Restrict
Correct
Solve
Interpolate
by recursion
… …
Goal: solve Ah uh = f h using a hierarchy of grids
29
Cache-optimized multigrid:DiMEPACK library
DFG project DiME: Data-local iterative methodsFast algorithm + fast implementationCorrection scheme: V-cycles, FMGRectangular domainsConstant 5-/9-point stencilsDirichlet/Neumann boundary conditionshttp://www10.informatik.uni-erlangen.de/dime
30
V(2,2) cycle - bottom line
Mflops For what
13 Standard 5-pt. Operator
56 Cache optimized (loop orderings, data merging, simple blocking)
150 Constant coeff. + skewed blocking + padding
220 Eliminating rhs if 0 everywhere but boundary
31
Parallel High Performance FE MultigridParallelize „plain vanilla“ multigrid
partition domainparallelize all operations on all gridsuse clever data structures
Do not worry (so much) about Coarse Gridsidle processors?short messages?sequential dependency in grid hierarchy?
Why we do not use conventional domain decompositionDD without coarse grid does not scale (algorithmically) and is suboptimal for large problems/ many processorsDD with coarse grids may be as efficient as multigrid but is as difficult to parallelize (the difficulty is in parallelizing the coarse grid)
32
Hierarchical Hybrid Grids (HHG)Unstructured input grid
Resolves geometry of problem domainPatch-wise regular refinement
generates nested grid hierarchies naturally suitable for geometric multigrid algorithms
New: Modify storage formats and operations on the grid to exploit the regular substructures
Does an unstructured grid with 1000 000 000 000 elements make sense?
HHG - Ultimate Parallel FE Performance!
33
HHG refinement example
Input Grid
34
HHG Refinement example
Refinement Level one
35
HHG Refinement example
Refinement Level Two
36
HHG Refinement example
Structured Interior
37
HHG Refinement example
Structured Interior
38
HHG Refinement example
Edge Interior
39
HHG Refinement example
Edge Interior
40
Parallel HHG - FrameworkDesign Goals
To realize good parallel scalability:
Minimize latency by reducing the number of messages that must be sentOptimize for high bandwidth interconnects ⇒ large messagesAvoid local copying into MPI buffers
41
HHG for ParallelizationUse regular HHG patches for partitioning the domain
42
HHG Parallel Update Algorithmfor each vertex do apply operation to vertexend for for each edge do copy from vertex interior apply operation to edge copy to vertex haloend for
for each element do copy from edge/vertex interiors apply operation to element copy to edge/vertex halosend for
update vertex primary dependencies
update edge primary dependencies
update secondary dependencies
43
Towards Scalable FE Software
Performance Results
44
Node Performance is Difficult! (B. Gropp)DiMe project: Cache-aware Multigrid (1996- ...)
grid size 173 333 653 1293 2573 5133standard 1072 1344 715 677 490 579no blocking 2445 1417 995 1065 849 8192x blocking 2400 1913 1312 1319 1284 12823x blocking 2420 2389 2167 2140 2134 2049
Performance of 3D-MG-Smoother for 7-pt stencil in Mflops on Itanium 1.4 GHzArray PaddingTemporal blocking - in EPIC assembly languageSoftware pipelineing in the extreme (M. Stürmer - J. Treibig)
Node Performance is Possible!
45
Single Processor HHG Performance on Itanium forRelaxation of a Tetrahedral Finite Element Mesh
46
#Proc #unkn. x 106 Ph.1: sec Ph. 2: sec Time to sol.
4 134.2 3.16 6.38* 37.98 268.4 3.27 6.67* 39.3
16 536.9 3.35 6.75* 40.332 1,073.7 3.38 6.80* 40.6
64 2,147.5 3.53 4.92 42.3128 4,295.0 3.60 7.06* 43.2
252 8,455.7 3.87 7.39* 46.4504 16,911.4 3.96 5.44 47.6
2040 68,451.0 4.92 5.60 59.03825 128,345.7 6.90 82.8
4080 136,902.0 5.68
6102 205,353.1 6.33
8152 273,535.7 7.43*
9170 307,694.1 7.75*
Parallel scalability of scalar elliptic problemdiscretized by tetrahedral finite elements.
Times for 12 V(2,2) cycles on SGI Altix: Itanium-2 1.6 GHz.
Largest problem solved to date:3.07 x 1011 DOFS on 9170 Procs: 7.8 s per V(2,2) cycle
B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006, also: „Is 1.7× 1010 unknowns the largest finite element system that can be solved today?“, SuperComputing, Nov‘ 2005.
47
So what?With scalable algorithms, well implemented, we can do (scalar) PDEs with
> 10 million unknows on a desktop> 300 billion unknowns on a TOP-50 class machine (HLRB-II, 63 Tflop peak, 40 TeraByte Mem)
In the future, we will be able to doaround 2010: ≈ 5 trillion unknowns on a PetaScale machine (assuming 1 PByte memory)around 2012-2015: ≈ 50 trillion unknowns on a machine delivering a petaflop for real applications (assuming 10 Pbyte memory)
This is e.g. sufficient to resolve all of earth‘s atmosphere with10 km grid resolution (current desktop)250 m mesh (current supercomputer)100 m mesh (Peak-Peta-Scale system in 2010?)50 m mesh (Application-Peta-Scale system in 2015?)
This is a buidling block for many other applications
48
Programming techniques
Seemingly conflicting goals:Portability/Flexibility:Code should run on a variety of (parallel) target platforms, including PC clusters, NUMA machines, etc.
Efficiency:Code should run as efficiently as possible on each target platform
How can this conflict be solved?
49
Part IV
Multicore Architectures
50
The STI Cell Processorhybrid multicore processor based on IBM Power architecture(simplified) PowerPC core
runs operating systemcontrols execution of programs
multiple co-processors (8, on Sony PS3 only 6 available)operate on fast, private on-chip memoryoptimized for computationDMA controller copies data from/to main memory
• multi-buffering can hide main memory latencies completely for streaming-like applications
• loading local copies has low and known latencies
memory with multiple channels and links can be exploited if many memory transactions are in-flight
51
The STI Cell Broadband Engine
52
Cell LBM SimulationsGoal: demanding (flow) simulations at moderate cost but very fast, e.g. for simulation of blood-flow in an aneurysm for therapy and surgery planningAvailable cell systems:
BladesPlaystation 3
53
Synergistic Processor Unit“very small computer of its own”
128 128-bit all-purpose registersoperates on 256 kB of Local Store (LS)
nearly all operations are SIMD onlyone scalar operation is more expensive than a SIMDonly load and store of 16 naturally aligned bytes from/to LS
25.6 GFlops (single precision fused-multiply-add)only truncation, fast double precision will be available soon
no dynamic branch prediction, only hints in softwarebut around 20 cycles branch miss penalty
no system calls or privileged operations
54
Memory Flow Controllercommunication interface (to PPE and other SPEs)
mailboxes and signal notificationmemory mapping of Local Store and register fileutilized by PPU to upload programs and control SPU
asynchronous data transfers (DMA)LS <-> main memory, other LSes or devices16 DMAs in-flightlist transfers possible (scatter / gather)only naturally aligned transfers of 1, 2, 4, 8, n·16 bytesusually multiple transfers on multiple MFCs are necessary to saturate main memory bandwidth
all interaction with SPU through channel interface
55
Programming the Cell-BEthe hard way
control SPEs using management librariesissue DMAs by language extensionsdo address calculations manuallyexchange main memory addresses, array sizes etc.synchronization using mailboxes, signals or libraries
frameworksAccelerated Library Framework (ALF) and Data, Communication, and Synchronization (DaCS) by IBMRapidmind SDK
accelerated librariessingle-source-compiler
IBM’s xlc-cbe-sse is in alpha stage, uses OpenMP
56
Naive SPU implementation: A[] = A[]*cvolatile vector float ls_buffer[8] __attribute__((aligned(128)));
void scale( unsigned long long gs_buffer, // main memory address of vector
int number_of_chunks, // number of chunks of 32 floats
float factor ) { // scaling factor
vector float v_fac = spu_splats(factor); // create SIMD vector with all
// four elements being factor
for ( int i = 0 ; i < number_of_chunks ; ++i ) {
mfc_get( ls_buffer , gs_buffer , 128 , 0 ,0,0); // DMA reading i-th chunk
mfc_write_tag_mask( 1 << 0 ); // wait for DMA...
mfc_read_tag_status_all(); // ...to complete
for ( int j = 0 ; j < 8 ; ++j )
ls_buffer[j] = spu_mul( ls_buffer[j] , v_fac ); // scale local copy using SIMD
mfc_put( ls_buffer ,gs_buffer , 128 , 0 ,0,0); // DMA writing i-th chunk
mfc_write_tag_mask( 1 << 0 ); // wait for DMA...
mfc_read_tag_status_all(); // ...to complete
gs_buffer += 128; // incr. global store pointer
} }
57
Remove latencies using multi-buffering
mfc_get( ls_buffer[0] , gs_buffer , 128 , 0 ,0,0); // request first chunk
for (int i = 0; i < number_of_chunks; ++i) {
int cur = ( i ) % 3; // buffer no. and DMA tag for i-th chunk
int next = (i+1) % 3; // " for (i-2)-th and (i+1)-th chunk
if (i < number_of_chunks-1) {
mfc_write_tag_mask( 1 << next ); // make sure the (i-2)-th chunk...
mfc_read_tag_status_all(); // ...has been stored
mfc_get( ls_buffer[next] , gs_buffer+128 , 128 , next ,0,0); // request (i+1)-th chunk
}
mfc_write_tag_mask( 1 << cur ); // wait until i-th chunk...
mfc_read_tag_status_all(); // ...is available
for (int j = 0; j < 8; ++j) ls_buffer[cur][j] = spu_mul(ls_buffer[cur][j],v_fac);
mfc_put( ls_buffer[cur] , gs_buffer , 128 , cur ,0,0);// store i-th chunk
gs_buffer += 128;
}
mfc_write_tag_mask( 1 | 2 | 4 ); // wait for any...
mfc_read_tag_status_all(); // outstanding DMA
volatile vector float ls_buffer[3][8] __attribute__((aligned(128)));
...
58
Part V
Case study: Lattice Boltzmann Methods for Flow Simulation on the Play Station
59
Example OMP-parallel Flow AnimationResolution: 880*880*336; 260M cells, 6.5M active on average
60
Simulationof Metal Foams
Joint work with C. Körner, WTM Erlangen
61
Aneurysms• Aneurysm are local dilatations of the blood vessels• Localized mostly at large arteries in soft tissue (e.g.
aorta, brain vessels)• Can be diagnosed by modern imaging techniques (e.g.
MRT,DSA)• Can be treated e.g by clipping or coiling
62
A data structure for simulating flow in blood vessels
• In a brain geometry only about 3-10% of the nodes are fluidal
• We use a domain decoupling in equally sized blocks, so-called patches and only allocate patches containing fluid cells
• The memory requirements and the computational time could be reduced significantly
• For the Cell processor we use patches of size 8x8x8, fitting into the SPU local memory
63
Results
Velocity near the wall in an aneurysm
Oscillatory shear stress near the wall in an aneurysm
64
Pulsating Blood Flow in an Aneurysm
Datensatz
Collaboration between:Neuro-Radiology (Prof. Dörfler, Dr. Richter)
Computer Science
Simulation
Imaging
CFD
65
LBM Optimized for Cell
memory layoutoptimized for DMA transfersinformation propagating between patches is reordered on the SPE and stored sequentially in memory for simple and fast exchange
code optimizationkernels hand-optimized in assembly codeSIMD-vectorized streaming and collisionbranch-free handling of bounce-back boundary conditions
66
Performance Results
0
12,5
25,0
37,5
50,0
Xeon 5160 PPE SPE*
49,0
2,04,8
10,4
straight-forward C codeSIMD-optimized assembly*on Local Store without DMA transfers
67
Performance Results
30,0
47,5
65,0
82,5
100,0
1 2 3 4 5 6
95949493
81
42
68
Performance Results
0
12,5
25,0
37,5
50,0
Xeon 5160* Playstation 3
43,8
11,7
21,1
9,1
1 core1 CPU
*performance optimized code by LB-DC
Other work:LBM on Graphics Hardware
see also: work by Jonas Tölke and M. Krafczyk at TU BraunschweigMaster thesis by J. Habich (co-suprvised jointly with G. Wellein, RRZE Erlangen)
nVidia Geforce 8800 GTX (G80 Processor) up to 250 Fluid MLUP/safter careful tuning!
69
Multigrid on Cell ProcessorMaster Thesis by Daniel Ritter:
A Fast Multigrid Solver for Molecular Dynamics on the Cell Broadband Engine
Performance limited by available bandwidthLocal store too small (?) for blocking techniques
70
71
Part IV
Conclusions
72
What have we learned?
The future is parallel on multi core CPUsMemory bandwidth per core will be a severe bottleneck
“inverse Moore’s law”Programming current leading edge multi-core architectures to exploit their performance potential requires expert knowledge of the architecture
better tool and system support neededcomplexity of the architecture
73
An HPC Tutorial !Getting Supercomputer Performance is Easy!
If parallel efficiency is bad, choose a slower serial algorithmit is probably easier to parallelizeand will make your speedups look much more impressive
Introduce the “CrunchMe” variable for getting high Flops ratesadvanced method: disguise CrunchMe by using an inefficient (but compute-intensive) algorithm from the start
Introduce the “HitMe” variable to get good cache hit ratesadvanced version: disguise HitMe within “clever data structures” that introduce a lot of overhead
Never cite “time-to-solution”who cares whether you solve a real life problem anywayit is the MachoFlops that interest the people who pay for your research
Never waste your time by trying to use a complicated algorithm in parallel (such as multigrid)
the more primitive the algorithm the easier to maximize your MachoFlops.
74
Talk is Over
Questions?