Invasive Compute Balancing for Applications with Hybrid Parallelization … · 2013-11-21 ·...
Transcript of Invasive Compute Balancing for Applications with Hybrid Parallelization … · 2013-11-21 ·...
Technische Universitat Munchen
SBAC-PAD’2013
Invasive Compute Balancing for Applications with HybridParallelization
M. Schreiber, C. Riesinger, T. Neckel, H.-J. Bungartz
Technische Universitat Munchen
October 25, 2013
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 1
Technische Universitat Munchen
Topics
Motivation
Methodology & BackgroundHybrid ParallelizationCompute MigrationInvasive Computing
Application: Tsunami Simulation
ResultsArtificial WorkloadTsunami Simulation
Conclusion
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 2
Technische Universitat Munchen
Topics
Motivation
Methodology & BackgroundHybrid ParallelizationCompute MigrationInvasive Computing
Application: Tsunami Simulation
ResultsArtificial WorkloadTsunami Simulation
Conclusion
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 3
Technische Universitat Munchen
Motivation
HPC Simulations withDynamic Adaptive Mesh Refinement (DAMR)
• Use high resolution grids in feature-rich areas• Save computations in feature-poor areas• Efficient parallelization of DAMR is challenging
Compute imbalances• Changing number of cells
per compute unit over simulation time• Several approaches available
to tackle imbalances
Goal
Maximize efficiency of DAMR simulations
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 4
Technische Universitat Munchen
Topics
Motivation
Methodology & BackgroundHybrid ParallelizationCompute MigrationInvasive Computing
Application: Tsunami Simulation
ResultsArtificial WorkloadTsunami Simulation
Conclusion
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 5
Technische Universitat Munchen
Hybrid Parallelization
Exploit the best of both worlds: Distributed and shared memory parallelization
Distributed memory parallelization
• Communication with messages over buffers• Mandatory to program big clusters• Possible overhead due to data migration
Shared memory parallelization
• Same memory accessible for all threads• Trend towards thousands of cores (Xeon Phi, GPUs, etc.)
+ -common address space access conflicts for shared resources
cache coherency false sharing for cachesavoids data migration management tables
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 6
Technische Universitat Munchen
Hybrid Parallelization in our context
• We use hybrid parallelization on cache-coherent memory systems• We start a constant number of MPI ranks• We start a constant number of threads (e.g. one per core)
cores
cache coherentshared memorybus system
MPI ranksMPI rank 0 MPI rank 1
workerthreads
logical separationof applications
ph
ysic
al la
yer
log
ical la
yer
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 7
Technische Universitat Munchen
Compute Migration
To overcome the issue of load imbalances due to dynamic adaptive grids,we use compute migration instead of data migration.
• Instead of copying data between MPI ranks,we assign threads to MPI ranks⇒ avoids copy operations
• The number of threads per MPI rank is not fixed but variable over runtime⇒ satisfies dynamic demands
• Number of threads per MPI rank is relatively small⇒ lower runtime overhead
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 8
Technische Universitat Munchen
Compute Migration
To overcome the issue of load imbalances due to dynamic adaptive grids,we use compute migration instead of data migration.
• Instead of copying data between MPI ranks,we assign threads to MPI ranks⇒ avoids copy operations
• The number of threads per MPI rank is not fixed but variable over runtime⇒ satisfies dynamic demands
• Number of threads per MPI rank is relatively small⇒ lower runtime overhead
cores
cache coherentshared memorybus system
MPI ranksMPI rank 0 MPI rank 1
workerthreads
logical separationof applications
ph
ysic
al la
yer
log
ical la
yer
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 8
Technische Universitat Munchen
Compute Migration
To overcome the issue of load imbalances due to dynamic adaptive grids,we use compute migration instead of data migration.
• Instead of copying data between MPI ranks,we assign threads to MPI ranks⇒ avoids copy operations
• The number of threads per MPI rank is not fixed but variable over runtime⇒ satisfies dynamic demands
• Number of threads per MPI rank is relatively small⇒ lower runtime overhead
cores
cache coherentshared memorybus system
MPI ranksMPI rank 0 MPI rank 1
workerthreads
logical separationof applications
ph
ysic
al la
yer
log
ical la
yer
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 8
Technische Universitat Munchen
Compute Migration
To overcome the issue of load imbalances due to dynamic adaptive grids,we use compute migration instead of data migration.
• Instead of copying data between MPI ranks,we assign threads to MPI ranks⇒ avoids copy operations
• The number of threads per MPI rank is not fixed but variable over runtime⇒ satisfies dynamic demands
• Number of threads per MPI rank is relatively small⇒ lower runtime overhead
cores
cache coherentshared memorybus system
MPI ranksMPI rank 0 MPI rank 1
workerthreads
logical separationof applications
ph
ysic
al la
yer
log
ical la
yer
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 8
Technische Universitat Munchen
Invasive Computing
• Compute migration realized with Invasive Computing Paradigms• Processes can specify varying resource requirements during runtime• Requirements are specified by application developer
Interfaces
invade
Resources are exclusivelyrequested depending onparticular application-specificrequirements
infect
After invading specificresources, programuses them for certaincomputations
retreat
Releasecomputationresources
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 9
Technische Universitat Munchen
Invasive Computing: Resource Manager
Properties
• Resources are dynamically assignedduring runtime to overcomedynamically changing demands
• Global decisions base onPerformance Graphs
• Performance Graphs have to beprovided by the application
• Realized as an own thread
Advantages
• Finds global optimum of computing resources utilization• Allows to run different applications with varying load• Avoids cache thrashing due to core multiplexing
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 10
Technische Universitat Munchen
Topics
Motivation
Methodology & BackgroundHybrid ParallelizationCompute MigrationInvasive Computing
Application: Tsunami Simulation
ResultsArtificial WorkloadTsunami Simulation
Conclusion
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 11
Technische Universitat Munchen
Application: Tsunami Simulation with SWE 1/3
Governing equations: Shallow Water Equations (SWE)
Homogeneous form given by conservation law of hyperbolic equations:
∂U(x , y , t)∂t
+∂G(U(x , y , t))
∂x+
∂H(U(x , y , t))
∂y= 0
with
U = (h, hu, hv)T , G(U) =
huhu2 + 1
2 gh2
huv
, H(U) =
hvhuv
hv2 + 12 gh2
h: Height of water relative to ground sea levelu: Velocity in x-directionv : Velocity in y-directionU: Conserved quantitiesG,H: Flux functions describe the change of conserved quantities over time
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 12
Technische Universitat Munchen
Application: Tsunami Simulation 2/3
Step 1: Weak form
By multiplying the equation with a test function ϕ and applying the divergencetheorem we get the weak form:∫
TUtϕi︸ ︷︷ ︸
mass-term
−∫
TG(U) · ∂ϕi
∂x+ H(U) · ∂ϕi
∂y︸ ︷︷ ︸stiffness-term
+
∮TF(U)ϕi · ~n︸ ︷︷ ︸
flux-term
= 0
ϕi : Test functionT : Triangle grid cell~n(x , y): Outward pointing normal of the grid cell
Step 2: Approximation
U(x , y , t) ≈ U(x , y , t) =∑
i
Ui (t) ϕi (x , y)
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 13
Technische Universitat Munchen
Application: Tsunami Simulation 3/3
Step 3: Rearrangement
• G and H evaluated nodal-wise with Lagrangereconstruction of a polynomial
• Explicit Euler time time stepping• Rearrange to do computations basing on
matrix/matrix and vector/matrix operations
U t+∆ti = U t
i + ∆tM−1(Sx U(t) + Sy U(t) + F(U−(t), U+(t))
)F : Flux via the boundaries (e.g. Lax-Friedrich flux)
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 14
Technische Universitat Munchen
Topics
Motivation
Methodology & BackgroundHybrid ParallelizationCompute MigrationInvasive Computing
Application: Tsunami Simulation
ResultsArtificial WorkloadTsunami Simulation
Conclusion
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 15
Technische Universitat Munchen
Results: Artificial Workload
System• 4× Intel Xeon E7-4850 @ 2.00 GHz• 4× 10 physical cores plus hyper-threading• 256 GB memory accessible by all cores• Threading implemented using TBB
Results & interpretation
0123456789
10
problem size
break even point
inva
sive
run
time
norm
aliz
edby
non
-inva
sive
run
time
Invasive vs. non-invasive scenariowith different workload sizes
⇒ With big problem sizes, InvasiveComputing using computebalancing outperforms animplementation with equallydistributed work to all ranks
⇒ With a problem size of 131072(triangles), the simulationrun-time was improved by 53%
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 16
Technische Universitat Munchen
Results: Tsunami Simulation 1/2
Setup
• Initial refinement depth of 14, thus creating (2× 2)14 grid-cells• The square domain is split along the diagonals and one quarter is
assigned to one MPI rank during the whole simulation time
⇒ Due to propagating wave and thus grid refinement load-imbalances occur
Results
0
5
10
15
20
25
30
35
40
stac
ked
core
-to-
rank
sch
edul
ing
real-time
Rank 3
Rank 2
Rank 1
Rank 0
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 17
Technische Universitat Munchen
Results: Tsunami Simulation 2/2
0
50
100
150
200
250
300
40 / 2 40 / 4 20 / 2 20 / 4
sim
latio
n tim
e in
sec
onds
cores / MPI ranks
non-inavsive
invasive
Interpretation
• Computational efficiency mostly improved by invasive compute migration• The higher the number of ranks, the higher the potential improvement
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 18
Technische Universitat Munchen
Topics
Motivation
Methodology & BackgroundHybrid ParallelizationCompute MigrationInvasive Computing
Application: Tsunami Simulation
ResultsArtificial WorkloadTsunami Simulation
Conclusion
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 19
Technische Universitat Munchen
Conclusion
• Compute migration as alternative solution forload-imbalances (which can, e.g., result from dynamicadaptive grids)
• Extension of the invasive paradigmto support compute balancing
• Clear interfaces (invade, infect, retreat) forapplication developer to dynamically manage resources
• Explicit scaling data for resource managerto find global optimum
• Also applicable with independent applicationsrunning on the same system
• Robust optimizations in performance for simulationsexecuted with hybrid parallelization on shared memorysystems !
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 20
Technische Universitat Munchen
Final slide
This work was supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre ”InvasiveComputing” (SFB/TR 89).
Christoph Riesinger: Invasive Compute Balancing for Applications with Hybrid Parallelization
SBAC-PAD’2013, October 25, 2013 21