Using Semi-Permeable Membrane Device (SPMD) Samplers in Cave Environments to Detect PCBs
Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup),...
Transcript of Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup),...
![Page 1: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/1.jpg)
Design of Data Management for
Multi SPMD Workflow Programming ModelT. Dufaud*, M. Tsuji**, M. Sato**
*University of Versailles, France
*RIKEN Center for Computational Science, Japan
ESPM2: Fourth International IEEE Workshop on Extreme Scale Programming Models and Middleware
SC18 Dallas, Texas, USA, Monday, November 12th, 2018
![Page 2: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/2.jpg)
Agenda
Motivation and background
Data exchange issues and model
Implementation through YML user interface
Development of Block Gauss-Jordan Algorithm
![Page 3: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/3.jpg)
Multi SPMD (mSPMD) for large scale computing
Why Multi SPMD
Flat MPI reaches a limit
Simulation code may consists in several parallel programs
Each SPMD is an independent program ⇒ component approach, reusability
Programming environment
Dominant execution model approach: X+MPI, MPI + X
Coupling between models and apps involves I/O
Proposed approach for 2-level programming model
Workflow programing with task dependencies
Single Program Multiple Data at the fine layer
Separation of data and computation (N.Emad, O. Delannoy, M. Dandouna AICCSA 2010)
Programing environment mSPMD
Distributed parallel XMP is a PGAS language http://www.xcalablemp.org
Workflow YML: graph of task, each task is described by a component, http://www.yml.prism.uvsq.fr
![Page 4: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/4.jpg)
accelerator
general
process core
shared memorydistributed
parallel workflow
StarPUYML XMP
XMP-dev
UNIVERSITE DE VERSAILLES
SAINT QUENTIN EN YVELINES
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
NODENODE
NODE
NODE
NODE NODE
OpenMP
OpenACC
etc..
- Coarse grained tasks in a workflow
- Moderate sized SPMD programs
Multi SPMD (mSPMD) Programming Model
![Page 5: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/5.jpg)
accelerator
general
process core
shared memorydistributed
parallel workflow
StarPUYML XMP
XMP-dev
UNIVERSITE DE VERSAILLES
SAINT QUENTIN EN YVELINES
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
NODENODE
NODE
NODE
NODE NODE
OpenMP
OpenACC
etc..
◼ introduce “parallelism” into tasks by XMP
◼ “heavy” task can be executed in parallel
Multi SPMD (mSPMD) Programming Model
![Page 6: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/6.jpg)
accelerator
general
process core
shared memorydistributed
parallel workflow
StarPUYML XMP
XMP-dev
UNIVERSITE DE VERSAILLES
SAINT QUENTIN EN YVELINES
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
NODENODE
NODE
NODE
NODE NODE
OpenMP
OpenACC
etc..
◼ divide a large parallel program into some sub-
programs to avoid the cost of communication in
large systems
Multi SPMD (mSPMD) Programming Model
![Page 7: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/7.jpg)
accelerator
general
process core
shared memorydistributed
parallel workflow
StarPUYML XMP
XMP-dev
UNIVERSITE DE VERSAILLES
SAINT QUENTIN EN YVELINES
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
NODENODE
NODE
NODE
NODE NODE
OpenMP
OpenACC
etc..
◼ Compose complex application by combining
parallel applications and libraries
Multi SPMD (mSPMD) Programming Model
![Page 8: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/8.jpg)
accelerator
general
process core
shared memorydistributed
parallel workflow
StarPUYML XMP
XMP-dev
UNIVERSITE DE VERSAILLES
SAINT QUENTIN EN YVELINES
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
<TASK>
NODENODE
NODE
NODE
NODE NODE
OpenMP
OpenACC
etc..
Fault detection
and recovery
w/o code-modification
Multi SPMD (mSPMD) Programming Model
Miwako Tsuji, Serge Petiton,, Mitsuhisa Sato:
"Fault Tolerance Features of a New Multi-SPMD Programming/Execution Environment"
Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware, SC15 (2015)
![Page 9: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/9.jpg)
I/O based Data Exchange in mSPMD
File System
<TASK 1>
node0 node1
node2 node3
Matrix A[N][N]
distributed on 4 nodesData import w/ MPI-IO
Computation w/ n-procs
Data export w/ MPI-IO
<TASK 2>
Data import w/ MPI-IO
Computation w/ n-procs
Data export w/ MPI-IO
node0 node1
node2 node3
MPI_File_write()
MPI_
File_re
ad()
Pros
Easy for application developers
(MPI-IO functions are automatically generated)
Portability, Checkpointing
Cons
Speed and Performance Instability
![Page 10: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/10.jpg)
Agenda
Motivation and background
Data exchange issues and model
Implementation through YML user interface
Development of Block Gauss-Jordan Algorithm
![Page 11: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/11.jpg)
Data exchange in mSPMD Context and choices
Parallel programming model on each level
Graph of tasks on the higher level
SPMD at the fine level
Data definition
Component approach at the higher level (large parallel tasks)
Definition of data separated from computation
High level⇒ Global data (To be Imported/Exported)
Low level ⇒ Group of local data
Import/Export are synchronous
Import from / export to a data repository
Data placement and dependencies
data flow deduced from workflow
Data persistency
Data repository can enable persistent storage
Quality
High performance in case of intense computation
Fault tolerance, persistency of important data
![Page 12: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/12.jpg)
Key point of the system
Main design
Detection of data distribution (Given by execution model)
Performance of communication according to data distribution...
... and architecture and middleware
adapt strategy at runtime(considering size of data and distribution)
earn benefit of both dataflow and workflow knowledge (Drawing our inspiration from M. Hugues et. al. ASIODS 2011)
consider a task scheduler and a data scheduler
interaction between schedulers for optimization
⇒ anticipate data migration, remapping of parallel data
Remarks
Component approach + ability to get information from interfaces
⇒ enable integration of automatic decision tools
![Page 13: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/13.jpg)
Target system
![Page 14: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/14.jpg)
Agenda
Motivation and background
Data exchange issues and model
Implementation through YML user interface
Development of Block Gauss-Jordan Algorithm
![Page 15: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/15.jpg)
First step implementation enable by YML front end
Figure: Integrate new data management through YML front end
![Page 16: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/16.jpg)
Import/export with 2 strategies in YML-XMP
Data export to PDR
File System
MPI_File_write()
MPI_
File_re
ad()
<TASK 1>
node0 node1
node2 node3
Matrix A[N][N]
distributed on 4 nodesData import w/ MPI-IO
Computation w/ n-procs
<TASK 2>
Data import to PDR or NSF
Computation w/ n-procs
Data export w/ MPI-IO
node0 node1
node2 node3
<TASK PDR>
Data server written in MPI
Encapsulated in XMP component
component_export
remap
component_importnode0 node1
node2 node3Data export to PDR or NFS
![Page 17: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/17.jpg)
Workflow
Application
Current Implementation of Parallel Data Repositories
par# Start PDRspar(i:=0;nPDRs-1)do
compute startPdr(i);notify(begin[i]);
enddo//
# Workflow Application…notify(end);
//# Stop PDRswait(end);par(i:=0;nPDRs-1)do
compute stopPDR(i);enddo
endpar
The PDRs are implemented as
tasks in a workflow
PDRPDR
(begin)
(end)
![Page 18: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/18.jpg)
Preliminary Experiment: Performance of our model implementation
Key points of evaluation
1. Import export by one task increasing load
2. Weak scaling in case of one task
3. Multiple accesses (N tasks, one PDR)
Platform
92 nodes IBM cluster (iDataPlex dx360 M4 servers)
2 CPU Sandy Bridge E5-2670 (2.60GHz)
8 cores per CPU / 16 cores per node
32 GB RAM per node
![Page 19: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/19.jpg)
Experiment (1) Import/export by one task, increasing load
Performance for Import/Export a real matrix
1 client: n x n Matrix, 16 procs
1 pdr: 16 procs
MPI-IO: Current mSPMD implementation using MPI-IO
MPI_Send/Recv: No I/O, use MPI Comm connect (etc.) and MPI Send / MPI Recv
<Time to perform 100 import/export> vs <matrix size in each procs>
(sec)
local matrix size
MPI_S/R is bounded by connection time
0
0.5
1
1.5
2
2.5
3
10x10 100x100 500x500 1000x1000
MPI-IO MPI_S/R
![Page 20: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/20.jpg)
Experiment (2) Weak scaling in case of one task
Weak scaling
1 client: 1000x1000 matrix for each process, a client is executed by 4x4 to 7x7 procs
1 pdr: executed on 4x4 to 7x7 procs
MPI-IO: Current mSPMD implementation using MPI-IO
MPI_Send/Recv: No I/O, use MPI Comm connect (etc.) and MPI Send / MPI Recv
<Time to perform 100 import/export> vs <# of nodes>
(sec)
(# of processes)0
50
100
150
200
250
300
4x4 5x5 6x6 7x7
MPI_IO MPI_S/R
![Page 21: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/21.jpg)
0
20
40
60
80
100
120
140
2 4 8 16 24
MPI-IO MPI_S/R
Experiment(3) Multiple accesses (N tasks, one PDR)
n-clients vs 1-pdr
2 to 24 client: each client uses 16 procs, each client has 1000x1000 matrix
1 pdr: 16 procs
Maximum time in case multiple accesses to
one data repository.
(sec)
(# of clients)
![Page 22: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/22.jpg)
0
20
40
60
80
100
120
140
2 4 8 16 24
MPI-IO MPI_S/R
Experiment(3) Multiple accesses (N tasks, one PDR)
n-clients vs 1-pdr
2 to 24 client: each client uses 16 procs, each client has 1000x1000 matrix
1 pdr: 16 procs
Maximum time in case multiple accesses to
one data repository.
(sec)
(# of clients)
Stable up to 8 clients
⇒ We need 3 PRDs for 24 clients
![Page 23: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/23.jpg)
Agenda
Motivation and background
Data exchange issues and model
Implementation through YML user interface
Development of Block Gauss-Jordan Algorithm
![Page 24: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/24.jpg)
The Block Gauss Jordan (BGJ) Algorithm
1. Pivot
2. Update row and column
3. Update A and Update B
Algorithm from M. Hugues, S. Petiton, A Matrix Inversion Method with YML/OmniRPC on a Large Scale Platform, VECPAR
Input: 𝐴 (partitioned into 𝑝 × 𝑝 blocks)
Output: 𝐵 = 𝐴−1
For 𝑘 = 0 to 𝑝 − 1𝐵𝑘𝑘 = 𝐴𝑘𝑘
−1
For 𝑖 = 𝑘 + 1 to 𝑝 − 1
𝐴𝑘𝑖 = 𝐵𝑘𝑘 × 𝐴𝑘𝑖End For
For 𝑖 = 0 to 𝑝 − 1If 𝑖 ≠ 𝑘𝐵𝑖𝑘 = −𝐴𝑖𝑘 × 𝐵𝑘𝑘
End If
If 𝑖 < 𝑘𝐵𝑘𝑖 = 𝐵𝑘𝑘 × 𝐵𝑘𝑖
End If
End For
For 𝑖 = 0 to 𝑝 − 1If 𝑖 ≠ 𝑘
For 𝑗 = 𝑘 + 1 to 𝑝 − 1𝐴𝑖𝑗 = 𝐴𝑖𝑗 − 𝐴𝑖𝑘 × 𝐴𝑘𝑗
End For
For 𝑗 = 0 to 𝑘 − 1𝐵𝑖𝑗 = 𝐵𝑖𝑗 − 𝐴𝑖𝑘 × 𝐵𝑘𝑗
End For
End If
End For
End For
1
2
2
2
3
3
![Page 25: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/25.jpg)
22
Illustration of BGJ steps
1
2
23
3
3
3
Pivot
getrf_tr
Update row and column
gemm for step 2
Update A
gemm for step 3
Update B
gemm for step 3
1
2
3
3
𝑝 − 10
𝑝 − 1
0
𝑘
𝑘
![Page 26: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/26.jpg)
3 import/export strategies are compared
MPI-IO
Mix:MPI-IO and PDRs
After Update row and column, blocks are exported to a NFS
Otherwise, blocks are sent to a PDR
PDR
![Page 27: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/27.jpg)
Large block size
NO I/O strategy = faster (1.73x speedup),
more stable / Mixed = compromise 1.36x,
checkpoint, stable
4500x4500 matrix is divided into 5x5 blocks
- Each task uses 3x3 procs
- (maximum) 16 tasks in parallel
- 5x5 PDRs, each PDR 3x3 procs
⇒ 2x(5x5)x(3x3) = 450 cores for PDRs
- 10 independent runs
![Page 28: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/28.jpg)
Large block size
NO I/O strategy = faster (1.73x speedup),
more stable / Mixed = compromise 1.36x,
checkpoint, stable
4500x4500 matrix is divided into 5x5 blocks
- Each task uses 3x3 procs
- (maximum) 16 tasks in parallel
- 5x5 PDRs, each PDR 3x3 procs
⇒ 2x(5x5)x(3x3) = 450 cores for PDRs
- 10 independent runs
2.0x speedup to the
worst case
1.7x speedup to the
average
![Page 29: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/29.jpg)
Small block size
Mixed strategy improve stability
and limit loss of performance
3000x3000 matrix is divided into 5x5 blocks
- Each task uses 3x3 procs
- (maximum) 16 tasks in parallel
- 5x5 PDRs, each PDR 3x3 procs
⇒ 2x(5x5)x(3x3) = 450 cores for PDRs
- 10 independent runs
![Page 30: Design of Data Management for Multi SPMD Workflow ... · NO I/O strategy = faster (1.73x speedup), more stable / Mixed = compromise 1.36x, checkpoint, stable 4500x4500 matrix is divided](https://reader036.fdocuments.us/reader036/viewer/2022063013/5fcb7d31f0f8325835585c45/html5/thumbnails/30.jpg)
Conclusion
Design
Design of data exchange for Multi SPMD
Graph of tasks + PGAS
Design enables to extract and use information
parallel programing model
execution model
architecture
Toward implementation
mSPMD and data-exchange through a data repository
Task scheduler
YML interface can be used for PDR
Choices: MPI-IO or MPI server (No I/O) or mix
Speedup up to 2.0x for benchmark application (BGJ)