Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
description
Transcript of Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
Maria Athanasaki, Evangelos Koukis, Nectarios Koziris
National Technical University of AthensSchool of Electrical and Computer Engineering
Computing Systems Laboratory
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Previous work M. Athanasaki, A. Sotiropoulos, G. Tsoukalas, N. Koziris,
"Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs using Memory Mapped Network Interfaces", SuperComputing Conference on High Performance Networking and Computing (SC2002), Baltimore, Maryland, November 16-22, 2002.
G. Goumas, A.Sotiropoulos and N. Koziris, "Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping," Proceedings of the 2001 International Parallel and Distributed Processing Symposium (IPDPS2001), IEEE Press, San Francisco, California, April 2001 .
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Overview
Tiling for parallelization Non-overlapping vs. Overlapping
execution scheme Grouping Application on a cluster of SMPs
with a fixed number of nodes Experimental-Simulation Results
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Nested For-Loops
for (i1=l1; i1<=u1; i1++)
for (i2=l2; i2<=u2; i2++)
… … … … …
for (in=ln; in<=un; in++)
{
Loop Body
}
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Dependence Vectors
i2
i1
for (i1=0; i1<=7; i1++)
for (i2=0; i2<=7; i2++)
A[i,j]=A[i-1,j]+A[i,j-1]
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Tiling
i2
i1
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Tiling
i2
i1
Processor 0
Processor 1
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Overview
Tiling for parallelization Non-overlapping vs.
Overlapping execution scheme Grouping Application on a cluster of SMPs
with a fixed number of nodes Experimental-Simulation Results
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Non-Overlapping Scheme
i2
i1
Processor 0
Processor 1
Processor 2
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Non-Overlapping vs. Overlapping Scheme
P0
P1
P2
P3
P0
P1
P2
P3
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Overlapping Scheme
i2
i1
Processor 0
Processor 1
Processor 2
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Overview
Tiling for parallelization Non-overlapping vs. Overlapping
execution scheme Grouping Application on a cluster of SMPs
with a fixed number of nodes Experimental-Simulation Results
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Generalization to SMPs – “Grouping”
SMP0
SMP1
SMP2
SMP3
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Example: Grouping + Non overlapping Communication Scheme
Tile Space
Group Space
SMP node0
SMP node1
Scheduling vector Π=(1,0)
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Example: Grouping + Overlapping Communication Scheme
Tile Space
Group Space
SMP node0
SMP node1
Scheduling vector Π=(1,1)
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Overview
Tiling for parallelization Non-overlapping vs. Overlapping
execution scheme Grouping Application on a cluster of SMPs
with a fixed number of nodes Experimental-Simulation Results
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Scheduling onto a Fixed Number of SMPs
Dynamic Scheduling by the Operating SystemRun time overhead for generating a
lot of processesContext switching slows down the
execution Static Scheduling at Compile
Time
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Scheduling onto a Fixed Number of SMPs
Cyclic Assignment Schedule
Mirror Assignment Schedule
Cluster Assignment Schedule
Retiling
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Cyclic Assignment
SMP0
SMP1
CPU0CPU1
CPU0CPU1
CPU0CPU1
CPU0CPU1
Cyclic assignment on 2 SMP nodes with 2 CPUs
each
SMP0
SMP1
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Cyclic Assignment
CPU0CPU1
CPU0CPU1
CPU0CPU1
CPU0CPU1
Cyclic assignment on 2 SMP nodes with 2 CPUs
each
SMP0
SMP1
SMP0
SMP1
chunk
chunk
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Cyclic Assignment – Non Overlapping Communication
CPU0
CPU1
CPU0
CPU1
Cyclic assignment on 2 SMP nodes with 2 CPUs
each
SMP0
SMP1
t
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Cyclic Assignment - Overlapping Communication
Cyclic assignment on 2 SMP nodes with 2 CPUs
each
t
CPU0
CPU1
CPU0
CPU1
SMP0
SMP1
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Cyclic Assignment - Communication
CPU0CPU1
CPU0CPU1
CPU0CPU1
CPU0CPU1
Cyclic assignment on 2 SMP nodes with 2 CPUs
each
SMP0
SMP1
SMP0
SMP1
chunk
chunk
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Scheduling onto a Fixed Number of SMPs
Cyclic Assignment Schedule
Mirror Assignment Schedule
Cluster Assignment Schedule
Retiling
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Mirror Assignment
SMP0
SMP1
CPU0CPU1
CPU0CPU1
CPU1CPU0
CPU1CPU0
Mirror assignment on 2 SMP nodes with 2 CPUs
each
SMP1
SMP0
chunk
chunk
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Mirror Assignment – Non Overlapping Communication
Mirror assignment on 2 SMP nodes with 2 CPUs
each
CPU0CPU1
CPU0CPU1
SMP0
SMP1
t
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Mirror Assignment - Overlapping Communication
Mirror assignment on 2 SMP nodes with 2 CPUs
each
tCPU0CPU1
CPU0CPU1
SMP0
SMP1
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Mirror Assignment - Communication
SMP0
SMP1
CPU0CPU1
CPU0CPU1
CPU1CPU0
CPU1CPU0
Mirror assignment on 2 SMP nodes with 2 CPUs
each
SMP1
SMP0
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Scheduling onto a Fixed Number of SMPs
Cyclic Assignment Schedule
Mirror Assignment Schedule
Cluster Assignment Schedule
Retiling
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Cluster Assignment
SMP0
SMP1
CPU0
Cluster assignment on 2 SMP nodes with 2 CPUs
each
CPU1
CPU0
CPU1
tiles “TILE”
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Cluster Assignment
SMP0
SMP1
CPU0
Cluster assignment on 2 SMP nodes with 2 CPUs
each
CPU1
CPU0
CPU1
TILESGROUPS
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Cluster Assignment – Non Overlapping Communication
SMP0
SMP1
CPU0
Cluster assignment on 2 SMP nodes with 2 CPUs
each
CPU1
CPU0
CPU1
t
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Cluster Assignment –Overlapping Communication
SMP0
SMP1
CPU0
Cluster assignment on 2 SMP nodes with 2 CPUs
each
CPU1
CPU0
CPU1
t
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Cluster Assignment - Communication
SMP0
SMP1
CPU0
Cluster assignment on 2 SMP nodes with 2 CPUs
each
CPU1
CPU0
CPU1
TILESGROUPS
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Scheduling onto a Fixed Number of SMPs
Cyclic Assignment Schedule
Mirror Assignment Schedule
Cluster Assignment Schedule
Retiling
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Retiling
SMP0
SMP1
CPU0
Retiling on 2 SMP nodes with 2 CPUs each
CPU1
CPU0
CPU1 old tiles
new tiles
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Retiling
SMP0
SMP1
CPU0
Retiling on 2 SMP nodes with 2 CPUs each
CPU1
CPU0
CPU1 old tiles
new tiles
retaining computation
volume of a tile
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Retiling – Non Overlapping Communication
SMP0
SMP1
CPU0
Retiling on 2 SMP nodes with 2 CPUs each
CPU1
CPU0
CPU1
t
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Retiling –Overlapping Communication
SMP0
SMP1
CPU0
Retiling on 2 SMP nodes with 2 CPUs each
CPU1
CPU0
CPU1
t
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Retiling - Communication
SMP0
SMP1
CPU0
Retiling on 2 SMP nodes with 2 CPUs each
CPU1
CPU0
CPU1
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Overview
Tiling for parallelization Non-overlapping vs. Overlapping
execution scheme Grouping Application on a cluster of SMPs
with a fixed number of nodes Experimental-Simulation
Results
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Experimental Platform Linux SMP (Symmetric Multi-
Processors) Cluster 2 nodes
1GB RAM2 Pentium III 1266MHz
Myrinet high performance interconnect
GM low level message passing system
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
The Myrinet interconnect User-level Networking
Based on the GM message passing interface All message exchange using DMA
Directly to/from pinned userspace buffers Communication is offloaded to the NIC
Programmable NIC LANai RISC processor @ 133-333MHz 2-8MB SRAM
2+2Gbps full duplex fiber links
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
GM Architecture
Comprised of three main parts User library Kernel driver Firmware on NIC
OS bypass design Regions of NIC
memory mapped to the VM of a process
GM Library
Application
GM kernel module
GM firmware
User
Kernel
NIC
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Sending and Receiving messages over Myrinet/GM
Sending application
Host
NICSend q
Send DMA Recv DMA
Host DMA
LANai
Receiving application
Host
NICRecv q
Send DMA Recv DMA
Host DMA
LANai
Buffer Event q Buffer Event q
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Initial Code
for (i=1; i<=X; i++)for (j=1; j<=Y; j++)
for (k=1; k<=Z; k++){
A[i][j][k] = func(A[i-1][j][k],
A[i][j-1][k], A[i][j][k-1])
}
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
cyclic
mirror
cluster
retile
cyclic
mirror
cluster
retile
Experimental results
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
500 1000 1500 2000 2500 3000 3500
Sp
eed
up
/ #
pro
cessors
Height of Iteration Space
Non Overlapping Execution Scheme
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
500 1000 1500 2000 2500 3000 3500
Sp
eed
up
/ #
pro
cessors
Height of Iteration Space
Overlapping Execution Scheme
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Simulation results
mirrorcyclic
retile
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 4000 8000 12000 16000 20000
Sp
eed
up
/ #
pro
cessors
Height of Iteration Space
Overlapping Execution Scheme
cluster
mirror
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 4000 8000 12000 16000 20000
Sp
eed
up
/ #
pro
cessors
Height of Iteration Space
Non Overlapping Execution Scheme
retile
clustercyclic
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Simulation results
retile
cluster
cyclic
mirror 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 4000 8000 12000 16000 20000
Sp
eed
up
/ #
pro
cessors
Height of Iteration Space
Non Overlapping Execution Scheme
mirror cluster
retile
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 4000 8000 12000 16000 20000
Sp
eed
up
/ #
pro
cessors
Height of Iteration Space
Overlapping Execution Scheme
cyclic
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Advantages - Disadvantages
Advantages Disadvantages
cyclic + fast pipeline filling - communication
mirror + better communication than cyclic- idle time steps- worse communication than cluster, retile
cluster+ communication: 1) little volume of data to be transferred 2) data combined in fewer messages
- slow pipeline filling
retile+ fast pipeline filling+ communication: little volume of data to be transfered
- reorganizes tiles annuls optimal tile shape for cache hits
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
The End
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes
National Technical University of AthensComputing Systems Laboratory
PDP 2004
Cyclic Assignment - Overlapping Communication
SMP0
SMP1
SMP0
SMP1
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
CPU0
CPU1
equivalentschedulings
P
tscheduling on a fixed number of processors
empty pipeline waiting for thenecessary data to become available
t
P
scheduling on an unlimited number of processors