Job Startup at Exascale: Challenges and...

1
Current Trends in HPC § Tremendous increase in system and job sizes § Dense many-core systems becoming popular § Less memory available per process § Fast and scalable job startup is essential Importance of Fast Job Startup § Development and debugging § Regression/Acceptance testing § Checkpoint restart Performance Bottlenecks Static Connection Setup § Setting up O(num_procs 2 ) connections is expensive § OpenSHMEM, UPC and other PGAS libraries lack on-demand connection management Network Address Exchange over PMI § Limited scalability, no potential for overlap § Not optimized for symmetric exchange Global Barriers § Unnecessary synchronization and connection setup Memory Scalability Issues § Each node requires O(number of processes * processes per node) memory for storing remote endpoint information Proposed Solutions On-demand Connection Management § Exchange information and establish connection only when two peers are trying to communicate [1] PMIX_Ring Extension § Move bulk of the data exchange to high- performance networks [2] Non-blocking PMI Collectives § Overlap the PMI exchange with other tasks [3] Shared-memory based PMI Get/Allgather § All clients access data directly from the launcher daemon through shared memory regions [4] Summary § Near-constant MPI and OpenSHMEM initialization time at any process count § 10x and 30x improvement in startup time of MPI and OpenSHMEM respectively at 16,384 processes § Reduce memory consumption by O(ppn) § 1GB Memory saved @ 1M processes and 16 ppn References [1] On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. (Chakraborty et al, HIPS ’15) [2] PMI Extensions for Scalable MPI Startup. (Chakraborty et al, EuroMPI/Asia ’14) [3] Non-blocking PMI Extensions for Fast MPI Startup. (Chakraborty et al, CCGrid ’15) [4] SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. (Chakraborty et al, CCGrid ’16) More Information § Available in latest MVAPICH2 and MVAPICH2-X § http://mvapich.cse.ohio-state.edu/downloads/ § https://go.osu.edu/mvapich-startup Job Startup at Exascale: Challenges and Solutions Results - Non-blocking PMI [3] § Near-constant MPI_Init at any scale § MPI_Init with Iallgather performs 288% better than the default based on Fence § Blocking Allgather is 38% faster than blocking Fence 0.4 0.8 1.2 1.6 2 64 256 1K 4K 16K Time Taken (Seconds) Number of Processes Performance of MPI_Init Fence Ifence Allgather Iallgather 0 0.4 0.8 1.2 1.6 64 256 1K 4K 16K Time Taken (Seconds) Number of Processes Performance Comparison of Fence and Allgather PMI2_KVS_Fence PMIX_Allgather PMIX_KVS_Ifence § Non-blocking version of PMI2_KVS_Fence § int PMIX_KVS_Ifence(PMIX_Request *request) PMIX_Iallgather § Optimized for symmetric data movement § Reduces data movement by up to 30% § 286KB 208KB with 8,192 processes § int PMIX_Iallgather(const char value[], char buffer[], PMIX_Request *request) PMIX_Wait § Wait for the specified request to be completed § int PMIX_Wait(PMIX_Request request) Non-blocking PMI Collectives § PMI operations are progressed by separate processes handling process management § MPI library not involved in progressing PMI communication § Similar to Functional Partitioning approaches § Can be overlapped with other initialization tasks PMIX_Request § Non-blocking collectives return before the operations is completed § Return an opaque handle to the request object that can be used to check for completion Results – On-demand Connection [1] § 29.6 times faster initialization time § Hello world performs 8.31 times better § Execution time of NAS benchmarks improved by up to 35% with 256 processes and class B data On-demand Connection Establishment Static Connection Setup § Setting up connections takes over 85% of the total startup time with 4,096 processes § RDMA operations require exchanging information about memory segments registered with the HCA Results Solution Challenges Overview Results – PMI Ring Extension [2] § MPI_Init based on PMIX_Ring performs 34% better compared to the default PMI2_KVS_Fence § Hello World runs 33% faster with 8K processes § Up to 20% improvement in total execution time of NAS parallel benchmarks 0 1 2 3 4 5 6 7 16 32 64 128 256 512 1K 2K 4K 8K Time Taken (seconds) Number of Processes Performance of MPI_Init and Hello World with PMIX_Ring Hello World (Fence) Hello World (Ring) MPI_Init (Fence) MPI_Init (proposed) 0 1 2 3 4 5 6 7 EP MG CG FT BT SP Time Taken (seconds) Benchmark NAS Benchmarks with 1K Processes, Class B Data Fence Ring New Collective – PMIX_Ring § A ring can be formed by exchanging data with only the left and the right neighbors § Once the ring is formed, data can be exchanged over the high speed networks like InfiniBand § int PMIX_Ring(char value[], char left[], char right[], …) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 16 64 256 1K 4K 16K Time Taken (Seconds) Number of Processes PMI Fence with different Number of Puts 100% Put + Fence 50% Put + Fence Fence Only 0 4 8 12 16 16 64 256 1K 4K 16K Time Taken (milliseconds) Number of Processes Time Taken by PMIX_Ring Shortcomings of Current PMI design § Puts and Gets are local operations § Fence consumes most of the time § Time taken for Fence grows approximately linearly with amount of data transferred (number of keys) 0 0.5 1 1.5 2 2.5 32 64 128 256 512 1K 2K 4K 8K Time Taken (Seconds) Number of Processes Breakdown of MPI_Init PMI Exchanges Shared Memory Other 0 1 2 3 4 5 6 7 16 64 256 1k 4k 16k Time Taken (seconds) Number of Processes Time spent in Put, Fence, & Get Fence Put Gets Main Thread Main Thread Connection Manager Thread Connection Manager Thread Process 1 Process 2 Put/Get (P2) Create QP QPInit Enqueue Send Create QP QPInit QPRTR QPRTR QPRTS Connection Established Dequeue Send Connect Request (LID, QPN) (address, size, rkey) Connect Reply (LID, QPN) (address, size, rkey) QPRTS Connection Established Put/Get (P2) 0 2 4 6 8 BT EP MG SP Execution Time (seconds) Benchmark Execution time of OpenSHMEM NAS Parallel Benchmarks Static On-demand 0 20 40 60 80 100 16 32 64 128 256 512 1K 2K 4K 8K Time Taken (Seconds) Number of Processes Performance of OpenSHMEM Initialization and Hello World Hello World - Static start_pes - Static Hello World - On-demand start_pes - On-demand 0 5 10 15 20 25 30 35 32 64 128 256 512 1K 2K 4K Time Taken (Seconds) Number of Processes Breakdown of OpenSHMEM Initialization Connection Setup PMI Exchange Memory Registration Shared Memory Setup Other Application Processes Average Peers BT 64 8.7 1024 10.6 EP 64 3.0 1024 5.0 MG 64 9.5 1024 11.9 SP 64 8.8 1024 10.7 2D Heat 64 5.3 1024 5.4 Results – Shared Memory based PMI [4] § PMI Get takes 0.25 ms with 32 ppn § 1,000 times reduction in PMI Get latency compared to default socket based protocol § Memory footprint reduced by O(Processes Per Node) 1GB @ 1M processes, 16 ppn § Backward compatible, negligible overhead Shared Memory (shmem) based PMI § Open a shared memory channel between the server and the clients § A hash table is suitable for Fence while Allgather only requires an array of values § Use a hash table based on two shmem regions for efficient insertion and merge, and compactness Memory Scalability in PMI § PMI communication between the server and the clients are based on local sockets § Latency is high with large number of clients § Copying data to client’s memory causes large memory overhead Sourav Chakraborty, Dhabaleswar K Panda, (Advisor), The Ohio State University 0 50 100 150 200 250 300 1 2 4 8 16 32 Time Taken (milliseconds) Number of Processes Time Taken by one PMI_Get 0.01 0.1 1 10 100 1000 10000 Memory Usage per Node (MB) Number of Processes PMI Memory Usage PPN = 16 PPN = 32 PPN = 64 0 50 100 150 200 250 300 1 2 4 8 16 32 Time Taken (milliseconds) Number of Processes Time Taken by one PMI_Get Default Shmem 1 10 100 1000 10000 32K 64K 128K 256K 512K 1M Memory Usage per Node (MB) Number of Processes PMI Memory Usage Fence - Default Allgather - Default Fence - Shmem Allgather - Shmem Empty Head Tail Key Value Next Hash Table (Table) Key Value Store (KVS) Top P M O Job Startup Performance Memory Required to Store Endpoint Information a b c d e P M PGAS – State of the art MPI – State of the art O PGAS/MPI – Optimized PMIX_Ring PMIX_Ibarrier PMIX_Iallgather Shmem based PMI b c d e a On-demand Connection b c d e a

Transcript of Job Startup at Exascale: Challenges and...

Page 1: Job Startup at Exascale: Challenges and Solutionssc16.supercomputing.org/sc-archive/src_poster/poster_files/spost129s2-file1.pdfCurrent Trends in HPC § Tremendous increase in system

Current Trends in HPC§ Tremendous increase in system and job sizes§ Dense many-core systems becoming popular§ Less memory available per process§ Fast and scalable job startup is essential

Importance of Fast Job Startup§ Development and debugging§ Regression/Acceptance testing§ Checkpoint restart

Performance BottlenecksStatic Connection Setup§ Setting up O(num_procs2) connections is expensive§ OpenSHMEM, UPC and other PGAS libraries lack

on-demand connection managementNetwork Address Exchange over PMI§ Limited scalability, no potential for overlap§ Not optimized for symmetric exchangeGlobal Barriers§ Unnecessary synchronization and connection setup

Memory Scalability Issues§ Each node requires O(number of processes *

processes per node) memory for storing remote endpoint information

Proposed SolutionsOn-demand Connection Management§ Exchange information and establish connection only

when two peers are trying to communicate[1]

PMIX_Ring Extension§ Move bulk of the data exchange to high-

performance networks[2]

Non-blocking PMI Collectives§ Overlap the PMI exchange with other tasks[3]

Shared-memory based PMI Get/Allgather§ All clients access data directly from the launcher

daemon through shared memory regions[4]

Summary§ Near-constant MPI and OpenSHMEM initialization

time at any process count§ 10x and 30x improvement in startup time of MPI and

OpenSHMEM respectively at 16,384 processes§ Reduce memory consumption by O(ppn)§ 1GB Memory saved @ 1M processes and 16 ppn

References[1] On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. (Chakraborty et al, HIPS ’15)[2] PMI Extensions for Scalable MPI Startup. (Chakraborty et al, EuroMPI/Asia ’14)[3] Non-blocking PMI Extensions for Fast MPI Startup. (Chakraborty et al, CCGrid ’15)[4] SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability.(Chakraborty et al, CCGrid ’16)

More Information§ Available in latest MVAPICH2 and MVAPICH2-X§ http://mvapich.cse.ohio-state.edu/downloads/§ https://go.osu.edu/mvapich-startup

Job Startup at Exascale: Challenges and Solutions

Results - Non-blocking PMI[3]

§ Near-constant MPI_Init at any scale§ MPI_Init with Iallgather performs 288% better than

the default based on Fence§ Blocking Allgather is 38% faster than blocking Fence

0.4

0.8

1.2

1.6

2

64 256 1K 4K 16K

Tim

e Ta

ken

(Sec

onds

)

Number of Processes

Performance of MPI_Init

Fence

Ifence

Allgather

Iallgather

0

0.4

0.8

1.2

1.6

64 256 1K 4K 16K

Tim

e Ta

ken

(Sec

onds

)

Number of Processes

Performance Comparison of Fence and Allgather

PMI2_KVS_Fence

PMIX_Allgather

PMIX_KVS_Ifence§ Non-blocking version of PMI2_KVS_Fence§ int PMIX_KVS_Ifence(PMIX_Request *request)

PMIX_Iallgather§ Optimized for symmetric data movement§ Reduces data movement by up to 30%§ 286KB → 208KB with 8,192 processes§ int PMIX_Iallgather(const char value[],

char buffer[], PMIX_Request *request)

PMIX_Wait § Wait for the specified request to be completed§ int PMIX_Wait(PMIX_Request request)

Non-blocking PMI Collectives§ PMI operations are progressed by separate

processes handling process management§ MPI library not involved in progressing PMI

communication§ Similar to Functional Partitioning approaches§ Can be overlapped with other initialization tasks

PMIX_Request§ Non-blocking collectives return before the

operations is completed§ Return an opaque handle to the request object that

can be used to check for completion

Results – On-demand Connection[1]

§ 29.6 times faster initialization time§ Hello world performs 8.31 times better§ Execution time of NAS benchmarks improved by

up to 35% with 256 processes and class B data

On-demand Connection EstablishmentStatic Connection Setup§ Setting up connections takes over 85% of the total

startup time with 4,096 processes§ RDMA operations require exchanging information

about memory segments registered with the HCA

ResultsSolutionChallengesOverview

Results – PMI Ring Extension[2]

§ MPI_Init based on PMIX_Ring performs 34% better compared to the default PMI2_KVS_Fence

§ Hello World runs 33% faster with 8K processes§ Up to 20% improvement in total execution time of

NAS parallel benchmarks

0

1

2

3

4

5

6

7

16 32 64 128 256 512 1K 2K 4K 8K

Tim

e Ta

ken

(sec

onds

)

Number of Processes

Performance of MPI_Init andHello World with PMIX_Ring

Hello World (Fence)

Hello World (Ring)

MPI_Init (Fence)

MPI_Init (proposed)

0

1

2

3

4

5

6

7

EP MG CG FT BT SP

Tim

e Ta

ken

(sec

onds

)

Benchmark

NAS Benchmarks with1K Processes, Class B Data

Fence

Ring

New Collective – PMIX_Ring§ A ring can be formed by exchanging data with only

the left and the right neighbors§ Once the ring is formed, data can be exchanged

over the high speed networks like InfiniBand§ int PMIX_Ring(char value[], char left[], char right[], …)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

16 64 256 1K 4K 16K

Tim

e Ta

ken

(Sec

onds

)

Number of Processes

PMI Fence with differentNumber of Puts

100% Put + Fence

50% Put + Fence

Fence Only

0

4

8

12

16

16 64 256 1K 4K 16K

Tim

e Ta

ken

(mill

iseco

nds)

Number of Processes

Time Taken by PMIX_Ring

Shortcomings of Current PMI design§ Puts and Gets are local operations§ Fence consumes most of the time§ Time taken for Fence grows approximately linearly

with amount of data transferred (number of keys)

0

0.5

1

1.5

2

2.5

32 64 128 256 512 1K 2K 4K 8K

Tim

e Ta

ken

(Sec

onds

)

Number of Processes

Breakdown of MPI_Init

PMI Exchanges

Shared Memory

Other

0

1

2

3

4

5

6

7

16 64 256 1k 4k 16k

Tim

e Ta

ken

(sec

onds

)

Number of Processes

Time spent in Put, Fence, & Get

Fence

Put

Gets

Main Thread

Main Thread

ConnectionManager Thread

ConnectionManager Thread

Process 1 Process 2

Put/Get(P2)

Create QPQP→Init

Enqueue Send

Create QPQP→InitQP→RTR

QP→RTRQP→RTS

Connection EstablishedDequeue Send

Connect Request(LID, QPN)

(address, size, rkey)

Connect Reply(LID, QPN)

(address, size, rkey)

QP→RTS

Connection Established

Put/Get(P2)

0

2

4

6

8

BT EP MG SP

Exec

utio

n Ti

me

(sec

onds

)

Benchmark

Execution time of OpenSHMEM NAS Parallel Benchmarks

Static

On-demand

0

20

40

60

80

100

16 32 64 128 256 512 1K 2K 4K 8K

Tim

e Ta

ken

(Sec

onds

)

Number of Processes

Performance of OpenSHMEMInitialization and Hello World

Hello World - Static

start_pes - Static

Hello World - On-demand

start_pes - On-demand

0

5

10

15

20

25

30

35

32 64 128 256 512 1K 2K 4K

Tim

e Ta

ken

(Sec

onds

)

Number of Processes

Breakdown of OpenSHMEM Initialization

Connection Setup

PMI Exchange

Memory Registration

Shared Memory Setup

Other

Application Processes Average Peers

BT64 8.7

1024 10.6

EP64 3.0

1024 5.0

MG64 9.5

1024 11.9

SP64 8.8

1024 10.7

2D Heat64 5.3

1024 5.4

Results – Shared Memory based PMI[4]

§ PMI Get takes 0.25 ms with 32 ppn§ 1,000 times reduction in PMI Get latency compared

to default socket based protocol§ Memory footprint reduced by O(Processes Per

Node) ≈ 1GB @ 1M processes, 16 ppn§ Backward compatible, negligible overhead

Shared Memory (shmem) based PMI§ Open a shared memory channel between the

server and the clients§ A hash table is suitable for Fence while Allgather

only requires an array of values§ Use a hash table based on two shmem regions for

efficient insertion and merge, and compactness

Memory Scalability in PMI§ PMI communication between the server and the

clients are based on local sockets§ Latency is high with large number of clients§ Copying data to client’s memory causes large

memory overhead

Sourav Chakraborty, Dhabaleswar K Panda, (Advisor), The Ohio State University

0

50

100

150

200

250

300

1 2 4 8 16 32

Tim

e Ta

ken

(mill

iseco

nds)

Number of Processes

Time Taken by one PMI_Get

0.01

0.1

1

10

100

1000

10000

Mem

ory

Usa

ge p

er N

ode

(MB)

Number of Processes

PMI Memory UsagePPN = 16

PPN = 32

PPN = 64

0

50

100

150

200

250

300

1 2 4 8 16 32

Tim

e Ta

ken

(mill

iseco

nds)

Number of Processes

Time Taken by one PMI_Get

Default

Shmem

1

10

100

1000

10000

32K 64K 128K 256K 512K 1M

Mem

ory

Usa

ge p

er N

ode

(MB)

Number of Processes

PMI Memory UsageFence - DefaultAllgather - DefaultFence - ShmemAllgather - Shmem

Empty Head Tail Key Value Next

Hash Table (Table)

Key Value Store (KVS)

Top

P M

O

Job Startup Performance

Mem

ory

Req

uire

d to

Sto

re

Endp

oint

Info

rmat

ion

a b c d

eP

M

PGAS – State of the art

MPI – State of the art

O PGAS/MPI – Optimized

PMIX_Ring

PMIX_Ibarrier

PMIX_Iallgather

Shmem based PMI

b

c

d

e

a On-demand Connection

b

c d

e

a