Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed...

Distributed communication-aware load balancing withTreeMatch in Charm++

The 11th workshop of the Joint Laboratory for Petascale Computing,Sophia-Antipolis

Emmanuel Jeannot Guillaume Mercier Francois TessierIn collaboration with the Charm++ Team from the PPL :

Esteban Meneses-Rojas, Gengbin Zheng, Sanjay Kale

June 9, 2014

Francois Tessier TreeMatch in Charm++ 1/ 19

Introduction

Scalable execution of parallel applications

Number of cores is increasing

But memory per core is decreasing

Application will need to communicate even more than now

Our solution

Process placement should take into account process affinityHere: load balancing in Charm++ considering :

CPU loadprocess affinity (or other communicating objects)topology : network and intra-node

Charm++

Features

Parallel object-oriented programming language based on C++

Programs are decomposed into a number of cooperating message-drivenobjects called chares.In general we have more chares than processing units

Chares are mapped to physical processors by an adaptive runtime system

Load balancers can be called to migrate chares

Charm++ is able to use MPI for the processes communications

Processes Placement

Why we should consider it

Many current and future parallel platforms have several levels of hierarchy

Application chares/processes do not exchange the same amount of data(affinity)The process placement policy may have impact on performance

Cache hierarchy, memory bus, high-performance network...

Switch

Cabinet Cabinet

... Node Node

... Processor Processor

Core Core Core Core

Problems

The parallel machine topology

The application communication pattern

Map application processes to physical resources (cores) to reduce thecommunication costs (NP-complete)

5 10 15

zeus16.map

Sender rank

TreeMatch

The TreeMatch Algorithm

Algorithm and environment to compute processes placement based onprocesses affinities and NUMA topologyInput :

The communication pattern of the applicationPreliminary execution with a monitored MPI implementation for staticplacementDynamic recovery on iterative applications with Charm++

A model (tree) of the underlying architecture : Hwloc can provide us this.Output :

A processes permutation σ such that σi is the core number on which wehave to bind the process i

TreeMatch can only work on tree topologies. How to deal with 3d torus ?

Network placement

libtopomap

T. Hoefler and M. Snir, "Generic Topology Mapping Strategies forLarge-Scale Parallel Architectures" Proc. Int’l Conf. Supercomputing(ICS), pp. 75-84, 2011.

Library that enables to map processes on various network topologies

Used in TreeMatchLB to consider the Blue Waters 3d torus

Figure: 3d Torus and a Cray Gemini router

Load balancing

Principle

Iterative applications

load balancer called at regular interval

Migrate chares in order to optimize several criteriaCharm++ runtime system provides:

chares loadchares affinityetc. . .

Constraints

Dealing with complex modern architectures

Taking into account communications between elements

Some other communication-aware load-balacing algorithms

[L. L. Pilla, et al. 2012] NUCOLB, shared memory machines

[L. L. Pilla, et al. 2012] HwTopoLB

Some "built-in" Charm++ load balancers : RefineCommLB,GreedyCommLB. . .

Several issues raised

Not so easy...

Several issues raised!

Scalability of TreeMatch

Need to find a relevant compromise between processes affinities and loadbalancing

What about load balancing time?

The next slides will present our load balancer relying on TreeMatch andlibtopomap which performs a parallel and distributed communication-awareload balancing.

Strategy for Charm++ - Network Placement

First step : minimize communication cost on network

libtopomap reorders processes from a communicatorHow to use it to reorder groups of processes (or chares) ? Example :groups of chares on nodes

Charm++ uses MPI : full access to the MPI APINew MPI communicator with MPI_Comm_split

0 1 2 3

Network (3d torus, tree, …)

New communicator

Francois Tessier TreeMatch in Charm++ 10 / 19

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel

3 6 8 9 12 14 15 16

Groups of chares assigned to nodes

Network (3d torus, hierarchical, …)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel Figure: Part of a 3d Torus attributed by

the resource manager

0 2 4 6 0 2 4 6

Groups of chares assigned to cores

0 2 4 6

Chares

0 2 4 6

Chares

0 2 4 6 0 2 4 6

Groups of chares assigned to cores

Results

kNeighbor

Benchmarks application designed to simulate intensive communicationbetween processes

Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) - PlaFRIMClusterParticularly compared to RefineCommLB

Takes into account load and communicationMinimizes migrations

kNeighbor on 64 cores128 elements − 1MB message size

kNeighbor on 64 cores256 elements − 1MB message size

Results

kNeighbor

Experiments on 16 nodes with 8 cores on each (Intel Xeon 5550) -PlaFRIM Cluster

1 MB messages - 100 iterations - 7-Neighbor

1 2 4 8 16

tion (

Number of chares by core

Execution time versus chares by core

DummyLBTreeMatchLB

Results

kNeighbor

Experiments on 16 nodes with 32 cores on each (AMD Interlagos 6276) -Blue Waters Cluster

1 MB messages - 100 iterations - 7-Neighbor

Bad performances...

1 2 4 8 16

tion (

Number of chares by core

Execution time versus chares by core

DummyLBTreeMatchLB

Results

Stencil3D

3 dimensional stencil with regularcommunication with fixed neighbors

One chare per core : balance onlyconsidering communications

Only one load balancing step after 10iterations

Experiments on 8 nodes with 8 cores oneach (Intel Xeon 5550)

Stencil3D on 64 cores − 64 elements

Results

What about the load balancing time?

Comparison between the sequential and the distributed versions ofTreeMatchLB

The master node distributes the data to each node which will compute itsown chares placement. This data distribution can be done in parallel(around 20% of improvments)

Sequential

Distributed

Sequential

Distributed

Sequential

Distributed

e in s

Time repartition for each step of the load balancing process

InitializationTM Sequential

TM Parallel

1638481924096

Results

Comparison between the sequential and the distributed versions ofTreeMatchLB

The master node distributes the data to each node which will compute itsown chares placement. This data distribution can be done in parallel(around 20% of improvments)

Master

165.6 165.7 165.8 165.9 166 166.1 166.2

4096 Chares - reverse - Par

InitProcess results

DistributeCalculate

Return

Results

Linear trajectory while the number of chares is doubled

TreeMatchLB is slower than the other Greedy strategies

RefineCommLB which provides some good results forcommunication-bound applications is not scalable (fails from 8192 chares)

128 256 512 1024 2048 4096 8192

tion t

Number of chares

Execution time of load balancingstrategies (running on 128 cores)

GreedyCommLBGreedyLB

RefineCommLBTreeMatchLB

Figure: Load balancing time of the different strategies vs. number of chares for theKNeighbor application.

Future work and ConclusionThe end

Topology is not flat!

Processes affinities are not homogeneous

Take into account these information to map chares give us improvement

Algorithm adapted to large problems (Distributed)

JLPC collaborations

10 days during August 2013 at the PPL

Paper accepted at IEEE Cluster 2013

Future work

Find a better way to gather the topology (Hwloc?)

Bad performances on Blue Waters... Need to understand why(Architecture ?)

Perform more large scale experiments

Hybrid architecture? Intel MIC?

Evaluate our solution on other applications

Compare to other load balancer (NUCOLB, application-specific LBs)

The End

Thanks for your attention !Any questions?

Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed...

Documents

Transcript of Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed...

A Review on different load balancing Algorithm in cloud ...Load balancer supports multiple load balancing algorithms. five common load balancing algorithms which are use in load balancing

6LB: Scalable and Application-Aware Load Balancing with ... · load-balancing mechanisms for static Web content include [23], [24], [25], which assign queries as a function of their

Dynamic I/O-Aware Load Balancing and Resource Management for Clusters

RIAL: Resource Intensity Aware Load Balancing in Cloudshs6ms/publishedPaper/Conference/2014/RIAL... · RIAL: Resource Intensity Aware Load Balancing ... and R. Buyya, ^Cloudsim: ...

IEEE TRANSACTIONS ON CLOUD COMPUTING VOL: PP NO: 99 … › 2015DOTNET › Energy-aware Load... · Energy-aware Load Balancing and Application Scaling for the Cloud Ecosystem Ashkan

Load balancing using dynamic cache allocationpeople.ac.upc.edu/mmoreto/papers/Moreto-CF2010.pdf · is to achieve load balance. Conversely, we are not aware of any dynamic load balancing

Energy-aware Load Balancing and Application Scaling for the Cloud Ecosystem-IEEE PROJECT 2015-2016

Elastic Load Balancing - ユーザーガイド...Elastic Load Balancing ユーザーガイド アベイラビリティーゾーンとロードバランサーノード Elastic Load

CONGA: Distributed Congestion-Aware Load … Distributed Congestion-Aware Load Balancing for Datacenters Mohammad Alizadeh y, Tom Edsall , Sarang Dharmapurikar , Ramanan Vaidyanathan

iez: Resource Contention Aware Load Balancing for Large ...people.cs.vt.edu/~butta/docs/ipdps2019-iez.pdfiez: Resource Contention Aware Load Balancing for Large-Scale Parallel File

DVAD42 –Load Balancing P4 programmable Load Balancing ...

CONGA: Distributed Congestion-Aware Load …web.eecs.umich.edu/~sugih/courses/eecs589/f16/33-Andrew+Jack.pdfCONGA: Distributed Congestion-Aware Load Balancing for Datacenters By Alizadeh,M

Elastic Load Balancing - docs.aws.amazon.com · Elastic Load Balancing Application Load Balancers Related Services • Getting Started with Elastic Load Balancing in the Elastic Load

Efficient, Proximity-Aware Load Balancing for DHT …fac-staff.seattleu.edu/zhuy/web/papers/tpds_loadbalanc… · · 2005-10-10Efficient, Proximity-Aware Load Balancing for DHT-Based

Communication-Aware Load Balancing for Parallel Applications on

CONGA: Distributed Congestion-Aware Load Balancing …sugih/courses/eecs589/f16/33-Andrew... · CONGA: Distributed Congestion-Aware Load Balancing for Datacenters By Alizadeh,M et

Distributed communication-aware load balancing with ...caradhras/documents/schedForLargeScaleSys.pdfDistributed communication-aware load balancing with TreeMatch in Charm++ The9thSchedulingforLargeScaleSystemsWorkshop,Lyon,France

Productivity-aware Design and Implementation of ... · Finally, the distributed load balancing strategies provided are e ective: the dynamic load balancing version is up to 1:5 faster

Temperature Aware Load Balancing For Parallel Applications

Content-aware Load Balancing for Distributed Backup · 2019-02-25 · Content-aware Load Balancing for Distributed Backup Fred Douglis EMC Fred.Douglis@emc.com Deepti Bhardwaj EMC

Elastic Load Balancing - ユーザーガイド...Elastic Load Balancing ユーザーガイドアベイラビリティーゾーンとロードバランサーノード Elastic Load