NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title:...
Transcript of NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title:...
![Page 1: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/1.jpg)
Charm++ on NUMA Platforms:
the impact of SMP Optimizations
and a NUMA-aware Load Balancer
Laércio Pilla (UFRGS/Brazil - INRIA)
Christiane Pousa (INRIA)
Daniel Cordeiro (USP/Brazil - INRIA)
Jean-François Méhaut (INRIA)
![Page 2: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/2.jpg)
Outline
• Introduction
• Performance Evaluation of SMP Optimizations of
Charm++ on NUMA Machines
• NUMA-aware Load Balancer on Charm++
• Conclusion and Future Work
![Page 3: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/3.jpg)
Motivation for NUMA platforms
• The number of cores per processor is
increasing
• Hierarchical shared memory multiprocessors
• ccNUMA is coming back (NUMA factor)
Node 3
M3 CPU 3
Node 0
M0 CPU 0
Node 1
M1 CPU 1
Node 2
M2 CPU 2
Node 2
M2 c c cc
Node 3
M3 c c cc
Node 0
M0 c c cc
Node 1
M1 c c cc
Now
![Page 4: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/4.jpg)
NUMA problems
• Remote access
• Optimization of latency
Node 1
M1 c c cc
Node 3
M3 c c cc
Node 5
M5 c c cc
Node 7
M7 c c cc
Node 0
M0 c c cc
Node 2
M2 c c cc
Node 4
M4 c c cc
Node 6
M6 c c cc
![Page 5: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/5.jpg)
NUMA problems
• Remote access
• Optimization of latency
Node 1
M1 c c cc
Node 3
M3 c c cc
Node 5
M5 c c cc
Node 7
M7 c c cc
Node 0
M0 c c cc
Node 2
M2 c c cc
Node 4
M4 c c cc
Node 6
M6 c c cc
![Page 6: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/6.jpg)
NUMA problems
• Remote access
• Optimization of latency
Node 1
M1 c c cc
Node 3
M3 c c cc
Node 5
M5 c c cc
Node 7
M7 c c cc
Node 0
M0 c c cc
Node 2
M2 c c cc
Node 4
M4 c c cc
Node 6
M6 c c cc
![Page 7: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/7.jpg)
NUMA problems
• Memory contention
• Optimization of bandwidth
Node 1
M1 c c cc
Node 3
M3 c c cc
Node 5
M5 c c cc
Node 7
M7 c c cc
Node 0
M0 c c cc
Node 2
M2 c c cc
Node 4
M4 c c cc
Node 6
M6 c c cc
![Page 8: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/8.jpg)
NUMA problems
• Memory contention
• Optimization of bandwidth
Node 1
M1 c c cc
Node 3
M3 c c cc
Node 5
M5 c c cc
Node 7
M7 c c cc
Node 0
M0 c c cc
Node 2
M2 c c cc
Node 4
M4 c c cc
Node 6
M6 c c cc
![Page 9: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/9.jpg)
NUMA problems
• Memory contention
• Optimization of bandwidth
Node 1
M1 c c cc
Node 3
M3 c c cc
Node 5
M5 c c cc
Node 7
M7 c c cc
Node 0
M0 c c cc
Node 2
M2 c c cc
Node 4
M4 c c cc
Node 6
M6 c c cc
![Page 10: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/10.jpg)
NUMA problems
On NUMA machines,
data distribution
matters!
![Page 11: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/11.jpg)
Charm++ Parallel
Programming System
• Platform independent
• Both shared and distributed memory
• Architecture abstraction
• Programmer productivity
From charm++ site: http://charm.cs.uiuc.edu/research/charm/
![Page 12: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/12.jpg)
Charm++ Parallel
Programming System
• Communications originally implemented
with message passing
• Even on SMP machines
• Currently, uses optimizations for SMP
systems
• Chao Mei et al., “Optimizing a parallel runtime system
for multicore clusters: a case study”, in TG ‘10
![Page 13: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/13.jpg)
Charm++ & NUMA
• How these optimizations work on
NUMA machines?
• How can we use knowledge about the
NUMA system to improve performance
on Charm++?
![Page 14: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/14.jpg)
Charm++ & NUMA
• How these optimizations work on
NUMA machines?
• Our evaluation
• How can we use knowledge about the
NUMA system to improve performance
on Charm++?
• NUMA-aware load balancer
![Page 15: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/15.jpg)
Outline
• Introduction
• Performance Evaluation of SMP Optimizations
of Charm++ on NUMA Machines
• NUMA-aware Load Balancer on Charm++
• Conclusion and Future Work
![Page 16: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/16.jpg)
Evaluation of SMP optimizations
• Different Charm++ versions
• With optimizations
• Without optimizations
• Different architecture compilations (flavors)
• net-linux: distributed memory
• multicore: shared memory
![Page 17: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/17.jpg)
NUMA machines• AMD Opteron
• 8 nodes x 2 cores
• @ 2.2GHz
• 2 MB L2 cache
• 32 GB main memory
• Low latency for local
memory access
• Crossbar
• NUMA factor: 1.2 – 1.5
• Linux 2.6.32.6
Node 6
M5 C12 C13
Node 7
M7 C14 C15
Node 4
M4 C8 C9
Node 5
M5 C10 C11
Node 2
M2 C4 C5
Node 3
M3 C6 C7
Node 0
M0 C0 C1
Node 1
M1 C2 C3
![Page 18: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/18.jpg)
NUMA machines
• Intel Xeon X7560
• 4 nodes x 8 cores
• @ 2.27 GHz
• 24 MB shared L3 cache
• 64 GB main memory
• QuickPath
• NUMA factor: 2 - 2.6
• Linux 2.6.32
Node 2
M2
C22 C23
C20 C21
C18 C19
C16 C17
Node 3
M3
C30 C31
C28 C29
C26 C27
C24 C25
Node 0
M0
C6 C7
C4 C5
C2 C3
C0 C1
Node 1
M1
C14 C15
C12 C13
C10 C11
C8 C9
![Page 19: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/19.jpg)
Experimental setup
• Exclusive access to the machines
• Minimum of 10 executions
• Low standard deviation (< 5%)
• Different numbers of cores
![Page 20: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/20.jpg)
Benchmark: Jacobi2D
• Iterative benchmark
• Computations over 2D matrix
• Communications with 4 neighbors
• Stencil (CPU bound)
• Imbalanced
![Page 21: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/21.jpg)
Jacobi2D on Opteron Machine
0
1
2
3
4
5
6
2 4 8 16
Ave
rag
e i
tera
tio
n t
ime
(s
)
Number of cores
With optim. multicore Without optim. multicore
With optim. net_linux Without optim. net_linux
No sensible
difference
b
e
t
t
e
r
Opteron
![Page 22: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/22.jpg)
Jacobi2D on Xeon Machine
0,0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
2 4 8 16 32
Ave
rag
e i
tera
tio
n t
ime
(s
)
Number of cores
With optim. multicore Without optim. multicore
With optim. net_linux Without optim. net_linux
b
e
t
t
e
r
Xeon
Inside de
error margin
![Page 23: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/23.jpg)
Benchmark: kNeighbor
• Synthetic benchmark
• Completely communication bound
• Each chare communicates with k
neighbors
• k = 3
• Message size = 1024 B
![Page 24: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/24.jpg)
kNeighbor on Opteron Machine
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 4 8 16
Ave
rag
e i
tera
tio
n t
ime
(u
s)
Number of cores
With optim. multicore Without optim. multicore
With optim. net_linux Without optim. net_linux
Opteron
b
e
t
t
e
r
Speedup of 9Speedup of 1.2
![Page 25: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/25.jpg)
kNeighbor on Xeon Machine
0
100
200
300
400
500
600
700
800
2 4 8 16 32
Ave
rag
e i
tera
tio
n t
ime
(u
s)
Number of cores
With optim. multicore Without optim. multicore
With optim. net_linux Without optim. net_linux
Xeon
b
e
t
t
e
r
![Page 26: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/26.jpg)
kNeighbor on Xeon Machine
0
100
200
300
400
500
600
700
800
2 4 8 16 32
Ave
rag
e i
tera
tio
n t
ime
(u
s)
Number of cores
With optim. multicore Without optim. multicore
With optim. net_linux Without optim. net_linux
Xeon
b
e
t
t
e
r
Speedup of 2
Speedup of 4.7
![Page 27: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/27.jpg)
Partial conclusions
• Times can be have a 50% difference
between Charm++ versions
• Times 90% smaller when using
multicore instead of net-linux
• Impact proportional to the amount of
communications
![Page 28: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/28.jpg)
Outline
• Introduction
• Performance Evaluation of SMP Optimizations of
Charm++ on NUMA Machines
• NUMA-aware Load Balancer on Charm++
• Conclusion and Future Work
![Page 29: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/29.jpg)
NUMA-aware Load Balancer
• Use knowledge about the system
• NUMA-factor among nodes
• Collected through libarchtopo
• Communication history
• No knowledge about the chare’s memory
• Improve performance by reducing
communication latency
• Avoid too many chare migrations
![Page 30: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/30.jpg)
NUMA-aware Load Balancer
Calculate processors’ load
Sort chares by decreasing load
While there are migratable chares
Pick most loaded chare k
Compute W(k,i) for all processors i
Migrate k for the processor with smaller W(k,i)
![Page 31: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/31.jpg)
NUMA-aware Load Balancer
W(k,i) = L(i) +
ɑ*(
- M(k,i)
+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))
)
![Page 32: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/32.jpg)
NUMA-aware Load Balancer
W(k,i) = L(i) +
ɑ*( Communication weight (constant)
- M(k,i)
+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))
)
Load on candidate processor (core)
![Page 33: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/33.jpg)
NUMA-aware Load Balancer
W(k,i) = L(i) +
ɑ*(
- M(k,i)
+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))
)
Intra-core communications
(extended for intra-NUMA node)
![Page 34: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/34.jpg)
NUMA-aware Load Balancer
W(k,i) = L(i) +
ɑ*(
- M(k,i)
+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))
) Inter-core communications
(extended for inter-NUMA node)
NUMA factor
![Page 35: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/35.jpg)
Load Balancer Evaluation
• Benchmarks
• Imbalance
• Jacobi2D
• Poisson3D
• Comparison with different load balancers
• GreedyLB
• GreedyCommLB
![Page 36: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/36.jpg)
Benchmark: Imbalance
• By Isaac Dooley
• Based on Fractography3D
• Iterative benchmark
• Imbalance increases with computations
• Computations over 2D array of chares
• Communications with 4 neighbors
![Page 37: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/37.jpg)
Imbalance on Opteron Machine
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 16
To
tal ti
me
sp
ee
du
p
Number of cores
No LB GreedyCommLB GreedyLB NumaLB
Opteron
b
e
t
t
e
r
15%5%
![Page 38: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/38.jpg)
Imbalance on Xeon Machine
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 16 32
To
tal ti
me
sp
ee
du
p
Number of cores
No LB GreedyCommLB GreedyLB NumaLB
Xeon
b
e
t
t
e
r
~5% (inside error margin)
![Page 39: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/39.jpg)
Benchmark: Jacobi2D
• Iterative benchmark
• Computations over 2D matrix
• Communications with 4 neighbors
• Stencil (CPU bound)
• Imbalaced
![Page 40: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/40.jpg)
Jacobi2D on Opteron Machine
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 16
Ite
rati
on
tim
e s
pe
ed
up
Number of cores
No LB GreedyCommLB GreedyLB NumaLB
Opteron
b
e
t
t
e
r
![Page 41: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/41.jpg)
Jacobi2D on Xeon Machine
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 16 32
Ite
rati
on
tim
e s
pe
ed
up
Number of cores
No LB GreedyCommLB GreedyLB NumaLB
Xeon
b
e
t
t
e
r
17%
3%
![Page 42: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/42.jpg)
Benchmark: Poisson3D
• By Xavier Besseron and Thierry Gautier
• Solves the Poisson equation on a 3D
domain
• Parallelized by domain decomposition
• Well balanced
![Page 43: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/43.jpg)
Poisson3D on Opteron Machine
0,900 0,7970,999 0,989
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 16
To
tal ti
me
sp
ee
du
p
Number of cores
No LB GreedyCommLB GreedyLB NumaLB
Opteron
b
e
t
t
e
r
![Page 44: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/44.jpg)
Poisson3D on Xeon Machine
0,914 0,8370,711
0,995 0,993 0,983
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 16 32
To
tal ti
me
sp
ee
du
p
Number of cores
No LB GreedyCommLB GreedyLB NumaLB
Xeon
b
e
t
t
e
r
GreedyLB performance decreased due to
migrations overhead
![Page 45: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/45.jpg)
Outline
• Introduction
• Performance Evaluation of SMP Optimizations of
Charm++ on NUMA Machines
• NUMA-aware Load Balancer on Charm++
• Conclusion and Future Work
![Page 46: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/46.jpg)
Conclusion
• SMP optimizations do affect the
performance on NUMA machines
• Up to 50% between versions and 90%
between architecture-specific compilations
• Gains with NUMA LB
• Speedups of up to 2.8 (compared to no LB)
• Performance near GreedyLB
• Avoid migrations
![Page 47: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/47.jpg)
Future Work
• Evolution of NUMA-aware LB
• Consider topology
• Number of hops
• Cache hierarchy
• Memory per chare
• Improve NUMA information discovery
• Initialization overheads
• Run experiments with communication intensive
benchmarks
• Interface for a memory LB?
![Page 48: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18](https://reader033.fdocuments.us/reader033/viewer/2022051822/5fec689e27df263a766bcf24/html5/thumbnails/48.jpg)
Charm++ on NUMA Platforms:
the impact of SMP Optimizations
and a NUMA-aware Load Balancer
Thank you