A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and...
-
Upload
whitney-wolfram -
Category
Documents
-
view
213 -
download
0
Transcript of A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and...
![Page 1: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/1.jpg)
A Study of Garbage Collector Scalability on Multicores
LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro
INRIA/University of Paris 6
![Page 2: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/2.jpg)
2
14/20 most popular languages have GC
but GC doesn’t scale on multicore hardware
Garbage collection on multicore hardware
Lokesh Gidra
Parallel Scavenge/HotSpot scalability on a 48-core machine
GC threads
GC
Thr
ough
put (
GB
/s)
SPECjbb2005 with48 application threads/3.5GB
A Study of the Scalability of Garbage Collectors on Multicores
Degrades after 24 GC threads
Better ↑
![Page 3: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/3.jpg)
3
Scalability of GC is a bottleneckBy adding new cores, application creates more garbage per time unit
And without GC scalability, the time spent in GC increases
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
Lusearch[PLOS’11]
~50% of the time spent in the GCat 48 cores
![Page 4: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/4.jpg)
4
Where is the problem?
Probably not related to GC design: the problem exists in ALL the GCs of HotSpot 7(both, stop-the-world and concurrent GCs)
What has really changed:
Multicores are distributed architectures, not centralized anymore
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
![Page 5: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/5.jpg)
A Study of the Scalability of Garbage Collectors on Multicores 5
From centralized architectures to distributed ones
Lokesh Gidra
A few years ago…
Uniform memory access machines
Now…
Inter-connectNode 0 Node 1
Node 2 Node 3Cores
Non-uniform memory access machines
Cores
Memory
System Bus
Mem
ory
Mem
ory
Mem
ory
Mem
ory
![Page 6: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/6.jpg)
A Study of the Scalability of Garbage Collectors on Multicores 6
From centralized architectures to distributed onesOur machine: AMD Magny-Cours with 8 nodes and 48 cores
12 GB per node 6 cores per node
Lokesh Gidra
Node 0 Node 1
Node 2 Node 3M
emor
y
Mem
ory
Mem
ory
Mem
ory
Local access: ~ 130 cyclesRemote access: ~350 cycles
![Page 7: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/7.jpg)
A Study of the Scalability of Garbage Collectors on Multicores 7
Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node
From centralized architectures to distributed ones
Lokesh Gidra
Node 0 Node 1
Node 2 Node 3M
emor
y
Mem
ory
Mem
ory
Mem
ory
Local access: ~ 130 cyclesRemote access: ~350 cycles
#cores = #threads
Better↓
Com
plet
ion
time
(ms)
Time to perform a fixednumber of reads in //
![Page 8: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/8.jpg)
A Study of the Scalability of Garbage Collectors on Multicores 8
Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node
From centralized architectures to distributed ones
Lokesh Gidra
Node 0 Node 1
Node 2 Node 3M
emor
y
Mem
ory
Mem
ory
Mem
ory
Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓
Com
plet
ion
time
(ms)
Local Access
Time to perform a fixednumber of reads in //
#cores = #threads
![Page 9: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/9.jpg)
A Study of the Scalability of Garbage Collectors on Multicores 9
Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node
From centralized architectures to distributed ones
Lokesh Gidra
Node 0 Node 1
Node 2 Node 3M
emor
y
Mem
ory
Mem
ory
Mem
ory
Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓
Com
plet
ion
time
(ms)
Random access
Time to perform a fixednumber of reads in //
#cores = #threads
Local Access
![Page 10: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/10.jpg)
A Study of the Scalability of Garbage Collectors on Multicores 10
Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node
From centralized architectures to distributed ones
Lokesh Gidra
Node 0 Node 1
Node 2 Node 3M
emor
y
Mem
ory
Mem
ory
Mem
ory
Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓
Com
plet
ion
time
(ms)
Time to perform a fixednumber of reads in //
Single node access
Local Access
#cores = #threads
Random access
![Page 11: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/11.jpg)
11
Parallel Scavenge Heap Space
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
Kernel’s lazy first-touch page allocation policyFirst-touch allocation
policy
Virtual address space
Parallel Scavenge
![Page 12: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/12.jpg)
12
Parallel Scavenge Heap Space
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on first node
Initial application
thread
First-touch allocation policy
Parallel Scavenge
![Page 13: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/13.jpg)
13
Parallel Scavenge Heap Space
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on its node
Initial application
thread
First-touch allocation policy
But during the whole execution, the mapping remains as it is
(virtual space reused by the GC)
Parallel Scavenge
A severe problem for generational GCs
![Page 14: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/14.jpg)
14
Parallel Scavenge Heap Space
Lokesh Gidra
Bad balance
Bad locality
First-touch allocation policy
95% on a single node
PS
SpecJBB
GC threads
GC
Thr
ough
put (
GB
/s)
Better ↑
Parallel Scavenge
A Study of the Scalability of Garbage Collectors on Multicores
![Page 15: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/15.jpg)
15
NUMA-aware heap layouts
Lokesh Gidra
Bad balance
Bad locality
First-touch allocation policy
Round-robin allocation policy
Node local objectallocation and copy
95% on a single node
Targets balance Targets locality
Parallel Scavenge Interleaved Fragmented
A Study of the Scalability of Garbage Collectors on Multicores
PS
SpecJBB
GC threads
GC
Thr
ough
put (
GB
/s)
Better ↑
![Page 16: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/16.jpg)
16
Interleaved heap layout analysis
Lokesh Gidra
Bad balance Perfect balance
Bad locality Bad locality
First-touch allocation policy
Round-robin allocation policy
Node local objectallocation and copy
95% on a single node 7/8 remote accesses
PS
Interleaved
SpecJBB
GC threads
GC
Thr
ough
put (
GB
/s)
Better ↑
Parallel Scavenge Interleaved Fragmented
A Study of the Scalability of Garbage Collectors on Multicores
![Page 17: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/17.jpg)
17
Fragmented heap layout analysis
Lokesh Gidra
Bad balance Perfect balance Good balance
Bad locality Bad locality Average locality
Parallel Scavenge Interleaved Fragmented
First-touch allocation policy
Round-robin allocation policy
Node local objectallocation and copy
95% on a single node 7/8 remote accesses Bad balance if a singlethread allocates for the others
PS
Interleaved
FragmentedSpecJBB7/8 remote scans
100% local copies
GC threads
GC
Thr
ough
put (
GB
/s)
Better ↑
A Study of the Scalability of Garbage Collectors on Multicores
![Page 18: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/18.jpg)
18
Synchronization optimizationsRemoved a barrier between the GC phases
Replaced the GC task-queue with a lock-free one
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
GC
Thr
ough
put (
GB
/s)
PS
Interleaved
FragmentedSpecJBB
Fragmented + synchro
Synchro optimization
has effect with high contention
GC threads
Better ↑
![Page 19: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/19.jpg)
19
Effect of Optimizations on the App (GC excluded)
A good balance improves a lot application time
Locality has only a marginal effect on applicationWhile fragmented space increases locality for application over interleaved
space
(recently allocated objects are the most accessed)Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
App
licat
ion
time
PS
Other heap layouts
XML Transform from SPECjvm
GC threads
Better↓
![Page 20: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/20.jpg)
20
Overall effect (both GC and application)
Optimizations double the app throughput of SPECjbb
Pause time divided in half (105ms to 49ms)
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
App
licat
ion
thro
ughp
ut (
ops/
ms)
PS
Fragmented
SpecJBB
Interleaved
Fragmented + synchro
GC threads
Better ↑
![Page 21: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/21.jpg)
21
GC scales well with memory-intensive applications
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
3.5GB 1GB 2GB
512MB1GB2GB
PS Fragmented + synchro
![Page 22: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/22.jpg)
22
Take Away
Previous GCs do not scale because they are not NUMA-aware Existing mature GCs can scale with standard // programming techniques Using NUMA-aware memory layouts should be useful for all GCs
(concurrent GCs included)
Most important NUMA effects
1. Balancing memory access
2. Memory locality only helps at high core count
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
![Page 23: A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c725503460f949241a3/html5/thumbnails/23.jpg)
23
Take Away
Previous GCs do not scale due to NUMA obliviousness Existing mature GCs can scale with standard // programming techniques Using NUMA-aware memory layouts should be useful for all GCs
(concurrent GCs included)
Most important NUMA effects
1. Balancing memory access
2. Memory locality at high core count
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
Thank You