Hoard: A Scalable Memory Allocator for Multithreaded Applications
description
Transcript of Hoard: A Scalable Memory Allocator for Multithreaded Applications
![Page 1: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/1.jpg)
Hoard: A Scalable Memory Allocator for Multithreaded Applications
Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson
Presented by Ivan Jibaja
(Some slides adapted from Emery Berger’s presentation)
1
![Page 2: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/2.jpg)
Outline
• Motivation• Problems in allocator design
– False sharing– Fragmentation
• Existing approaches• Hoard design• Experimental evaluation
2
![Page 3: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/3.jpg)
Motivation
• Parallel multithreaded programs prevalent– Web servers, search engines, DB managers etc.– Run on CMP/SMP for high performance
• Memory allocation is a bottleneck– Prevents scaling with number of processors
3
![Page 4: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/4.jpg)
Desired allocator attributes on a multiprocessor system
• Speed– Competitive with uniprocessor allocators on 1 cpu
• Scalability– Performance linear with the number of processors
• Fragmentation (=max allocated / max in use)– High fragmentation poor data locality paging
• False sharing avoidance
4
![Page 5: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/5.jpg)
The problem of false sharing• Program causes false sharing
• Allocate number of objects in a cache line, pass objects to different threads
• Allocators cause false sharing!• Actively:
• malloc satisfies different thread requests from same cache line
• Passively:• free allows future malloc to produce false sharing
processor 1 processor 2x2 = malloc(s);x1 = malloc(s);
A cache line
thrash… thrash…
5
![Page 6: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/6.jpg)
The problem of fragmentation
• Blowup:– Increase in memory consumption when allocator
reclaims memory freed by program, but fails to use it for future requests
– Mainly a problem of concurrent allocators
– Unbounded (worst case) or bounded (O(P))
6
![Page 7: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/7.jpg)
Example: Pure Private Heaps Allocator
• Pure private heaps:• one heap per processor.
• malloc gets memoryfrom the processor's heap or the system
• free puts memory on the processor's heap
• Avoids heap contention• Examples: STL, Cilk
x1= malloc(s)
free(x1) free(x2)
x3= malloc(s)
x2= malloc(s)
x4= malloc(s)
processor 1 processor 2
= allocated by heap 1
= free, on heap 2
7
![Page 8: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/8.jpg)
How to Break Pure Private Heaps: Fragmentation
• Pure private heaps:• memory consumption can
grow without bound!
• Producer-consumer:• processor 1 allocates• processor 2 frees• Memory always
unavailable to producer
free(x1)
x2= malloc(s)
free(x2)
x1= malloc(s)processor 1 processor 2
x3= malloc(s)
free(x3)
8
![Page 9: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/9.jpg)
Example II: Private Heaps with Ownership
• free puts memory back on the originating processor's heap.
• Avoids unbounded memory consumption• Examples: ptmalloc,LKmalloc
x1= malloc(s)
free(x1)
free(x2)
x2= malloc(s)
processor 1 processor 2
9
![Page 10: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/10.jpg)
How to Break Private Heaps with Ownership:Fragmentation
• memory consumption can blowup by a factor of P.
• Round-robin producer-consumer:processor i allocatesprocessor i+1 frees
• Program requires 1 (K) blocks, allocator gets 3 (P*K) blocks
free(x2)
free(x1)
free(x3)
x1= malloc(s)
x2= malloc(s)
x3=malloc(s)
processor 1 processor 2 processor 3
10
![Page 11: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/11.jpg)
Existing approaches
11
![Page 12: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/12.jpg)
Uniprocessor Allocators on Multiprocessors
• Fragmentation: Excellent– Very low for most programs [Wilson & Johnstone]
• Speed & Scalability: Poor– Heap contention
• A single lock protects the heap
• Can exacerbate false sharing– Different processors can share cache lines
12
![Page 13: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/13.jpg)
Existing Multiprocessor Allocators• Speed:
• One concurrent heap (e.g., concurrent B-tree):
• O(log (#size-classes)) cost per memory operation• too many locks/atomic updates
Fast allocators use multiple heaps
• Scalability:• Allocator-induced false sharing
• Other bottlenecks (e.g. nextHeap global in Ptmalloc)
• Fragmentation:• P-fold increase or even unbounded
13
![Page 14: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/14.jpg)
Hoard as the solution
14
![Page 15: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/15.jpg)
Hoard Overview• P per-processor heaps & 1 global heap• Each thread accesses only its local heap & global • Manages memory in page-sized superblocks of
same-sized objects (LIFO free-list)– Avoids false sharing by not carving up cache lines– Avoids heap contention – local heaps allocate & free
small blocks from their superblocks
• Avoids blowup by– Moving superblocks to global heap when fraction of
free memory exceeds some threshold15
![Page 16: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/16.jpg)
Superblock management
Emptiness threshold: (ui ≥ (1-f)*ai)∨(ui ≥ ai – K*S)
f = ¼K = 0
• Multiple heaps Avoid actively induced false sharing
• Block coalescing Avoid passively induced false sharing
• Superblocks transferred are usually empty and transfer is infrequent
16
![Page 17: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/17.jpg)
Hoard pseudo-codemalloc(sz)1. If sz > S/2, allocate the superblock from the OS
and return it.2. i hash(current thread)3. Lock heap i4. Scan heap i’s list of superblocks from full to least
(for the size class of sz)5. If there is no superblock with free space {6. Check heap 0 (global) for a superblock7. If there is none {8. Allocate S bytes as superblock s & set
owner to heap i9. } Else {10. Transfer the superblock s to heap i11. u0 u0 – s.u; ui ui + s.u
12. a0 a0 - S; ai ai + S
13. }14. }15. ui ui + sz; s.u s.u + sz
16. Unlock heap i17. Return a block from the superblock
free(ptr)1. If the block is “large”2. Free superblock to OS and return3. Find the superblock s this blocks comes from4. Lock s5. Lock heap i, the superblock’s owner6. Deallocate the block from the superblock7. ui ui – block size
8. s.u s.u – block size9. If (i = 0) unlock heap i, superblock s and return10. If (ui < ai – K*S) and (ui<(1-f)*ai) {
11. Transfer a mostly-empty superblock s1 to heap 0 (global)
12. u0 u0 + s1.u; ui ui – s1.u
13. a0 a0 + S; ai ai – S
14. } 15. Unlock heap i and superblock s
17
![Page 18: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/18.jpg)
Heap contention
• Per-processor Heap contention
– 1 allocator thread / multiple threads free• Inherently unscalable
– Pairs of producer/consumer threads• malloc/free calls serialized• At most 2X slowdown (undesirable but scalable)
– Empirically only a small fraction of memory is freed by another thread Contention expected to be low
18
![Page 19: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/19.jpg)
Heap contention (2)• Global Heap contention
– Measure # GH lock acquisitions as upper bound
– Growing phase:• Each thread at most k/(f*S/s) acquisitions for k malloc’s
– Shrinking phase:• Pathological case where program frees (1-f) of each superblock and
then frees every block in superblock one at a time
– Empirically: No excessive shrinking and gradual growth of memory usage low overall contention
19
![Page 20: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/20.jpg)
Experimental Evaluation• Dedicated 14-processor Sun Enterprise
– 400 MHz Ultrasparc– 2 GB RAM, 4MB L2 cache– Solaris 7– Superblock size=8K, f = ¼
• Comparison between– Hoard– Ptmalloc (GNU libC, multiple heaps & ownership)– Mtmalloc (Solaris multithreaded allocator)– Solaris (default system allocator)
20
![Page 21: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/21.jpg)
Benchmarks
21
![Page 22: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/22.jpg)
Speed
22
Size classes need to be handled more cleverly
![Page 23: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/23.jpg)
Scalability - threadtest
23
278% faster than Ptmalloc on 14 cpus
t threads allocate/deallocate 100,000/t 8-byte objects
![Page 24: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/24.jpg)
Scalability – Larson
24
• “Bleeding” typical in server applications• Mainly stays within empty fraction during execution• 18X faster than next best allocator on 14 cpus
![Page 25: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/25.jpg)
Scalability - BEMengine
25• Few times below empty fraction low synchronization
![Page 26: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/26.jpg)
False sharing behavior
26
• Active-false: Each thread allocates small object, writes it few times, frees it
• Passive-false: Allocate objects, hand them to threads that free them, emulate Active-false
• Illustrate effects of contention of the coherence mechanism
![Page 27: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/27.jpg)
Fragmentation results
27
Large number of size classes remain live for
duration of program and scattered across
blocks
Within 20% of Lea’s allocator
![Page 28: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/28.jpg)
Hoard Conclusions• Speed: Excellent
• As fast as a uniprocessor allocator on one processor• amortized O(1) cost• 1 lock for malloc, 2 for free
• Scalability: Excellent• Scales linearly with the number of processors• Avoids false sharing
• Fragmentation: Very good• Worst-case is provably close to ideal• Actual observed fragmentation is low
28
![Page 29: Hoard: A Scalable Memory Allocator for Multithreaded Applications](https://reader035.fdocuments.us/reader035/viewer/2022062517/56813ba6550346895da4d674/html5/thumbnails/29.jpg)
Discussion Points
• If we had to re-evaluate Hoard today which benchmarks would we use?
• Are there any changes needed to make it work with languages like Java?
29