Automatic Pool
Allocation:Improving Performance by
Controlling Data Structure Layout
in the HeapPaper by:
Chris Lattner and Vikram AdveUniversity of Illinois at Urbana-ChampaignBest Paper Award at PLDI 2005
Presented by: Jeff Da SilvaCARG - Aug 2nd 2005
MotivationMotivation
Computer architecture and compiler research has Computer architecture and compiler research has primarily focused on analyzing & optimizing memory primarily focused on analyzing & optimizing memory access pattern for dense arrays rather than for access pattern for dense arrays rather than for pointer-based data structures.pointer-based data structures. e.g. caches, prefetching, loop transformations, etc.e.g. caches, prefetching, loop transformations, etc.
Why?Why? Compilers have precise knowledge of the runtime layout and Compilers have precise knowledge of the runtime layout and
traversal patterns associated with arrays.traversal patterns associated with arrays. The layout of heap allocated data structures and traversal The layout of heap allocated data structures and traversal
patterns can be difficult to statically predict (generally it’s not patterns can be difficult to statically predict (generally it’s not worth the effort).worth the effort).
IntuitionIntuition Improving a dynamic data structure’s spatial locality Improving a dynamic data structure’s spatial locality
through program analysis (like shape analysis) is through program analysis (like shape analysis) is probably too difficult and might not be the best probably too difficult and might not be the best approach.approach.
What if you can somehow influence the layout What if you can somehow influence the layout dynamically so that the data structure is allocated dynamically so that the data structure is allocated intelligently and possibly appears like a dense intelligently and possibly appears like a dense array?array?
A new approach: develop a new technique that A new approach: develop a new technique that operates at the “macroscopic” leveloperates at the “macroscopic” level i.e. at the level of the entire data structure rather than individual i.e. at the level of the entire data structure rather than individual
pointers or objectspointers or objects
What is the problem?What is the problem?List 1Nodes
What the program creates:What the program creates:
List 2Nodes
TreeNodes
What is the problem?What is the problem?List 1Nodes
List 2Nodes
TreeNodes
What the compiler sees:What the compiler sees:
What is the problem?What is the problem?List 1Nodes
List 2Nodes
TreeNodes
What we want the program to create and theWhat we want the program to create and thecompiler to see:compiler to see:
What the compiler sees:What the compiler sees:
Their Approach: Segregate the HeapTheir Approach: Segregate the Heap
Step #1: Memory Usage AnalysisStep #1: Memory Usage Analysis Build context-sensitive points-to graphs for programBuild context-sensitive points-to graphs for program We use a fast unification-based algorithmWe use a fast unification-based algorithm
Step #2: Automatic Pool AllocationStep #2: Automatic Pool Allocation Segregate memory based on points-to graph nodesSegregate memory based on points-to graph nodes Find lifetime bounds for memory with escape analysisFind lifetime bounds for memory with escape analysis Preserve points-to graph-to-pool mappingPreserve points-to graph-to-pool mapping
Step #3: Follow-on pool-specific optimizationsStep #3: Follow-on pool-specific optimizations Use segregation and points-to graph for later Use segregation and points-to graph for later
optimizationsoptimizations
Why Segregate Data Structures?Why Segregate Data Structures?
Primary Goal: Primary Goal: Better compiler information & controlBetter compiler information & control Compiler knows where each data structure lives in memoryCompiler knows where each data structure lives in memory Compiler knows order of data in memory (in some cases)Compiler knows order of data in memory (in some cases) Compiler knows type info for heap objects (from points-to info)Compiler knows type info for heap objects (from points-to info) Compiler knows which pools point to which other poolsCompiler knows which pools point to which other pools
Second Goal:Second Goal: Better performance Better performance Smaller working setsSmaller working sets Improved spatial localityImproved spatial locality
Especially if allocation order matches traversal orderEspecially if allocation order matches traversal order
Sometimes convert irregular strides to regular stridesSometimes convert irregular strides to regular strides
Contributions of this PaperContributions of this Paper
1.1. First “region inference” technique for C/C++:First “region inference” technique for C/C++: Previous work Previous work requiredrequired type-safe programs: ML, Java type-safe programs: ML, Java Previous work focused on memory managementPrevious work focused on memory management
2.2. Region inference driven by pointer analysis:Region inference driven by pointer analysis: Enables handling non-type-safe programsEnables handling non-type-safe programs Simplifies handling imperative programsSimplifies handling imperative programs Simplifies further pool+ptr transformationsSimplifies further pool+ptr transformations
3.3. New pool-based optimizations:New pool-based optimizations: Exploit per-pool and pool-specific propertiesExploit per-pool and pool-specific properties
4.4. Evaluation of impact on memory hierarchy:Evaluation of impact on memory hierarchy: We show that pool allocation reduces working setsWe show that pool allocation reduces working sets
OutlineOutline
Introduction & MotivationIntroduction & Motivation Automatic Pool Allocation TransformationAutomatic Pool Allocation Transformation Pool Allocation-Based OptimizationsPool Allocation-Based Optimizations Pool Allocation & Optimization Performance Pool Allocation & Optimization Performance
ImpactImpact ConclusionConclusion
ExampleExamplestruct list { list *Next; int *Data; };struct list { list *Next; int *Data; };
list *list *createnodecreatenode(int *Data) {(int *Data) {list *New = list *New = mallocmalloc(sizeof(list));(sizeof(list));New->Data = Num; return New;New->Data = Num; return New;
return New;return New;}}
void void splitclonesplitclone(list *L, list **R1, list **R2) {(list *L, list **R1, list **R2) {if(L==0) {*R1 = *R2 = 0 return;}if(L==0) {*R1 = *R2 = 0 return;}if(some_predicate(L->Data)) {if(some_predicate(L->Data)) {
*R1 = *R1 = createnodecreatenode(L->data);(L->data);splitclonesplitclone(L->Next, &(*R1)->Next);(L->Next, &(*R1)->Next);
} else {} else {*R2 = *R2 = createnodecreatenode(L->data);(L->data);splitclonesplitclone(L->Next, &(*R1)->Next);(L->Next, &(*R1)->Next);
}}}}
ExampleExample
void void processlistprocesslist(list *L) {(list *L) {list *A, *B, *tmp;list *A, *B, *tmp;
// Clone L, splitting nodes in list A, and B.// Clone L, splitting nodes in list A, and B.splitclonesplitclone(L, &A, &B);(L, &A, &B);processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list
// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; freefree(A); A = tmp; }(A); A = tmp; }
// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; freefree(B); B = tmp; }(B); B = tmp; }
}}Note that lists A and B use distinct heap memory; it would therefore be beneficial if they were allocated using separate pools of memory.
Pool Alloc Runtime Library InterfacePool Alloc Runtime Library Interface
void void poolcreatepoolcreate(Pool *PD, uint Size, uint Align);(Pool *PD, uint Size, uint Align);Initialize a pool descriptor. (obtains one or more pages of memory using malloc)Initialize a pool descriptor. (obtains one or more pages of memory using malloc)
void void pooldestroypooldestroy(Pool *PD);(Pool *PD);Releases pool memory and destroy pool descriptor.Releases pool memory and destroy pool descriptor.
void *void *poolallocpoolalloc(Pool *PD, uint numBytes);(Pool *PD, uint numBytes);
void void poolfreepoolfree(Pool *PD, void *ptr);(Pool *PD, void *ptr);
void *void *poolreallocpoolrealloc(Pool *PD, void *ptr, uint numBytes);(Pool *PD, void *ptr, uint numBytes);
Interface also uses:Interface also uses:poolinit_bppoolinit_bp(..), (..), poolalloc_bppoolalloc_bp(..), (..), pooldestroy_bppooldestroy_bp(..).(..).
Algorithm StepsAlgorithm Steps
1)1) Generate a Generate a DS GraphDS Graph (Points-to Graphs) for each (Points-to Graphs) for each functionfunction
2)2) Insert code to Insert code to createcreate and and destroydestroy pool descriptors pool descriptors for DS nodes who’s lifetime does not escape a for DS nodes who’s lifetime does not escape a function.function.
3)3) Add pool descriptor Add pool descriptor argumentsarguments for every DS node for every DS node that escapes a function.that escapes a function.
4)4) Replace Replace mallocmalloc and and freefree with calls to with calls to poolallocpoolalloc and and poolfreepoolfree..
5)5) Further refinements and optimizationsFurther refinements and optimizations
Points-To Graph – DS GraphPoints-To Graph – DS Graph
Builds a points-to graph for each function in Builds a points-to graph for each function in Bottom Up [BU] order Bottom Up [BU] order
Context sensitive naming of heap objectsContext sensitive naming of heap objects More advanced than the traditional “allocation callsite” More advanced than the traditional “allocation callsite”
A unification-based approachA unification-based approach Allows for a fast and scalable analysisAllows for a fast and scalable analysis Ensures every pointer points to one unique nodeEnsures every pointer points to one unique node
Field SensitiveField Sensitive Added accuracy Added accuracy
Also used to compute escape infoAlso used to compute escape info
Example – DS GraphExample – DS Graph
list *list *createnodecreatenode(int *Data) {(int *Data) {list *New = list *New = mallocmalloc(sizeof(list));(sizeof(list));New->Data = Num; return New;New->Data = Num; return New;
return New;return New;}}
Example – DS GraphExample – DS Graphvoid void splitclonesplitclone(list *L, list **R1, list **R2) {(list *L, list **R1, list **R2) {
if(L==0) {*R1 = *R2 = 0 return;}if(L==0) {*R1 = *R2 = 0 return;}if(some_predicate(L->Data)) {if(some_predicate(L->Data)) {
*R1 = *R1 = createnodecreatenode(L->data);(L->data);splitclonesplitclone(L->Next, &(*R1)->Next);(L->Next, &(*R1)->Next);
} else {} else {*R2 = *R2 = createnodecreatenode(L->data);(L->data);splitclonesplitclone(L->Next, &(*R1)->Next);(L->Next, &(*R1)->Next);
}}}}
P1 P2
Example – DS GraphExample – DS Graphvoid void processlistprocesslist(list *L) {(list *L) {
list *A, *B, *tmp;list *A, *B, *tmp;// Clone L, splitting nodes in list A, and B.// Clone L, splitting nodes in list A, and B.splitclonesplitclone(L, &A, &B);(L, &A, &B);processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; freefree(A); A = tmp; }(A); A = tmp; }// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; freefree(B); B = tmp; }(B); B = tmp; }
}}
P1 P2
Example – TransformationExample – Transformation
list *list *createnodecreatenode((Pool *PD,Pool *PD, int *Data) { int *Data) {
list *New = list *New = poolallocpoolalloc((PD,PD, sizeof(list)); sizeof(list));
New->Data = Num; return New;New->Data = Num; return New; return New;return New;
}}
Example – TransformationExample – Transformation
void void splitclonesplitclone((Pool *PD1, Pool *PD2,Pool *PD1, Pool *PD2, list *L, list **R1, list **R2) {list *L, list **R1, list **R2) {
if(L==0) {*R1 = *R2 = 0 return;}if(L==0) {*R1 = *R2 = 0 return;}
if(some_predicate(L->Data)) {if(some_predicate(L->Data)) {
*R1 = *R1 = createnodecreatenode((PD1,PD1, L->data); L->data);splitclonesplitclone((PD1, PD2, PD1, PD2, L->Next, &(*R1)->Next);L->Next, &(*R1)->Next);
} else {} else {
*R2 = *R2 = createnodecreatenode((PD2PD2,, L->data); L->data);splitclonesplitclone((PD1, PD2, PD1, PD2, L->Next, &(*R1)->Next);L->Next, &(*R1)->Next);
}}}}
P1 P2
Example – TransformationExample – Transformationvoid void processlistprocesslist(list *L) {(list *L) {
list *A, *B, *tmp;list *A, *B, *tmp;
Pool PD1, PD2;Pool PD1, PD2;poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);
splitclonesplitclone((&PD1, &PD2,&PD1, &PD2, L, &A, &B); L, &A, &B);
processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; poolfreepoolfree((&PD1&PD1, A); A = tmp; }, A); A = tmp; }// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; poolfreepoolfree((&PD2&PD2, B); B = tmp; }, B); B = tmp; }
pooldestroy(&PD1);pooldestroy(&PD1);pooldestroy(&PD2);pooldestroy(&PD2);
}}
P1 P2
More Algorithm DetailsMore Algorithm Details
Indirect Function Call Handling:Indirect Function Call Handling: Partition functions into equivalence classes:Partition functions into equivalence classes:
If F1, F2 have If F1, F2 have common call-sitecommon call-site same class same class Merge points-to graphs for each equivalence classMerge points-to graphs for each equivalence class Apply previous transformation unchangedApply previous transformation unchanged
Global variables pointing to memory nodesGlobal variables pointing to memory nodes Use global pool variable rather than passing them Use global pool variable rather than passing them
around through function args.around through function args.
More Algorithm DetailsMore Algorithm Details
poolcreate / pooldestroy placement:poolcreate / pooldestroy placement: Move calls earlier/later by analyzing the pool’s lifetimeMove calls earlier/later by analyzing the pool’s lifetime Reduces memory usageReduces memory usage Enables poolfree eliminationEnables poolfree elimination
poolfree eliminationpoolfree elimination Eliminate unnecessary poolfree callsEliminate unnecessary poolfree calls
No allocations between poolfree & pooldestroyNo allocations between poolfree & pooldestroy Behaves like static garbage collectionBehaves like static garbage collection
Example – Example – poolcreate/pooldestroy placementpoolcreate/pooldestroy placement
void void processlistprocesslist(list *L) {(list *L) {list *A, *B, *tmp;list *A, *B, *tmp;
Pool PD1, PD2;Pool PD1, PD2;poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);
splitclonesplitclone((&PD1, &PD2,&PD1, &PD2, L, &A, &B); L, &A, &B);
processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; poolfreepoolfree((&PD1&PD1, A); A = tmp; }, A); A = tmp; }// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; poolfreepoolfree((&PD2&PD2, B); B = tmp; }, B); B = tmp; }
pooldestroy(&PD1);pooldestroy(&PD1);pooldestroy(&PD2);pooldestroy(&PD2);
}}
void void processlistprocesslist(list *L) {(list *L) {list *A, *B, *tmp;list *A, *B, *tmp;
Pool PD1, PD2;Pool PD1, PD2;poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);
splitclonesplitclone((&PD1, &PD2,&PD1, &PD2, L, &A, &B); L, &A, &B);
processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; poolfreepoolfree((&PD1&PD1, A); A = tmp; }, A); A = tmp; }
pooldestroy(&PD1);pooldestroy(&PD1);
// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; poolfreepoolfree((&PD2&PD2, B); B = tmp; }, B); B = tmp; }
pooldestroy(&PD2);pooldestroy(&PD2);}}
Example – Example – poolfree Eliminationpoolfree Elimination
void void processlistprocesslist(list *L) {(list *L) {list *A, *B, *tmp;list *A, *B, *tmp;
Pool PD1, PD2;Pool PD1, PD2;poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);
splitclonesplitclone((&PD1, &PD2,&PD1, &PD2, L, &A, &B); L, &A, &B);
processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; poolfreepoolfree((&PD1&PD1, A); A = tmp; }, A); A = tmp; }
pooldestroy(&PD1);pooldestroy(&PD1);
// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; poolfreepoolfree((&PD2&PD2, B); B = tmp; }, B); B = tmp; }
pooldestroy(&PD2);pooldestroy(&PD2);}}
void void processlistprocesslist(list *L) {(list *L) {list *A, *B, *tmp;list *A, *B, *tmp;
Pool PD1, PD2;Pool PD1, PD2;poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);
splitclonesplitclone((&PD1, &PD2,&PD1, &PD2, L, &A, &B); L, &A, &B);
processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; A = tmp; }A = tmp; }
pooldestroy(&PD1);pooldestroy(&PD1);
// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; B = tmp; }B = tmp; }
pooldestroy(&PD2);pooldestroy(&PD2);}}
void void processlistprocesslist(list *L) {(list *L) {list *A, *B, *tmp;list *A, *B, *tmp;
Pool PD1, PD2;Pool PD1, PD2;poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);
splitclonesplitclone((&PD1, &PD2,&PD1, &PD2, L, &A, &B); L, &A, &B);
processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list
pooldestroy(&PD1);pooldestroy(&PD1);
pooldestroy(&PD2);pooldestroy(&PD2);}}
OutlineOutline
Introduction & MotivationIntroduction & Motivation Automatic Pool Allocation TransformationAutomatic Pool Allocation Transformation Pool Allocation-Based OptimizationsPool Allocation-Based Optimizations Pool Allocation & Optimiztion Performance Pool Allocation & Optimiztion Performance
ImpactImpact ConclusionConclusion
PAOpts (1/4) and (2/4)PAOpts (1/4) and (2/4)
Selective Pool AllocationSelective Pool Allocation Don’t pool allocate when not profitableDon’t pool allocate when not profitable Avoids creating and destroying a pool descriptor Avoids creating and destroying a pool descriptor
(minor) and avoids significant wasted space when the (minor) and avoids significant wasted space when the object is much smaller than the smallest internal object is much smaller than the smallest internal page.page.
PoolFree EliminationPoolFree Elimination Remove explicit de-allocations that are not neededRemove explicit de-allocations that are not needed
Looking closely: Anatomy of a heapLooking closely: Anatomy of a heap
Fully general malloc-compatible allocator:Fully general malloc-compatible allocator: Supports malloc/free/realloc/memalign etc.Supports malloc/free/realloc/memalign etc. Standard malloc overheads: object header, alignmentStandard malloc overheads: object header, alignment Allocates slabs of memory with exponential growthAllocates slabs of memory with exponential growth By default, all returned pointers are 8-byte alignedBy default, all returned pointers are 8-byte aligned
In memory, things look like (16 byte allocs):In memory, things look like (16 byte allocs):
16-byte16-byteuser datauser data
16-byte16-byteuser datauser data
16-byte16-byteuser datauser data
One 32-byte Cache Line
4-byte object header4-byte padding for
user-data alignment
PAOpts (3/4): Bump Pointer OptznPAOpts (3/4): Bump Pointer Optzn
If a pool has no poolfree’s:If a pool has no poolfree’s: Eliminate per-object headerEliminate per-object header Eliminate freelist overhead (faster object allocation)Eliminate freelist overhead (faster object allocation)
Eliminates 4 bytes of inter-object paddingEliminates 4 bytes of inter-object padding Pack objects more densely in the cachePack objects more densely in the cache
Interacts with poolfree elimination (PAOpt 2/4)!Interacts with poolfree elimination (PAOpt 2/4)! If poolfree elim deletes all frees, BumpPtr can applyIf poolfree elim deletes all frees, BumpPtr can apply
16-byte16-byteuser datauser data
16-byte16-byteuser datauser data
One 32-byte Cache Line
16-byte16-byteuser datauser data
16-byte16-byteuser datauser data
PAOpts (4/4): Alignment AnalysisPAOpts (4/4): Alignment Analysis
Malloc must return 8-byte aligned memory:Malloc must return 8-byte aligned memory: It has no idea what types will be used in the memoryIt has no idea what types will be used in the memory Some machines bus error, others suffer performance Some machines bus error, others suffer performance
problems for unaligned memoryproblems for unaligned memory
Type-safe pools infer a type for the pool:Type-safe pools infer a type for the pool: Use 4-byte alignment for pools we know don’t need itUse 4-byte alignment for pools we know don’t need it Reduces inter-object paddingReduces inter-object padding
16-byte16-byteuser datauser data
16-byte16-byteuser datauser data
16-byte16-byteuser datauser data
One 32-byte Cache Line
4-byte object header
16-byte16-byteuser datauser data
OutlineOutline
Introduction & MotivationIntroduction & Motivation Automatic Pool Allocation TransformationAutomatic Pool Allocation Transformation Pool Allocation-Based OptimizationsPool Allocation-Based Optimizations Pool Allocation & Optimization Performance Pool Allocation & Optimization Performance
ImpactImpact ConclusionConclusion
Implementation & InfrastructureImplementation & Infrastructure
Link-time transformation using the LLVM Link-time transformation using the LLVM Compiler InfrastructureCompiler Infrastructure
Uses LLVM-to-C back-end and the resulting Uses LLVM-to-C back-end and the resulting code is compiled with GCC 3.4.2 –O3code is compiled with GCC 3.4.2 –O3
Evaluated on AMD Athlon MP 2100+Evaluated on AMD Athlon MP 2100+ 64KB L1, 256KB L264KB L1, 256KB L2
Simple Pool Allocation StatisticsSimple Pool Allocation Statistics
Programs from SPEC Programs from SPEC CINT2K, Ptrdist, CINT2K, Ptrdist,
FreeBench & Olden FreeBench & Olden suites, plus unbundled suites, plus unbundled
programsprograms 91
Table 1
Simple Pool Allocation StatisticsSimple Pool Allocation Statistics
DSA is able to infer DSA is able to infer that most static pools that most static pools are type-homogenousare type-homogenous
91
Table 1
Compile TimeCompile Time
Compilation overhead Compilation overhead is less than 3%is less than 3%
Table 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
175.v
pr
197.p
ars
er-b
300.tw
olf
bc
ft analy
zer
llu-b
ench
chom
p
fpgro
wth
espre
sso
povra
y31
bis
ort
health
mst
perim
ete
r
tsp
Ru
nti
me r
ati
o (
sm
aller
is f
aste
r)Pool Allocation SpeedupPool Allocation Speedup
Several programs unaffected by pool allocation (see paper) Several programs unaffected by pool allocation (see paper) Sizable speedup across many pointer intensive programsSizable speedup across many pointer intensive programs Some programs (ft, chomp) order of magnitude fasterSome programs (ft, chomp) order of magnitude faster
Most programs are 0% Most programs are 0% to 20% faster with pool to 20% faster with pool
allocation aloneallocation alone
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
175.v
pr
197.p
ars
er-b
300.tw
olf
bc
ft analy
zer
llu-b
ench
chom
p
fpgro
wth
espre
sso
povra
y31
bis
ort
health
mst
perim
ete
r
tsp
Ru
nti
me r
ati
o (
sm
aller
is f
aste
r)Pool Allocation SpeedupPool Allocation Speedup
Several programs unaffected by pool allocation (see paper) Several programs unaffected by pool allocation (see paper) Sizable speedup across many pointer intensive programsSizable speedup across many pointer intensive programs Some programs (ft, chomp) order of magnitude fasterSome programs (ft, chomp) order of magnitude faster
Two are 10x faster, one Two are 10x faster, one is almost 2x fasteris almost 2x faster
0
0.2
0.4
0.6
0.8
1
1.2
1.4
175.v
pr
197.p
ars
er-b
300.tw
olf
bc
ft analy
zer
llu-b
ench
chom
p
fpgro
wth
espre
sso
povra
y31
bis
ort
health
mst
perim
ete
r
tsp
Ru
nti
me R
ati
o (
1.0
= B
ase P
oo
l A
llo
cati
on
)
No Pool Allocation
All Optimizations
Pool Optimization Speedup (FullPA)Pool Optimization Speedup (FullPA)
Baseline 1.0 = Run Time with Pool AllocationBaseline 1.0 = Run Time with Pool Allocation
Optimizations help all of these programs:Optimizations help all of these programs: Despite being very simple, they make a big impactDespite being very simple, they make a big impact
Most are 5-15% faster Most are 5-15% faster with optimizations than with optimizations than with Pool Alloc alonewith Pool Alloc alone
PA PA TimeTime
Figure 9(with different baseline)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
175.v
pr
197.p
ars
er-b
300.tw
olf
bc
ft analy
zer
llu-b
ench
chom
p
fpgro
wth
espre
sso
povra
y31
bis
ort
health
mst
perim
ete
r
tsp
Ru
nti
me R
ati
o (
1.0
= B
ase P
oo
l A
llo
cati
on
)
No Pool Allocation
All Optimizations
Pool Optimization Speedup (FullPA)Pool Optimization Speedup (FullPA)
Baseline 1.0 = Run Time with Pool AllocationBaseline 1.0 = Run Time with Pool Allocation
Optimizations help all of these programs:Optimizations help all of these programs: Despite being very simple, they make a big impactDespite being very simple, they make a big impact
One is 44% faster, other One is 44% faster, other is 29% fasteris 29% faster
PA PA TimeTime
Figure 9(with different baseline)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
175.v
pr
197.p
ars
er-b
300.tw
olf
bc
ft analy
zer
llu-b
ench
chom
p
fpgro
wth
espre
sso
povra
y31
bis
ort
health
mst
perim
ete
r
tsp
Ru
nti
me R
ati
o (
1.0
= B
ase P
oo
l A
llo
cati
on
)
No Pool Allocation
All Optimizations
Pool Optimization Speedup (FullPA)Pool Optimization Speedup (FullPA)
Baseline 1.0 = Run Time with Pool AllocationBaseline 1.0 = Run Time with Pool Allocation
Optimizations help all of these programs:Optimizations help all of these programs: Despite being very simple, they make a big impactDespite being very simple, they make a big impact
Pool optimizations help Pool optimizations help some progs that pool some progs that pool
allocation itself doesn’tallocation itself doesn’t
PA PA TimeTime
Figure 9(with different baseline)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
175.v
pr
197.p
ars
er-b
300.tw
olf
bc
ft analy
zer
llu-b
ench
chom
p
fpgro
wth
espre
sso
povra
y31
bis
ort
health
mst
perim
ete
r
tsp
Ru
nti
me R
ati
o (
1.0
= B
ase P
oo
l A
llo
cati
on
)
No Pool Allocation
All Optimizations
Pool Optimization Speedup (FullPA)Pool Optimization Speedup (FullPA)
Baseline 1.0 = Run Time with Pool AllocationBaseline 1.0 = Run Time with Pool Allocation
Optimizations help all of these programs:Optimizations help all of these programs: Despite being very simple, they make a big impactDespite being very simple, they make a big impact
Pool optzns effect can Pool optzns effect can be additive with the pool be additive with the pool
allocation effectallocation effect
PA PA TimeTime
Figure 9(with different baseline)
0
0.2
0.4
0.6
0.8
1
1.2
175.vpr
197.parser-b
300.twolf
bc ft analyzer
llu-bench
chomp
fpgrowth
espresso
povray31
bisort
health
mst
perimeter
tsp
Ca
ch
e m
iss
ra
tio
L1 Misses
L2 Misses
TLB Misses
Cache/TLB miss reductionCache/TLB miss reduction
Sources:Sources: Defragmented heapDefragmented heap Reduced inter-object Reduced inter-object
paddingpadding Segregating the heap!Segregating the heap!
Miss rate measuredMiss rate measured
with perfctr on AMD with perfctr on AMD Athlon 2100+Athlon 2100+
Figure 10
Pool Optimization StatisticsPool Optimization Statistics
Table 2
Optimization ContributionOptimization Contribution
Figure 11
Pool Allocation ConclusionsPool Allocation Conclusions
1.1. Segregate heap based on points-to graphSegregate heap based on points-to graph Improved Memory Hierarchy PerformanceImproved Memory Hierarchy Performance Give compiler some control over layoutGive compiler some control over layout Give compiler information about localityGive compiler information about locality
2.2. Optimize pools based on per-pool propertiesOptimize pools based on per-pool properties Very simple (but useful) optimizations proposed hereVery simple (but useful) optimizations proposed here Optimizations could be applied to other systemsOptimizations could be applied to other systems
The EndThe End
Backup SlidesBackup Slides
Table 4Table 4
Table 5Table 5
list *list *makeListmakeList(int Num) {(int Num) { list *New = list *New = mallocmalloc(sizeof(list));(sizeof(list)); New->Next = Num ? New->Next = Num ? makeListmakeList(Num-1) : 0;(Num-1) : 0; New->Data = Num; return New;New->Data = Num; return New;}}
int int twoListstwoLists( ) {( ) {
list *X = list *X = makeListmakeList(10);(10); list *Y = list *Y = makeListmakeList(100);(100); GL = Y;GL = Y; processListprocessList(X);(X); processListprocessList(Y);(Y); freeListfreeList(X);(X); freeListfreeList(Y);(Y);
}}
Pool Allocation: ExamplePool Allocation: Example
Pool P1; poolinit(&P1);
pooldestroy(&P1);
, &P1)
, Pool* P){poolalloc(P);
, P)
, &P1)
P1
, P2)
Pool* P2
, P2)
P2
Change calls to free into Change calls to free into calls to poolfree calls to poolfree retain retain
explicit deallocationexplicit deallocation
Different Data Structures Have Different PropertiesDifferent Data Structures Have Different Properties Pool allocation segregates heap:Pool allocation segregates heap:
Roughly into logical data structuresRoughly into logical data structures Optimize using pool-specific propertiesOptimize using pool-specific properties
Examples of properties we look for:Examples of properties we look for: Pool is type-homogenousPool is type-homogenous Pool contains data that only requires 4-byte alignmentPool contains data that only requires 4-byte alignment Opportunities to reduce allocation overheadOpportunities to reduce allocation overhead
buildbuildtraversetraversedestroy destroy
complex complex allocation allocation
patternpattern
Pool Specific OptimizationsPool Specific Optimizations
list: HMRlist* int
head
list: HMRlist* int
head
list: HMRlist* int
head
BenchmarksBenchmarks
Pointer-intensive SPECINT 2000, Ptrdist, Pointer-intensive SPECINT 2000, Ptrdist, Olden, FreeBench suitesOlden, FreeBench suites
povray, espresso, fpgrowth, llu-bench, povray, espresso, fpgrowth, llu-bench, chompchomp
Benchmarks with custom allocators are not Benchmarks with custom allocators are not evaluated except for parser, which they hand evaluated except for parser, which they hand modified.modified.
Top Related