Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64...
Transcript of Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64...
![Page 1: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/1.jpg)
basics detection Fix
Optimizing Memory Allocation in C++
Sebastien [email protected]
CERN
February 5th 2018
S. Ponce Optimizing Memory Allocation in C++ 1 / 41
![Page 2: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/2.jpg)
basics detection Fix
Context
We are spending (far) too much time allocating and deallocatingmemory
Initially25% of the total HLT1 time !
S. Ponce Optimizing Memory Allocation in C++ 2 / 41
![Page 3: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/3.jpg)
basics detection Fix
Why ?
Main issue
We are allocating too many small bits
We should allocate large chunks
Source of the problem
Our object model, full of containers of pointers
Plus our bad coding, not reserving the space
S. Ponce Optimizing Memory Allocation in C++ 3 / 41
![Page 4: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/4.jpg)
basics detection Fix
The solution
What to do
use containers of objects
and references, move semantic, emplace, ...so that things are never copied
reserve the full size of your container at creation
Why is it hard ?
Lot’s of code to be adapted
non trivial C++concepts at work
S. Ponce Optimizing Memory Allocation in C++ 4 / 41
![Page 5: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/5.jpg)
basics detection Fix
Outline
1 Basics of memory allocation
2 Detect suboptimal allocations
3 How to improveChange containersAdapt creation codeAdapt insertion codeAdapt read accesses
S. Ponce Optimizing Memory Allocation in C++ 5 / 41
![Page 6: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/6.jpg)
basics detection Fix
Foreword
the concepts and techniques presented here are generic
they apply to basically all containers
for simplicity, I’ll show them on vectors and maps
S. Ponce Optimizing Memory Allocation in C++ 6 / 41
![Page 7: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/7.jpg)
basics detection Fix
Basics of memory allocation
S. Ponce Optimizing Memory Allocation in C++ 7 / 41
![Page 8: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/8.jpg)
basics detection Fix
Basic container in memory
Simple vector case
std::vector<int> v;
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 ...
Vector of objects
struct A { float x, y, z; };
std::vector<A> v;
x0 y0 z0
A0
x1 y1 z1
A1
x2 y2 z2
A2
x2 ...
S. Ponce Optimizing Memory Allocation in C++ 8 / 41
![Page 9: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/9.jpg)
basics detection Fix
Container of pointers
Naıve view
struct A { float x, y, z; };
std::vector<A*> v;
ptr0 ptr1 ptr2 ptr3 ptr4 ptr5 ptr6 ptr7 ptr8 ptr9 ...
Realistic view
ptr0 ptr1 ptr2 ptr3 ptr4 ptr5 ptr6 ptr7 ptr8 ptr9 ...x0 y0 z0
x1 y1 z1
x2 y2 z2
x3 y3 z3 x4 y4 z4
x5 y5 z5
x6 y6 z6x7 y7 z7
x8 y8 z8
x9 y9 z9
S. Ponce Optimizing Memory Allocation in C++ 9 / 41
![Page 10: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/10.jpg)
basics detection Fix
Container of pointers
Naıve view
struct A { float x, y, z; };
std::vector<A*> v;
ptr0 ptr1 ptr2 ptr3 ptr4 ptr5 ptr6 ptr7 ptr8 ptr9 ...
Realistic view
ptr0 ptr1 ptr2 ptr3 ptr4 ptr5 ptr6 ptr7 ptr8 ptr9 ...x0 y0 z0
x1 y1 z1
x2 y2 z2
x3 y3 z3 x4 y4 z4
x5 y5 z5
x6 y6 z6x7 y7 z7
x8 y8 z8
x9 y9 z9
S. Ponce Optimizing Memory Allocation in C++ 9 / 41
![Page 11: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/11.jpg)
basics detection Fix
Consequences : memory allocations
Number of allocations
vector¡A¿ -¿ optimally 1 allocation
vector¡A*¿ -¿ minimum n+1 allocations
What is an allocation ?
finding an empty piece of memory
going though a list/map hold by the linux kernel
and taking a lock to make it thread safe
So allocations are costly !
S. Ponce Optimizing Memory Allocation in C++ 10 / 41
![Page 12: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/12.jpg)
basics detection Fix
More consequences : reading data
Memory view for vector<A>
Each line corresponds to a cache line (64 bytes, 16 floats)
0x00000x00400x00800x00C0
x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5y5 z5 x6 y6 z6 x7 y7 z7 x8 y8 z8 x9 y9 z9 . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
One read from RAM to Level 1 Cache is enough (2 lines in one go)
S. Ponce Optimizing Memory Allocation in C++ 11 / 41
![Page 13: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/13.jpg)
basics detection Fix
More consequences : reading data
Memory view for vector<A>
Each line corresponds to a cache line (64 bytes, 16 floats)
0x00000x00400x00800x00C0
x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5y5 z5 x6 y6 z6 x7 y7 z7 x8 y8 z8 x9 y9 z9 . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
One read from RAM to Level 1 Cache is enough (2 lines in one go)
S. Ponce Optimizing Memory Allocation in C++ 11 / 41
![Page 14: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/14.jpg)
basics detection Fix
More consequences : reading data
Memory view for vector<A*>
Each line corresponds to a cache line (64 bytes, 16 floats)
0x00000x00400x00800x00C00x01000x01400x01800x01C00x02000x0240
p0 p1 p2 p3 p4 p5 p6 p7 p8 p9
x0 x1
x2
x3 x4
x5 x6
x7
x8 x9
y0 y1
y2
y3 y4
y5 y6
y7
y8 y9
z0 z1
z2
z3 z4
z5 z6
z7
z8 z9
You need to read many lines, in several accessesRemember a RAM access is 100 cycles
S. Ponce Optimizing Memory Allocation in C++ 12 / 41
![Page 15: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/15.jpg)
basics detection Fix
More consequences : reading data
Memory view for vector<A*>
Each line corresponds to a cache line (64 bytes, 16 floats)
0x00000x00400x00800x00C00x01000x01400x01800x01C00x02000x0240
p0 p1 p2 p3 p4 p5 p6 p7 p8 p9
x0 x1
x2
x3 x4
x5 x6
x7
x8 x9
y0 y1
y2
y3 y4
y5 y6
y7
y8 y9
z0 z1
z2
z3 z4
z5 z6
z7
z8 z9
You need to read many lines, in several accessesRemember a RAM access is 100 cycles
S. Ponce Optimizing Memory Allocation in C++ 12 / 41
![Page 16: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/16.jpg)
basics detection Fix
Detect suboptimal allocations
S. Ponce Optimizing Memory Allocation in C++ 13 / 41
![Page 17: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/17.jpg)
basics detection Fix
The main tools
vtune
uses internal processor counters to estimate what going on inreal execution
and in particular the (estimated) number of cycles spent inmemory allocations
as well as the cache misses
better used in opt mode
callgrind
simulates a processor and allows to count what is going on
and in particular the (estimated) number of cycles spent inmemory allocations
as well as the cache misses
better used in dbg mode
S. Ponce Optimizing Memory Allocation in C++ 14 / 41
![Page 18: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/18.jpg)
basics detection Fix
vtune in practice
allow vtune to be found in your environment :
source /cvmfs/projects.cern.ch/intelsw/psxe/linux/18-all-setup.sh
run your program with a command line like this :
amplxe-cl -collect hotspots -start-paused -- \
./Brunel/run gaudirun.py MiniBrunelHLT1fast.py
-start-paused allows to start vtune only when needed, butrequires the option mbrunel.IntelProfile = True inMiniBrunel config file
use enouhg events, typically >10000 for MiniBrunel HLT1 only
you will get a directory called r000hs
visualize results with
amplxe-gui r000hs
S. Ponce Optimizing Memory Allocation in C++ 15 / 41
![Page 19: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/19.jpg)
basics detection Fix
vtune in practice
you get a summary with hotspots, out of which new
go to bottom-up tab
S. Ponce Optimizing Memory Allocation in C++ 16 / 41
![Page 20: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/20.jpg)
basics detection Fix
vtune in practice
new is indeed in the top CPU consumers
click on the triangle on the left to see who calls it
S. Ponce Optimizing Memory Allocation in C++ 17 / 41
![Page 21: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/21.jpg)
basics detection Fix
vtune in practice
many culprits ! Choose yours and go to the caller-callee tab
find it on the right (Ctrl-f) and click on it
S. Ponce Optimizing Memory Allocation in C++ 18 / 41
![Page 22: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/22.jpg)
basics detection Fix
vtune in practice
I’ve chosen FTMeasurementProvider::measurement
one can see where it’s called, and where it spends it time
S. Ponce Optimizing Memory Allocation in C++ 19 / 41
![Page 23: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/23.jpg)
basics detection Fix
vtune in practice
And the culprit is...
unsigned int HLT1Fitter::fit(LHCb::Track& ...) {
...
// Store results of the Kalman fit
std::vector<LHCb::Measurement*> measurements;
...
}
S. Ponce Optimizing Memory Allocation in C++ 20 / 41
![Page 24: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/24.jpg)
basics detection Fix
Callgrind in practice
run your program with a command line like this :
./Brunel/run \
/cvmfs/lhcbdev.cern.ch/tools/valgrind/3.12.0/x86_64-centos7/bin/valgrind \
--tool=callgrind --instr-atstart=no --dump-instr=yes --cache-sim=yes \
python $(./Brunel/run which gaudirun.py) MiniBrunelHLT1fast.py
--instr-atstart=no allows to start callgrind only whenneeded, but requires the option
mbrunel.CallgrindProfile = True
in MiniBrunel config file
use only few events, typically <1000
you will get 2 files callgrind .out. < tid > andcallgrind .out. < tid > .1
open the ’.1’ one with kcachegrind
S. Ponce Optimizing Memory Allocation in C++ 21 / 41
![Page 25: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/25.jpg)
basics detection Fix
callgrind in practice
you get a list of functions and time spent in each
search for new on top left and click on operator new
look at the bottom right panel for who calls new
S. Ponce Optimizing Memory Allocation in C++ 22 / 41
![Page 26: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/26.jpg)
basics detection Fix
callgrind in practice
one can double click on any item to see their callers/callees
one can display source code on the upper right panel
let’s look at FTMeasurementProvider::measurement
S. Ponce Optimizing Memory Allocation in C++ 23 / 41
![Page 27: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/27.jpg)
basics detection Fix
callgrind in practice
show where new is called
one can check the call graph
S. Ponce Optimizing Memory Allocation in C++ 24 / 41
![Page 28: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/28.jpg)
basics detection Fix
callgrind in practice
all comes from the Fitter, where you find the wrong container
unsigned int HLT1Fitter::fit(LHCb::Track& ...) const {
// Store results of the Kalman fit
std::vector<LHCb::Measurement*> measurements;
S. Ponce Optimizing Memory Allocation in C++ 25 / 41
![Page 29: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/29.jpg)
basics detection Fix container creation insertion access
How to improve
S. Ponce Optimizing Memory Allocation in C++ 26 / 41
![Page 30: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/30.jpg)
basics detection Fix container creation insertion access
Step 1 : change container to container of objects
prefer standard containers
specially run away from KeyedContainer
be aware of the variety of containers and their specificities
e.g. do you know of flat map or small vector ?
practically, you often only need to drop a * in the containerdefinition. In our case :
std::vector<LHCb::Measurement*> measurements;
becomes
std::vector<LHCb::Measurement> measurements;
S. Ponce Optimizing Memory Allocation in C++ 27 / 41
![Page 31: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/31.jpg)
basics detection Fix container creation insertion access
Step 1 : possible impact
in case the container is member of a class, you can thenprobably get rid of quite some code
destructorcopy/move constructorsassignement operator
all now will be default, while they needed to release thecontent of the container before
S. Ponce Optimizing Memory Allocation in C++ 28 / 41
![Page 32: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/32.jpg)
basics detection Fix container creation insertion access
Step 2 : deal with container creation
should be trivial, using default
that’s ok for static containers where the size is fixed
std::array typically
not enough for growing containers
including std::vector, std::map, ...
S. Ponce Optimizing Memory Allocation in C++ 29 / 41
![Page 33: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/33.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
struct A { float x, y, z; };
std::vector<A*> v;
Construction
Default constructor creates and empty vector, with no storage
end of storagefinish
start
0x0
0x0
0x0
First push
allocates storage for the first element only !
0x1234
0x1240
0x1240
x0 y0 z0
S. Ponce Optimizing Memory Allocation in C++ 30 / 41
![Page 34: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/34.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
struct A { float x, y, z; };
std::vector<A*> v;
Construction
Default constructor creates and empty vector, with no storage
end of storagefinish
start
0x0
0x0
0x0
First push
allocates storage for the first element only !
0x1234
0x1240
0x1240
x0 y0 z0
S. Ponce Optimizing Memory Allocation in C++ 30 / 41
![Page 35: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/35.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
struct A { float x, y, z; };
std::vector<A*> v;
Construction
Default constructor creates and empty vector, with no storage
end of storagefinish
start
0x0
0x0
0x0
First push
allocates storage for the first element only !
0x1234
0x1240
0x1240
x0 y0 z0
S. Ponce Optimizing Memory Allocation in C++ 30 / 41
![Page 36: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/36.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
Second push
0x1234
0x1240
0x1240
x0 y0 z0
x0 y0 z0 x1 y1 z1
1 allocate new piece of memory for 2 items
2 copy existing content
3 write new content
4 update pointers
5 Deallocate original piece of memory
S. Ponce Optimizing Memory Allocation in C++ 31 / 41
![Page 37: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/37.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
Second push
0x1234
0x1240
0x1240
x0 y0 z0
x0 y0 z0 x1 y1 z1
1 allocate new piece of memory for 2 items
2 copy existing content
3 write new content
4 update pointers
5 Deallocate original piece of memory
S. Ponce Optimizing Memory Allocation in C++ 31 / 41
![Page 38: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/38.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
Second push
0x1234
0x1240
0x1240
x0 y0 z0 x0 y0 z0
x1 y1 z1
1 allocate new piece of memory for 2 items
2 copy existing content
3 write new content
4 update pointers
5 Deallocate original piece of memory
S. Ponce Optimizing Memory Allocation in C++ 31 / 41
![Page 39: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/39.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
Second push
0x1234
0x1240
0x1240
x0 y0 z0 x0 y0 z0 x1 y1 z1
1 allocate new piece of memory for 2 items
2 copy existing content
3 write new content
4 update pointers
5 Deallocate original piece of memory
S. Ponce Optimizing Memory Allocation in C++ 31 / 41
![Page 40: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/40.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
Second push
0x1234
0x1240
0x1240
x0 y0 z0 x0 y0 z0 x1 y1 z10x5678
0x5684
0x5684
1 allocate new piece of memory for 2 items
2 copy existing content
3 write new content
4 update pointers
5 Deallocate original piece of memory
S. Ponce Optimizing Memory Allocation in C++ 31 / 41
![Page 41: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/41.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
Second push
0x1234
0x1240
0x1240
x0 y0 z0
x0 y0 z0 x1 y1 z10x5678
0x5684
0x5684
1 allocate new piece of memory for 2 items
2 copy existing content
3 write new content
4 update pointers
5 Deallocate original piece of memory
S. Ponce Optimizing Memory Allocation in C++ 31 / 41
![Page 42: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/42.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
Third push
0x5678
0x5684
0x5684
x0 y0 z0 x1 y1 z1
x0 y0 z0 x1 y1 z1 x2 y2 z20x9ABC
0x9AC8
0x9ACC
1 allocate new piece of memory for 4 items
double size at each iteration
2 copy existing content
3 write new content
4 update pointers
5 Deallocate original piece of memory
S. Ponce Optimizing Memory Allocation in C++ 32 / 41
![Page 43: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/43.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
Third push
0x5678
0x5684
0x5684
x0 y0 z0 x1 y1 z1
x0 y0 z0 x1 y1 z1 x2 y2 z20x9ABC
0x9AC8
0x9ACC
1 allocate new piece of memory for 4 items
double size at each iteration
2 copy existing content
3 write new content
4 update pointers
5 Deallocate original piece of memory
S. Ponce Optimizing Memory Allocation in C++ 32 / 41
![Page 44: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/44.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
Third push
0x5678
0x5684
0x5684
x0 y0 z0 x1 y1 z1 x0 y0 z0 x1 y1 z1
x2 y2 z20x9ABC
0x9AC8
0x9ACC
1 allocate new piece of memory for 4 items
double size at each iteration
2 copy existing content
3 write new content
4 update pointers
5 Deallocate original piece of memory
S. Ponce Optimizing Memory Allocation in C++ 32 / 41
![Page 45: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/45.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
Third push
0x5678
0x5684
0x5684
x0 y0 z0 x1 y1 z1 x0 y0 z0 x1 y1 z1 x2 y2 z2
0x9ABC
0x9AC8
0x9ACC
1 allocate new piece of memory for 4 items
double size at each iteration
2 copy existing content
3 write new content
4 update pointers
5 Deallocate original piece of memory
S. Ponce Optimizing Memory Allocation in C++ 32 / 41
![Page 46: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/46.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
Third push
0x5678
0x5684
0x5684
x0 y0 z0 x1 y1 z1 x0 y0 z0 x1 y1 z1 x2 y2 z20x9ABC
0x9AC8
0x9ACC
1 allocate new piece of memory for 4 items
double size at each iteration
2 copy existing content
3 write new content
4 update pointers
5 Deallocate original piece of memory
S. Ponce Optimizing Memory Allocation in C++ 32 / 41
![Page 47: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/47.jpg)
basics detection Fix container creation insertion access
How does a vector grow ?
Third push
0x5678
0x5684
0x5684
x0 y0 z0 x1 y1 z1
x0 y0 z0 x1 y1 z1 x2 y2 z20x9ABC
0x9AC8
0x9ACC
1 allocate new piece of memory for 4 items
double size at each iteration
2 copy existing content
3 write new content
4 update pointers
5 Deallocate original piece of memory
S. Ponce Optimizing Memory Allocation in C++ 32 / 41
![Page 48: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/48.jpg)
basics detection Fix container creation insertion access
How to make proper allocation for vectors
Why to avoid the default ?
content of vectors is reallocated and copied when they grow
first item of a 1000 nodes vector will be copies 10 times !
when reaching 1000 items, you will have copied 1023 items intotal and allocated 11 pieces of memory, releasing 10
Solution
you can avoid all that thanks to reserve
std::vector<int> v;
v.reserve(1000);
ensures single allocation, no copies, no reallocations
0x1234
0x1234
0x1dec
...
S. Ponce Optimizing Memory Allocation in C++ 33 / 41
![Page 49: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/49.jpg)
basics detection Fix container creation insertion access
Step 3 : deal with insertions
What happens by default ?
1 std::vector<A> v;
2 v.reserve(10);
3 A tmp{args};
4 v.push_back(tmp);
What actually happens :
allocate space in the vector (line 2)
allocate space for the temporary A object (line 3)
call A constructor (line 3)
call copy constructor for A (line 4)
deallocate temporary A (end of scope)
S. Ponce Optimizing Memory Allocation in C++ 34 / 41
![Page 50: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/50.jpg)
basics detection Fix container creation insertion access
Step 3 : deal with insertions
What happens by default ?
1 std::vector<A> v;
2 v.reserve(10);
3 A tmp{args};
4 v.push_back(tmp);
What actually happens :
allocate space in the vector (line 2)
allocate space for the temporary A object (line 3)
call A constructor (line 3)
call copy constructor for A (line 4)
deallocate temporary A (end of scope)
S. Ponce Optimizing Memory Allocation in C++ 34 / 41
![Page 51: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/51.jpg)
basics detection Fix container creation insertion access
Ok, but we have move
What is the default ?
std::vector<A> v;
v.reserve(10);
A tmp{args};
v.push_back(std::move(tmp));
What actually happens :
allocate space in the vector (line 2)
allocate space for the temporary A object (line 3)
call A constructor (line 3)
call move constructor for A (line 4)
can be much better than copy, e.g. for vectorscan be identical, e.g. for plain object
deallocate temporary A (end of scope)
We would like to completely avoid the temporary object
S. Ponce Optimizing Memory Allocation in C++ 35 / 41
![Page 52: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/52.jpg)
basics detection Fix container creation insertion access
Ok, but we have move
What is the default ?
std::vector<A> v;
v.reserve(10);
A tmp{args};
v.push_back(std::move(tmp));
What actually happens :
allocate space in the vector (line 2)
allocate space for the temporary A object (line 3)
call A constructor (line 3)
call move constructor for A (line 4)
can be much better than copy, e.g. for vectorscan be identical, e.g. for plain object
deallocate temporary A (end of scope)
We would like to completely avoid the temporary object
S. Ponce Optimizing Memory Allocation in C++ 35 / 41
![Page 53: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/53.jpg)
basics detection Fix container creation insertion access
Proper solution for vectors
In place construction
std::vector<A> v;
v.reserve(10);
v.emplace_back(args);
What actually happens :
allocate space in the vector
call constructor for A
using args as the constructor argumentsusing the space allocated in the vector
For the record, this is using variadic templates, new in C++11
S. Ponce Optimizing Memory Allocation in C++ 36 / 41
![Page 54: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/54.jpg)
basics detection Fix container creation insertion access
In place construction and maps
Naıve code
1 std::map<int,A> m;
2 std::pair<int,A> item(5, A(args));
3 m.insert(item);
Problems :
copy on line 2
copy on insertion on line 3
With emplace
1 std::map<int,A> m;
2 m.emplace(5, A(args));
Problem :
the pair is constructed in place, not A
still a move/copy for A
S. Ponce Optimizing Memory Allocation in C++ 37 / 41
![Page 55: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/55.jpg)
basics detection Fix container creation insertion access
In place construction and maps
Naıve code
1 std::map<int,A> m;
2 std::pair<int,A> item(5, A(args));
3 m.insert(item);
Problems :
copy on line 2
copy on insertion on line 3
With emplace
1 std::map<int,A> m;
2 m.emplace(5, A(args));
Problem :
the pair is constructed in place, not A
still a move/copy for A
S. Ponce Optimizing Memory Allocation in C++ 37 / 41
![Page 56: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/56.jpg)
basics detection Fix container creation insertion access
In place construction and maps
piecewise construct
To solve this problem, std::pair has a dedicated constructor
this constructor takes 2 tuples, holding the arguments toconstruct key and value
1 std::map<int,A> m;
2 m.emplace(piecewise_construct,
3 make_tuple(5),
4 make_tuple(args));
so A is now constructed in place, in the pair inside the map
Is that optimal ?
close to, but not completely
we are copying args into the tuple now !
S. Ponce Optimizing Memory Allocation in C++ 38 / 41
![Page 57: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/57.jpg)
basics detection Fix container creation insertion access
In place construction and maps
piecewise construct
To solve this problem, std::pair has a dedicated constructor
this constructor takes 2 tuples, holding the arguments toconstruct key and value
1 std::map<int,A> m;
2 m.emplace(piecewise_construct,
3 make_tuple(5),
4 make_tuple(args));
so A is now constructed in place, in the pair inside the map
Is that optimal ?
close to, but not completely
we are copying args into the tuple now !
S. Ponce Optimizing Memory Allocation in C++ 38 / 41
![Page 58: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/58.jpg)
basics detection Fix container creation insertion access
In place construction and maps
piecewise construct + forward as tuple
this can be solved by using tuple of references
1 std::map<int,A> m;
2 m.emplace(piecewise_construct,
3 make_tuple(5),
4 forward_as_tuple(args));
forward as tuple creates a tuple of references
prevents to copy args twice
S. Ponce Optimizing Memory Allocation in C++ 39 / 41
![Page 59: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/59.jpg)
basics detection Fix container creation insertion access
Step 3 : deal with accesses
The easy part
1 std::vector<A*> v;
2 bar(const A* a);
3 ...
4 v[n]->foo();
5 bar(v[n]);
becomes
1 std::vector<A> v;
2 bar(const A& a); // now const is real !
3 ...
4 v[n].foo();
5 bar(v[n]);
S. Ponce Optimizing Memory Allocation in C++ 40 / 41
![Page 60: Optimizing Memory Allocation in C++ · 2018. 11. 23. · Each line corresponds to a cache line (64 bytes, 16 oats) 0x0000 0x0040 0x0080 0x00C0 x 0y z 0 1 y 1 1 2 y 2 2 3 y 3 3 4 y](https://reader033.fdocuments.us/reader033/viewer/2022060818/60979926a39d3147f76b3e25/html5/thumbnails/60.jpg)
basics detection Fix container creation insertion access
Conclusion
Memory allocation/deallocations are not cheap
Optimizing them will lead to ∼20% gain in HLT1
New code should take that into account
key words are :
container of objectsreserve
emplace
references
S. Ponce Optimizing Memory Allocation in C++ 41 / 41