Post on 16-Jul-2015
Memory Unmanglement With Perl
How to do what you dowithout getting hit in the memory.
Steven LembarkWorkhorse Computing
In Our Last Episode...
● We saw our hero battling the forces of rambloat in longrunning, heavilyforked, or largescale processes.
● Learned the golden rule: Nothing Shrinks.● Observed memory benchmarks using Devel::Peek,
Devel::Size, and perl -d.● peek() shows the structure & hash efficiency.● size() & total_size() show memory usage.
Time vs. Space
● The classic tradeoff is handled in favor of time in the perl implementations.
● More efficient data structures can help both sides.● Avoiding wasted space can help avoid thrashing, heap
management, and system call overhead.● Faster access for arrays can make them more compact
and faster than hashes in some situations.
● Benchmarks are not only for time: include checks of size(), total_size(), and peek() to see what is really going on.
Nothing Ever Shrinks
● perl maintains strings and arrays as pointers to memory allocations.● Adjusting the size of a scalar with substr or a regex
changes it start and length.● shift and pop adjust an array's initial offset and count.
● None of these will reduce the memory overhead of the 'scaffolding' perl uses to manage the data.
Look Deep Into Your Memory
● Devel::Peek● peek() at the structure● Shows efficiency of hashing.
● Devel::Size● size() shows memory usage of “scaffolding”.● total_size() includes contents along with skeleton.
● size() can be useful in loops for managing size of recycled buffers.
Size & Structure
● Scalars● Reference allocations for strings with offset & length.● size() of the scalar is small, total_size() can be large.
● Arrays● Allocated list of Scalars, also with offset & length.● size() reports space for list, total_size() includes contents.
● Hashes● Hash chains are an array of arrays with min. 8 chains.● size() reports space for hash chains.
Taming the Beast
● There are tools for managing the memory, most of which involve some sort of time/space tradeoff.● undef can help – probably less than you think.● You can manage the lifetime of variables with lexical or
local values.● Recycling buffers localizes the bloat to one structure.● Adapting your code to use more effective data structures
offers the best solution for large data.
● Here are some ideas.
undef() is somewhat helpful
● Marks the variable for reclamation.● Space may not be immediately reclaimed – up to perl
whether to add heap or recycle the undefed variables.
● Structures are discarded, not reduced.● This can have a significant performance overhead on
nested, reused data structures.
● Tradeoff: space for time for rebuilding the skeleton of discarded structures.
● Most useful for recycling singlelevel structures.
undefing an Array Doesn't Zero It
● For a large, nested structure this may not save the amount of space you expect.
my @a = ();$#a = 999_999;print "Size \@a:\t", size( \@a ), "\n";
undef @a;print "Size \@a:\t", size( \@a ), "\n";
Full @a:4000200Post @a: 100
● The contents are discarded & reallocated:
Recycling Buffers
● Use size() to discard and reallocate the buffer if it grows too large.
● Preallocate to avoid marginoferror added by perl when the initial allocation grows.
● Decent tradeoff between reallocating a buffer frequently and having it grow without bounds.
● Avoids one record botching the entire processing cycle.
Scalar Buffer
● Recycle buffer, clean it up, then copy by value.● Easiest with scalars since they don't have any nested
structure.while( $buffer = get_data ){ $buffer =~ s/^\s+//; ... push @data, $buffer;
if( size( $buffer ) > $max_buff ) { undef $buffer; $buffer = ' ' x $max_buff; }}
Array Buffer
● This works well for single level buffers multilevel buffers often require too much work to rebuild.my @buff = ();$#buff = $buff_count;
while( @buff = get_data ){ ... # clean up buffer $data{ $key } = [ @buff ]; # store values
if( size( \@a ) > $buff_max ) { undef @buff; $#buff = $max_buff; }}
Assign Arrays SinglePass
● Say you have to store a large number of items:
my @a = @b = ();
push @a, “” for( 1 .. 1_000_000 );@b = map { “” } ( 1 .. 1_000_000 );
print 'Size of @a: ', size( \@a ), "\n";print 'Size of @b: ', size( \@b ), "\n";
Size of @a: 4194388Size of @b: 4000100
● Push ends up with a larger structure:
Hashes are Huge
● Incremental assignment doesn't make hashes larger: they are 8x larger than arrays in both cases.
my %a =();my %b = ();
$a{ $_ } = “” for ( 1 .. 1_000_000 );%b = map { $_ => “” } ( 1 .. 1_000_000 );
print 'Size of %a: ', size( \%a ), "\n";print 'Size of %b: ', size( \%b ), "\n";
Size of %a: 32083244 # vs. 4000100Size of %b: 32083244 # in an array!
Two Ways of Storing Nothing
● There are two common ways of storing nothing in the values of a hash:● Assign an empty list: $hash{ $key } = ();
● Assign an empty string: $hash{ $key } = “”;
● Question:
Which would take less space: empty list or empty string?
TMTOWTDN
my %a =();my %b = ();
$a{ $_ } = () for( 'aaa' .. 'zzz' );$b{ $_ } = '' for( 'aaa' .. 'zzz' );
print "Size of %a:\t", size( \%a ), "\n";print "Size of %b:\t", size( \%b ), "\n";
Size of %a: 570516 # same size for “” & ()?Size of %b: 570516
● size() gives the same result for both values. Why?
TMTOWTDN
my %a =();my %b = ();
$a{ $_ } = () for( 'aaa' .. 'zzz' );$b{ $_ } = '' for( 'aaa' .. 'zzz' );
print "Size of %a:\t", size( \%a ), "\n";print "Size of %b:\t", size( \%b ), "\n";
print "Total in %a:\t", total_size( \%a ), "\n";print "Total in %b:\t", total_size( \%b ), "\n";
Size of %a: 570516 # size() doesn't alwaysSize of %b: 570516 # matter!
Total in %a: 851732Total in %b: 1203252
● total_size() benchmarks the values:
Replace Hashes With Arrays
● The smartmatch operator (“~~”) is fast.● Pushing onto an array:
$a ~~ @uniq or push @uniq, $a
uses about 1/8 the space of assigning hash keys:$uniq{ $a } = ();
...
keys %uniq
● The extra space used by array growth in push is dwarfed by the savings of an array over a hash.
● sort @uniq is much faster than sort keys %uniq.
Example: Taxonomy Trees
● The NCBI Taxonomy is delivered with each entry having a full tree.
● These must be reduced to a single tree for data entry and validation.
● There are several ways to do this...
Worst Solution: Parent tree.
● Since the tree is often used from the bottom up, some people store it as a child:parent relationship:
$parentz{ $child_id } = $parent_id;
● Unfortunately, this allocates a full hash table for each 1:1 relationship between a child and parent.
Another Bad Solution: Child Tree
● Another alternative is storing the children in a hash for each parent:
$childz{ $parent_id }{ $child_id } = ();
$childz{ '' } = [ $root_id ];
● This works via depthfirst search to generate the trees and has space to store the treedepth.
● Hashes are bulky and slow for storing a singlelevel structure like this.
Another Solution: SingleLevel Hash
● One oftforgotten bit of Perly lore in the age of references: multipart hash keys.
$childz{ $parent_id, $child_id } = $depth;
$childz{ “” } = [ $root_id ];
● Trades wasted space in thousands of anon hashes for split /$;/o, $key and grep's.
● Usable for moderate trees.● Obviously painful for really large trees.
Q: Why Nest Hashes?● Hashes are nice for the toplevel lookup, but why
nest them?
● Arrays save about 85% of the overhead below the top level.
● Any wasted space from the arrays growing via push is more than saved by avoiding hashes.
● The arrays only need to be sorted once if the tree is used multiple times.
my $c = $childz{ $parent_id } ||=[];
$new_id ~~ $cor push @{ $c{ $parent_id } }, $new_id;
Nested Lists
● List::Util has first() which saves greping entire lists.● A key and payload on an array can be handled
quickly.first { $_->[0] eq $key } @data;
● For shorter lists this saves space and can be faster than a hash.
● This is best for numerics, which don't have to be converted to text in order to be hashed: $_->[0] == $value is the least amount of work to compare integers.
Manage Lifespans
● Lexical variables are an obvious place.● Local values are another.
● Saves reallocating a set of values within tight loops in the called code.
● Local hash keys are a good way to manage storage in reused hashes handled with recursion.
● Use delete to remove hash keys in multilevel structures instead of assigning an empty list or “”.● This preserves the skeleton for recycling.● Saves storing the keys.
Use Simpler Objects
● If you're using insideout objects, why bless a hash?● Users aren't supposed to diddle around inside your
objects anyway.
● The only thing you care about is the address.● Bless something smaller:
my $obj = bless \(my $a), $package;
Use Linked Lists for Queues
● Automatically frees discarded nodes without having to modify the entire list.
● Based on an array they don't use much extra data:$node = [ $ref_to_next, @node_data ];
● Walking the list is simple enough:( $node, my @data ) = @$node;
● So is removing a node:$node->[0] = $node->[0][0];
● These are quite convenient for threading.
Use Hashes for Sparse Arrays
● OK, Time to stop beating up on hashes.● They beat out arrays for sparse lists.● Even list of integers.
● Say a collection of DNA runs from 15 to 10_000 bases, filling about 10% of the actual values.
● You could store it as:$dnaz[ $length ] = [ qw( dna dna dna ) ];
● But this is probably better stored in a hash:$dnaz{ $length } = [ qw( dna dna dna ) ];
Accessing Hash Keys: Integer Slices
● Numeric sequences work fine as hash keys.● Say you want to find all of the sequences within
+/ 10% of the current length:‑
my $min = 0.9 * $length;my $max = 1.1 * $length;my @found = grep{ $_ } @dnaz{ ( $min .. $max ) };
● For nontrivial, sparse lists this saves scaffolding by only storing the structure necessary.
● This doesn't change the data storage, just the overhead for accessing it by length.
Store Uppertriangular Comparisons
● Saves more than half the space.● Accessor can look for $i > $j ? [$i][$j] : [$j][$i] and
get the same results.● Requires designing symmetric comparison
algorithms (values can be returned asis or just negated).
● Also saves about half the processing time to only generate a single comparison for each pair.
● Requires access to the algorithm.
Example: DNA Analysis
● Our Wcurve analysis is used to compare large groups of DNA to one another.
● The original algorithm compared the curves until the first one was exhausted.
● Changing that to use the longer sequence in all cases saved us over half the comparison time.
Summary
● Devel::Size can be useful in your code.● Managing the lifespan of values helps.● Using efficient structures helps even more.
● Use arrays instead of hash structures where they make sense.
● Bless smaller structures: scalars, regexen, globs make perfectly good objects and take less space than hashes.
● Use XS or Inline where necessary.● And, yes, size() still matters.