Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

123
Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Page 1: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Reducing Garbage Collector Cache Misses

Shachar Rubinstein

Garbage Collection Seminar

Page 2: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The End!

Page 3: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The general problem

CPU’s are getting fast faster and faster Main memory speed lags behind Result: The cost to access main memory is

increasing

Page 4: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Solutions

Hardware and software techniques:– Memory hierarchy– Prefetcing– Multithreading– Non-blocking caches– Dynamic instruction scheduling– Speculative execution

Page 5: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Great Solutions?

Complex hardware and compilers Ineffective for many programs Attack the manifestation (= memory latency)

and not the source (=poor reference locality)

Not exactly…

Page 6: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Previous work

Improving cache locality in dense matrices using loop transformation

Other profile-driven, compiler directed approach

Page 7: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The GC problem

Little temporal locality. Each live object is usually read only once

during mark phase. Most reads are likely to miss. The new contents are unlikely to be used

more than once.

Page 8: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The GC problem – cont.

The sweep phase, like the mark phase, also touches each object once

That’s since the free list pointers are maintained in the objects themselves

Unlike the mark phase, the sweep phase is more sequential

Page 9: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The GC problem – cont.

The sweep is less likely to use cache contents left by the marker

The allocator is likely to miss again, when the object is allocated

Page 10: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The GC problem - previous work

Older work concentrated on paging performance.

Memory size increase lead to abandoning this goal.

But memory size also lead to huge cache miss penalties.

The largest cache size < heap size This problem is unavoidable.

Page 11: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Previous work

Reducing sweep time for a nearly empty heap

Compiler-based prefetching for recursive data structures

Page 12: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

How am I going to improve the situation?

Do some magic! Well no… Use real-time information to improve program

cache locality. The mark and sweep phases offers

invaluable opportunities for improvements– Bring objects earlier to the cache– Reuse freed objects for reallocation

Page 13: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Some numbers

Relative to copy GC– Cache miss rates reduced by 21-42%– Program performance improved by 14-37%

Relative to a page level GC– Cache miss rates reduced by 20-41%– Program performance improved by 18-31%

Page 14: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Road map

Cache conscious data placement using generational GC

– Overview– Short generational GC reminder– Real-time data profiling– Object affinity graph– Combining the affinity graph with GC– Experimental evaluation

Other methods and their experimental results

Page 15: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Overview

A program is instrumented to profile its access patterns

The data is used in the same execution and not the next one.

The data -> affinity graph A new copy algorithm uses the graph to

layout the data while copying.

Page 16: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Generational GC – A reminder

The heap is divided to generations GC activity concentrates on young objects,

which typically die faster. Objects that survive one or more scavenges

are moved to the next generation

Page 17: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Implementation notes

The authors used the UM GC toolkit The toolkit has several steps per generation The authors used a single step for each

generation for simplicity. Each step consists of fixed size blocks The blocks are not necessarily contiguous in

memory

Page 18: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Implementation notes - steps

Page 19: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Implementation notes - steps

The steps are used to encode the objects’ age

An object which survives a scavenge is moved to the next step

Page 20: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Implementation notes – moving between generations

The scavenger collects a generation g and all its younger generations

It starts from objects that are:– In g– Reachable from the roots.

Moving an object is copying it into a TO space.

The FROM space can be reused

Page 21: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Copying algorithm – a reminder

Cheney’s algorithm TO and FROM spaces are

switched Starts from the root set Objects are traversed

breadth-first using a queue Objects are copied to TO

space Terminates when the

queue is empty

Page 22: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Copying algorithm – the queue trick

Page 23: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The algorithm

Page 24: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Did you get it?

Page 25: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Real time data profiling

Earlier program run profile is not good enough

Real time data eliminates:– Profile execution run– Finding inputs

But the overhead must be low!

Great!

Page 26: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Profiling data access patterns

Trace every load and store to heap

Huge overhead (factor of 10!)

Page 27: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Reducing overhead

Use object oriented programs properties

1. Most objects are small, often less than 32 bytes

– No need to distinguish between fields, since cache blocks are bigger

Page 28: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Reducing overhead – cont.

2. Most object accesses are not lightweight– Profiling instrumentation will not incur large

overhead

Don’t believe? Stay awake

Page 29: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Collecting profiling data

“Load”s of base object addresses Uses a modified compiler The compiler retains object type information

for selective loads

Page 30: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Code instrumentation

Page 31: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Collecting profiling data - cont

The base object address is entered to an object access buffer

Page 32: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Implementation note

Uses a page trap for buffer overflow A trap causes a graph to be built Recommended buffer size: 15000 (60KB)

Page 33: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Affinity?

Main Entry: af·fin·i·ty Pronunciation: &-'fi-n&-tEFunction: nounInflected Form(s): plural -tiesEtymology: Middle English affinite, from Middle French or Latin; Middle French afinité, from Latin affinitas, from affinis bordering on, related by marriage, from ad- + finis end, borderDate: 14th century1 : relationship by marriage2 a : sympathy marked by community of interest : KINSHIP b (1) : an attraction to or liking for something <people with an affinity to darkness -- Mark Twain> <pork and fennel have a natural affinity for each other -- Abby Mandel> (2) : an attractive force between substances or particles that causes them to enter into and remain in chemical combination c : a person especially of the opposite sex having a particular attraction for one3 a : likeness based on relationship or causal connection <found an affinity between the teller of a tale and the craftsman -- Mary McCarthy> <this investigation, with affinities to a case history, a psychoanalysis, a detective story -- Oliver Sacks> b : a relation between biological groups involving resemblance in structural plan and indicating a common origin

Page 34: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The object affinity graph

Page 35: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The object affinity graph

Nodes – objects Edges – Temporal affinity between objects An undirected graph

Page 36: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Building the graph

Page 37: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Inserting an object to the queue

Page 38: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Incrementing edges’ weight

Page 39: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

All clear?

Page 40: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Demonstration

A

B

A

C

D

D

A

Locality queueObject access buffer Graph

Queue tail

Queue tail

Page 41: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Demonstration

A

B

A

C

D

D

A

Locality queueObject access buffer Graph

A

Queue tail

Queue tail

Page 42: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Demonstration

A

B

A

C

D

D

A

Locality queueObject access buffer Graph

Queue tail

A

B

1

Queue tail

Page 43: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Demonstration

A

B

C

D

D

A

Locality queueObject access buffer Graph

Queue tail

A

B

12

Queue tail

Page 44: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Demonstration

A

B

C

D

D

A

Locality queueObject access buffer Graph

Queue tail

A

B

2

C1

1

Queue tail

Page 45: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Demonstration

A

C

D

D

A

Locality queueObject access buffer Graph

Queue tail

A

B

2

C1

1D

11

Queue tail

Page 46: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Demonstration

A

C

D

A

Locality queueObject access buffer Graph

Queue tail

A

B

2

C1

1D

11

Queue tail

Page 47: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Demonstration

A

C

D

A

Locality queueObject access buffer Graph

Queue tail

A

B

2

C1

1D

22

Queue tail

Page 48: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Demonstration

C

D

A

Locality queueObject access buffer Graph

Queue tail

A

B

2

C1

1D

22

Queue tail

Page 49: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Demonstration

C

D

A

Locality queueObject access buffer Graph

Queue tail

A

B

2

C2

1D

23

Queue tail

Page 50: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Implementation notes

A separate affinity graph is built for each generation, except the first.

It uses the fact that the object generation is encoded in its address.

This method prevents placing objects from different generations in the same cache block. (Explanations later on)

Page 51: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Implementation notes – queue size

The locality queue size is important Too small -> Miss temporal relationships Too big -> huge graph, long processing time Recommended: 3.

Page 52: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Implementation notes

Re-create or update the graph? Depends on the application

– Access phases should re-create– Uniform behavior should update

In this article – re-create before each scavange

Page 53: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Stop!

Our goal: Produce a cache conscious data layout, so that objects are likely to reside in the same cache block

In English: place objects with high temporal affinity next to each other.

The method: Use the profiling information we’ve collected in the copying process.

Page 54: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

GC + Real-time profiling

Use the object affinity graph in the Copying algorithm.

Page 55: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Example – object affinity graph

Page 56: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Example – before step 1

Page 57: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Step 1 – using the graph

Flip roles (TO and FROM) Initialize free and unprocessed to the

beginning of the TO space. Pick a node that is in:

– The root set– and– the affinity graph and has the highest edge weight

Perform a greedy DFS on the graph

Page 58: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Step 1 – cont.

Copy each visited object to the TO space Increment the free pointer Store a forwarding address in the FROM

space

Page 59: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Example – After step 1

Page 60: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Step 2 – continues Cheney’s way

Process all objects between the unprocessed and the free pointers, as in Cheney’s algorithm

Page 61: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Example – After step 2

Page 62: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Step 3 - cleanup

Ensure all roots are in the TO space If not, process them using Cheney’s

algorithm

Page 63: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Example – After step 3

Page 64: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Implementation notes

The object access buffer can be used as a stack for the DFS

Page 65: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Inaccurate results(?)

The object affinity graph may retain objects not reachable = garbage

They will be incorrectly promoted at most once

Efforts are focused on longer lived objects and not on the youngest generation

Page 66: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Experimental evaluation

Methodology – If we have the time Object oriented programs manipulate small

objects Real-time data profiling overhead The algorithm impact on performance

Page 67: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Size of heap objects

Page 68: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

But that’s not the point!

Small objects often die fast

Page 69: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Surviving heap objects

Page 70: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Real-time data profiling overhead

Page 71: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Overall execution time

Page 72: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Overall execution time - notes

No impact on L1 cache because its blocks are 16B

Page 73: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Compared to WLM algorithm

Page 74: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Comparison notes

WLM (Wilson-Lam-Moher) improves program’s virtual memory locality.

It performed worse or close to Cheney’s because of the 2GB memory

Page 75: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

What else?

Page 76: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Other methods

Two methods that can be used with the previous one– Prefetch on grey– Lazy sweeping

Page 77: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Assumptions

Non moving mark-sweep collector For simplicity, the collector segregates

objects by size. Each block contains objects of a single size

The collector’s data structure are outside the user-visible heap

A mark bit is reserved for each word in the block

Page 78: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Advantages of “outside the heap” data

The mark phase does not need to examine (=bring to the cache) pointer-free objects

Sequences of small unreachable objects can be reclaimed as a group– A single instruction is needed to examine their

sequence of mark bits– It is used when a heap block turns out to be

empty

Page 79: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The mark phase – a reminder

Ensure that all objects are white. Grey all objects pointed to by a root. while there is a grey object g

– blacken g– For each pointer p in g

if p points to a white object– grey that object.

Page 80: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The mark phase – colors

1 mark bit– 0 is white– 1 is grey/black

Stack– In the stack – grey– Removed from stack - black

Page 81: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The mark GC problem

A significant fraction of time is spent to retrieve the first pointer p from each grey object

About third of the marker’s execution time is spent

This time is expected to increase on future machines

Page 82: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Prefetching

A modern CPU instruction A program can prefetch data into the cache

for future use

Page 83: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Prefetching – cont.

But object reference must be predicted soon enough

For example, if the object is in main memory, it must be prefetched hundred of cycles before its use

Prefetching instructions are mostly inserted by compiler optimizations

Page 84: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Prefetch on grey

When? Prefetch as soon as p is found likely to be a pointer

What? Prefetch the first cache line of the object

Page 85: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

To improve performance

The last pointer to be pushed on the mark stack is prefetched first

It minimizes the cases in which a just grayed object is immediately examined

Page 86: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

And to improve more

Prefetch a few cache lines ahead when scanning an object

It helps with large objects It prefetches more objects if it isn’t that large

Page 87: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The sweep GC problem

If (reclaimed memory > cache size)– Objects are likely to be evicted from the cache by

the allocator or mutator

Thus, the allocator will miss again when reusing the reclaimed memory

Page 88: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Lazy sweeping

Originally used to reduce page faults Delay the sweeping for the allocator Pages will be reused instead of evicted from

the cache

Page 89: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

A reminder

A mark bit is saved for each word in a cache block.

A mark bit is used only if its word is the beginning of an object

Page 90: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Cache lazy sweeping – the collector

Scans for each block its mark bits If all bits are unmarked, the block is added to

the free blocks’ pool without touching it If some bits are marked, it’s added to a

queue of blocks waiting to be swept There are several queues, one or more for

each object size

Page 91: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Cache lazy sweeping – the allocator

Maps the request to the appropriate object free list

Returns the first object from the list If the list is empty

– It sweeps the queue of the right size for a block with some available objects

Page 92: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Experimental results

Measured on two platforms Second platform is to get some calibration on

architecture variation

Page 93: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Pentium III/500 results

Page 94: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

HP PA-8000/180 based results

Page 95: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Results conclusions

Prefetch on grey eliminates a third to almost all cache miss overhead in the marker.

But it is dependent on data structures used in the program

Page 96: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Results conclusions – cont.

Collector performance is determined by the marker

The sweep performance is architecture dependent

Page 97: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Conclusions

Be concerned about cache locality or Have a method that does it for you

Page 98: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Conclusions – cont.

Real-time data profiling is feasible Produce cache conscious data layout using

that information May help reduce the performance gap

between high-level to low-level languages

Page 99: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Conclusions – cont.

Prefetch on grey and lazy sweeping are cheap to implement and should be in future garbage collectors

Page 100: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Bibliography

Using Generational Garbage Collection To Implement Cache-Conscious Data Placement - Trishul M. Chilimbi and James R. Larus

Reducing Garbage Collector Cache Misses - Hans-J. Boehm

Page 101: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Further reading

Look at the articles Garbage collection – algorithms for automatic

dynamic memory management – Richard Jones & Rafael Lins

Page 102: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Further reading – cont.

Cecil – – Craig Chambers. “Object-oriented multi-methods

in Cecil.” In Proceedings ECOOP’92, LNCS 615, Springer-Verlag, pages 33–56, June 1992.

– Craig Chambers. “The Cecil language: Specification and rationale.” University of Washington Seattle, Technical Report TR-93-03-05, Mar. 1993.

Hyperion by Dan Simmons

Page 103: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.
Page 104: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Items

Large objects Inter-generational objects placement Why explicitly build free lists? Experimental methodology Second experimental methodology

Page 105: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Large objects

Ungar and Jackson : – There’s an advantage from not copying large

objects (>= 256 bytes) with the same age

A large object is never copied Each step has an associated set of large

objects

Page 106: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Large objects – cont.

A large object is linked in a doubly linked list. If it survives a collection, it’s removed from its

list and inserted to the TO space list. No compaction is done on large objects.

Page 107: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Large objects – cont.

Read more in David Ungar and Frank Jackson. “An adaptive tenuring policy for generation scavengers.” ACM Transactions on Programming Languages and Systems, 14(1):1–27, January 1992

Page 108: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Two generations, one cache block

How important is co-location of inter-generation objects?

The way to achieve this is to demote or promote.

Page 109: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Two generations, one cache block – cont.

Intra-generation pointers are not tracked. In order to demote safely, it’s needed to

collect its original generation Result: Long collection time

Page 110: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Two generations, one cache block – cont.

Promote can be done safely– The young generation is being collected and its

pointers updated– Pointers from old to young are tracked

The locality benefit will start only when the old generation is collected

Premature promotion

Page 111: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Why explicitly build free lists?

Allocation is fast Heap scanning for unmarked objects can be

fast using mark bits Little additional space overhead is required

Page 112: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Experimental methodology

Vortex compiler infrastructure Vortex supports GGC only for Cecil Cecil – A dynamically typed, purely object-

oriented language. Used Cecil benchmarks Repeated each experiment 5 times and

reported the average

Page 113: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Cecil benchmarks

Page 114: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Cecil benchmarks – cont.

Compiled at highest (o2) optimization level

Page 115: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The platform

Sun Ultraserver E5000 12 167Mhz UltraSPARC processors 2GB memory – To prevent page faults Solaris 2.5.1

Page 116: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The platform - memory

L1 – 16KB direct-mapped, 16B blocks L2 – 1MB unified direct-mapped, 64B blocks 64 entry iTLB and 64 entry dTLB, fully

associative

Page 117: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The platform – memory costs

L1, data cache hit – 1 cycle L1 miss, L2 hit – 6 cycles L2 miss – additional 64 cycles

Page 118: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Second experimental methodology

Two platforms All benchmarks except one are C programs

Page 119: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Pentium measurements

Dual processor 500Mhz Pentium III (but only one used)

100Mhz bus 512KB L2 cache Physical memory > 300MB (why keep it a secret?),

which prevented paging and allowed the whole executable in memory

RedHat 6.1 Benchmarks compiled using gcc with –O2

Page 120: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

RISC measurements

A single PA-8000/180 MHz processor Running HP/UX 11 Single level I and D caches, 1MB each

Page 121: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Benchmarks

Execution time measurements are a five runs average

The division between sweep and mark times is arbitrary

Pentium III prefetcht0 introduced a new overhead, so prefetchnta was used. It was less effective eliminating cache miss, though

Page 122: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

?

Page 123: Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

The end

Lectured by: Shachar Rubinstein

[email protected]

GC seminarMolley Sagiv

Audience:

You

Thanks:

For your patience

The Powerpoint XP effects

My parents

No animals were harmed during this production (except for annoying mosquitoes)

Thank you for listening! (and staying awake…)