D2.3 (WP2): Prototype Scalable Runtime System...

50
ICT-287510 RELEASE A High-Level Paradigm for Reliable Large-Scale Server Software A Specific Targeted Research Project (STReP) D2.3 (WP2): Prototype Scalable Runtime System Architecture Due date of deliverable: September 30, 2013 Actual submission date: September 23, 2013 Start date of project: 1st October 2011 Duration: 36 months Lead contractor: Uppsala University Revision: 0.1 ( 23rd September 2013) Purpose: To describe additions and improvements made to key components of the Erlang Virtual Machine that improve its performance, scalability and responsiveness on big multicore machines. Results: The main results presented in this deliverable are: Efficient functionality that allows the Erlang runtime system to determine thread progress. Infrastructure for memory allocation and deallocation by multiple scheduler threads that requires less locking and support for memory carrier migration. Better organizations of process and port tables. More scalable ways to manage processes and handle port signals. Non-blocking mechanisms for code loading and trace setting. An algorithm that preserves term sharing in copying and message passing and its low-level implementation on the Erlang VM. Conclusion: The set of changes described in this deliverable have allowed many of the key com- ponents of the Erlang runtime system to become more efficient and scalable and have eliminated many of the bottlenecks that hindered the scalability and responsiveness of its VM. Project funded under the European Community Framework 7 Programme (2011-14) Dissemination Level PU Public > PP Restricted to other programme participants (including the Commission Services) RE Restricted to a group specified by the consortium (including the Commission Services) CO Confidential only for members of the consortium (including the Commission Services)

Transcript of D2.3 (WP2): Prototype Scalable Runtime System...

Page 1: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510RELEASE

A High-Level Paradigm for Reliable Large-Scale Server SoftwareA Specific Targeted Research Project (STReP)

D2.3 (WP2): Prototype Scalable Runtime System

Architecture

Due date of deliverable: September 30, 2013Actual submission date: September 23, 2013

Start date of project: 1st October 2011 Duration: 36 months

Lead contractor: Uppsala University Revision: 0.1 ( 23rd September 2013)

Purpose: To describe additions and improvements made to key components of the Erlang VirtualMachine that improve its performance, scalability and responsiveness on big multicore machines.

Results: The main results presented in this deliverable are:

• Efficient functionality that allows the Erlang runtime system to determine thread progress.

• Infrastructure for memory allocation and deallocation by multiple scheduler threads thatrequires less locking and support for memory carrier migration.

• Better organizations of process and port tables.

• More scalable ways to manage processes and handle port signals.

• Non-blocking mechanisms for code loading and trace setting.

• An algorithm that preserves term sharing in copying and message passing and its low-levelimplementation on the Erlang VM.

Conclusion: The set of changes described in this deliverable have allowed many of the key com-ponents of the Erlang runtime system to become more efficient and scalable and have eliminatedmany of the bottlenecks that hindered the scalability and responsiveness of its VM.

Project funded under the European Community Framework 7 Programme (2011-14)Dissemination Level

PU Public >PP Restricted to other programme participants (including the Commission Services)RE Restricted to a group specified by the consortium (including the Commission Services)CO Confidential only for members of the consortium (including the Commission Services)

Page 2: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 1

Prototype Scalable Runtime System Architecture

Konstantinos Sagonas <[email protected]>

Sverker Eriksson <[email protected]>

Rickard Green <[email protected]>

Kenneth Lundin <[email protected]>

Nikolaos Papaspyrou <[email protected]>

Contents

1 Executive Summary 4

2 Introduction 4

3 Thread Progress 53.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3.2 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3.3 The Actual Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Delayed Deallocation 104.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3.1 Tail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3.2 Head . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3.3 Empty List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3.4 Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3.5 Schedulers and the Locked Allocator Instance . . . . . . . . . . . . . . . . . . 13

4.4 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Carrier Migration 135.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.3.1 Management of Free Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.3.2 Carrier Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.3.3 Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.3.4 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Page 3: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 2

6 Process and Port Tables 166.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.3.1 Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.3.2 Insertion and Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.3.3 Iteration over the Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.3.4 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7 Process Management 217.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217.2 Solution and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7.2.1 Rearranging the Process Structure . . . . . . . . . . . . . . . . . . . . . . . . 227.2.2 Less Locking on the Run Queues . . . . . . . . . . . . . . . . . . . . . . . . . 227.2.3 Combined Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7.3 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

8 Port Signals 238.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

8.3.1 Scheduling of Port Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248.3.2 Preparation of Signal Send . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268.3.3 Preserving Low Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268.3.4 Signal Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

8.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

9 Code Loading 279.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

9.3.1 The Load Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279.3.2 The Finishing Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

10 Trace Setting 3010.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3010.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3010.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

10.3.1 Atomicity Without Atomic Operations . . . . . . . . . . . . . . . . . . . . . . 3010.3.2 Adding a New Breakpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3110.3.3 Updating and Removing Breakpoints . . . . . . . . . . . . . . . . . . . . . . . 3210.3.4 Global Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3210.3.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

11 Term Sharing 3311.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3311.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3311.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

11.3.1 Erlang/OTP’s Tagging Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 3311.3.2 Copying and Term Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

11.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4011.4.1 Stress Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4211.4.2 Shootout Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Page 4: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 3

12 Concluding Remarks 43

A Scalability Improvements in Erlang/OTP Releases 46A.1 Improvements in Erlang/OTP R15B (2012-12-14) . . . . . . . . . . . . . . . . . . . . 46A.2 Improvements in Erlang/OTP R15B01 (2012-04-02) . . . . . . . . . . . . . . . . . . 47A.3 Improvements in Erlang/OTP R15B02 (2012-09-03) . . . . . . . . . . . . . . . . . . 47A.4 Improvements in Erlang/OTP R15B03 (2012-12-06) . . . . . . . . . . . . . . . . . . 47A.5 Improvements in Erlang/OTP R16B (2013-02-25) . . . . . . . . . . . . . . . . . . . . 47A.6 Improvements in Erlang/OTP R16B01 (2013-06-18) . . . . . . . . . . . . . . . . . . 49A.7 Improvements in Erlang/OTP R16B02 (2013-09-18) . . . . . . . . . . . . . . . . . . 49

Page 5: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 4

1 Executive Summary

This document presents the third deliverable of Work Package 2 (WP2) of the RELEASE project.WP2 is concerned with improving the Erlang Virtual Machine (VM) by re-examining its runtimesystem architecture, identifying possible bottlenecks that affect its performance, scalability and re-sponsiveness, designing and implementing improvements and changes to its components and, when-ever improvements without major changes are not possible, proposing alternative mechanisms ableto eliminate these bottlenecks. Towards this goal we have gradually improved the implementationof several key components of the Erlang runtime system architecture and compimented them withsome new components with additional functionality. In this document we describe these changes,additions and improvements. More specifically, this report presents:

• Efficient functionality added to the Erlang runtime system that allows it to determine threadprogress.

• Improved infrastructure for memory allocation and deallocation by multiple scheduler threadsthat requires less locking and is more scalable and support for carrier migration.

• A better scheme for the organization of process and port tables.

• More scalable ways to do the internal management of processes and handle port signals.

• Non-blocking mechanisms for code loading and trace setting that improve the responsivenessof the system.

• An algorithm that preserves term sharing in copying and message passing and its low-levelimplementation on the Erlang VM.

These changes, with the sole exception of the last one, are already part of the Erlang/OTP system,are robust enough to be used by its programming community, and have significantly improved boththe performance and scalability of the Erlang VM across its releases. The last change is in theprocess of being integrated into the main development branch and is expected to appear in a futurerelease of Erlang/OTP.

2 Introduction

The main goal of the RELEASE project is to investigate extensions of the Erlang language andimprove aspects of its implementation technology in order to increase the performance of Erlangapplications and allow them to achieve better scalability when run on big multicores or clusters ofmulticore machines. Work Package 2 (WP2) of RELEASE aims to improve the Erlang VM. Thelead site of WP2 is Uppsala University. The task of WP2 pertaining to this deliverable is:

Task 2.3: “... investigate alternative runtime system architectures that reduce the need for copyingdata sent as messages between processes and scheduler extensions that reduce inter-processcommunication and support fine-grained parallelism.”

Towards fulfilling this task, this deliverable (D2.3), due exactly at the end of the second year of theproject, presents the implementation of key components of a scalable runtime system architecturefor the Erlang VM. More specifically, it describes in detail various performance and scalabilityimprovements that either have been already incorporated into a publicly available release of theErlang/OTP system or are planned to be incorporated into it during the duration of the RELEASEproject. The last deliverable of WP2 (D2.4) will document the complete implementation of ascalable runtime system architecture in which components that are still in a prototype phase willbe robust and more efficient than the current ones.

The remaining eight sections, which form the main body of this document, describe in detailsupporting infrastructure that is crucial to achieving scalability on multicores, and changes to

Page 6: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 5

key components of the Erlang VM that lifted scalability and availability bottlenecks. Each sectionstarts with an account of problems that needed to be addressed, a sketch of the solution we adopted,followed by a detailed description of the solution’s implementation and, where appropriate, somebenchmark results. Specifically these sections describe:

• the thread progress functionality of the runtime system (i.e., the mechanism with which theruntime system determines that a thread has completed access to some data structure andensures that all modifications to this data structure are consistently observed) and its efficientimplementation;

• how memory allocation and deallocation by multiple threads can happen in a way that reduceslock contention and makes the underlying implementation more scalable;

• the support for memory carrier migration that was recently added in Erlang/OTP;

• the internal data structures of process and port tables, their organization and the changes thatallow for efficient and scalable lookups, insertions and deletions by multiple scheduler threads;

• the internal management of Erlang processes by the runtime system and mechanisms thatreduce the need for locking when inserting or migrating processes from the run queues ofscheduler threads;

• how the handling of port signals has been changed so as to happen asynchronously and bescheduled in a way that enables its parallel execution by multiple scheduler threads;

• how code loading can occur in a way that does not block the execution of already runningErlang processes;

• how trace breakpoints can be set without blocking the entire VM; and

• how term sharing can be preserved during message passing or when copying terms and theperformance benefits that this brings to a fundamental operation of the system and the Erlanglanguage.

The report ends with a brief section with some concluding remarks. The appendix lists changesand scalability improvements which have made it into the Erlang/OTP system in one of its releasesduring the first twenty four months of the project.

The work for this deliverable has been done by researchers from Ericsson AB (EAB), the Instituteof Communication and Computer Systems (ICCS), and Uppsala University (UU). The bulk of thework described was done by the Ericsson team; researchers from ICCS and UU did the design andimplementation of the algorithm that preserves term sharing in copying and message passing.

3 Thread Progress

3.1 Problems

Knowing when Threads have Completed Accesses to a Data Structure When multiplethreads access the same data structure, the runtime system needs to know when all threads havecompleted their accesses. This is needed, for example, in order to know when it is safe to deallocatethe data structure. One simple way to accomplish this is to reference count all accesses to thedata structure. The problem with this approach is that the cache line where the reference counteris located needs to be communicated between all involved processors. Such communication canbecome extremely expensive and will scale poorly if the reference counter is frequently accessed. Inshort, we want to use some other approach of keeping track of threads than reference counting.

Knowing that Modifications of Memory are Consistently Observed Different hardwarearchitectures have different memory models. Some architectures allow very aggressive reordering ofmemory accesses while others only reorder a few specific cases. Common to all modern hardwareis, however, that some type of reordering will occur. When using locks to protect memory accesses

Page 7: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 6

from multiple threads, such reorderings will not be visible. The locking primitives will ensure thatthe memory accesses will be ordered. When using lock-free algorithms however, one has to takeinto account this reordering made by the hardware.

Hardware memory barriers or memory fences are instructions that can be used to enforce orderbetween memory accesses. Different hardware architectures provide different memory barriers.Lock-free algorithms need to use memory barriers in order to ensure that memory accesses arenot reordered in such ways that the algorithm breaks down. Memory barriers are also expensiveinstructions, so we typically want to minimize their use.

3.2 Solution

The thread progress functionality that we added to the Erlang VM is used to address these problems.The name thread progress was chosen since we want to use it to determine when all threads in aset of threads have made such progress so that two specific events have taken place for all them.

We call the set of threads that we are interested in managed threads. The managed threads arethe only threads that we get any information about. These threads have to frequently reportprogress. Not all threads in the system are able to frequently report progress. Such threadscannot be allowed in the set of managed threads and are called unmanaged threads. An exampleof unmanaged threads are threads in the asynchronous thread pool. Such threads can be blockedfor a very long time and by this be prevented from frequently reporting progress. Currently, onlyscheduler threads and a couple of other threads are managed threads.

Thread Progress Events Any thread in the system may use the thread progress functionalityin order to determine when the following events have occurred at least once in all managed threads:

1. The thread has returned from other code to a known state in the thread progress functionality,which is independent of any other code.

2. The thread has executed a full memory barrier.

These events, of course, need to occur ordered with respect to other memory operations. Theoperation of determining this begins by initiating the thread progress operation. After this, thethread that initiated the thread progress operation polls for the completion of the operation. Bothof these events must occur at least once after the thread progress operation has been initiated, andat least once before the operation has completed in each managed thread. This is ordered usingcommunication via memory which makes it possible to draw conclusion about the memory stateafter the thread progress operation has completed. Let’s call the progress made from initiation tocompletion thread progress.

Assuming that the thread progress functionality is efficient, a lot of algorithms can both besimplified and made more efficient than using the first approach that comes to mind. A couple ofexamples follow.

By being able to determine when the first event above has occurred we can easily know whenall managed threads have completed accesses to a data structure. This can be determined in thefollowing way. We have an implementation of some functionality F using a data structure D. Thereference to D is always looked up before D is being accessed, and the references to D are alwaysdropped before we leave the code implementing F. If we remove the possibility to look up D and thenwait until the first event has occurred in all managed threads, no managed threads can have anyreferences to the data structure D. This could for example have been achieved by using referencecounting, but the cache line containing the reference counter would in this case be ping-pongedbetween all processors accessing D at every access.

By being able to determine when the second event has occurred it is quite easy to do complexmodifications of memory that needs to be seen consistently by other threads without having toresort to locking. By doing the modifications, then issuing a full memory barrier, then waiting untilthe second event has occurred in all managed threads, and then publishing the modifications, we

Page 8: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 7

know that all managed threads reading this memory will get a consistent view of the modifications.Managed threads reading this will not have to issue any extra memory barriers at all.

3.3 Implementation

3.3.1 Requirements

In order to be able to determine when all managed threads have reached the states that we areinterested in, we need to communicate between all involved threads. Naturally, we want to minimizethis communication. We also want threads to be able to determine when thread progress has beenmade relatively fast. That is we need to have some balance between communication overhead andtime to complete the operation.

3.3.2 API

We only present the most important functions in the API here.

• ErtsThrPrgrVal erts thr progress later(void);

A function that initiates the operation. The thread progress value returned can be used fortesting for the operation’s completion.

• int erts thr progress has reached(ErtsThrPrgrVal val);

Returns a non zero value when we have reached the thread progress value passed as argument.That is, when a non zero value is returned the operation has completed.

When a thread executes my val = erts thr progress later(); and subsequently waits forerts thr progress has reached(my val) to return a non zero value, it knows that thread progresshas been made.

While waiting for erts thr progress has reached to return a non zero value, we typically donot want to block waiting, but instead to continue working with other tasks. If we run out of work,we typically do want to block waiting until we have reached the thread progress value that we arewaiting for. In order to be able to do this, we provide functionality for waking up a thread when acertain thread progress value has been reached:

• void erts thr progress wakeup(ErtsSchedulerData *esdp, ErtsThrPrgrVal val);

A function that requests a wake up. The calling thread will be woken when thread progresshas reached val.

Managed threads frequently need to update their thread progress by calling the following func-tions:

• int erts thr progress update(ErtsSchedulerData *esdp);

Updates thread progress. If a non zero value is returned, erts thr progress leader update

has to be called without any locks held.

• int erts thr progress leader update(ErtsSchedulerData *esdp);

Leader update thread progress.

Unmanaged threads can delay thread progress being made:

• ErtsThrPrgrDelayHandle erts thr progress unmanaged delay(void);

Delays thread progress.

• void erts thr progress unmanaged continue(ErtsThrPrgrDelayHandle handle);

Lets thread progress continue.

Page 9: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 8

Figure 1: The communication pattern in the thread progress functionality; leader thread is leftmost.

Scheduler threads can schedule an operation to be executed by the scheduler itself when threadprogress has been made:

• void erts schedule thr prgr later op(void (*funcp)(void *), void *argp,

ErtsThrPrgrLaterOp *memp);

Schedules a call to funcp. The call (*funcp)(argp) will be executed when thread progresshas been made since the call to erts schedule thr prgr later op was made.

3.3.3 The Actual Implementation

To determine when all events have happened, we use a global counter that is incremented when allmanaged threads have called erts thr progress update (or erts thr progress leader update).This could naively be implemented using a “thread confirmed” counter. This would however causean explosion of communication where all involved processors would need to communicate with eachother at each update.

Instead of confirming at a global location, each thread confirms that it accepts an increment ofthe global counter in its own cache line. These confirmation cache lines are located in sequence inan array, and each cache line will only be written by one and only one thread. One of the managedthreads always has the leader responsibility. This responsibility may jump between threads but,as long as there is some activity in the system, always one of the threads will have the leaderresponsibility. The leader has a responsibility to call erts thr progress leader update whichwill check that all other threads have confirmed an increment of the global counter before doing theincrement of the global counter. The leader is the only thread reading the confirmation cache lines.

Doing it this way we will get a communication pattern of information going from the leaderthread out to all other managed threads and then back from the other threads to the leader thread;see Figure 1. This since only the leader thread will write to the global counter and all otherthreads will only read it, and since each confirmation cache line will only be written by one specificthread and only read by the leader thread. When each managed thread is distributed over differentprocessors, the communication between processors will be a reflection of this communication patternbetween threads.

Figure 2 shows what the communication pattern in the thread progress functionality would havebeen if we stored the internal data in only one common cache line. (Again, the leader thread is theleftmost one in this figure.)

Page 10: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 9

Figure 2: An example of a bad communication pattern in the thread progress functionality.

The value returned from erts thr progress later equals the latest confirmed (by this thread)value plus two. The global value may be the latest confirmed value or the latest confirmed value mi-nus one. In order to be certain that all other managed threads will call erts thr progress update

at least once before we reach the value returned from erts thr progress later, the global counterplus one is not enough. This since all other threads may already have confirmed current global valueplus one at the time when we call erts thr progress later. They are however guaranteed not tohave confirmed global value plus two at this time.

The implementation described above minimizes the communication needed before we can in-crement the global counter. The amount of communication in the system due to the threadprogress functionality however also depends on the frequency with which managed threads callerts thr progress update. Today each scheduler thread calls erts thr progress update eachtime an Erlang process is scheduled out. One way to further reduce communication due to thethread progress functionality is to only call erts thr progress update every second or third timean Erlang process is scheduled out, or even less frequently. However, by doing updates of threadprogress less frequently, all operations depending on the thread progress functionality will also takea longer time.

Delay of Thread Progress by Unmanaged Threads In order to implement delay of threadprogress from unmanaged threads we use two reference counters: current and waiting. When anunmanaged thread wants to delay thread progress it increments current and gets a handle back tothe reference counter it incremented. When it later wants to enable continuation of thread progressit uses the handle to decrement the reference counter it previously incremented.

When the leader thread is about to increment the global thread progress counter it verifies thatthe waiting counter is zero before doing so. If it is not zero, the leader is not allowed to incrementthe global counter, but instead needs to wait. When it is or becomes zero, it swaps the waiting andcurrent counters before increasing the global counter. From now on the new waiting counter willdecrease, so that it eventually will reach zero, making it possible to increment the global counterthe next time. If we only used one reference counter it would potentially be held above zero foreverby different unmanaged threads.

When an unmanaged thread increments the current counter it will not prevent the next incre-ment of the global counter, but instead the increment after that. This is sufficient since the globalcounter needs to be incremented two times before thread progress has been made. It is also desirable

Page 11: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 10

not to prevent the first increment, since the probability increases that the delay is withdrawn beforeany increment of the global counter is delayed. That is, the operation will cause as little disruptionas possible.

However, this feature of delaying thread progress from unmanaged threads should preferably beused as little as possible, since heavy use of it will cause contention on the reference counter cachelines. The functionality is however very useful in code which normally only executes in managedthreads, but which may, under some infrequent circumstances, be executed in other threads.

Overhead The overhead caused by the thread progress functionality is more or less fixed usingthe same amount of schedulers regardless of the number of uses of the functionality. Alreadytoday quite a lot of functionality uses it, and we plan to use it even more. When rewriting oldimplementations of ERTS internal functionality to use the thread progress functionality, this impliesremoving communication in the old implementation. Otherwise there is simply no point in rewritingthe old implementation to use the thread progress functionality. Since the thread progress overheadis more or less fixed, the rewrite will cause a reduction of the total communication in the system.

An Example The main structure of an ETS table was originally managed using reference count-ing. Already a long time ago we replaced this strategy since the reference counter caused contentionon each access of the table. The solution used was to schedule “confirm deletion” jobs on eachscheduler in order to know when it was safe to deallocate the table structure of a removed table.These confirm deletion jobs needed to be allocated. That is, we had to allocate and deallocate asmany blocks as schedulers in order to deallocate one block. This of course was a quite an expensiveoperation, but we only needed to do this once when removing a table. It was more important to getrid of the contention on the reference counter which was present on every operation on the table.

When the thread progress functionality had been introduced, we could remove the code imple-menting the “confirm deletion” jobs, and then just schedule a thread progress later operation whichdeallocates the structure. Besides simplifying the code a lot, we got an increase of more than 10%of the number of transactions per second handled on the mnesia tpcb benchmark executing on aquad-core machine.

4 Delayed Deallocation

4.1 Problem

An easy way to handle memory allocation in a multi-threaded environment is to protect the memoryallocator with a global lock which threads performing memory allocations or deallocations have tohave locked during the whole operation. Of course, this scheme scales very poorly, due to heavylock contention. An improved version of this scheme is to use multiple, thread-specific instances ofsuch an allocator. That is, each thread allocates in its own allocator instance which is protected bya lock. In the general case references to memory need to be passed between threads. In the casewhere a thread needs to deallocate memory that originates from another thread’s allocator instancea lock conflict is possible. In a system as the Erlang VM where memory allocation and deallocationis frequent and references to memory are also passed around between threads, this solution will alsoscale poorly due to lock contention.

4.2 Solution

In order to reduce contention due to locking of allocator instances we introduced completely lock-free instances tied to each scheduler thread, and an extra locked instance for other threads; thescheme is shown in Figure 3. The scheduler threads in the system are expected to do the majorpart of the work. Other threads may still be needed but should not perform any major and/ortime-critical work. The limited amount of contention that appears on the locked allocator instancecan more or less be disregarded.

Page 12: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 11

Figure 3: The memory allocator infrastructure after introduction of delayed deallocation.

Since we still need to be able to pass references to memory between scheduler threads we needsome way to manage this. An allocator instance belonging to one scheduler thread is only allowedto be manipulated by that scheduler thread. When other threads need to deallocate memoryoriginating from a foreign allocator instance, they only pass the memory block to a “message box”containing deallocation jobs attached to the originating allocator instance. When a scheduler threaddetects such deallocation jobs it performs the actual deallocation.

4.3 Implementation

The “message box” is implemented using a lock-free single-linked list through the memory blocksto deallocate. The order of the elements in this list is not important. Insertion of new free blockswill be made somewhere near the end of this list. Requiring that the new blocks need to be insertedat the end would cause unnecessary contention when large amount of memory blocks are insertedsimultaneously by multiple threads.

The data structure referring to this single-linked list occupies two cache lines. One cache linecontaining information about the head of the list and one containing information about the tail ofthe list. In order to reduce cache line ping-ponging of this data structure, the head of the list isonly manipulated by the thread owning the allocator instance, and the tail is manipulated by otherthreads inserting deallocation jobs.

4.3.1 Tail

In the tail part of the data structure we find a pointer to the last element of the list, or at leastsomething that is near the end of the list. In the uncontended case the tail will point to the end ofthe list, but when simultaneous insert operations are performed it will point to something near theend of the list.

When inserting an element we try to write a pointer to the new element in the next pointerof the element pointed to by the last pointer. This is done using an atomic compare and swapinstruction that expects the next pointer to be NULL. If this succeeds the thread performing thisoperation moves the last pointer to point to the newly inserted element. If the compare and swapfails, the last pointer did not point to the last element. In this case we need to insert the new

Page 13: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 12

element somewhere in between the element that the last pointer pointed to and the actual lastelement. If we do it this way, the last pointer will eventually end up at the last element whenthreads stop adding new elements. When trying to insert somewhere near the end and we fail to doso, the inserting thread sometimes moves to the next element and sometimes tries with the sameelement again. This is done in order to spread the inserted elements during heavy contention. Thatis, we try to spread the modifications of memory to different locations instead of letting all threadscontinuously trying to modify the same location in memory.

4.3.2 Head

The head contains pointers to the beginning of the list (head.first) and to the first block whichother threads may refer to (head.unref end). Blocks between these pointers are only referred to bythe head part of the data structure which is only used by the thread owning the allocator instance.When these two pointers are not equal the thread owning the allocator instance deallocates blockafter block until head.first reaches head.unref end.

We of course periodically need to move the head.unref end closer to the end in order tobe able to continue deallocating memory blocks. Since all threads inserting new elements inthe linked list will enter the list using the last pointer we can use this knowledge. If we callerts thr progress later and wait until we have reached that thread progress we know that nomanaged threads can refer the elements up to the element pointed to by the last pointer at thetime when we called erts thr progress later. This, since all managed threads must have left thecode implementing this at least once and they always enter into the list via the last pointer. Thetail.next field contains information about next head.unref end pointer and thread progress thatneeds to be reached before we can move head.unref end.

Unfortunately, not only threads managed by the thread progress functionality may insert mem-ory blocks. Other threads also need to be taken care of. However, other threads are not as frequentusers of this functionality as managed threads, so using a less efficient scheme for them is not thatbig of a problem. In order to handle unmanaged threads we use two reference counters. Whenan unmanaged thread enters this implementation it increments the reference counter currentlyused, and when leaving the implementation it decrements the same reference counter. When theconsumer thread calls erts thr progress later in order to determine when it is safe to movehead.unref end, it also swaps reference counters for unmanaged threads. The previous currentrepresents outstanding references from the time up to this point. The new current represents futurereference following this point. When the consumer thread detects that we have both reached thedesired thread progress and when the previous current reference counter reaches zero it is safe tomove the head.unref end.

The reason for using two reference counters is that we need to know that the reference counterwill eventually reach zero. If we only used one reference counter it could potentially be held abovezero forever by different unmanaged threads.

4.3.3 Empty List

If no new memory blocks are inserted into the list, the list should eventually be emptied. Allpointers to the list however expect to always point to something. This is solved by inserting anempty “marker” element, whose only purpose is to be there in the absence of other elements. Thatis, when the list is empty it only contains this “marker” element.

4.3.4 Contention

When elements are continuously inserted by threads not owning the allocator instance, the threadowning the allocator instance will be able to work more or less undisturbed by other threads at thehead end of the list. At the tail end large amounts of simultaneous inserts may cause contention, butwe reduce such contention by spreading inserts of new elements near the end instead of requiringall new elements to be inserted at the end.

Page 14: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 13

4.3.5 Schedulers and the Locked Allocator Instance

Also the locked allocator instance for use by non-scheduler threads has a message box for deallocationjobs just as all the other allocator instances. The reason for this is that other threads may allocatememory pass it to a scheduler that then needs to deallocate it. We do not want the scheduler tohave to wait for the lock on this locked instance. Since also locked instances have message boxesfor deallocation jobs, the scheduler can just insert the job and avoid the locking.

4.4 Benchmark

When running the ehb benchmark, a large amount of messages is passed around between schedulers.All message passing will in some way or another cause memory allocation and deallocation. Sincemessages are passed between different schedulers we will get contention on the allocator instanceswhere messages were allocated. By the introduction of the delayed deallocation feature, we got aspeedup of between 25-45%, depending on configuration of the benchmark, when running on a rela-tively new machine with an Intel i7 quad-core processor with hyper-threading (i.e., the Erlang/OTPsystem was using 8 schedulers).

5 Carrier Migration

The memory allocators of the Erlang runtime system manage memory blocks in two types of rawmemory chunks. We call these chunks of raw memory carriers. Single-block carriers which onlycontain one large block, and multi-block carriers which contain multiple blocks. On Unix systems acarrier is typically created using mmap(). However, how a carrier is created is of minor importance.An allocator instance typically manages a mixture of single- and multi-block carriers.

5.1 Problem

When a carrier is empty, i.e., when it contains only one large free block, it is deallocated. Since multi-block carriers can contain both allocated and free blocks at the same time, an allocator instancemight be stuck with a large amount of poorly utilized carriers if the memory load decreases. Aftera peak in memory usage, it is often the case that not all memory can be returned since the blockswhich are still allocated are likely to be dispersed over multiple carriers. Such poorly utilized carrierscan usually be reused if the memory load increases again. However, since each scheduler threadmanages its own set of allocator instances and memory load is not necessarily related to CPU load,we might get into a situation where there are lots of poorly utilized multi-block carriers on someallocator instances while we need to allocate new multi-block carriers on other allocator instances.In scenarios like these, the demand for multi-block carriers in the system might increase at the sametime as the actual memory demand in the system has decreased which is both unwanted and quiteunexpected for the end user.

5.2 Solution

In order to prevent scenarios such as these, we have implemented support for migration of multi-block carriers between allocator instances of the same type. This support was introduced inErlang/OTP R16B01 and was further refined in R16B02.

5.3 Implementation

5.3.1 Management of Free Blocks

In order to be able to remove a carrier from one allocator instance and add it to another we needto be able to move references to the free blocks of the carrier between the allocator instances. Thedata structure referring to the free blocks that each allocator instance manages often refers to thesame carrier from multiple places. For example, when the address order best fit strategy is used

Page 15: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 14

Figure 4: The memory allocator infrastructure extended by the introduction of a pool of carriers.

this data structure is a binary search tree spanning all carriers that the allocator instance manages.Free blocks in one specific carrier can be referred to from potentially every other managed carrier,and the amount of such references can be huge. That is, the work of removing the free blocks ofsuch a carrier from the search tree can become a bottleneck. One way of solving this could be to notmigrate carriers that contain lots of free blocks, but this would prevent us from migrating carriersthat potentially need to be migrated in order to solve the problem we set out to solve.

By using one data structure of free blocks in each carrier and an allocator instance wide datastructure of carriers managed by the allocator instance, the work needed in order to remove and addcarriers can be kept to a minimum. When migration of carriers is enabled on a specific allocatortype, we require that an allocation strategy with such an implementation is used. Currently wehave implemented this for three different allocation strategies. All of these strategies use a searchtree of carriers sorted so that we can find the carrier with the lowest address that can satisfies therequest. Internally in carriers, we use yet another search tree that either implements address orderfirst fit, address order best fit, or best fit. The abbreviations used for these different allocationstrategies are aoff, aoffcaobf, and aoffcbf.

5.3.2 Carrier Pool

In order to migrate carriers between allocator instances we move them through a pool of carriers.The memory allocator infrastructure of Figure 3 extended with the pool of carriers is shown inFigure 4. In order for a carrier migration to complete, one scheduler needs to move the carrier intothe pool, and another scheduler needs to take the carrier out of the pool.

The pool is implemented as a lock-free, circular, doubly-linked list. The list contains a sentinelthat is used as the starting point when inserting to, or fetching from the pool. Carriers in the poolare elements in this list.

The list can be modified by all scheduler threads simultaneously. During modifications the list isallowed to get a bit “out of shape”. For example, following the next pointer to the next element andthen following the prev pointer does not always take us back to where we started. The followinghowever are always true:

• Repeatedly following next pointers will eventually take us to the sentinel.

• Repeatedly following prev pointers will eventually take us to the sentinel.

• Following a next or a prev pointer will take us to either an element in the pool, or an elementthat used to be in the pool.

Page 16: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 15

When inserting a new element we search for a place to insert the element by only following next

pointers, and we always begin by skipping the first element encountered. When trying to fetch anelement we do the same thing, but instead only follow prev pointers.

By following different directions when inserting and fetching, we avoid contention betweenthreads inserting and threads fetching as much as possible. By skipping one element when webegin searching, we preserve the sentinel unmodified as much as possible. This is beneficial sinceall search operations need to read the content of the sentinel. If we were to modify the sentinel, thecache line containing the sentinel would be bounced unnecessarily between processors.

The prev and next fields in the elements of the list contain the value of the pointer, a modi-fication marker, and a deleted marker. Memory operations on these fields are done using atomicmemory operations. When a thread has set the modification marker in a field, no thread exceptthe one that set the marker is allowed to modify the field. If multiple modification markers needto be set, we always begin with next fields followed by prev fields in the order following the actualpointers. This guarantees that no deadlocks will occur.

When a carrier is being removed from a pool, we mark it with a thread progress value thatneeds to be reached before we are allowed to modify the next and prev fields. That is, until wereach this thread progress we are not allowed to insert the carrier into the pool again, and we arenot allowed to deallocate the carrier. This ensures that threads inspecting the pool will always beable to traverse the pool and reach valid elements. Once we have reached the thread progress valuethat the carrier was tagged with, we know that no threads may have references to the carrier viathe pool.

5.3.3 Migration

There exists one pool for each allocator type enabling migration of carriers between scheduler specificallocator instances of the same allocator type.

Each allocator instance keeps track of the current utilization of its multi-block carriers. Whenthe utilization falls below the “abandon carrier utilization limit” it starts to inspect the utilizationof the current carrier when deallocations are made. If also the utilization of the carrier falls belowthe “abandon carrier utilization limit” it unlinks the carrier from its data structure of available freeblocks and inserts the carrier into the pool.

Since the carrier has been unlinked from the data structure of available free blocks, no moreallocations will be made in the carrier. The allocator instance putting the carrier into the pool,however, still has the responsibility of performing deallocations in it while it remains in the pool.

Each carrier has a field containing information about the allocator instance owning the carrier,a flag indicating if the carrier is in the pool or not, and a flag indicating if it is busy or not. Whenthe carrier is in the pool, the owning allocator instance needs to mark it as busy while operatingon it. If another thread inspects it in order to try to fetch it from the pool, it will abort thefetch if the carrier is busy. When fetching the carrier from the pool, ownership will be changed andfurther deallocations in the carrier will be redirected to the new owner using the delayed deallocationfunctionality described in the previous section.

If a carrier in the pool becomes empty, it will be withdrawn from the pool. All carriers thatbecome empty are also always passed to its originating allocator instance for deallocation using thedelayed deallocation functionality. Since carriers will always be deallocated by the allocator instancethat allocated the carrier, the underlying functionality of allocating and deallocating carriers canremain simple and does not need to bother about multiple threads. In a NUMA system we will alsonot mix carriers originating from different NUMA nodes.

When an allocator instance needs more carrier space, it always begins by inspecting its owncarriers that are waiting for thread progress before they can be deallocated. If no such carrier canbe found, it then inspects the pool. If no carrier can be fetched from the pool, the allocator instancewill allocate a new carrier. Regardless of where the allocator instance gets the carrier from, it justlinks the carrier into its data structure of free blocks.

Page 17: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 16

5.3.4 Result

The use of this strategy of abandoning carriers with poor utilization and reusing them in allocatorinstances with an increased carrier demand is extremely effective and completely eliminates theproblems that otherwise sometimes occurred when CPU load dropped while memory load did not.

When using the aoffcaobf or aoff strategies compared to gf or bf, we lose some performancesince we get more modifications in the data structure of free blocks. This performance penaltyis however reduced using the aoffcbf strategy. A trade-off between memory consumption andperformance is however inevitable, and it is up to the user to decide what is most important.

5.4 Related Work

Using multiple areas in order to separate concurrent operations made by different threads is awidely used trick by memory allocators trying to perform well in multi threaded environments.The malloc implementation of glibc [7] used in Linux uses multiple areas called arenas. Thejemalloc [5] implementation also uses a multiple arena approach. Hoard [1, 2] uses areas calledsuperblocks which are part of processor-specific heaps. The Intel Threading Building Blocks (IntelTBB) scalable allocator [10] uses an approach similar to Hoard.

Common to the glibc, hoard, and jemalloc implementations is that threads lock areas whileoperating on them. This is similar to the approach we used before delayed deallocation was intro-duced. Both glibc and jemalloc use a finer grained approach than the one we do, at least if oneassumes lots of small areas. This by having a lock per area instead of a lock protecting multipleareas. There is however always a risk of a large amount of threads needing to operate on the samearea during heavy loads. When that happens, memory allocation operations all of a sudden take amuch longer time. We therefore do not think that such an approach is good enough.

The Intel TBB allocator tries to minimize lock contention by using separate free lists for ownerthread and foreign threads in heaps. We tried a similar approach during development of delayeddeallocation, but did not find that satisfying enough because of the likelihood that the owner threadmay suffer during heavy contention.

The Hoard and the Intel TBB allocators solve issues with unused memory due to uneven memoryloads in different areas specific to different threads by passing memory areas through a locked globalheap. In situations where lots of areas need to pass through the global heap one can of course expectheavy lock contention. Our carrier pools also cause contention if lots of carriers are passed throughthe pools, but the contention on the pool is negligible compared to using a locked global areaapproach.

TCMalloc [6] uses a bit of a different approach with thread-specific, lock free free lists. In orderto be able to reuse blocks prevously deallocated by other threads, blocks can also be moved toa locked global list. One downside of this approach is that allocations from different threads aremixed in the same area which increases the risk of false sharing. A negative effect is also describedby Jason Evans [5]. In order to avoid this one can allocate blocks in units of the cache line size. Acommon cache line size is 64 bytes, so for small blocks this approach will waste a lot of memory.TCMalloc however does not seem to do this.

5.5 Future Work

It is quite easy to extend the implementation to allow migration of multi-block carriers between allallocator types. The only obstacle is maintenance of statistics information.

6 Process and Port Tables

6.1 Problems

The process table is a mapping from process identifiers to process structure pointers. The processstructure contains miscellaneous information about a process, as for example pointers to its heap,

Page 18: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 17

message queue, etc. When the runtime system needs to operate on a process, it looks up the processstructure in the process table using the process identifier. An example of when this happens is whenpassing a message to a process.

The process table has, for a very long time, been just an array of pointers to process structures.Since internally in the runtime system process identifiers are 28-bit integers, it is quite easy to mapa process identifier to an index into the array. The 28-bits were divided into two sets. The leastsignificant set of bits was used as index into the array. The most significant set of bits was onlyused to be able to distinguish between a number of identifiers which map to the same index in thearray. With this scheme, as long as process table sizes of a power of two are used, there are 2ˆ28unique process identifiers.

When the first SMP support was implemented, the table was kept more or less the same way,but protected by two types of locks. One lock that protected the whole table against modificationsand an array of locks protecting different parts of the table. The exact locking strategy previouslyused is not interesting. What is interesting is that it suffered from heavy lock contention especiallywhen lots of modifications were being made, but also when only performing lookups.

In order to be able to detect when it is safe to deallocate a previously used process structure,reference counting of the structure was used. This was also problematic, since simultaneous lookupsneeded to modify the reference counter which also caused contention on the cache line where thereference counter was located. This since all modifications need to be communicated between allinvolved processors.

The port table is very similar to the process table. The major difference, at least conceptually,is that it is a mapping from port identifiers to port structures. It had a similar implementation,but with some differences. Instead of being an array of pointers it was an array of structures, andinstead of being protected by two types of locks it was only protected by one global lock. This tablealso suffered from lock contention in various situations.

6.2 Solution

The process table was the major problem to address since processes are much more frequently usedthan ports. The first implementation only implemented a solution for processes, but since the porttable is very similar and susceptible to very similar problems, the process table implementationwas later generalized so that it could also be used for the implementation of the port table. Forsimplicity below we only describe the solution for the process table in the following text, but thesame solution applies to the port table unless otherwise stated.

If we disregard the locking issues, the original process table organization is very appealing. Themapping from process identifier to index into the array is very fast, and this property is somethingwe would like to keep. The vast majority of operations on these tables are lookups so optimizingfor lookups is what we want to do.

6.3 Implementation

6.3.1 Lookup

Using a set of bits in the process identifier as index into an array seems hard to beat. By replacingthe array of pointers with an array of our pointer-sized atomic data type, a lookup will consist ofthe following three steps:

1. Map the 28-bit integer to an index into the array.

More about this mapping later.

2. Read the pointer using an atomic memory operation at determined index in array.

On all platforms that we provide atomic memory operations, this is just a volatile read,preventing the compiler to use values in registers, forcing the read to be from memory instead.

Page 19: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 18

3. Depending on use, issue an appropriate memory barrier.

A common barrier used is a barrier with acquire semantics. On x86/x86 64 platforms, this mapsto a compiler barrier preventing the compiler to reorder instructions, but on other hardwareoften some kind of light-weight hardware memory barrier is also needed.

When comparing with a locked approach, at least one heavy-weight memory barrier will be issuedwhen locking the lock on most, if not all, hardware architectures (including x86/x86 64), andoften some kind of light-weight memory barrier will be issued when unlocking the lock.

When looking at this very simple solution with very little overhead one might wonder why it wasnot implemented this way from the beginning. It all boils down to the read operation of the pointer.We need some way to know that it is safe to access the memory pointed to. One way of doing this isto place a reference counter in the process structure. Increment of the reference counter at lookupneeds to be done atomically with the lookup. A lock can typically provide this service for us, whichwas the approach we previously used. Another approach could be to co-locate the reference counterwith the pointer in the table. The major problem with this approach is the modifications of thereference counter. Since these modification would have to be communicated between all involvedprocessors, causes contention on the cache line containing the reference counter. The new lookupapproach above is possible since we can use the thread progress functionality in order to determinewhen it is safe to deallocate the process structure. We will get back to this point when describingdeletion from the table.

Using this new lookup approach we won’t modify any memory at all which is important. Alookup conceptually only reads memory, and with this approach this is true also in the implemen-tation, which is important from a scalability perspective. The previous implementation modifiedthe cache line containing the reference counter two times, and the cache line containing the corre-sponding lock two times at each lookup.

6.3.2 Insertion and Deletion

A light-weight lookup operation was the most important goal to achieve, but we also wanted toimprove modifications. The process table is modified when a new process is spawned, i.e. a newpointer is inserted into the table, or when a process terminates, i.e. a pointer is removed from thetable.

Assuming that we spawn fewer processes than the maximum amount of unique process identifiersin the system, we can easily determine the order of process creation just by comparing processidentifiers. If PidX is larger than PidY, then PidX was created after PidY assuming both identifiersoriginate from the same node. However, since we have a quite limited amount of unique identifiers(currently only 2ˆ28), this property cannot be relied upon if we create a large amount of processes.But nevertheless, this is a property the system has always had.

With a bigger amount of unique identifiers available, it would have been tempting to drop ormodify this ordering property as described above. The ordering property could for example bebased on the scheduler performing the spawn operation. It would have been possible to reservelarge ranges of identifiers exclusively for each scheduler thread which could be used to minimize theneed for communication when allocating identifiers. The amount of identifiers we got to work withtoday is, however, not even close to a number that allows for such an approach.

Since we have a limited amount of unique identifiers, we need to be careful not to waste them. Ifpreviously used identifiers are reused too quickly, identifiers originating from terminated processeswill refer to newly created processes and mix ups will occur. The previously used approach wasquite good at not wasting identifiers. Using a modified version of the same approach also lets uskeep the ordering property that we have always had.

Insertion The original approach is more or less to search for the next free index or slot in thearray. The search starts from the last slot allocated. If we reach the end of the array we increase a“wrapped counter” and then continue the search. The process identifier is constructed by writing

Page 20: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 19

the index to the least significant set of bits, and the “wrapped counter” to the most significant setof bits. The amount of bits in each set of bits is decided at boot time, so that maximum index willjust fit into the least significant set of bits.

In the lock-free version of this approach that we implemented, we more or less do it the sameway, but with some important modifications trying to avoid unnecessary contention when multipleschedulers create processes simultaneously. Since multiple threads might be trying to search for thenext free slot at the same time from the same starting point we want subsequent slots to be locatedin different cache lines. Multiple schedulers simultaneously writing new pointers into the table aretherefore very likely to write into adjacent slots. If adjacent slots are located in the same cacheline all modifications of this cache line need to be communicated between all involved processors,something which is very expensive and scales poorly. By locating adjacent slots in different cachelines only true conflicts trigger communication between involved processors, thereby avoiding anyfalse sharing.

A cache line is larger than a pointer, typically 8 or 16 times larger, so using one cache line foreach slot (i.e., for one pointer) would be a waste of space. Each cache line therefore holds a fixedamount of slots. The first slot of the table is the first slot of the first cache line, the second slot ofthe table is the first slot of the second cache line, and so on until we reach the end of the array. Thenext slot after that will be the second slot of the first cache line, etc., moving forward one cache lineinternal slot each time we wrap. With this scheme, we are able to fit the same amount of pointersinto an array of the same size while always keeping adjacent slots in different cache lines.

With this scheme the mapping from identifier to slot or index into the array gets a bit morecomplicated. Instead of a shift and a bit-wise and, we get two shifts, two bit-wise ands, and anadd; see the implementation of erts ptab data2pix in file erl ptab.h. However, by storing thisinformation optimized for lookup we only need a shift and a bit-wise and on 32-bit platforms. On64-bit platforms we got enough room for the 28-bit identifier in the least significant half word andthe index in the most significant half word, in other words, we just need to read the most significanthalf word to get the index. That is, this operation is as fast, or faster than before. The downsideis that on 32-bit platforms we need to convert this information into the 28-bit identifier numberwhen printing, or when ordering identifiers from the same node. These operations are, however,extremely infrequent compared to lookups.

When we insert a new element in the table we do the following:

1. We begin by reserving space in the table by atomically incrementing a counter of processes in thetable. If our increment brings the counter above the maximum size of the table, the operationfails and a system limit exception is raised.

2. The table contains a 64-bit atomic variable of the last identifier used. Only the least significantbits will be used when actually creating the identifier. This identifier is where the search begins.

3. We increment the last identifier value used. In order to determine the slot that corresponds tothis identifier we call erts ptab data2pix that maps identifiers to slots. We read the contentof the slot. If the slot is free we try to write a reservation marker using an atomic compare andswap. If this fails we repeat this step until it succeeds.

4. We change the table variable of last identifier used. Since multiple writes might occur at thesame time this value may already have been changed by an identifier larger that the one we got.In this case we can continue; otherwise, we need to change it to the identifier we got.

5. We now do some initializations of the process structure that cannot be done before we know theprocess identifier, and have to be done before we publish the structure in the table. This, forexample, includes storing the identifier in the process structure.

6. Now we can publish the structure in the table by writing the pointer to the process structure inthe slot previously reserved in step 3.

Page 21: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 20

Using this approach we were able to preserve the property of identifier ordering and can reuseidentifiers, while at the same time improve performance and scalability. This approach has one flaw,though. There is no guarantee that the operation will terminate. This can quite easily be fixedthough, and will be fixed in a future release. We will get back to this below.

Deletion When a process terminates, we mark the process as terminated in the process structure,the counter of number of processes in the table is decreased, and the reference to the process structureis removed by writing a NULL pointer into the corresponding slot. The scheduler thread performingthis then schedules a thread progress later job which will do the final cleanup and deallocate theprocess structure. The thread progress functionality ensures that this job will not execute until itis certain that all managed threads have dropped all references to the process structure.

6.3.3 Iteration over the Table

The erlang:processes/1 and erlang:port/1 BIFs iterate over the tables and return correspond-ing identifiers. These BIFs should return a consistent snapshot of the table content during sometime when the BIF is executing. In order to implement this we use locking in a strange way. Weuse an inverted rwlock.

When performing lookups in the table we do not need to bother about locking at all, but whenmodifying the table we read-lock the rwlock protecting the table which allows for multiple writersduring normal operation. When the BIF that iterates over the table needs access to the table itwrite-locks the rwlock and reads the content of the table. The BIF does not read the whole tablein one go but instead reads small chunks at time, only write-locking while reading. The actualimplementation of the BIFs is out of the scope of this document.

An out of the box rwlock will typically suffer from contention on the single cache line containingthe state of the rwlock even in the case we are only read-locking. Instead of using such an rwlock, wecreated our own implementation of a reader-optimized rwlock, which keeps track of reader threadsin separate thread-specific cache lines. This is done in order to avoid contention on singe cachelines. As long as we only do read lock operations, threads only need to read a global cache line andmodify their own cache line, and by this minimize communication between involved processors. Theiterating BIFs are normally very infrequently used, so in the normal case we will only do read-lockoperations on the global rwlock of the table.

6.3.4 Future Improvements

The first planned improvement is to provide a fix that guarantees that insert operations will alwaysterminate. Currently, when the operation starts we verify that there actually exists a free slot thatwe can use. The problem is that we might not find it since it may move when multiple threadsmodify the table at the same time as we are trying to find the slot. The easy fix is to abort theoperation if an empty slot could not be found in a finite number of steps, and then restart theoperation under a write lock. This will be implemented in a next Erlang/OTP release, but furtherwork is needed in order to find a better solution.

The current but also previous implementations do not work well when the table is nearly full.We get both long search times for free slots and we reuse identifiers more frequently since we morefrequently wrap during the search. These tables work best when the table is much larger than theamount of simultaneous existing processes. One easy improvement is to always have room for moreprocesses than we allow in the table. This will also be implemented in a next release, but thisshould probably also be worked more by trying to find an even better solution.

It would also be nice to get rid of the rwlock all together. The use of a reader-optimized rwlockmakes sure we do not any contention on the lock, but unnecessary memory barriers will be issueddue to the lock. The main issue here is to modify the iterating BIFs so that they do not requireexclusive access to the table while reading a sequence of slots. In principle this should be rathereasy, since the code can already handle sequences of variable sizes, so shrinking the sequence size

Page 22: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 21

of slots to one would solve the problem. This, however, requires some tweaks and modifications ofnon-trivial code and is something that will be looked at in the future.

By increasing the size of identifiers, at least on 64-bit machines (which is not as easy as it firstmight seem) we get further room for improvement. Besides the obvious improvement of not reusingidentifiers as fast as we currently do, it makes it possible to further avoid contention when insertingelements in the table. (At least if we drop this ordering property, which is not that useful anyway.)

6.4 Benchmarks

In order to test modifications of the process table we ran a couple of benchmarks where lots of pro-cesses are spawned and terminated simultaneously, and got a speedup of about 150-200% comparedto the previous implementation. Running a similar benchmark but with ports we got a speedup ofabout 130%.

The BIF erlang:is process alive/1 is the closest we can get to performing only a processtable lookup. The BIF looks up the process corresponding to the process identifier passed asargument and then checks if it is alive. In a benchmark that runs multiple processes looping overthis BIF checking the same process, we got a speedup of about 20,000-23,000%. Conceptually thisoperation only involves read operations. Actually, in the implementation used in Erlang/OTP R16Bonly read operations are performed, while the previous implementations needed to lock structuresin order to read the data, suffering from both lock contention and contention due to modificationsof cache lines used by lock internal data structures and the reference counter on the process beinglooked up.

These benchmarks were run on a relatively new machine with an Intel i7 quad-core processorwith hyper-threading using 8 schedulers. On a machine with more communication overhead and/orlarger amount of logical processors the speedups are expected to be even larger.

7 Process Management

7.1 Problems

Early versions of Erlang’s runtime system with SMP support completely relied on locking in orderto protect data accesses from multiple threads. In some cases this is not problematic, but in othercases it really is. It complicates the code, in order to ensure that all locks needed are actually heldand are acquired in such an order that no deadlocks occur. Acquiring locks in the right order oftenalso involves releasing some locks held, forcing threads to re-read data already read. (Besides beinginefficient, it’s also a good recipe for introducing subtle bugs.) Trying to use more fine-grainedlocking in order to increase possible parallelism in the system makes the complexity situation evenworse. Also, having to acquire a bunch of locks when doing operations often causes heavy lockcontention which in turn results in poor scalability.

In past releases of Erlang/OTP, internal management of processes suffered from these problems.When changing state on a process, for example from waiting to runnable, a lock on the processneeded to be locked. When inserting a process into a run queue the lock protecting the run queuehad to be locked. When migrating a process from one run queue to another run queue, locks onboth run queues and on the process had to be locked. Actually, the last example is a quite commoncase during normal operation. For example, when a scheduler thread runs out of work it tries tosteal work from another scheduler thread’s run queue. When searching for a victim to steal fromthere was a lot of juggling of run queue locks involved, and the actual theft was performed bylocking both run queues and the process. This is problematic, because when one scheduler runs outof work, often others also do, causing lots of lock contention.

Page 23: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 22

7.2 Solution and Implementation

7.2.1 Rearranging the Process Structure

In order to avoid these situations, we wanted to be able to do most of the fundamental operationson a process without having to acquire a lock on the process. Some examples of such fundamentaloperations are: moving a process between run queues, detecting if we need to insert it into a runqueue or not, and detecting if a process is alive or not.

All of this information that was needed by these operations was protected in the process structureby the process status lock, but the information was spread across a number of fields. The fieldsused were typically state fields that could contain a small number of different states. By reorderingthis information a bit we could easily fit this information into a 32-bit wide field of bit flags (onlytwelve flags were needed). Furthermore, by moving this information we could remove five 32-bitwide fields and one pointer field from the process structure! The move also enabled us to easilyread and change the state using atomic memory operations.

7.2.2 Less Locking on the Run Queues

As with processes we wanted to be able to do the most fundamental operations on a run queuewithout having to acquire a lock on it. The most important operation is determining if we shouldenqueue a process in a specific run queue or not. This involves being able to read the queue’s actualload and load balancing information.

The load balancing functionality is triggered at repeated fixed intervals. Load balancing moreor less strives to even out run queue lengths over the system. When balancing is triggered, infor-mation about every run queue is gathered, migrations paths and run queue length limits are set up.(Migration paths and limits stay fixed until the next balancing.) The most important informationabout each run queue is the maximum run queue length since last balancing. All of this informationwas previously stored in the run queues themselves.

When a process has become runnable, for example due to reception of a message, we need todetermine which run queue to enqueue it in. Previously this involved locking the run queue thatthe process currently was assigned to while holding the status lock on the process. Depending onload we sometimes also had to acquire a lock on another run queue in order to be able to determineif it should be migrated to that run queue or not.

In order to be able to decide which run queue to use without having to lock any run queues,we moved all fixed balancing information (i.e., migration paths and run queue limits) out of therun queues into a global memory block. Information that needed to be frequently updated, likefor example maximum run queue length, was kept in the run queue, but instead of operating onthis information under locks we could now use atomic memory operations when accessing thisinformation. This made it possible to first determine which run queue to use, without locking anyrun queues, and when decided, lock the chosen run queue and insert the process.

Fixed Balancing Information When determining which run queue to choose we need to read thefixed balancing information that we moved out of the run queues. This information is global, readonly between load balancing operations, but will be changed during a load balancing. Naturally, wedid not want to introduce a global lock that needs to be acquired when accessing this information.A reader optimized rwlock could avoid some of the overhead since the data is most frequently read,but it would unavoidably cause disruption during load balancing, since this information is veryfrequently read. (The probability of a large disruption due to this also increases as the number ofschedulers grows.)

The solution we implemented was that instead of using a global lock protecting modificationsof this information, we write a completely new version of it at each load balancing. The newversion is written in another memory block than the previous one, and published by issuing a writememory barrier and then storing a pointer to the new memory block in a global variable using anatomic write operation. When schedulers need to read this information, they read the pointer to

Page 24: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 23

currently used information using an atomic read operation, and then issue a data dependency readbarrier, which on most architectures is a no-op. That is, on most architectures, getting access tothis information happens with very little overhead.

Instead of allocating and deallocating memory blocks for the different versions of the balancinginformation we keep old memory blocks and reuse them when it is safe to do so. In order to be ableto determine when it is safe to reuse a block we use the thread progress functionality, ensuring thatno threads have any references to the memory block when we reuse it.

Extra Optimizations We also implemented a test version using lock-free run queues. However,this implementation did not perform as well as the version using one lock per run queue. Thereason for this was not investigated enough to draw any safe conclusions. Since the version withlocks performed better we kept it, at least for the time being. The lock-free version, however, forcedus to also consider other improvements, which we ended up keeping.

Previously when a process that was in a run queue got suspended, we removed it from the queuestraight away. This involved locking the process, locking the run queue, and then unlinking it fromthe double linked list implementing the queue. Removing a process from a lock-free queue getsreally complicated. Instead, of removing it from the queue, we just leave it in the queue and markit as suspended. When later selected for execution we check if the process is suspended, and if sowe just drop it. However, during its time in the queue, it might also get resumed again, in whichcase we can immediately execute it when it gets selected for execution.

By keeping this part when reverting back to the implementation with the lock, we could removea pointer field in each process structure and avoid unnecessary operations on the process and thequeue which might cause contention.

7.2.3 Combined Modifications

By combining the modifications of the process state management and the run queue management,we can do large parts of the process management work related to scheduling and migration withouthaving any locks locked at all. In these situations we previously had to have multiple locks locked.This of course caused a lot of rewrites across large parts of the Erlang runtime system, but thisrewrite both simplified code and eliminated locking in a number of places. The major benefit is, ofcourse, reduced contention and better scalability.

7.3 Benchmark

When running the chameneosredux benchmark, schedulers frequently run out of work trying tosteal work from each other. That is, either succeed in migrating, or try to migrate processes whichis a scenario which we wanted to optimize. By the introduction of these improvements, we gota speedup of 25-35% when running this benchmark on a relatively new machine with an Intel i7quad-core processor with hyper-threading using 8 schedulers.

8 Port Signals

8.1 Problems

Erlang ports are conceptually very similar to Erlang processes. Erlang processes execute Erlang codein the virtual machine, while Erlang ports execute native code typically used for communicationwith the outside world. For example, when an Erlang process wants to communicate using TCP overthe network, it communicates via an Erlang port implementing the TCP socket interface in nativecode. Both Erlang processes and ports communicate using asynchronous signaling. The nativecode executed by an Erlang port is a collection of callback functions, called a driver. Roughly, eachcallback implements the code of a signal to or from the port.

Page 25: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 24

Even though processes and ports conceptually always have been very similar, their implemen-tations have been very different. Originally, port signals were handled synchronously at the timethey occurred. Very early in the development of the SMP support for the runtime system it wasrecognized that this was a big problem for signals between ports and the outside world. That is, forI/O events to and from the outside world or for I/O signals. This was one of the first things thathad to be rewritten in order to be able to do I/O in parallel at all. The solution was to implementscheduling of these signals. I/O signals corresponding to different ports could then be executed inparallel on different scheduler threads. Signals from processes to ports were not as big a problemas I/O signals, and their implementation was left unchanged.

Each port has its own lock to protect against simultaneous execution in multiple threads. Pre-viously when a process, executing on a scheduler thread, sent a port a signal, it tried to grab theport lock and synchronously executed the code corresponding to the signal. If the lock was busy,the scheduler thread blocked waiting until it could grab the lock. If multiple processes executingsimultaneously on different scheduler threads sent signals to the same port, schedulers suffered fromheavy lock contention. Such contention could also occur between I/O signals for the port executingon one scheduler thread, and a signal from a process to the port executing on another schedulerthread. This scheme, besides suffering from contention issues, also loses the potential to executework in parallel on different scheduler threads, as the process sending the asynchronous signal isblocked while the code implementing the signal is executed synchronously.

8.2 Solution

In order to prevent multiple schedulers from trying to execute signals to/from the same port simul-taneously, we need to be able to ensure that all signals to/from a port are executed in sequence onone scheduler. One (perhaps the only) way to do this is to schedule all types of signals. Signalscorresponding to a port can then be executed in sequence by a single scheduler thread. If only onethread tries to execute the port, no contention will happen on the port lock. Besides getting rid ofthe contention, processes sending signals to the port can also continue execution of their own Erlangcode on other schedulers at the same time as the signaling code is executing on another scheduler.

8.3 Implementation

When implementing this scheme there are a couple of important properties that we either need orwant to preserve:

• Signal ordering guarantee. Signals from process X to port Y, must be delivered to Y in thesame order as sent from X.

• Signal latency. Due to the previous synchronous implementation, latency of signals sent fromprocesses to ports has usually been very low. (During periods of contention the latency is ofcourse increased.) Since users expect latency of these signals to be low, a sudden increase inlatency would not be appreciated.

• Compatible flow control. For a very long time ports have had the possibility to use the busyport functionality when implementing flow control. One may argue that this functionality fitsvery badly with the conceptually completely asynchronous signaling, but the functionality hasbeen there for ages and is expected by Erlang users. When a port sets itself into a busy state,command signals should not be delivered, and senders of such signals should suspend until theport sets itself in a not busy state.

8.3.1 Scheduling of Port Signals

Each run queue has four queues for processes of different priority and one queue for ports. Thescheduler thread associated with the run queue switches evenly between execution of processes andexecution of ports while both processes and ports exist in the queue. (Actually, this is not completely

Page 26: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 25

true, but how the implementation differs from this description is not important for what we discuss.)A port that is in a run queue also has a queue of tasks to execute. Each task corresponds to an in-or out-going signal. When the port is selected for execution each task will be executed in sequence.The run queue locks not only protect the queues of ports, but also the queues of port tasks.

Since we go from a state where I/O signals are the only port-related signals scheduled, to astate where potentially all port-related signals may be scheduled we may drastically increase theload on the run queue lock. The amount of scheduled port tasks depends on the Erlang applicationexecuting, which we do not control, and we do not want to get increased contention on the runqueue locks. We therefore need another approach of protecting the port task queue.

Task Queue We chose a “semi locked” approach, with one public locked task queue, and a private,lock-free, queue-like, task data structure. This semi locked approach is similar to how the messageboxes of processes are managed. The lock is port specific and only used for protection of port tasks,so the run queue lock now works in more or less the same way for ports as for processes. Thisensures that we will not see an increased lock contention on run queue locks due to this rewrite ofthe port functionality.

When an executing port runs out of work to execute in the private task data structure, it movesthe public task queue into the private task data structure while holding the lock. Once tasks havebeen moved to the private data structure no lock protects them. This way the port can continueworking on tasks in the private data structure without having to fight for the lock.

I/O signals may however be aborted. This could be solved by letting the port specific schedulinglock also protect the private task data structure, but then the port very frequently would have tofight with others enqueueing new tasks. In order to handle this while keeping the private task datastructure lock-free, we use a similar approach as we use when handling processes that get suspendedwhile in the run queue. Instead of removing the aborted port task, we just mark it as aborted usingan atomic memory operation. When a task is selected for execution, we first verify that it has notbeen aborted. If aborted, we just drop the task.

A task that can be aborted is referred via another data structure from other parts of the system,so that a thread that needs to abort the task can reach it. In order to be sure to safely deallocate atask that is no longer used, we first clear this reference and then use the thread progress functionalityin order to ensure that there exist no references to the task. Unfortunately, unmanaged threads mayalso abort tasks. This happens very infrequently, but it may occur. This can be handled locallyfor each port, but would require extra information in each port structure which would be used veryinfrequently. Instead of implementing this in each port, we implemented general functionality thatcan be used from unmanaged threads to delay thread progress.

The private “queue-like” task data structure could have been an ordinary queue if it was notfor the busy port functionality. When the port has flagged itself as busy, command signals are notallowed to be delivered and need to be blocked. Other signals sent from the same sender followinga command signal that has been blocked also have to be blocked; otherwise, we would violate theordering guarantee. At the same time, other signals that have no dependencies to blocked command

signals are expected to be delivered.The above requirements make the private task data structure a rather complex data structure.

It has a queue of unprocessed tasks and a busy queue. The busy queue contains blocked taskscorresponding to command signals and tasks with dependencies to such tasks. The busy queue isaccompanied by a table of blocked tasks, based on sender, with a references into the last task inthe busy queue from a specific sender. This since we need to check for dependencies when newtasks are processed in the queue of unprocessed tasks. When a new task that needs to be blockedis processed, it is not enqueued at the end of the busy queue, but instead directly after the lasttask with the same sender. This in order to easily be able to detect when we have tasks that nolonger have any dependencies to tasks corresponding to command signals which should be movedout of the busy queue. When the port executes, it switches between processing tasks from thebusy queue, and processing directly from the unprocessed queue based on its busy state. Whenprocessing directly from the unprocessed queue the port might of course have to move a task into

Page 27: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 26

the busy queue instead of executing it.

Busy Port Queue Since it is the port itself which decides when it is time to enter a busy state, itneeds to be executing in order to enter this state. As a result of command signals being scheduled, wemay get into a situation where the port gets flooded by a huge amount of command signals before iteven gets a chance to set itself into a busy state. This since it has not been scheduled for executionyet. That is, under these circumstances the busy port functionality loses the flow control propertiesit was intended to provide.

In order to solve this problem, we introduced a new busy feature, named busy port queue. Eachport has a limit of command data that is allowed to be enqueued in the task queue. When this limitis reached, the port automatically enters a busy port queue state. While in this state, senders ofcommand signals will be suspended, but command signals will still be delivered to the port unless itis also in a busy port state. This limit is known as the high limit.

There is also a low limit. When the amount of queued command data falls below this limit andthe port is in a busy port queue state, the busy port queue state is automatically disabled. Thelow limit should typically be significantly lower than the high limit in order to prevent frequentoscillation around the busy port queue state.

With the introduction of this new busy state we can provide the old flow control, which meansthat old drivers do not have to be changed. The limits can, however, be configured and even disabledby the port. By default the high limit is 8 KB and the low limit is 4 KB.

8.3.2 Preparation of Signal Send

Previously, all operations sending signals to ports began by acquiring the port lock, then performedpreparations for sending the signal, and then finally sent the signal. The preparations typicallyincluded inspecting the state of the port and preparing the data to pass along with the signal. Thepreparation of data is frequently quite time consuming and does not really depend on the port.That is we would like to do this without having the port lock locked.

In order to improve this, we reorganized the state information in the port structure, so that wecan access it using atomic memory operations. This together with the new port table implementationenabled us to lookup the port and inspect the state before acquiring the port lock, which in turnmade it possible to perform preparations of signal data before acquiring the port lock.

8.3.3 Preserving Low Latency

If we disregard the contended cases, we will inevitably get a higher latency when scheduling signalsfor execution at a later time than by executing the signal immediately. In order to preserve thelow latency we now first check if this is a contended case or not. If it is, we schedule the signal forlater execution; otherwise, we execute the signal immediately. We are in a contended case if othersignals already are scheduled on the port or if we fail to acquire the port lock. That is, we will notblock waiting for the lock.

With this implementation we preserve the low latency at the expense of lost potential for parallelexecution of the signal and other code in the process sending the signal. This default behaviourcan however be changed on a port basis or system wide, forcing scheduling of all signals fromprocesses to ports that are not part of a synchronous communication. That is, an unconditionalrequest/response pair of asynchronous signals. In this case, there is no potential for parallelism andthus no point in forcing scheduling of the request signal.

The immediate execution of signals may also cause a scheduler that is about to execute scheduledtasks to block waiting for the port lock. This is however more or less the only scenario where ascheduler needs to wait for the port lock. The maximum time it has to wait is the time it takes toexecute one signal, since we always schedule signals when contention occurs.

Page 28: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 27

8.3.4 Signal Operations

Besides implementing the functionality enabling the scheduling and preparing signal data withoutport lock, each operation sending signals to ports had to be quite extensively re-written. Thisin order to move all sub-operations that can be done without the lock to a place before we haveacquired the lock, and also since signals now sometimes are executed immediately and sometimesscheduled for execution at a later time which puts different requirements on the data to pass alongwith the signal.

8.4 Benchmarks

When running some simple benchmarks where contention only occurs due to I/O signals contendingwith signals from one single process we got a speedup of 5-15%. When multiple processes sendsignals to one single port the improvements can be much larger, but the scenario with one processcontending with I/O is very common. The benchmarks were run on a relatively new machine withan Intel i7 quad-core processor with hyper-threading using 8 schedulers.

9 Code Loading

9.1 Problem

Earlier when an Erlang code module was loaded, all other execution in the VM was halted while theload operation was carried out in single-threaded mode. This might not have been a big problem forinitial loading of modules during VM boot, but it could be a severe problem for availability whenupgrading modules or adding new code on a VM with running payload. This problem grows withthe number of cores as both the time it takes to wait for all schedulers to stop increases as well asthe potential amount of halted ongoing work.

9.2 Solution

Starting with Erlang/OTP R16B (April 2013), modules are loaded without blocking the VM. Erlangprocesses may continue executing undisturbed in parallel during the entire load operation. The codeloading is carried out by a normal Erlang process that is scheduled like all the others. The loadoperation is completed by making the loaded code visible to all processes in a consistent way withone single atomic instruction. Non-blocking code loading improves the real-time characteristics ofapplications when modules are loaded or upgraded on a running SMP system.

9.3 Implementation

9.3.1 The Load Phases

The loading of a module is divided into two phases: a prepare phase and a finishing phase. Theprepare phase contains reading the BEAM file format and doing all the preparations of the loadedcode that can easily be done without interference with the running code. The finishing phasewill make the loaded (and prepared) code accessible from the running code. Old module versions(replaced or deleted) will also be made inaccessible by the finishing phase.

The prepare phase is designed to allow several “loader” processes to prepare separate modules inparallel while the finishing phase can only be done by one loader process at a time. A second loaderprocess trying to enter a finishing phase will be suspended until the first loader is done. This willonly block the process, the scheduler is free to schedule other work while the second loader is waiting.(See erts try seize code write permission and erts release code write permission).

The ability to prepare several modules in parallel is not currently used as almost all code loadingis serialized by the code server process. The BIF interface is however prepared for this. The ErlangAPI of the load phases is shown below.

Page 29: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 28

-module(foo).-export([version/0]).-export([new_fun/0]).

version() -> 2.

new_fun() -> hello.

-module(foo).-export([version/0]).

version() -> 1.foo:version/0

foo:version/0

foo:new_fun/0

Active export table

Staging export tableNewly prepared BEAM code

Current BEAM code

the_active_code_index

the_staging_code_index

Global atomic variables

Figure 5: An upgraded BEAM module just before it is commited.

erlang:prepare_loading(Module, Code) -> LoaderState

erlang:finish_loading([LoaderState])

The idea is that prepare loading could be called in parallel for different modules and returnsa “magic binary” containing the internal state of each prepared module. Function finish loading

takes a list of such states and does the finishing of all of them in one go. Currently we use the legacyBIF erlang:load module which is now implemented in Erlang by calling the above two functionsin sequence. Function finish loading is currently limited to accepting only a list with the loaderstate of one module as we do not yet use the multiple module loading feature.

9.3.2 The Finishing Sequence

During VM execution, code is accessed through a number of data structures. These code accessstructures are

• Export table. Contains one entry for every exported function.

• Module table. Contains one entry for each loaded module.

• “beam catches”. Identifies jump destinations for catch instructions.

• “beam ranges”. Maps code address to functions and lines in source file.

The export table is the most frequently used of these structures since it is accessed in run time forevery executed external function call to get the address of the callee. For performance reasons, wewant to access all these structures without any overhead from thread synchronization. Earlier thiswas solved with an emergency break. Stop the entire VM to mutate these code access structures,otherwise treat them as read-only.

The solution we implemented in Erlang/OTP R16B is instead to replicate the code accessstructures. We have one set of active structures read by the running code; see Figure 5. When newcode is loaded, the active structures are copied, the copy is updated to include the newly loadedmodule, and then a switch is made to make the updated copy the new active set. The active set isidentified by a single global atomic variable called the active code index. The switch can thus bemade by a single atomic write operation. The running code has to read this atomic variable whenusing the active access structures, which means that it must execute one atomic read operation perexternal function call. The performance penalty from this extra atomic read is however very smallas it can be done without any memory barriers at all; see below. With this solution we also preserve

Page 30: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 29

the transactional feature of a load operation. Running code will never see the intermediate resultof a partly loaded module.

The finishing phase is carried out in the following sequence by the BIF erlang:finish loading:

1. Seize exclusive code write permission (suspend the process if needed until the permission isgranted).

2. Make a full copy of all the active access structures. This copy is called the staging area and isidentified by the global atomic variable the staging code index.

3. Update all access structures in the staging area to include the newly prepared module.

4. Schedule a thread progress event. That is a time in the future when all schedulers have yieldedand executed a full memory barrier.

5. Suspend the loader process.

6. After thread progress, commit the staging area by assigning the staging code index to variablethe active code index.

7. Release the code write permission allowing other processes to stage new code.

8. Resume the loader process allowing it to return from erlang:finish loading.

Thread Progress The waiting for thread progress in steps 4-6 is necessary in order for processesto read the active code index atomically during normal execution without any expensive memorybarriers. When we write a new value into the active code index in step 6, we know that allschedulers will see an updated and consistent view of all the new active access structures once theybecome reachable through the active code index.

The total lack of memory barrier when reading the active code index has one interestingconsequence however. Different processes may see the new code at different points in time dependingon when different cores happen to refresh their hardware caches. This may sound unsafe but itactually does not matter. The only property we must guarantee is that the ability to see the newcode must spread with process communication. After receiving a message that was triggered bynew code, the receiver must be guaranteed to also see the new code. This will be guaranteed asall types of process communication involve memory barriers in order for the receiver to be sure toread what the sender has written. This implicit memory barrier will then also make sure that thereceiver reads the new value of the active code index and thereby also sees the new code. Thisis true for all kinds of inter process communication (TCP, ETS, process name registering, tracing,drivers, NIFs, etc.), not just Erlang messages.

Code Index Reuse To optimize the copy operation in step 2, code access structures are reused.In the current solution we have three sets of code access structures, identified by a code index of0, 1 and 2. These indexes are used in a round robin fashion. Instead of having to initialize acompletely new copy of all access structures for every load operation, we just have to update withthe changes that have happened since the last two code load operations. We could get by withonly two code indexes (0 and 1), but that would require yet another round of waiting for threadprogress before step 2 in the finish loading sequence. We cannot start reusing a code index asstaging area until we know that no lingering scheduler thread is still using it as the active codeindex. With three generations of code indexes, the waiting for thread progress in steps 4-6 givesthis guarantee for us. Thread progress will wait for all running schedulers to reschedule at leastone time. No reading code access structures with ongoing execution, which are reached from an oldvalue of the active code index, can exist after a second round of thread progress.

The design choice between two or three generations of code access structures is a trade-offbetween memory consumption and code loading latency.

Page 31: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 30

Ensuring a Consistent Code View Some native BIFs may need to get a consistent snapshotview of the active code. To do this it is important to read the active code index only onceand then use that index value for all code accessing during the BIF. If another load operation isexecuted in parallel, reading the active code index a second time might result in a different value,and thereby a different view of the code.

10 Trace Setting

10.1 Problem

Earlier when trace settings were changed by erlang:trace pattern, all other execution in the VMwas put to a halt while the trace operation was carried out in single-threaded mode. Similar tocode loading, this can be a severe problem for availability that grows with the number of cores.

10.2 Solution

Starting with Erlang/OTP R16B, trace breakpoints are set without blocking the VM and Erlangprocesses may continue executing undisturbed in parallel during the entire operation. The samebase technique as for code loading is used. Namely, a staging area of breakpoints is prepared andthen becomes active with a single atomic operation.

10.3 Implementation

To make it easier to manage breakpoints without needing to resort to single-threaded mode aredesign of the breakpoint mechanism has been made. The old breakpoint wheel data structure,which has been in the runtime system even before it was extended with SMP support, was a circulardouble-linked list of breakpoints for each instrumented function. To support it in the SMP emulatoris was essentially expanded to one breakpoint wheel per scheduler. As more breakpoint types havebeen added, its implementation had become messy and hard to understand and maintain.

In the new design the old breakpoint wheel was dropped and instead replaced by one structure(GenericBp) to hold the data for all types of breakpoints for each instrumented function. A bit-flagfield is used to indicate what different type of break actions are enabled.

Even though trace pattern uses the same technique as the non-blocking code loading, i.e., onewith replicated generations of data structures and an atomic switch, the two implementations arequite separate from each other. One initial idea was to use the existing mechanism of code loading todo a dummy load operation that would make a copy of the affected modules. That copy could thenbe instrumented with breakpoints and then it could be made reachable with the same atomic switchthat code loading also uses. This approach seems straightforward but has a number of shortcomings,one being the large memory footprint when many modules are instrumented. Another problem ishow execution is supposed to reach the new instrumented code. Normally, loaded code can only bereached through external functions calls. Trace settings must be activated instantaneously withoutthe need of external function calls.

The solution for tracing we ended up implementing is instead to use the technique of replica-tion applied on the data structures for breakpoints. Two generations of breakpoints are kept andidentified by index of 0 and 1. The global atomic variable erts active bp index determines whichgeneration of breakpoints will be used by running code.

10.3.1 Atomicity Without Atomic Operations

Not using the code loading generations (or any other code duplication) means that trace pattern

must at some point write to the active BEAM code in order for running processes to reach the stagedbreakpoints structures. This can be done with one single atomic write operation per instrumentedfunction. The BEAM bytecode instructions are however read with normal memory loads and notthrough the atomic API. The only guarantee we need is that the written instruction word is seen as

Page 32: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 31

1st instruction word

[-1] Function arity

[-2] Function name

[-3] Module name

[-4] Unused

2nd instruction word

op_i_generic_breakpoint

[-1] Function arity

[-2] Function name

[-3] Module name

[-4] GenericBp*

2nd instruction word

BeamInstr orig_instrint flags[2]GenericBpData data[2]

Figure 6: BEAM instruction layout of a function without (left) and with (right) a breakpoint.

atomic. Either fully written or not at all. Conveniently, this is true for word aligned write operationon all hardware architectures we currently use.

10.3.2 Adding a New Breakpoint

This is a simplified sequence describing what the implementation of function erlang:trace pattern

does when adding a new breakpoint.

1. Seize exclusive code write permission (possibly suspending the process until the permission isgranted).

2. Allocate breakpoint structure GenericBp including both generations. Set the active part asdisabled with a flag field containing zeros. Save the original instruction word in the breakpoint;see also Figure 6.

3. Write a pointer to the breakpoint at offset -4 from the first instruction in the function. This isan otherwise unused word in the func info header.

4. Set the staging part of the breakpoint as enabled with specified breakpoint data.

5. Wait for thread progress.

6. Write an op i generic breakpoint as the first instruction for the function. This instructionwill execute the breakpoint that it finds at offset -4.

7. Wait for thread progress.

8. Commit the breakpoint by switching erts active bp index.

9. Wait for thread progress.

10. Prepare for the next call to trace pattern by updating the new staging part (the old active) ofthe breakpoint to be identical to the new active part.

11. Release the code write permission and return from trace pattern.

The code write permission “lock” seized in step 1 is the same as used by code loading. It ensuresthat only one process at a time can stage new trace settings but it also prevents concurrent codeloading and makes sure we see a consistent view of the BEAM code during the entire sequence.

Note that between steps 6 and 8, running processes might execute the op i generic breakpoint

instruction written in step 6. They will get the breakpoint structure written in step 3, readerts active bp index and execute the corresponding part of the breakpoint. Before the switch instep 8 becomes visible however, they will execute the disabled part of the breakpoint structure anddo nothing other than executing the saved original instruction.

Page 33: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 32

10.3.3 Updating and Removing Breakpoints

The above sequence only described how a new breakpoint is added. For updating the settings ofan existing breakpoint, we perform basically the same sequence except steps 2, 3 and 6 that can beskipped as they have already been done.

To remove a breakpoint some more steps are needed. The idea is to first stage the breakpointas disabled, do the switch, wait for thread progress, and then remove the disabled breakpoint byrestoring the original BEAM bytecode instruction.

Here is a more complete sequence that contains both adding, updating and removing breakpoints.

1. Seize exclusive code write permission (possibly suspending the process until the permission isgranted).

2. Allocate new breakpoint structures with a disabled active part and the original BEAM instruc-tion. Write a pointer to the breakpoint in the func info header at offset -4.

3. Update the staging part of all affected breakpoints. Disable breakpoints that are to be removed.

4. Wait for thread progress.

5. Write an op i generic breakpoint as the first instruction for all functions with new breakpoints.

6. Wait for thread progress.

7. Commit all staged breakpoints by switching erts active bp index.

8. Wait for thread progress.

9. Restore the original BEAM instruction for disabled breakpoints.

10. Wait for thread progress.

11. Prepare for next call to trace pattern by updating the new staging area (the old active) for allenabled breakpoints.

12. Deallocate disabled breakpoint structures.

13. Release code write permission and return from trace pattern.

The reader might be wondering about the four rounds of waiting for thread progress in theabove sequence. In the code loading sequence we sacrificed memory overhead of three generationsto avoid a second round of thread progress. The latency of trace pattern should not be such abig problem for applications however, as it is normally not part of production code.

The waiting in step 4 is needed to ensure that all threads will see an updated view of the break-point structures once they become reachable through the op i generic breakpoint instructionwritten in step 5. The waiting in step 6 is to make the activation of the new trace settings as atomicas possible. Different cores might see the new value of erts active bp index at different timesas it is read without any memory barrier. But this is the best we can do without more expensivethread synchronization. The waiting in step 8 ensures that we do not restore the original BEAMinstructions for disabled breakpoints until we know that no thread is still accessing the old enabledpart of a disabled breakpoint. Finally, the waiting in step 10 ensures that no lingering thread isstill accessing disabled breakpoint structures that will be deallocated in step 12.

10.3.4 Global Tracing

Call tracing with global option only affects external function calls. This was earlier handled byinserting a special trace instruction in export entries without the use of breakpoints. With thenew non-blocking tracing we want to avoid special handling for global tracing and make use of thestaging and atomic switching within the breakpoint mechanism. The solution was to create thesame type of breakpoint structure for a global call trace. The difference to local tracing is that weinsert the op i generic breakpoint instruction (with its pointer at offset -4) in the export entryrather than in the code.

Page 34: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 33

10.3.5 Future Work

Despite these improvements, the runtime system still enters single-threaded mode when new codeis loaded for a module that is traced, or when loading code when a default trace pattern is set. Thisis a limitation that is possible to fix in a future Erlang/OTP release, but it requires much closercooperation between the tracing and the loader BIFs.

11 Term Sharing

11.1 Problems

In programming language implementations, one of the most important design decisions concernsthe underlying representation of terms. In functional languages with immutable terms, the runtimesystem can choose to preserve sharing of subterms or destroy sharing and expand terms to theirflattened representation during certain key operations. Both options have pros and cons. Theimplementation of Erlang in the Erlang/OTP system has so far opted for an implementation wheresharing of subterms is not preserved when terms are copied (e.g., when sent from one process toanother or when used as arguments in spawns).

User experience supports the argument that indiscriminately flattening terms during copying isnot a good idea for a language like Erlang. In extreme cases, this causes the Erlang compiler to crash,as the flattening of constant terms is included as an optimization. In less extreme cases, adding anio:format call in a program (which would print only a few characters) results in an “out of memory”exception.1 In all cases, the loss of sharing of common terms when messages are sent is a wasteof memory and may introduce arbitrary performance slowdowns. An Erlang/OTP designed withscalability in mind would have as one of its first priorities to optimize the implementation of messagepassing (the only mechanism for process communication); this clearly suggests that the preservationof term sharing is very important.

11.2 Solution

In a recent publication that reports on work accomplished within WP2 [14], we proposed a mech-anism for a sharing-preserving copying algorithm in Erlang/OTP and described in detail its im-plementation, which for the moment is publicly available from [email protected]:nickie/otp.git

(branch preserve-sharing). We quantified its overhead in extreme cases where no sharing is in-volved and in a variety of benchmark programs and showed that the implementation has a reasonableoverhead which is negligible in practice.

The new copying algorithm takes advantage of the Erlang/OTP’s tagging scheme, i.e., the wayin which Erlang terms are represented in the runtime system, to squeeze information about sharingwithin the terms themselves, as they are being copied. In this way, the sharing-preserving copyingis possible with only very little extra memory needed. The algorithm works in two passes, each onetraversing the term to be copied and visiting each subterm exactly once. During the first traversal,the size of the term is measured and sharing information is being collected; this information isstored in the original term, to save memory. During the second traversal, subterms are actuallycopied, based on the sharing information; furthermore, the original term is restored to its properform, sharing information being deleted.

11.3 Implementation

11.3.1 Erlang/OTP’s Tagging Scheme

We begin by describing how terms are represented in the Erlang Runtime System (ERTS). In thisdescription, we only go to as much detail as necessary for making this section self-contained. Adetailed description of the staged tagging scheme that is used by ERTS, although a bit outdated,

1We refer the interested readers to the published paper [14] for detailed code examples showing such situations.

Page 35: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 34

(*) The representation of binaries is more complicated and not accurately depicted in “Anything else”.

Figure 7: The representation of Erlang terms.

is given in a technical report by Pettersson [15], who also provides a brief rationale and shows thehistorical evolution of the tagging scheme.

An Erlang term is represented in ERTS as a word.2 The two least significant bits of this wordare the term’s primary tag. They are used in the following way (see Figure 7):

• If the primary tag is 11, the term is an immediate value. The remaining bits of the termprovide the value itself, using a secondary tag (and possibly a tertiary one). The most commonimmediate values are: small integer numbers, atoms, process identifiers, port identifiers, andthe empty list [].

• If the primary tag is 01, the term is a cons cell (a list element). The remaining bits of theterm (after clearing the primary tag) provide a pointer to the memory location where the conscell is stored. The size of the cons cell is two words, in which two terms are stored: the headand the tail of the list. Notice that the tail need not be a list: Erlang supports improper lists,e.g., [1|2] makes a cons cell that contains 1 and 2 in its two words.

• If the primary tag is 10, the term is a boxed object. The remaining bits of the term (afterclearing the primary tag) provide a pointer to the memory location where the boxed objectis stored. The contents of this memory depend on what the boxed object really is but thefirst word always contains the object’s header. Boxed objects are used for representing thefollowing: tuples, big integer numbers and floating-point numbers, binaries, external processidentifiers, ports and references, and function closures.

Notice that the primary tag of an Erlang term cannot be 00; this tag is only used in the headerword of boxed objects.3 Notice also that there is a special header word, called THE_NON_VALUE,which is not used as a header in boxed objects and has one more interesting property: it does notrepresent a legal pointer to a memory location.

2This is not entirely accurate. The so called “halfword” virtual machine uses only half a word (32 bits) for termrepresentation in 64 bit computer architectures, to provide faster execution [11]. However, this is a technical issuewhich does not affect the results we present here, although it complicates the implementation slightly.

3Strictly speaking, a primary tag of 00 has some more uses in words that are not Erlang terms (e.g., variouspointers in stack frames) and is also used during garbage collection.

Page 36: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 35

The representation of boxed objects is probably the most complicated part of term representationin Erlang/OTP. The next four least significant bits, after the primary tag of 00, form a secondarytag in the header word which reveals the nature of the boxed object. The remaining bits are usedto represent the boxed object’s size,4 which is a natural number n. The secondary tag is used inthe following way:

• If the secondary tag is 0000, the object is a tuple of size n. Its elements are Erlang terms andare stored in n words, following the header.

• If the secondary tag is 0101, the object is a function closure. The next n words containinformation about the function to be called (e.g., the address of its code in memory). Theyalso contain one word that represents the number m of free variables used by the functionclosure. The next m + 1 words contain: in the first word an Erlang term that is the processidentifier of the process that created the function closure, and in the next m words Erlangterms that contain the values of the m free variables of the closure.

• The other values of the secondary tag correspond to boxed objects that do not contain Erlangsubterms in them. In most cases (a notable exception is binaries, whose representation isquite complicated) the size of the boxed object is n + 1.

11.3.2 Copying and Term Sharing

When copying a term, Erlang VM traverses the term twice. During the first traversal, the flat sizeof the term is calculated (function size_object in erts/emulator/beam/copy.c). Then, the spacenecessary for holding the copy is allocated (e.g., on the heap of the recipient process, on the heap ofa newly spawned process or in a temporary message buffer). Finally, a second traversal creates a flatcopy of the term in the allocated space (function copy_struct in erts/emulator/beam/copy.c).5

The new algorithm for creating a sharing-preserving copy of a term uses again two traversals,in the same spirit. To be precise, it uses two traversals that visit each shared subterm only once,in contrast to flat traversals. In our new implementation, the first one is implemented by functioncopy_shared_calculate and the second by function copy_shared_perform.

Design Issues For creating a sharing-preserving copy, we need to know which subterms areshared. This can be accomplished by keeping track of visited subterms, when traversing, and usingforwarding pointers instead of copying anew when a subterm is revisited. The idea is obviouslynot new: copying garbage collectors work in a similar way [4], as well as the implementations ofmarshalling/serialization routines for data structures [3, 8, 9, 12, 13, 16, 17].

During traversal, two pieces of information must be kept for each subterm: (a) whether thesubterm has been visited or not, and (b) the forwarding pointer that will be used for avoidingmultiple copies. A major design issue in the copying algorithm is where this information willbe stored. One option is to store it in a separate lookup table, e.g., a hash map. This optionunfortunately requires extra memory proportional to the size of the original term and imposes a(probably non-negligible) run-time overhead.

Another option is to store (parts of) the information inside the subterms, if term representationpermits; this is what copying garbage collectors usually do. By storing information inside thesubterms when copying, however, we are altering the original term and this causes two problems:

1. We obviously have to restore the term, after the copying takes place, and therefore a secondtraversal is unavoidable. (Copying garbage collectors do not have this problem, as the originalterm is discarded, after being copied.)

4The Erlang/OTP code calls arity what we prefer to call size in this section.5In principle, one traversal suffices for creating a flat copy of a term. However, such an implementation would

have to allocate the necessary space incrementally, during the traversal, checking all the time whether there is enoughspace. Whenever space does not suffice, the implementation needs to ensure that the garbage collector never seespartially copied terms. The Erlang runtime system implementation has opted for the simpler (and arguably moreefficient) strategy that uses two traversals.

Page 37: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 36

2. We have to make sure that no part of the original term will be accessed during the time thatwe copy it (e.g., by another process running simultaneously or by the garbage collector).

We can easily deal with the first problem, as we will be traversing the term twice anyway. Duringthe first traversal, we count the term’s size and at the same time we flag subterms as visited andstore sharing information in them. During the second traversal, we copy the term and at the sametime restore the original contents.

The second problem is much harder to deal with in a highly concurrent language like Erlang,at least in a general way. Let us identify a set of facts and assumptions that currently hold forErlang/OTP, which are important for the validity of our approach. Let P be the process thatcopies the term t, e.g., the one sending t as message to another process.

A1. All tagged pointers contained in t and all of its subterms will point either to objects in P ’sheap, or to objects outside P ’s heap that are globally accessible and do not need to be copied,e.g., constants in the module’s constant pool.

A2. The heap of process P cannot be accessed by any other process, running concurrently.

A3. Copying takes place atomically per scheduler, i.e., when P starts copying a term t it cannotbe stopped before the copying finishes. Also, during the copying, the heap of P cannot begarbage collected.

Based on these assumptions, we have devised a copying algorithm that only alters subterms locatedin P ’s heap (we will call these subterms and the objects that they point to “local”) and avoidscopying subterms that are outside of it — for all such non-local subterms, only pointers are copied.In this way, better sharing of subterms is achieved: constants can be shared between different copiedterms.

Term Mangling After making sure that altering the original term cannot do harm to programexecution, we are faced with the problem of how exactly to alter the original term, in order toincorporate sharing information. It is clear that, given Erlang/OTP’s representation of terms, it isnot possible to find room for storing forwarding pointers inside heap objects and then to be ableto restore their original contents. We will instead have to store forwarding pointers in an externallookup table (we will call it the sharing table) but, for efficiency reasons, we would like to use spacein that table only for subterms that are really shared, not to store forwarding pointers for everycopied subterm.

At any given time during the two traversals, each heap object (cons cell or boxed object) canbe in one of four different states at any given time:

• original : the first traversal has not yet visited the object for the first time;

• visited : the first traversal has visited the object exactly once;

• shared, unprocessed : the first traversal has visited the object at least twice but the secondtraversal has not yet visited the object;

• shared, processed : the first traversal has visited the object at least twice and the secondtraversal at least once.

The state of a heap object can change during the copying according to the transition diagramshown in Figure 8. Each transition is labeled by the traversal during which it may occur. Betweenthe two traversals, all objects corresponding to local subterms will be either visited or shared andunprocessed. After the second traversal, all objects will have been restored to the original state.

For each local object, we need to distinguish between these four states and we would be happyto store these (two bits of) information inside the object itself; if we achieve this, then only theforwarding pointers of shared and processed objects will have to be stored in the sharing table,

Page 38: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 37

Figure 8: State transition diagram for heap objects.

during the second traversal. We will refer to this squeezing of two bits of information inside a heapobject as “mangling”.

Mangling boxed objects is relatively easy, as the object header is bound to have a primary tagof 00, in the original state. We can use the other three combinations of the primary tag’s two bitsto denote the other three states, as shown in Figure 9. For visited objects, the remaining part ofthe header will be left unchanged. However, when a shared object is found, we can allocate a newentry for it in the sharing table and replace the remaining contents of the header with a pointer tothis entry. Of course, we must store the original contents of the header in the entry itself, so as tobe able to restore it later.

The situation is much more complicated for cons cells, which are simply too small for easilysqueezing in them the extra two bits of information. We only need to be able to distinguishbetween original, visited and shared; once a cons cell becomes shared, we can allocate an entry inthe sharing table and then we can have all the space we need. With a first look, there is no roomfor this information. Looking closer, however, we see that neither of the two terms in the cons cell(the head or the tail) can have a primary tag of 00. This observation gives us the mangling schemein Figure 9. Cons cells whose tail is a list or an immediate value should be the most common (thecells of all proper lists are like this). We can encode the visited state for such cells by replacingthe primary tag of the CAR or that of the CDR with 00, without losing information. This leavesus with only one option for visited cells with a boxed tail: to replace the primary tags of both theCAR and the CDR with 00. In this way, we lose two bits of information: when we restore a visitedcell, we won’t know the original primary tag of the CAR. We must again store these two bits ofinformation externally. However, we do not really need a table for this purpose; a stored sequence ofbits is just enough as long as we make sure that, during the second traversal, the order in which wevisit subterms is the same as in the first traversal. We call this sequence of bits the “bit store” andwe notice that its size should normally be very small: just two bits for each subterm that happensto be a list with a boxed tail, e.g., something like [1|{ok,2}] or [3|<<4>>] in Erlang syntax.

Using the bit store, we are now able to distinguish between original and visited cons cells. Toencode the shared ones, we can use the special term THE_NON_VALUE in the CDR, as we know itcannot appear there in any other way, mangled or not. (Remember that this term is tagged with 00

and does not correspond to a valid pointer.) A shared cell corresponds to an entry in the sharingtable; we can store a pointer to this entry in the CAR and use its primary tag for encoding if it hasbeen processed or not. This gives us the complete picture for mangling cons cells.

For each entry in the sharing table, four words are necessary. The first two hold the informationthat we had to erase in the original object (both the CAR and the CDR in the case of cons cells).The second word will be THE_NON_VALUE if and only if the entry corresponds to a boxed object. Thethird word will contain the forwarding pointer, once the shared object is processed. The fourth wordwill contain a reverse pointer to the original shared heap object; it will be required for restoring theoriginal state of shared objects.

The Copying Algorithm Algorithms 1 and 2 describe the two traversals that implement thecopying of a term, preserving the sharing of subterms. Algorithm 1 corresponds to functioncopy_shared_calculate in our implementation. It is responsible for efficiently calculating the

Page 39: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 38

Algorithm 1 Size calculation and mangling.

Input: A term tOutput: The real size (in words) of t, contained in variable sizeOutput: The bit store BOutput: The sharing table T

B := an empty bit storeT := an empty sharing tableQ := an empty queue of termssize := 0obj := t

loopswitch (primary tag of obj )

case 01: {cons cell}if the pointer of obj is local then

if the object is visited thenadd an entry e for it in Tmake the object shared and unprocessed (Figure 9)

else if the object is original thenmake the object visited: may add to B (Figure 9)size := size + 2add the head of the list to Qobj := the tail of the listcontinue {with the next iteration of the loop}

end ifend if

case 10: {boxed object}if the pointer of obj is local then

if the object is visited thenadd an entry e for it in Tmake the object shared and unprocessed (Figure 9)

else if the object is original thenmake the object visited (Figure 9)n := the size of the object stored in the headersize := size + 1 + nswitch (secondary tag of the header)

case 0000: {tuple}add the n elements of the tuple to Q

case 0101: {function closure}m := the number of free variablessize := size + 1 + madd the process creator of the closure to Qadd the m free variables of the closure to Q

end switchend if

end if

end switch

if Q is empty thenreturn

elseremove a term from Q and store it in obj

end ifend loop

Page 40: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 39

Boxed objects

original visitedshared

unprocessedshared

processed

header header header header

x |00 x |01 e |10 e |11

Cons cells

original visitedshared

unprocessedshared

processed

CAR CDR CAR CDR CAR CDR CAR CDR

x |01 y |01 x |01 y |00

e |00 NONV e |01 NONV

x |11 y |01 x |11 y |00x |10 y |01 x |10 y |00x |01 y |11 x |00 y |01x |11 y |11 x |00 y |11x |10 y |11 x |00 y |10x |01 y |10 x |00* y |00x |11 y |10 x |00* y |00x |10 y |10 x |00* y |00

Entries in the sharing table

object kind first secondforwarding

pointerreversepointer

cons cells x |Tx y |Ty ptr? ptrboxed objects x |Tx NONV ptr? ptr

Memorandum

Primary tags in original objects: 00 (header), 01 (cons cell), 10 (boxed object), 11 (immediate value).

NONV The special THE_NON_VALUE term, tagged with 00.* The primary tag of the CAR is placed in the bit store data structure.e The corresponding entry in the shared subterms table.

Figure 9: Mangling of heap objects.

real size of a term and for identifying the shared subterms; in the process, it mangles the term.Algorithm 2 corresponds to function copy_shared_perform in our implementation. It is responsi-ble for creating the actual copy and, at the same time, for unmangling the original term. Noticethat the two algorithms, as presented here, do not handle the case of binaries, as this would makethe presentation much longer and more complicated; our implementation, of course, supports thecopying of binaries.

The two algorithms communicate by means of the bit store B and the sharing table T . Thebit store is created by Algorithm 1, which fills it with the missing two bits of information for allsubterms that are improper lists with a boxed tail, as explained in the previous section. It is thenused by Algorithm 2, which reads the missing bits in order to restore the original term. The sharingtable is created by Algorithm 1, which stores there information that has been removed from theterm (CAR and CDR for cons cells, the header for boxed objects) and also the reverse pointer. Itis then used by Algorithm 2, which stores and uses the forwarding pointer for each shared subtermand, before finishing, restores the shared subterms to their original state.

Both algorithms use a queue of terms Q to implement a breadth-first traversal of the originalterm. (Lists are treated as a special case for reasons of efficiency; as a result, the tail of an improperlist is visited before the list’s elements.) The maximum size of Q is proportional to the height of theterm to be copied; in the worst case this will be equal to the size of the term, however, in most casesit will be much smaller. In our implementation, the memory allocated for the Q of Algorithm 1 is

Page 41: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 40

then reused for the Q of Algorithm 2 (which requires exactly the same size).There are some non-trivial issues in Algorithm 2 that are worth explaining. The variable obj

contains the current subterm of the original term that is being processed. On the other hand, thevariable addr contains the address of the term variable where the copy of obj has to be placed.Initially, obj is equal to the term t which must be copied and addr points to a term variable t′

which will hold the final result.As the outer loop iterates, heap objects (i.e., cons cells and boxed objects) are copied to the

preallocated memory space that is pointed to by hp and the pointer hp advances. The copies ofthese objects are placed in the terms pointed to by addr ; it is therefore necessary each time to findthe next target of addr . To accomplish this, we use variables scan and remaining and a specialvalue HOLE that does not correspond to a valid Erlang term (in our implementation, we are usinga NULL pointer tagged as a list with 01). Initially, scan = hp and remaining = 0. The algorithmmaintains the following invariants:

1. it is always scan ≤ hp;

2. scan + remaining is either equal to hp or points to the start of a heap object;

3. addr points either to the result term (initially) or to a term that contains a HOLE and is partof a heap object located before scan;

4. the heap objects that are located in addresses before scan do not contain HOLE terms, withthe only possible exception of the term pointed to by addr ; and

5. the heap objects that are located in addresses between scan and hp contain exactly as manyHOLE terms as the size of Q.

During the copying, visited objects are unmangled. Shared objects are unmangled too whenthey are first processed, only this unmangling takes place inside the entry of T that corresponds tothem; the reason is that shared objects must continue to be distinguishable from unshared ones.The final loop finalizes the unmangling of shared terms by restoring their original contents.

11.4 Benchmarks

It is obviously very easy to come up with benchmarks showing that an implementation whichpreserves the sharing of subterms when copying is arbitrarily faster than one that does not. Bothin execution time and memory usage, simple benchmarks can be written that exhibit exponentialbehaviour using flat copying and linear behaviour using the sharing-preserving one.

In this section, we study the performance of “average” Erlang applications which are not ex-pected to exchange messages with a lot of sharing very often. We classify our benchmarks in twocategories:

• “Stress tests” for the copying algorithm: a set of simple programs that create Erlang termsof various sizes that do not share any subterms and copy them around.

• “Shootout benchmarks”, that come from “The Computer Language Benchmarks Game”.6

These are programs created to compare performance across a variety of programming lan-guages and implementations.

Figures 10 and 11 summarize the results of executing the benchmark programs in the working“master” branch of vanilla Erlang/OTP during June 2012 (the one that eventually become R16B),as well as in our version (derived from the same branch but nowadays rebased in the “master” branchof Erlang/OTP that will become R17B) that implements the sharing-preserving copying of terms.The experiments were performed on a machine with four Intel Xeon E7340 CPUs (2.40 GHz), havinga total of 16 cores and 16 GB of RAM, running Linux 2.6.32-5-amd64 and GCC 4.4.5. (Similar

6Available from http://shootout.alioth.debian.org/.

Page 42: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 41

Algorithm 2 Copying and unmangling.

Input: A term tInput: The bit store BInput: The sharing table TInput: A pointer hp to the memory where t must be copiedOutput: A term t′ that is a copy of t

Q := an empty queue of termsobj := taddr := the address of the result term t′

scan := hpremaining := 0

loop {the actual copying}switch (primary tag of obj )

case 01: {cons cell}if the pointer of obj is local then

if the object is shared and processed thenfwd := the forwarding pointer from the entry inTthe term pointed to by addr := fwd |01

elseif the object is shared and unprocessed then

find the head and tail of the list from Tstore hp as the forwarding pointer in T

end ifmake the object (or its copy in T ) original:

may use B (Figure 9)add the head of the list to Qstore HOLE to the CAR of hpobj := the tail of the listthe term pointed to by addr := hp |01addr := the address of the CDR of hphp := hp + 2continue {next iteration of the loop}

end ifelse

the term pointed to by addr := objend if

case 10: {boxed object}if the pointer of obj is local then

if the object is shared and processed thenfwd := the forwarding pointer from the entry inTthe term pointed to by addr := fwd |10

elseif the object is shared and unprocessed then

find the header of the boxed object from Tstore hp as the forwarding pointer in T

end ifmake the object (or its copy in T ) original (Fig-ure 9)n := the size of the object stored in the headerthe term pointed to by addr := hp |10store the header in hpswitch (secondary tag of the header)

case 0000: {tuple}for i := 1 to n do

add the i-th element of the tuple to Qstore HOLE to the i-th element of hp

end forhp := hp + 1 + n

case 0101: {function closure}m := the number of free variablesadd the process creator of the closure to Qcopy n words to their place in hp

store HOLE to the process creator of hpfor i := 1 to m do

add the i-th free variable to Qstore HOLE to the i-th free variable of hp

end forhp := hp + 2 + n + m

default: {anything else}copy n words to their place in hphp := hp + 1 + n

end switchend if

elsethe term pointed to by addr := obj

end if

case 11: {immediate value}the term pointed to by addr := obj

end switch

if Q is empty thenbreak {exit the copying loop}

elseremove a term from Q and store it in objloop {find the next addr}

if remaining = 0 thenif scan points to a word containing a HOLE then{this is the CAR of a cons cell}addr := scanscan := scan + 2break {exit the loop, addr found}

else{this is the header a boxed cell}n := the size stored in the headerswitch (secondary tag of the header)

case 0000: {tuple}remaining := nscan := scan + 1

case 0101: {function closure}remaining := 1+ the number of free vari-ablesscan := scan + 1 + n

default: {anything else}scan := scan + 1 + n

end switchend if

else if scan points to a word containing a HOLE

thenaddr := scanscan := scan + 1remaining := remaining − 1break {exit the loop, addr found}

elsescan := scan + 1remaining := remaining − 1

end ifend loop

end ifend loop

loop {unmangle shared subterms}for all e in T do

unmangle the object corresponding to e (Figure 9)end for

end loop

Page 43: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 42

Benchmark Iter × Size Without sharing With sharing Overhead (%)

mklist(25) 1 × 134M 4.146 6.457 55.76mktuple(25) 1 × 101M 2.742 3.840 40.02mkfunny(47) 1 × 101M 2.754 3.896 41.44mkimfunny1(52) 1 × 130M 4.421 7.610 72.12mkimfunny2(32) 1 × 109M 3.204 4.436 38.42mkimfunny3(23) 1 × 112M 3.976 5.963 49.97mkimfunny4(60M) 1 × 120M 2.412 2.974 23.31mkimfunny5(72) 1 × 120M 4.472 6.103 36.47mkcls(53) 1 × 131M 4.790 7.386 54.2142 10M × 0 3.470 2.649 −23.66[] 10M × 0 3.866 2.674 −30.82ok 10M × 0 4.142 2.600 −37.21[42] 10M × 2 3.323 2.894 −12.90{42} 10M × 2 3.376 2.849 −15.62<<>> 10M × 2 3.300 2.830 −14.24<<42>> 10M × 3 3.415 2.850 −16.54<<17, 42>> 10M × 3 3.414 2.816 −17.53list:seq(1, 20) 10M × 40 5.736 7.775 35.57mklist(5) 5M × 124 6.755 9.147 35.41mktuple(5) 5M × 93 7.163 8.567 19.60mkcls(3) 2.5M × 220 6.685 8.230 23.11list:seq(1, 250) 1M × 500 2.964 5.036 69.91mklist(8) 0.5M × 1020 4.691 6.617 41.06mktuple(8) 0.5M × 765 4.764 6.045 26.88mkcls(6) 0.25M × 1640 4.493 6.465 43.88

Figure 10: The results of the “stress test” benchmarks.

results, not reported here, were obtained by running the same benchmarks on a quad-core 2.5GHzIntel (Q8300), with 4GB of RAM and 2x2MB of L2 cache, running a Linux 2.6.26-2-686 kernel.)All times are in seconds and were taken by executing the benchmark program 15 times and takingthe median value.

11.4.1 Stress Tests

The code of the benchmarks that were used as stress tests can be found in our repository, publiclyavailable at [email protected]:nickie/otp.git, in stress.erl). These benchmarks are worst-casescenaria for our implementation which tries to locate shared subterms in terms of various sizes thatare bound to contain none; on the other hand the vanilla implementation of Erlang/OTP takes thisfor granted and uses a far more efficient traversal.

The stress tests are classified in three categories: (a) those that copy a single very large termonce, (b) those that copy an extremely small term 10 million times, and (c) those that copy small(but non-trivial) terms many times. (Copying a term is done by sending it to another waitingprocess.) In Figure 10 the first column describes the term that is copied and the second columncontains the number of iterations and the term’s size in words.

For the first category of tests, we notice that there is an average overhead of 45.75% (rangingfrom 23.31% to 72.12%), which is due to the more costly checks that our implementation performs,as well as the mangling and unmangling of terms. A smaller average overhead of 36.93% (rangingfrom 19.60% to 69.91%) is observed for the third category of tests, for the same reasons.

In the case of the second category, however, we had a surprising result. Our implementationturned out to be on the average 21.06% faster (ranging from 12.90% to 37,21%). The biggest gain

Page 44: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 43

Benchmark Without sharing With sharing Overhead (%)

binary-trees 48.774 44.892 −7.96chameneos-redux 75.810 73.154 −3.50fannkuch-redux 8.972 9.552 6.46k-nucleotide 149.152 151.870 1.82mandelbrot 5.273 5.074 −3.77pidigits 5.296 5.330 0.64regex-dna 24.169 22.633 −6.36reverse-complement 13.229 13.476 1.87spectral-norm 12.908 11.675 −9.55threadring 4.124 3.986 −3.35

Figure 11: The results of the “shootout” benchmarks.

was when copying terms of zero size (i.e., immediate values like 42, [] and ok). Of course, this hasnothing to do with the identification of shared subterms, as with this kind of terms there is no traver-sal to be done; it seems that our implementation, unlike the code of vanilla Erlang/OTP R15B01,takes a shortcut in erts_alloc_message_heap_state and does not try to allocate heap space ofsize zero. On the other hand, when copying a term such as [42] of non-zero size, the differencethat is observed is due to the fact that the vanilla implementation always copies this term (to anew cons cell on the heap), whereas our implementation avoids copying it when the compiler hasidentified this term as a constant and put it in the constant pool, outside the process heap.

11.4.2 Shootout Benchmarks

The code of the shootout benchmarks that we used can also be found in our repository, in thedirectory shootout. We only considered benchmarks that spawn processes and use message passing.From Figure 11, we immediately observe that these applications are not penalized by the sharing-preserving implementation of copying. In fact, performance is slightly better in some of them, notbecause sharing is involved but because only very small terms are copied and, therefore, for thesame reasons as explained before.

12 Concluding Remarks

We have described changes and improvements to several key components of the Erlang runtimesystem architecture that completely eliminate scalability bottlenecks or significantly reduce theireffects and improve the performance and responsiveness of the Erlang VM. As mentioned, most ofthese changes have already found their place in the Erlang/OTP system (cf. also the Appendix),and are used by the Erlang community. The last one, related to preserving the sharing of terms incopying and in message passing, is still in the development stage; it is expected to also become partof a future Erlang/OTP release.

Acknowledgments

We thank Bjorn Gustavsson from the Ericsson Erlang/OTP team for designing and implementingthe support for trace setting in the Erlang VM (Section 9).

Page 45: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 44

Change Log

Version Date Comments

0.1 23/9/2013 First Version Submitted to the Commission Services

References

[1] E. D. Berger. The Hoard memory allocator, 2013. URL http://www.hoard.org/.

[2] E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memoryallocator for multithreaded applications. In Proceedings of the Ninth International Conferenceon Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX),pages 117–128. ACM Press, Nov. 2000.

[3] Boost Framework. The serialization library, release 1.50.0, 2008. URL http://www.boost.

org/doc/libs/1_50_0/libs/serialization/doc/.

[4] C. J. Cheney. A nonrecursive list compacting algorithm. Commun. ACM, 13(11):677–678,1970. doi: 10.1145/362790.362798.

[5] J. Evans. A scalable concurrent malloc(3) implementation for FreeBSD, 2006. URL http:

//people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf.

[6] S. Ghemawat and P. Menage. TCMalloc: Thread-caching malloc. URL http://goog-

perftools.sourceforge.net/doc/tcmalloc.html.

[7] GNU libc. malloc source, 2013. URL http://ftp.gnu.org/gnu/glibc/.

[8] HackageDB. The binary package, version 0.5.1, 2012. URL http://hackage.haskell.org/

package/binary.

[9] HackageDB. The cereal package, version 0.3.5.2, 2012. URL http://hackage.haskell.org/

package/cereal.

[10] A. Kukanov and M. J. Voss. The foundations for scalable multi-core software in Intel ThreadingBuilding Blocks. Intel Technology Journal, 11(4):309–322, 2007. URL http://noggin.intel.

com/sites/default/files/vol11_iss04.pdf.

[11] P. Nyblom. The “halfword” virtual machine. Talk given at the Erlang UserConference, Nov. 2011. Available from http://www.erlang-factory.com/conference/

ErlangUserConference2011/speakers/PatrikNyblom.

[12] OCaml Standard Library. The Marshal module, version 3.12, 2011. URL http://caml.inria.

fr/pub/docs/manual-ocaml/libref/Marshal.html.

[13] Oracle. Java object serialization specification, version 1.7.0.5, 2012. URL http://docs.

oracle.com/javase/7/docs/platform/serialization/spec/serialTOC.html.

[14] N. Papaspyrou and K. Sagonas. On preserving term sharing in the Erlang virtual machine. InT. Hoffman and J. Hughes, editors, Proceedings of the 11th ACM SIGPLAN Erlang Workshop,pages 11–20, Copenhagen, Denmark, Sept. 2012. ACM. doi: 10.1145/2364489.2364493.

[15] M. Pettersson. A staged tag scheme for Erlang. Technical Report 2000-029, Department ofInformation Technology, Uppsala University, Nov. 2000.

[16] Python Standard Library. The pickle module, version 2.7.3, 2012. URL http://docs.python.

org/library/pickle.html.

Page 46: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 45

[17] Ruby Standard Library. The Marshal package, version 1.9.3, 2012. URL http://www.ruby-

doc.org/core-1.9.3/Marshal.html.

Page 47: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 46

A Scalability Improvements in Erlang/OTP Releases

In this appendix we list changes and improvements that affect scalability and responsiveness of theErlang Runtime System, which have made it into the Erlang/OTP system in one of its releasesduring the first twenty four months of the RELEASE project. Naturally, major releases of thesystem (R15B and R16B) contain significantly many more changes and improvements than minorreleases (R15B01, R15B02, R15B03, R16B01, and R16B02).

A.1 Improvements in Erlang/OTP R15B (2012-12-14)

• A number of memory allocation optimizations have been implemented. Most of them reducecontention caused by synchronization between threads during allocation and deallocation ofmemory. Most notably:

– Synchronization of memory management in scheduler specific allocator instances hasbeen rewritten to use lock-free data structures.

– Synchronization of memory management in scheduler specific pre-allocators has beenrewritten to use lock-free data structures.

– The mseg alloc memory segment allocator now uses scheduler specific instances insteadof one global instance. Apart from reducing contention this also ensures that memoryallocators always create memory segments on the local NUMA node on NUMA systems.

• The API of the ethread atomic memory operations used by the runtime system has beenextended and improved. The library now also performs runtime tests for presence of hardwarefeatures, such as for example SSE2 instructions, instead of requiring this to be determinedat compile time. All uses of the old deprecated atomic API in the runtime system havebeen replaced with the use of the new atomic API, a change which in many places implies arelaxation of memory barriers used.

• The Erlang Runtime System (ERTS) internal system block functionality has been replaced bynew functionality for blocking the system. The old system block functionality had contentionissues and complexity issues. The new functionality piggy-backs on thread progress trackingfunctionality needed by newly introduced lock-free synchronization in the runtime system.When the functionality for blocking the system is not used, there is practically no overhead.This since the functionality for tracking thread progress is there and needed anyway.

• An ERTS internal, generic, many to one, lock-free queue for communication between threadshas been introduced. The many to one scenario is very common in ERTS, so it can be usedin a lot of places in the future. Currently it is used by scheduling of certain jobs and theasynchronous thread pool, but more uses are planned for the future.

– Drivers using the driver async functionality are not automatically locked to the systemanymore, and can be unloaded as any dynamically linked in driver.

– Scheduling of ready asynchronous jobs is now also interleaved in between other jobs.Previously all ready asynchronous jobs were performed at once.

• The runtime system does not bind schedulers to logical processors by default anymore. Therationale for this change is the following: If the Erlang runtime system is the only operatingsystem process that binds threads to logical processors, this improves the performance ofthe runtime system. However, if other operating system processes (as for example anotherErlang runtime system) also bind threads to logical processors, there might be a performancedegradation instead. In some cases this degradation might be severe. Due to this, there wasa change in the default setting so that the user is required to make an active decision in orderto bind schedulers.

Page 48: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 47

A.2 Improvements in Erlang/OTP R15B01 (2012-04-02)

• Added erlang:statistics(scheduler wall time) to ensure correct determination of sched-uler utilization. Measuring scheduler utilization is strongly preferred over CPU utilization,since CPU utilization gives very poor indications of actual scheduler/VM usage.

A.3 Improvements in Erlang/OTP R15B02 (2012-09-03)

• A new scheduler wake up strategy has been implemented. For more information see thedocumentation of the +sws command line argument of erl.

• A switch for configuration of busy wait length for scheduler threads has been added. For moreinformation see the documentation of the +sbwt command line argument of erl.

A.4 Improvements in Erlang/OTP R15B03 (2012-12-06)

• The frequency with which sleeping schedulers are woken due to outstanding memory deallo-cation jobs has been reduced.

A.5 Improvements in Erlang/OTP R16B (2013-02-25)

• Various process optimizations have been implemented. The most notable of them are:

– New internal process table implementation allowing for both parallel reads as well aswrites. Especially read operations have become really cheap. This reduces contention invarious situations (e.g., when spawning or terminating processes, sending messages, etc.)

– Optimizations of run queue management reducing contention.

– Optimizations of process state changes reducing contention.

• Non-blocking code loading. Earlier when an Erlang module was loaded, all other executionin the VM was halted while the load operation was carried out in single threaded mode. Nowmodules are loaded without blocking the VM. Processes may continue executing undisturbedin parallel during the entire load operation. The load operation is completed by making theloaded code visible to all processes in a consistent way with one single atomic instruction. Non-blocking code loading improves the real-time characteristics of applications when modules areloaded or upgraded on a running SMP system.

• Major port improvements. The most notable of them are:

– New internal port table implementation allowing for both parallel reads as well as writes.Especially read operations have become really cheap. This reduce contention in varioussituations. For example when, creating ports, terminating ports, etc.

– Dynamic allocation of port structures. This allows for a much larger maximum amountof ports allowed as a default. The previous default of 1024 has been raised to 65536.Maximum amount of ports can be set using the +Q command line flag of erl.

– Major rewrite of scheduling of port tasks. Major benefits of the rewrite are reducedcontention on run queue locks, and reduced amount of memory allocation operationsneeded. The rewrite was also necessary in order to make it possible to schedule signalsfrom processes to ports.

– Improved internal thread progress functionality for easy management of unmanagedthreads. This improvement was necessary for the rewrite of the port task scheduling.

– Rewrite of all process to port signal implementations in order to make it possible toschedule those operations. All port operations can now be scheduled which allows forreduced lock contention on the port lock as well as truly asynchronous communicationwith ports.

Page 49: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 48

– Optimized lookup of port handles from drivers.

– Optimized driver lookup when creating ports.

– Preemptable erlang:ports/0 BIF.

These changes imply changes of the characteristics of the system. The most notable are:

Order of signal delivery. The previous implementations of the VM delivered signals fromprocesses to ports in a synchronous fashion, which was stricter than required by thelanguage. Starting with Erlang/OTP R16B, signals are truly asynchronously delivered.The order of signal delivery still adheres to the requirements of the language, but onlyto those. That is, some signal sequences that previously always were delivered in onespecific order may now from time to time be delivered in different orders.

Latency of signals sent from processes to ports. Signals from processes to ports wherepreviously always delivered immediately. This kept latency for such communication to aminimum, but it could cause lock contention which was very expensive for the system as awhole. In order to keep this latency low also in the future, most signals from processes toports are by default still delivered immediately as long as no conflicts occur. An exampleof such a conflict is not being able to acquire the port lock. When such conflicts occur, thesignal will be scheduled for delivery at a later time. A scheduled signal delivery may causea higher latency for this specific communication, but improves the overall performance ofthe system since it reduces lock contention between schedulers. The default behavior ofonly scheduling delivery of these signals on conflict can be changed by passing the +spp

command line flag to erl. The behavior can also be changed on port basis using theparallelism option of the open port/2 BIF.

Execution time of erlang:ports/0. Since the erlang:ports/0 BIF now can be preempted,the responsiveness of the system as a whole has been improved. A call to erlang:ports/0

may, however, take a much longer time to complete than before. How much longer timeheavily depends on the system load.

Reduction cost of calling driver callbacks. Calling a driver callback is quite costly. Thiswas previously not reflected in reduction cost at all. Since the reduction cost now hasincreased, a process performing lots of direct driver calls will be scheduled out morefrequently than before.

• The default reader group limit has been increased to 64 from 8. This limit can be set usingthe +rg command line argument of erl. This change of default value reduces lock contentionon Erlang Term Storage (ETS) tables using the read concurrency option at the expenseof increased memory consumption when the amount of schedulers and logical processors arebetween 8 and 64.

• Increased potential concurrency in ETS for write concurrency option. The number of in-ternal table locks has been increased from 16 to 64. This makes it four times less likely thattwo concurrent processes writing to the same table would collide and thereby be serialized.The cost is an increased constant memory footprint for tables using write concurrency. Thememory consumption per inserted record is not affected. The increased footprint can beparticularly large if write concurrency is combined with read concurrency.

• The scheduler wake up strategy implemented in Erlang/OTP R15B02 is now used by default.This strategy is not as quick to forget about previous overload as the previous strategy. Thischange implies changes of the system’s characteristics. Most notable: When a small overloadcomes and then disappears repeatedly, the system will be willing to wake up schedulers forslightly longer time than before. Timing in the system will also change, due to this.

• The +stbt command line argument of erl was added. This argument can be used for tryingto set scheduler bind type. Upon failure unbound schedulers will be used.

Page 50: D2.3 (WP2): Prototype Scalable Runtime System Architecturerelease-project.softlab.ntua.gr/documents/D2.3.pdf · ponents of the Erlang runtime system to become more e cient and scalable

ICT-287510 (RELEASE) 23rd September 2013 49

A.6 Improvements in Erlang/OTP R16B01 (2013-06-18)

• Introduced support for migration of memory carriers between memory allocator instances.This feature is not enabled by default and does not affect the characteristics of the system.However, when enabled, it has the effect of reducing memory footprint when the memory loadis unevenly distributed between scheduler specific allocator instances.

A.7 Improvements in Erlang/OTP R16B02 (2013-09-18)

• New allocator strategy aoffcbf (address order first fit, carrier best fit). Supports carrier mi-gration but with better CPU performance than aoffcaobf.

• Added command line option to set schedulers by percentages. For applications that showenhanced performance from the use of a non-default number of emulator scheduler threads,having to accurately set the right number of scheduler threads across multiple hosts each withdifferent numbers of logical processors is difficult because the +S option requires absolute num-bers of scheduler threads and scheduler threads online to be specified. To address this issue, a+SP command line option was added to erl, similar to the existing +S option but allowing thenumber of scheduler threads and scheduler threads online to be set as percentages of logicalprocessors configured and logical processors available, respectively. For example, +SP 50:25

sets the number of scheduler threads to 50% of the logical processors configured, and thenumber of scheduler threads online to 25% of the logical processors available. The +SP optionalso interacts with any settings specified with the +S option, such that the combination of op-tions +S 4:4 +SP 50:50 (in either order) results in two scheduler threads and two schedulerthreads online.