Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a...

13
Compact NUMA-aware Locks Dave Dice Alex Kogan Oracle Labs {rst.last}@oracle.com Abstract Modern multi-socket architectures exhibit non-uniform mem- ory access (NUMA) behavior, where access by a core to data cached locally on a socket is much faster than access to data cached on a remote socket. Prior work oers several ecient NUMA-aware locks that exploit this behavior by keeping the lock ownership on the same socket, thus reducing remote cache misses and inter-socket communication. Virtually all those locks, however, are hierarchical in their nature, thus requiring space proportional to the number of sockets. e increased memory cost renders NUMA-aware locks unsuit- able for systems that are conscious to space requirements of their synchronization constructs, with the Linux kernel being the chief example. In this work, we present a compact NUMA-aware lock that requires only one word of memory, regardless of the number of sockets in the underlying machine. e new lock is a variant of an ecient (NUMA-oblivious) MCS lock, and inherits its performant features, such as local spinning and a single atomic instruction in the acquisition path. Unlike MCS, the new lock organizes waiting threads in two queues, one composed of threads running on the same socket as the current lock holder, and another composed of threads running on a dierent socket(s). We integrated the new lock in the Linux kernel’s qspin- lock, one of the major synchronization constructs in the kernel. Our evaluation using both user-space and kernel benchmarks shows that the new lock has a single-thread performance of MCS, but signicantly outperforms the laer under contention, achieving a similar level of performance when compared to other, state-of-the-art NUMA-aware locks that require substantially more space. ACM Reference format: Dave Dice Alex Kogan Oracle Labs {rst.last}@oracle.com. 2016. Compact NUMA-aware Locks. In Proceedings of ACM Conference, Washington, DC, USA, July 2017 (Conference’17), 13 pages. DOI: 10.1145/nnnnnnn.nnnnnnn 1 Introduction Locks are used by concurrently running processes (or threads) to acquire exclusive access to shared data. Since their in- vention in the mid sixties [11], locks remain the topic of extensive research and the most popular synchronization technique in parallel soware. Prior research has shown that the performance of such soware oen depends directly on the eciency of the locks it employs [8, 12, 13]. e evolution of locks is tightly coupled with the evolution of computing architectures. Modern architectures feature an increasing number of nodes (or sockets), each comprising of a locally aached memory, a fast local cache and multiple processing units (or cores). Accesses by a core to a local memory or local cache are signicantly faster than accesses to the remote memory or cache lines residing on another node [20], characteristic known as NUMA (Non-Uniform Memory Access). As a result, researchers have proposed multiple designs for NUMA-aware locks, which try to keep the lock ownership within the same socket [3, 4, 9, 10, 15, 18, 20]. is approach decreases remote cache misses and the associated inter-socket communication, as it increases the chance that the lock data, as well as the shared data subsequently accessed in a critical section, will be cached locally to the socket on which a lock holder is running. While performance evaluations show that NUMA-aware locks perform substantially beer than their NUMA-oblivious counter-parts, one particular issue hampers the adoption of the former in practice. Specically, while NUMA-oblivious locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of a set of local (typically, per-socket) locks each mediating threads running on the same socket and a global lock synchronizing threads holding a local lock [3, 4, 9, 10, 15, 18]. e hierarchical structure of NUMA-aware locks is prob- lematic for three reasons. First, even when the lock is un- contended, a thread still has to perform multiple atomic operations to acquire multiple low-level locks of the hierar- chy before it can enter a critical section. is oen results in suboptimal single-thread performance when compared to ecient NUMA-oblivious locks. Second, to ensure portabil- ity, NUMA-aware locks have to be initialized dynamically as the number of sockets of the underlying system is unknown until the run time. ird, and perhaps most importantly in the context of this work, hierarchical NUMA-aware locks re- quire space proportional to the number of sockets. In certain environments, such an increase in the space requirement is prohibitively expensive. One example is systems that feature numerous (e.g., millions) of locks, such as database systems or an operating system kernel. e Linux kernel, for instance, strictly limits the size of its spin lock to 4 bytes. Among many use cases, this lock is embedded in the page structure, which in turn is embedded in every 4K -sized physical page [5]. 1 arXiv:1810.05600v1 [cs.OS] 12 Oct 2018

Transcript of Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a...

Page 1: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

Compact NUMA-aware LocksDave Dice Alex Kogan

Oracle Labs{�rst.last}@oracle.com

AbstractModern multi-socket architectures exhibit non-uniform mem-ory access (NUMA) behavior, where access by a core to datacached locally on a socket is much faster than access to datacached on a remote socket. Prior work o�ers several e�cientNUMA-aware locks that exploit this behavior by keeping thelock ownership on the same socket, thus reducing remotecache misses and inter-socket communication. Virtually allthose locks, however, are hierarchical in their nature, thusrequiring space proportional to the number of sockets. �eincreased memory cost renders NUMA-aware locks unsuit-able for systems that are conscious to space requirementsof their synchronization constructs, with the Linux kernelbeing the chief example.

In this work, we present a compact NUMA-aware lockthat requires only one word of memory, regardless of thenumber of sockets in the underlying machine. �e new lockis a variant of an e�cient (NUMA-oblivious) MCS lock, andinherits its performant features, such as local spinning anda single atomic instruction in the acquisition path. UnlikeMCS, the new lock organizes waiting threads in two queues,one composed of threads running on the same socket asthe current lock holder, and another composed of threadsrunning on a di�erent socket(s).

We integrated the new lock in the Linux kernel’s qspin-lock, one of the major synchronization constructs in thekernel. Our evaluation using both user-space and kernelbenchmarks shows that the new lock has a single-threadperformance of MCS, but signi�cantly outperforms the la�erunder contention, achieving a similar level of performancewhen compared to other, state-of-the-art NUMA-aware locksthat require substantially more space.

ACM Reference format:Dave Dice Alex KoganOracle Labs{�rst.last}@oracle.com. 2016. Compact NUMA-aware Locks. InProceedings of ACM Conference, Washington, DC, USA, July 2017(Conference’17), 13 pages.DOI: 10.1145/nnnnnnn.nnnnnnn

1 IntroductionLocks are used by concurrently running processes (or threads)to acquire exclusive access to shared data. Since their in-vention in the mid sixties [11], locks remain the topic ofextensive research and the most popular synchronizationtechnique in parallel so�ware. Prior research has shown that

the performance of such so�ware o�en depends directly onthe e�ciency of the locks it employs [8, 12, 13].

�e evolution of locks is tightly coupled with the evolutionof computing architectures. Modern architectures feature anincreasing number of nodes (or sockets), each comprising ofa locally a�ached memory, a fast local cache and multipleprocessing units (or cores). Accesses by a core to a localmemory or local cache are signi�cantly faster than accessesto the remote memory or cache lines residing on anothernode [20], characteristic known as NUMA (Non-UniformMemory Access). As a result, researchers have proposedmultiple designs for NUMA-aware locks, which try to keepthe lock ownership within the same socket [3, 4, 9, 10, 15,18, 20]. �is approach decreases remote cache misses andthe associated inter-socket communication, as it increasesthe chance that the lock data, as well as the shared datasubsequently accessed in a critical section, will be cachedlocally to the socket on which a lock holder is running.

While performance evaluations show that NUMA-awarelocks perform substantially be�er than their NUMA-obliviouscounter-parts, one particular issue hampers the adoption ofthe former in practice. Speci�cally, while NUMA-obliviouslocks can be implemented using a single memory word (oreven a bit), virtually all NUMA-aware locks are hierarchicalin their nature, built of a set of local (typically, per-socket)locks each mediating threads running on the same socketand a global lock synchronizing threads holding a locallock [3, 4, 9, 10, 15, 18].

�e hierarchical structure of NUMA-aware locks is prob-lematic for three reasons. First, even when the lock is un-contended, a thread still has to perform multiple atomicoperations to acquire multiple low-level locks of the hierar-chy before it can enter a critical section. �is o�en resultsin suboptimal single-thread performance when compared toe�cient NUMA-oblivious locks. Second, to ensure portabil-ity, NUMA-aware locks have to be initialized dynamically asthe number of sockets of the underlying system is unknownuntil the run time. �ird, and perhaps most importantly inthe context of this work, hierarchical NUMA-aware locks re-quire space proportional to the number of sockets. In certainenvironments, such an increase in the space requirement isprohibitively expensive. One example is systems that featurenumerous (e.g., millions) of locks, such as database systemsor an operating system kernel. �e Linux kernel, for instance,strictly limits the size of its spin lock to 4 bytes. Among manyuse cases, this lock is embedded in the page structure, whichin turn is embedded in every 4K-sized physical page [5].

1

arX

iv:1

810.

0560

0v1

[cs

.OS]

12

Oct

201

8

Page 2: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

As a result, any increase to the size of the lock would beunacceptable [5, 6].

�is work proposes a compact NUMA-aware lock, calledCNA, that requires one word only, regardless of the num-ber of sockets of the underlying machine. Moreover, CNArequires only one atomic instruction per lock acquisition.�e CNA lock is a variant of the popular and highly e�cientMCS lock [19], in which threads waiting for the lock form aqueue and spin on a local cache line until the lock becomesavailable to them. CNA a�empts to pass the lock to a succes-sor running on the same socket, rather than to a thread thathappens to be next in the queue. While looking for the suc-cessor, the lock holder moves waiting threads running on adi�erent socket(s) to a separate, secondary queue, so they donot interfere in subsequent lock handovers. �reads in thesecondary queue are moved back to the primary queue underone of the two conditions: (a) when the main queue doesnot have any waiting threads running on the same socketas the current lock holder, or (b) a�er a certain number oflocal handovers. �e la�er condition is to ensure long-termfairness and avoid starvation (of threads in the secondaryqueue).

We implemented the CNA lock as a stand-alone dynami-cally linked library conforming to the POSIX pthread API.We also modi�ed the Linux kernel spin lock implementa-tion (qspinlock) to use CNA. We evaluated the CNA lockusing a user-space microbenchmark, multiple real applica-tions and several kernel microbenchmarks. �e results showthat, unlike many NUMA-aware locks, CNA does not in-troduce any overhead in single-thread runs over the MCSlock. At the same time, it signi�cantly outperforms MCS un-der contention (typically, over 40% on a two-socket systemand over 100% on a four-socket system), achieving a simi-lar level of performance compared to other, state-of-the-artNUMA-aware locks that require substantially more space.

2 Related WorkA test-and-set lock [2] is one of the simplest spin locks. Toacquire this lock, a thread repeatedly executes an atomictest-and-set instruction until it returns a value indicatingthat the state of the lock has been changed from unlockedto locked. While this lock can be implemented with justone memory word (or even bit), it employs global spinning,that is all threads trying to acquire the lock spin on thesame memory location. �is leads to excessive coherencetra�c during lock handovers. Furthermore, this lock doesnot provide any fairness guarantees, as a thread that justreleased the lock can return and bypass threads that havebeen waiting for the lock for a long time.

�eue locks address those challenges by organizing threadswaiting for the lock in a FIFO queue. In MCS [19], one ofthe most popular queue locks, the shared state of the lockconsists of a pointer to a tail of the queue. Each thread has a

record (queue node) that it inserts into the queue (by atom-ically swapping the tail), and then spins locally on a �aginside its record, which will be set by its predecessor whenthe la�er unlocks the lock. In general, queue spin locksprovide faster lock handover under contention compared tolocks with global spinning [2]. �ose locks, however, arenot friendly to NUMA systems, since they cause increasedcoherence tra�c as the lock data (and the data accessed inthe critical section) can migrate from one socket to anotherwith every lock handover.

In order to keep the lock on the same socket, Radovic andHagersten propose a hierarchical backo� lock (HBO) [20],which requires only one word of memory. �e idea is touse that word to store the socket number of the lock holder.When a thread �nds the lock busy, it sets its back-o� to asmall value if it runs on the same node as the lock holder andto a larger value if it runs on a di�erent socket. �is approach,however, poses the same challenges as spin locks with globalspinning [18]. In particular, threads spinning on other sock-ets may be starved, and they create contention on the lockstate even if they check the state less frequently. (To mitigatethose issues, the authors considered several extensions of thebasic HBO algorithm (using some extra words of memory),yet they do not eliminate them completely [20]). Moreover,backo� timeouts require tuning for optimal performance,and this tuning is known to be challenging [14, 16].

Subsequent designs of NUMA-aware locks use a hierarchyof synchronization constructs (e.g., queue locks), with per-socket constructs serving for intra-socket synchronizationand a construct at the top of the hierarchy synchronizingbetween threads on di�erent sockets. �is idea was realizedin [9, 18]. Dice et al. [10] generalize this approach in theLock Cohorting technique that constructs a NUMA-awarelock out of any two spin locks L and G that have certainproperties. Chabbi et al. [3] generalize the Lock Cohortingtechnique further to multiple levels of hierarchy, addressingarchitectures with a deeper NUMA hierarchy. �ey presentHMCS, a hierarchical MCS lock, in which each level of thehierarchy is protected by an MCS lock.

Hierarchical NUMA-aware locks have a memory footprintwith the size proportional to the number of sockets (or moreprecisely, to the number of nodes in the hierarchy tree). Be-sides, hierarchical locks tend to perform poorly under noor light contention, since a lock operation involves multipleatomic instruction (to acquire multiple locks). To addresssome of those challenges, Kashyap et al. [15] present a hierar-chical NUMA-aware CST lock, which defers the allocation ofper-socket locks until the moment this lock is accessed �rst.�is is useful in environments where threads are restrictedto access a subset of sockets, yet the memory footprint ofthe CST lock grows linearly with the number of socketsin the general case. A di�erent angle is taken by Chabbiand Mellor-Crummey in [4], where the authors augment theHMCS lock [3] with a fast path. �is path enables threads to

2

Page 3: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

bypass multiple levels of the hierarchy in HMCS when thereis no contention and compete directly for the lock at thetop of the hierarchy. �is results in a contention-conscioushierarchical lock that performs similarly to MCS under nocontention, but increases the memory cost of the HMCSlock even further, as it has to maintain metadata that helpsthreads to estimate the current level of contention.

In a di�erent, but highly related area, Dice [8] explored thescalability collapse phenomenon, in which the throughput ofa system drops abruptly due to lock contention. He suggeststo modify the admission policy of a lock, limiting the numberof distinct threads circulating though the lock. In the caseof MCS, excessive threads are removed from the MCS lockqueue and placed into a separate list. While the resultinglock (called MCSCR [8]) is shown to perform well undercontention, and in particular on over-subscribed systems, itis NUMA-oblivious and uses multiple words of memory (tokeep track of the multiple queues/lists). As a future direction,Dice mentions a possibility for constructing MCSCRN, aNUMA-aware version of MCSCR. In addition to the stateof MCSCR, MCSCRN is conceived to also include two extra�elds: the identity of the current preferred socket and apointer to the list of ”remote“ threads running on a di�erentsocket(s) [8].

3 Background (Linux Kernel Spin Lock)Being one of the major synchronization constructs in theLinux kernel and having ”a great deal of in�uence over thesafety and performance of the kernel“ [7], the spin lockimplementation has evolved over the course of the years.�e current implementation of spin locks uses a multi-pathapproach, with a fast path implemented as a test-and-set lockand a slow path implemented as an MCS lock [17]. Morespeci�cally, a four-byte lock word is divided into three parts:the lock value, the pending bit and the queue tail. A threadacquiring a spin lock tries �rst to �ip atomically the valueof the lock word from 0 to 1, and if successful, it has thelock. Otherwise, it checks whether there is any contentionon the lock. �e contention is indicated by any other bit inthe lock word being set (the pending bit or the bits of thequeue tail). If there is no contention, the thread a�emptsto set atomically the pending bit, and if successful, spinsand waits until the current lock holder releases the lock. Ifcontention is detected, the thread switches to a slow path,in which it enters an MCS queue.

Once a thread t enters the MCS queue (by atomicallyswapping the queue tail in the lock word with an encodedpointer to its queue node), it waits until it reaches the headof the queue. It happens if t enters an empty queue, or whent ’s predecessor in the queue writes into a �ag in t ’s queuenode on which t spins. At that point, t waits for the lockholder and a thread spinning on the pending bit (if such athread exists) to go away. �is in turn happens when t �ndsboth the lock value and the pending bit being clear. At that

time, t claims the lock (by se�ing non-atomically the lockvalue to 1), and writes into the �ag of its successor s in theMCS queue (if such successor exists), notifying the la�erthat now s became to be the thread at the top of the MCSqueue.

A thread releasing a spin lock simply sets the lock valueto 0. It is interesting to note that unlike the original MCSlock [19], this design avoids the need to carry a queue nodefrom lock to unlock, since the release of the spin lock does notinvolve queue nodes. Furthermore, the linux kernel limitsthe number of contexts that can nest and in which a spinlock can be acquired (the limit is four). It allows all queuenodes to be statically preallocated and at the same time, itenables space and time-e�cient encoding of the tail pointerto make the entire lock word �t into four bytes [17].

�e CNA lock is used to replace the slow path of thekernel spin lock, leaving the fast path as well as the unlockprocedure intact. �us, the change to the kernel is minimaland localized to just a few (two) �les. �e design of the CNAlock is detailed next.

4 Design Overview�e CNA lock can be seen as a variant of the MCS lock witha few notable di�erences. MCS organizes threads waiting toacquire the lock in one queue. CNA organizes threads intotwo queues, the “main” queue composed of threads runningon the same socket as the lock holder, and a “secondary”queue composed of threads running on a di�erent socket (orsockets). When a thread a�empts to acquire the CNA lock,it always joins the main queue �rst; it might be moved tothe secondary queue by a lock holder running on a di�erentsocket as detailed below and exempli�ed in Figure 1.

Despite managing two queues of waiting threads, theshared state of the CNA lock is comprised of one singleword, tail, which is a pointer to the tail of the main queue.Like the MCS lock, CNA acquisition requires an auxiliarynode, which a thread t inserts into the main queue by atomi-cally swapping the tail pointer with a pointer to its node.(�is is the only atomic instruction performed in the lockacquisition procedure.) �e auxiliary node contains the next�eld pointing to the next node in the (main or secondary)queue, and the spin �eld on which t would spin waitingfor it to change (e.g., from 0 to 1). �is change will happenwhen t ’s predecessor in the queue exits its critical sectionand executes the unlock function, passing the ownership ofthe lock to t .

While the lock function in CNA is almost identical tothat in MCS, the unlock function is where the CNA lockfundamentally di�ers. Instead of simply passing the lockto its successor, the current lock holder h would traversethe main queue and look for a thread running on the samesocket (the socket of each thread is recorded in its node). Ifit �nds one, say a thread h′, it moves all threads (or moreprecisely, all nodes) between h and h′ from the main queue

3

Page 4: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

(a) �e main queue consists of six threads, with t1, t4 and t5 running on socket 0 and the rest running onsocket 1. �read t1 has the lock. �e secondary queue is empty.

(b) t1 exits its critical section, traverses the main queue and �nds that t4 runs on the same socket. �erefore,t1 moves t2 and t3 into the secondary queue, se�ing the secondaryTail �eld in t2’s node to t3. �en, t1passes the lock to t4 by writing the pointer to the head of the secondary queue into t4’s spin �eld.

(c) t1 returns and enters the main queue.

(d) When t4 exits its critical section, it discovers that the next thread in the main queue (t5) runs on the samesocket. t4 passes the lock to t5 by simply copying the value in t4’s spin �eld into t5’s spin �eld.

(e) t7 running on socket 1 arrives and enters the main queue.

(f) t5 exits its crtitical section, moves t6 into the end of the secondary queue (and updates the secondaryTail�eld in t2’s node), and passes the lock to t1.

(g) t1 exits its critical section and �nds no threads on socket 0 in the main queue. �us, t1 moves nodes fromthe secondary queue back to the main one, pu�ing them before its successor t7, and passes the lock to t2.

Figure 1. A running example for CNA lock handovers on a 2-socket machine. Empty cells represent NULL pointers.

4

Page 5: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

to the secondary queue, and passes the ownership to h′

(by changing the spin �eld in the node belonging to h′).See Figure 1 (b) for an example.

Note that h needs to pass the pointer to the head of thesecondary queue to h′ (which the la�er will have to pass toits successor, and so on). �is is needed so that h′ (or oneof its successors) would be able to pass the lock to a threadin the secondary queue, e.g., if the main queue becomesempty. One natural place to record this pointer is in the lockstructure, but that would increase the lock structure space.Another alternative is to record the pointer in the h′’s queuenode in a separate �eld, and then have h′ copy the pointerto its successor, and so on. �at would require an extra storeinstruction (and, potentially, an extra cache miss) duringlock handover. Our approach is to reuse the spin �eld forthat purpose, that is, instead of handing the lock to h′ bywriting 1, h writes the pointer to the head of the secondaryqueue into h′’s spin. We assume here that a valid pointercannot have the value 1, which is true for most systems.

We are le� to discuss when threads in the secondary queuewould be able to return to the main queue and acquire thelock. �at would happen under one of the two conditions.First, if the current lock holder h cannot �nd a thread in themain queue running on the same socket, it would place thelast node in the secondary queue (i.e., the secondary tail) be-fore h’s immediate successor in the main queue and pass thelock to the �rst node in the secondary queue (cf. Figure 1 (g)).�is would e�ectively empty the secondary queue. Note thatin order to �nd the secondary tail, one could scan that queuestarting from the head, but that would be ine�cient if thequeue gets long. As an optimization, we store the pointer tothe tail of the secondary queue in the node belonging to thehead of that queue. We also use (and update) that pointerwhen moving nodes from the main queue (in the unlockfunction) into the (non-empty) secondary queue.

�e second condition for moving nodes from the sec-ondary to the main queue deals with the long-term fairnessguarantees. As described so far, if threads running on thesame socket repeatedly try to acquire the lock (and enter themain queue), CNA might starve all other threads runningon a di�erent socket (or sockets). To avoid this case, CNAperiodically moves threads from the secondary queue to themain one, e�ectively passing the lock ownership to a threadon a di�erent socket. To this end, we employ a lightweightpseudo-random number generator, and empty the secondaryqueue with a low (but positive) probability.

5 Implementation DetailsIn this section, we provide the C-style pseudo-code for theCNA lock structures and implementation. We assume se-quential consistency for clarity. Our actual implementationuses volatile keywords and memory fences as well as padding(to avoid false sharing) where necessarily.

typedef struct cna node {uintptr t spin ;int socket ;struct cna node ∗ secTail ;struct cna node ∗next ;

} cna node t ;

typedef struct {cna node t ∗ tail ;

} cna lock t ;

Figure 2. Lock and node structures

1 int cna lock ( cna lock t ∗ lock , cna node t ∗me) {2 me−>next = 0;3 me−>socket = −1;4 me−>spin = 0;

5 /* Add myself to the main queue */

6 cna node t ∗ tail = SWAP(&lock−>tail, me);

7 /* No one there? */

8 if (! tail ) return 0;

9 /* Someone there, need to link in */

10 me−>socket = current numa node();11 tail −>next = me;

12 /* Wait for the lock to become available */

13 while (!me−>spin) { CPU PAUSE(); }

14 return 0;15 }

Figure 3. Lock procedure. SWAP stands for an atomic ex-change instruction, while CPU PAUSE is a no-op used forpolite busy waiting.

�e CNA lock structure as well as the structure of theCNA node are provided in Figure 2. We note that the CNAqueue node structure contains a few extra �elds comparedto MCS (secondaryTail and socket). �e space of queuenode structures, however, is almost never a practical concern,since those structures can be reused for di�erent lock acqui-sitions, and between di�erent locks. Normally, each threadwould have a preallocated (small) number of such structuresin a thread-local storage. In the Linux kernel, for example,each thread (or more precisely, each CPU) has exactly foursuch statically preallocated structures [17]. Nevertheless,we discuss an optimization in Section 6 that eliminates thesocket �eld, albeit mostly for performance considerations.Note that the CNA lock instance itself requires one wordonly — a pointer to the tail of the main queue.

5

Page 6: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

16 void cna unlock( cna lock t ∗ lock , cna node t ∗me) {17 /* Is there a successor in the main queue? */

18 if (! me−>next) {19 /* Is there a node in the secondary queue? */

20 if (me−>spin <= 1) {21 /* If not, try to set tail to NULL, indicating that

22 both main and secondary queues are empty */

23 if (CAS(&lock−>tail, me, NULL) == me) return;24 } else {25 /* Otherwise, try to set tail to the last node in

26 the secondary queue */

27 cna node t ∗secHead = (cna node t ∗) me−>spin;28 if (CAS(&lock−>tail, me, secHead−>secTail) == me) {29 /* If successful, pass the lock to the head of

30 the secondary queue */

31 secHead−>spin = 1;32 return;33 }34 }

35 /* Wait for successor to appear */

36 while (me−>next == NULL) { CPU PAUSE(); }37 }

38 /* Determine the next lock holder and pass the lock by

39 setting its spin field */

40 cna node t ∗succ = NULL;41 if ( keep lock local () && (succ = �nd successor (me))) {42 succ−>spin = me−>spin ? me−>spin : 1;43 } else if (me−>spin > 1) {44 succ = (cna node t ∗) me−>spin;45 succ−>secTail−>next = me−>next;46 succ−>spin = 1;47 } else {48 me−>next−>spin = 1;49 }50 }

Figure 4. Unlock procedure. CAS stands for an atomiccompare-and-swap instruction.

�e details of the lock procedure are given in Figure 3.�ose familiar with the MCS lock implementation will notea strike similarity between the lock procedure of CNA to theone of MCS. �e only di�erences are in Line 3 and Line 10,which initialize and record the current socket number of athread in its node structure. We note that on the x64 platform,one could use an e�cient rdtscp instruction for the la�er. Ifa platform does not support a lightweight way to retrieve thecurrent socket number, one could store this information in athread local variable, and refresh it periodically, e.g., every1K lock acquisitions. Note that if a thread is migrated and itsactual socket number is di�erent from the one recorded inits node structure, this might have only performance impli-cations but not correctness. Finally, we note that recording

51 cna node t ∗ �nd successor (cna node t ∗me) {52 cna node t ∗next = me−>next;53 int mySocket = me−>socket;54 if (mySocket == −1) mySocket = current numa node();

55 /* Check if my immediate successor is on the same socket */

56 if (next−>socket == mySocket) return next;

57 cna node t ∗secHead = next ;58 cna node t ∗ secTail = next ;59 cna node t ∗cur = next−>next;

60 /* Traverse the main queue */

61 while (cur) {62 /* Check if cur is running on my socket */

63 if (cur−>socket == mySocket) {64 if (me−>spin > 1)65 (( cna node t ∗)( me−>spin))−>secTail−>next = secHead;66 else me−>spin = (uintptr t )secHead;67 secTail−>next = NULL;68 (( cna node t ∗)( me−>spin))−>secTail = secTail ;69 return cur ;70 }71 secTail = cur ;72 cur = cur−>next;73 }

74 return NULL;75 }

76 /* Long-term fairness threshold */

77 #de�ne THRESHOLD (0x��)78 int keep lock local () { return pseudo rand() & THRESHOLD; }

Figure 5. Auxiliary functions called from the Unlock proce-dure.

the socket number takes place only if the thread �nds an-other node in the (main) queue (cf. Line 8). In other words,when the lock is not contended, this line does not add anyoverhead to the Lock procedure.

�e pseudo-code for the unlock procedure is presented inFigure 4 and axillary functions it uses are in Figure 5. Likethe unlock procedure of the MCS lock, it starts by checkingwhether there are other nodes in the main queue (Line 18).If not, it a�empts to set the tail of the main queue to eitherNULL (Line 23) or the tail of the secondary queue if the la�eris not empty (Line 28). Note that, similarly to MCS, an atomiccompare-and-swap (CAS) instruction is needed to make sureno thread has enqueued itself into the main queue betweenthe previous if-statement and the update of the tail pointer.If that does happen, the lock holder h waits for its successorto update the next pointer in h’s node (Line 36).

If the main queue is not empty, the lock holder h deter-mines the next lock holder by invoking the find successorfunction. �ere, it traverses the main queue and searches for

6

Page 7: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

a thread running on the same socket as h. If it �nds one, itmoves all the threads between h and that successor to thesecondary queue (Lines 64–68). Otherwise, it returns NULLwithout updating the secondary queue (Line 74).

If a successor on the same socket is found (and the keep -lock local function returns a positive number), the lockis handed over to that successor (Line 42). (Note that atthis point me->spin can contain 0 if h has entered an emptyqueue (cf. Line 8); therefore in Line 42 we make sure to storea non-zero value into succ->spin). Otherwise, it is handedto the �rst node in the secondary queue, a�er connectingthe tail of that queue to the h’s successor in the main queue(Lines 45–46). (Note that it would be correct to set succ->secTail to NULL a�er Line 45 , but this is unnecessarysince succ gets the lock and will not read this �eld. �isexplains why in Figure 1 (g) t2’s secondaryTail remains topoint t6.) If the secondary queue is empty, the lock is handedover to h’s successor in the main queue (Line 48).

6 Optimizations�e implementation of the CNA lock admits a few simpleoptimizations. First, we can encode the socket of a thread inthe next pointer of its predecessor in the queue (cf. Line 11).�is would obviate the access to socket in find successor(cf. Line 56 and Line 63), and thus avoid a cache miss(es) onthe critical path. �is encoding is straightforward on ma-chines with a small (two/four) number of sockets, in whichpointers are (4-byte) word aligned. On bigger machines withmore sockets, one can allocate queue node structures with aproper alignment to make space for the socket encoding.

Second, when the contention on the lock is light, the(small) overhead of moving threads in and out of the sec-ondary queue might not justify the gain of keeping the locklocal on the same socket. As a result, the performance of theCNA lock might su�er in this case. To avoid that, we canreduce the probability of moving threads to the secondaryqueue (or more precisely, of calling the find successorfunction) when the secondary queue is empty, and simplyhand over the lock to the successor of the lock holder in themain queue. In other words, we can introduce the so-calledshu�e reduction optimization with the following lines ofpseudo-code placed between Line 37 and Line 38 in cna un-lock (cf. Figure 4):

if (me−>spin <= 1 && (pseudo rand() & 0x�� ) < THRESHOLD2) {me−>next−>spin = 1;return;

}

Finally, as already mentioned, the socket number can becached in a thread-local variable and refreshed periodically.Similarly, instead of drawing a pseudo-random number inevery invocation of keep lock local, a thread can store thedrawn number in a thread-local variable and decrement itwith every lock handover. Once the number reaches 0, the

thread would redraw a new number, and have keep lock -local return zero.

While we experiment and report initial �ndings on theshu�e reduction optimization in Section 7, the other opti-mizations are fairly straightforward engineering tweaks le�for future work.

7 EvaluationFor user-space benchmarks, we integrated CNA into LiTL [13],an open-source project1 that provides an implementationof dozens of various locks, including the MCS lock as wellas several state-of-the-art NUMA-aware locks. All locks inLiTL, as well as our CNA lock, are implemented as dynamiclibraries conforming to the pthread mutex lock API de�nedby the POSIX standard. �is allows interposing those lockswith any so�ware that uses the standard API (through theLD PRELOAD mechanism) without modifying or even re-compiling the so�ware code. For kernel experiments, weintegrated the CNA lock into the kernel version 4.15.18. �eresulting qspinlock uses the same fast path (based on thetest-and-set lock; see Section 3) and the CNA lock as the slowpath. In other words, we modi�ed the slow path acquisitionfunction (queued spin lock slowpath) to use CNA insteadof MCS. �is required modifying just two �les in the kernel,adding a few dozens lines of code.

In user-space experiments, we compared the CNA lock tothe MCS lock, as well as to several state-of-the-art NUMA-aware locks, namely C-BO-MCS, C-PTL-TKT and C-TKT-TKT locks, which are the variants of Cohort locks [10], HMCSlock [3] and HYSHMCS lock [15]. All NUMA-aware lockswere con�gured with similar fairness se�ings, that is, keep-ing the lock local to a socket for a similar number of lockhandovers before passing the lock to a thread running onanother socket. We note that in all our experiments, HMCSand HYSHMCS produced similar performance except wherenoted, while C-BO-MCS typically performed best amongall Cohort lock variants. �erefore, for presentation claritypurposes only, we omit the results of HYSHMCS and Cohortlock variants but C-BO-MCS in subsequent plots.

In the kernel, we compare the existing MCS-based qspin-lock implementation to the new one based on CNA. Note thatwe could not integrate any of the state-of-the-art NUMA-aware locks into the kernel as they require substantiallymore than 4 bytes of space.

We run our experiments on a system with two Intel XeonE5-2699 v3 sockets featuring 18 hyperthreaded cores each(72 logical CPUs in total) and running Ubuntu 18.04. Wedo not pin threads to cores, relying on the OS to make itschoices. In all user-space experiments, we employ a scalablememory allocator [1]. We also disable the turbo mode toavoid the e�ects that mode may have (which varies with thenumber of threads) on the results. We vary the number of

1h�ps://github.com/multicore-locks/litl7

Page 8: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

threads in each experiment from 1 to 70, leaving a few sparelogical CPUs for any occasional kernel activity that mightotherwise skew the results by preempting the lock holderthread. Each reported experiment has been run 3 times inexactly the same con�guration. Presented results are theaverage of results reported by each of those 3 runs.

Our evaluation is posed to answer the following questions:how the CNA lock compares to MCS and NUMA-aware locksin a user-space microbenchmark in terms of throughput andcache miss rates, and how its long-term fairness fares to otherlocks (Section 7.1.1). We also demonstrate that our conclu-sions extend beyond two socket systems (Section 7.1.1). Next,we evaluate the CNA lock with a few user-land benchmarks(Section 7.1.2 and Section 7.1.3) and �nally explore the im-pact it has on the Linux kernel performance through severalkernel microbenchmarks (Section 7.2).

7.1 User-space benchmarks7.1.1 Key-value map microbenchmarkWe consider a simple key-value map implemented on top ofan AVL tree protected with a single lock. �e benchmarkincludes operations for inserting, removing and looking upkeys (and associated values) stored in the key-value map(tree). A�er initial warmup, not included in the measurementinterval, all threads start running at the same time, and applyoperations chosen uniformly and at random from the givenoperation mix, with keys chosen uniformly and at randomfrom the given range. At the end of the measured timeperiod (lasting 10 seconds), the total number of operationsis calculated, and the throughput is reported. �e key-valuemap is pre-initialized to contain roughly half of the key range.�e benchmark allows varying the key range (e�ectivelycontrolling the initial size of the key-value map) as wellas the amount of the external work, i.e., the duration of anon-critical section (simulated by a pseudorandom numbercalculation loop) between operations on the map.

Figure 6 (a) shows the results of an experiment with thekey range of 1024 and an operation mix of 80% lookups and20% updates (split evenly between inserts and removes). Inthis experiment, threads do not perform any external work,which results in substantial contention on the lock protectingthe tree and absolutely no scalability. �e performance ofthe MCS lock drops abruptly between one and two threads,as the lock becomes contended with threads running ondi�erent sockets. �is happens to all other locks but C-BO-MCS. In the case of C-BO-MCS, the global (high-level) lock isbacko� test-and-set, which in this case performs well sincethe same thread manages to acquire the lock repeatedly ande�ectively starve the other one. We note that backo�-basedlocks are known to be unfair [18, 20], with backo� timeoutshard to tune for optimal performance [14, 16].

Beyond two threads, the performance of the MCS staysmostly �at. At the same time, CNA lock matches the perfor-mance of MCS for 1 and 2 threads, and then improves a�er-wards as it starts taking advantage of its NUMA-awareness.Notably, with more than 4 threads CNA outperforms allvariants of Cohort locks and only lags behind HMCS (andHYSHMCS) by a narrow margin. Overall, it achieves 40%speedup over MCS at 70 threads.

Figure 6 (b) shows the results of the same experiment withan operation mix that does not include lookup operations.As a result, threads performs more writes into shared datain their critical sections. �erefore, we expect that a NUMA-aware lock admission policy would yield even higher bene�tsin this case, as it would keep the shared data (in additionto the lock word itself) from migrating frequently betweensockets. �e results in Figure 6 (b) con�rm this hypothesis,and while generally following the same pa�ern as Figure 6 (a),all NUMA-aware locks perform be�er relatively to the MCSlock. In particular, CNA achieves the speedup of 51% overMCS at 70 threads.

Figure 7 (a) provides data on the LLC load miss rates (asmeasured by perf) for the evaluated locks. (�e presenteddata is for the workload with 20% updates shown in Fig-ure 6 (a). Unless speci�ed otherwise, the rest of the sectionconsiders this workload only, for brevity.) �e rates corre-late with the throughput results in Figure 6 (a). We note thesharp increase in LLC load miss rate between one and twothreads (corresponding to the performance collapse at thesame exact interval). A�erwards, MCS exhibits a high LLCload miss rate unlike all other NUMA-aware locks, includingCNA.

Naturally, one may ask how the long-term fairness of allthe locks is a�ected by their NUMA-awareness (or the lackof it). �ere are several ways to measure the fairness of alock admission policy; in this work we present the data inthe form of a fairness factor. To calculate this factor, wesort the number of operations performed by each thread asreported at the end of the experiment. �en we divide thetotal number of the �rst half of the threads (in the sorteddecreasing order of their number of operations) by the totalnumber of operations. �us, the resulting fairness factor isa number in the range of 0.5 and 1, with a strictly fair lockyielding a factor of 0.5 and a strictly unfair lock yielding afactor close to 1.

Figure 7 (b) shows the fairness factor for each of the locks.As expected, MCS with its strict FIFO admission policy main-tains the fairness factor of 0.5 across all thread counts. HMCShas a fairness factor close to that, while C-BO-MCS has thefactor close to 1. �e la�er is an example of the starvationbehavior by a backo� test-and-set lock mentioned above.�e CNA lock achieves slightly higher fairness factor thanMCS, yet those rates are well below 60%, suggesting that itpreserves the long-term fairness. We note that like otherstate-of-the-art NUMA-aware locks, the CNA lock provides

8

Page 9: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

(a) 20% updates (b) 100% updates

Figure 6. Total throughput for the key-value map microbenchmark.

(a) LLC load miss rate (b) Long-term fairness

Figure 7. Performance stats for the key-value map microbenchmark (20% updates).

Figure 8. Total throughput for the key-value microbench-mark with non-critical work.

a knob to tune the fairness-vs-throughput tradeo� (cf. keep -lock local in Figure 5).

In the next experiment, we add some amount of exter-nal (non-critical section) work that would reduce the lock

Figure 9. Total throughput for the key-value microbench-mark on a 4-socket machine (same workload as in Fig-ure 6 (a)).

contention and allow the benchmark to scale up to a smallnumber of threads. �e results are presented in Figure 8.Indeed, the performance of the benchmark when using the

9

Page 10: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

(a) Pre-�lled DB (b) Empty DB

Figure 10. Total throughput for the leveldb benchmark.

MCS lock increases between 1 and 2 threads, and then staysmostly �at. All NUMA-aware locks scale up to 8-16 threadsdue to reduced cache miss rate and coherence tra�c, andthen maintain a substantial margin with respect to MCS.Notably, CNA performs slightly worse (by 12%) than MCS at4 threads. �is is because at this thread count the contentionis enough to occasionally shu�e waiting threads into thesecondary queue, but not high enough to keep them there.Intuitively, it results in paying continuously the (modest)price of accessing (and recording) socket information and re-structuring the waiting queue without reducing cache missesin return. We note that the LLC load miss rate for CNA (notshown for this experiment) matches the one for MCS at 4threads (but drops dramatically a�er that). As the numberof threads increases beyond 4, CNA reaches and maintainsa speedup of about 40–45% over MCS, with a performancelevel between C-BO-MCS and HMCS.

In the context of this experiment, we explore the potentialof the shu�e reduction optimization described in Section 6.We implement a variant of CNA in which, if the secondaryqueue is empty, the lock is handed over to the immediate suc-cessor in the main queue (and without searching for anotherthread running on the same socket) with high probability.�is variant is denoted as CNA (opt) in Figure 8. With thisoptimization in place, CNA (opt) closes the gap from (and infact, outperforms) MCS at 4 threads, and matches the perfor-mance of CNA (without the optimization) at all other threadcounts. �e superior performance over MCS at 4 threads isthe result of the shu�ing that does take place once in a while,organizing threads’ arrivals to the lock in a way that reducesthe inter-socket lock migration without the need to continu-ously modify the main queue. �is is con�rmed by LLC loadmiss rates, which are lower for CNA (opt) compared to MCS(and CNA) at 4 threads. We also collected statistics on howmany times the main waiting queue is altered in CNA, andcon�rmed that the shu�e reduction optimization indeed re-duces this number by almost a factor of ten at 4 threads (and

has no impact at other thread counts). We note, however,that the extent to which this occasional reorganization ofwaiting threads would have a positive e�ect on performanceis expected to be application-dependent.

To validate our conclusions beyond two sockets, we alsorun experiments on a system with four Intel Xeon E7-8895 v3sockets featuring 144 logical CPUs in total. Figure 9 showsthe results of the experiment with the same workload asthe one shown in Figure 6 (a). While qualitatively all eval-uated locks exhibited the same behavior, quantitatively theperformance of CNA (and other NUMA-aware locks) undercontention improved relatively to MCS. As an example, at142 threads CNA performs be�er than MCS by 107%. Webelieve this is because the ratio of the cost of a remote cachemiss (fetching data from the LLC on another socket) to thecost of a local cache miss (fetching data from the LLC onthe same socket) on this machine is higher than on the two-socket machine. �is can be seen by the drop in performancebetween 1 and 2 threads — on the two-socket machine thethroughput of the MCS lock goes down from from 5.3 op-s/us to 1.6 ops/us (cf. Figure 6 (a)), while on the four-socketmachine is goes down from 7.1 ops/us to 1.6 ops/us (cf. Fig-ure 9).

7.1.2 leveldbIn the next two sections we explore how the microbenchmarkresults discussed in Section 7.1.1 extend to real applications.We start with leveldb, an open-source key-value storage li-brary developed by Google2. Our experiments were donewith the latest release (1.20) of the library, which also in-cludes a built-in benchmark (db bench).

We used db bench to create a database with the default1M key-value pairs. �is database was used subsequentlyin the readrandom mode of db bench. As its name suggests,the readrandom mode is composed of Get operations on the

2h�ps://github.com/google/leveldb10

Page 11: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

database with random keys. Each Get operation acquires aglobal database lock in order to take a consistent snapshotof pointers to internal database structures (and incrementreference counters to prevent the deletion of those structureswhile Get is running). �e search operation itself, however,executes without holding the database lock, but acquireslocks protecting (sharded) LRU cache as it seeks to updatethe cache structure with the accessed key. While the centraldatabase lock and internal LRUCache locks are known to becontended [8], the contention is spread over multiple locks.

We modi�ed the readrandom mode to run for a �xed time(rather than run a certain number of operations, so we couldbe�er control the runtime of each experiment). �e reportednumbers are the aggregated throughput for runs of 30 sec-onds in the readrandom mode. Figure 10 (a) presents theresults, which show that as long as the the benchmark scales,all locks yield a similar performance (with only C-BO-MCSlagging slightly behind at 8 threads). However, once the scal-ing slows down, CNA outperforms MCS (and C-BO-MCS).Overall, the performance pa�ern is somewhat similar to theone in Figure 8. At the largest thread count, CNA outper-forms MCS by 40%.

We also explore how increased contention on the databaselock a�ects the speedup achieved by CNA. To that end, werun the same readrandom mode with an empty database. Inthis case, the work outside of the critical sections (searchingfor a key) is minimal and does not involve acquiring any LRUcache lock. �e results are shown in Figure 10 (b), and ingeneral, echo the microbenchmark results with no externalwork in Figure 6.

7.1.3 Kyoto CabinetWe detail the results of the experiment with Kyoto Cabi-net, another open-source key-value storage engine3. We useits built-in kccachetest benchmark run in a wicked mode,which exercises an in-memory database with a random mixof operations. Similarly to [8], we modi�ed the benchmarkto use the standard POSIX pthread mutex locks, which we in-terpose with evaluated locks from the LiTL library. Similarlyto leveldb, we modi�ed the benchmark to run for a �xed timeand report the aggregate work completed. We also note that,originally, the benchmark sets the key range dependent onthe number of threads, which makes the performance com-parison across di�erent thread count challenging. �erefore,we �xed the key range at a constant (10M) elements. Notethat all those changes were also applied in [8]. �e length ofeach run was 60 seconds.

�e results are presented in Figure 11. �e performanceof kccachetest benchmark does not scale, and in fact, be-comes worse as the contention grows. �erefore, the bestperformance is achieved with a single thread, and CNA is theonly lock to match the performance of MCS at that thread

3h�p://fallabs.com/kyotocabinet

Figure 11. Total throughput for Kyoto Cabinet.

Figure 12. Performance results for locktorture.

count. As the number of threads grows beyond 4, CNA (andother NUMA-aware locks) exploit the increasing lock con-tention, and outperform MCS. At 70 threads, CNA performsbe�er than MCS by 34%.

7.2 Linux kernel7.2.1 locktortture�e locktorture benchmark is distributed as a part of theLinux kernel4. It is implemented as a loadable kernel module,and according to its documentation, “runs torture tests oncore kernel locking primitives”, including qspinlock, thekernel spin lock. It creates a given number of threads that re-peatedly acquire and release the lock, with occasional shortdelays (citing the comment in the source code, “to emulatelikely code”) and occasional long delays (“to force massivecontention”) inside the critical section. At the end of a run(lasting 30 seconds in our case), a total number of lock oper-ations performed by all threads is reported.

�e performance results for the locktorture benchmarkare given in Figure 12. �e unmodi�ed Linux kernel is de-noted as stock, while the version with CNA integrated intothe qspinlock slow path is denoted as patched. Overall,we see that patched either matches or outperforms, at times

4h�ps://www.kernel.org/doc/Documentation/locking/locktorture.txt11

Page 12: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

substantially, the stock version. Speci�cally, as long as thestock version scales (up to 18 threads), patched performs onpar (but at times, slightly be�er) than stock. However, oncethe stock reaches its peak at 18 threads, its performancefades as the number of threads (and the lock contention)grows. �e patched version exploits this contention (byreducing inter-socket lock handovers) and manages to main-tain nearly peak performance. At 70 threads, patched out-performs stock by 39%.

7.2.2 will-it-scaleFinally, we present the results of will-it-scale, a suiteof user-space microbenchmarks designed to stress variouskernel sub-systems5. We experimented with several mi-crobenchmarks in will-it-scale, and based on lock statstatistics6 identi�ed a few that create a severe contention onvarious spin locks in the kernel. We note that for microbench-marks that have not showed signs of spin lock contentionin lock stat, the patched version performed similarly tostock. On the other hand, microbenchmarks with a con-tention on one (or more) spin locks in the kernel exhibiteda similar performance pa�ern, as demonstrated by a fewexamples included below.

Figure 13 presents the results for four microbenchmarksfrom will-it-scale. In the �rst pair, threads repeatedlylock and unlock a �le lock through the fcntl command, withthe only di�erence that in lock1 threads each thread workswith a separate �le, while in lock1 threads all threads op-erate on a lock associated with the same �le. In the secondpair, threads repeatedly open and close a �le (separate �le foreach thread) in either the same directory (open1 threads)or a di�erent one for each thread (open2 threads). �e sum-mary of points of contention in each of those benchmarks isgiven in Table 1.

Overall, the charts in Figure 13 show a behavior similar tothe key-value map microbenchmark presented in Figure 8.Speci�cally, the patched version matches the performanceof stock as long as they both scale (and where the contentionon the spin locks detailed in Table 1 does not exist yet). Atthe performance peak, the patched version slightly under-performs stock (by about 10%) as it pays for the overhead ofrestructuring the queue of waiting threads without rippingthe bene�t. However, as the contention on the respectivespin lock(s) increases, the performance of stock degrades,while the patched version maintains a close-to-peak perfor-mance level. �is allows the patched version to outperformstock by 38–63% at 70 threads.

In addition to the patched version, Figure 13 includes theresults of a variant with the shu�e reduction optimization inplace (cf. Section 6), denoted as patched (opt). �is variantreduces the slowdown at the peak performance to around

5h�ps://github.com/antonblanchard/will-it-scale6h�ps://www.kernel.org/doc/Documentation/locking/lockstat.txt

Benchmark Contended spin locks Call sites

lock1 threads files struct.file lockalloc fd

fcntl setlklock2 threads file lock context.flc lock posix lock inode

open1 threadsfiles struct.file lock

alloc fdclose fd

lockref.lock

dputd alloc

lockref get not zerolockref get not dead

open2 threads files struct.file lockalloc fdclose fd

Table 1. Contention in the will-it-scale benchmarks.

5% or less while maintaining the speedups of up to over 60%under contention.

8 Conclusion�e paper presents the construction of CNA, a compactNUMA-aware queue spin lock. Unlike state-of-the-art NUMA-aware locks, which are hierarchical in their nature and thushave memory cost proportional to the number of sockets,CNA’s state requires only one word of memory. �is featurepaired with the fact the CNA requires only one atomic in-struction for acquisition (and at most one for release) makeCNA an a�ractive alternative for any (NUMA-oblivious orNUMA-aware) lock.

We implemented CNA as a stand-alone POSIX API-compliantlibrary as well as integrated it into the Linux kernel, re-placing the implementation of the spin lock slow path inthe la�er. Our evaluation using both user-space and kernelbenchmarks shows that CNA matches MCS in single-threadperformance, but outperforms it (typically, by at least 40%)under contention. It is achieved by reducing the numberof remote cache misses in lock handovers and in data ac-cessed in critical sections protected by the lock. At the sametime, the admission policy of CNA achieves the long-termfairness comparable to that of MCS. When compared to state-of-the-art NUMA-aware locks, CNA achieves a similar levelof performance despite requiring substantially less space.

In the future work, we aim to explore the e�ect of variousoptimizations described in this paper on the performance ofCNA, and also evaluate CNA in other user-space and kernelbenchmarks.

References[1] Yehuda Afek, Dave Dice, and Adam Morrison. Cache index-aware

memory allocation. In Proceedings of the International Symposium onMemory Management (ISMM), pages 55–64, 2011.

[2] T. E. Anderson. �e performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 1(1):6–16,January 1990.

[3] Milind Chabbi, Michael Fagan, and John Mellor-Crummey. Highperformance locks for multi-level numa systems. In Proceedings ofthe ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP), pages 215–226, 2015.

12

Page 13: Compact NUMA-aware Locks - arXiv · locks can be implemented using a single memory word (or even a bit), virtually all NUMA-aware locks are hierarchical in their nature, built of

(a) lock1 threads (b) lock2 threads

(c) open1 threads (d) open2 threads

Figure 13. Performance results for the will-it-scale benchmarks.

[4] Milind Chabbi and John Mellor-Crummey. Contention-conscious,locality-preserving locks. In Proceedings of the ACM SIGPLAN Sym-posium on Principles and Practice of Parallel Programming (PPoPP),2016.

[5] Jonathan Corbet. Cramming more into struct page. h�ps://lwn.net/Articles/565097, August 28, 2013. Accessed: 2018-10-01.

[6] Jonathan Corbet. MCS locks and qspinlocks. h�ps://lwn.net/Articles/590243, March 11, 2014. Accessed: 2018-09-12.

[7] Jonathan Corbet. Range reader/writer locks for the kernel. h�ps://lwn.net/Articles/267968, February 6, 2008. Accessed: 2018-09-28.

[8] Dave Dice. Malthusian locks. In Proceedings of the Twel�h EuropeanConference on Computer Systems (EuroSys), pages 314–327, 2017.

[9] Dave Dice, Virendra J. Marathe, and Nir Shavit. Flat-combining numalocks. In Proceedings of the Twenty-third Annual ACM Symposium onParallelism in Algorithms and Architectures (SPAA), pages 65–74, 2011.

[10] Dave Dice, Virendra J. Marathe, and Nir Shavit. Lock cohorting: Ageneral technique for designing numa locks. ACM Trans. ParallelComput., 1(2):13:1–13:42, 2015.

[11] E. W. Dijkstra. Solution of a problem in concurrent programmingcontrol. Commun. ACM, 8(9), 1965.

[12] Stijn Eyerman and Lieven Eeckhout. Modeling critical sections in am-dahl’s law and its implications for multicore design. In Proceedings ofthe Annual International Symposium on Computer Architecture (ISCA),pages 362–370, 2010.

[13] Hugo Guiroux, Renaud Lachaize, and Vivien �ema. Multicore locks:�e case is not closed yet. In Proceedings of the USENIX AnnualTechnical Conference (ATC), pages 649–662, 2016.

[14] Ryan Johnson, Manos Athanassoulis, Radu Stoica, and Anastasia Ail-amaki. A New Look at the Roles of Spinning and Blocking. In Pro-ceedings of the International Workshop on Data Management on NewHardware (DaMoN). ACM, 2009. URL: h�p://doi.acm.org/10.1145/1565694.1565700.

[15] Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. Scalable NUMA-aware blocking synchronization primitives. In Proceedings of theUSENIX Conference on Usenix Annual Technical Conference (ATC),pages 603–615, 2017.

[16] Beng-Hong Lim and Anant Agarwal. Waiting Algorithms for Synchro-nization in Large-scale Multiprocessors. ACM Transactions on Com-puting Systems, 1993. URL: h�p://doi.acm.org/10.1145/152864.152869.

[17] Waiman Long. qspinlock: Introducing a 4-byte queue spinlock imple-mentation. h�ps://lwn.net/Articles/561775, July 31, 2013. Accessed:2018-09-19.

[18] Victor Luchangco, Dan Nussbaum, and Nir Shavit. A hierarchicalCLH queue lock. In Proceedings of the 12th International Conferenceon Parallel Processing (EuroPar), pages 801–810, 2006.

[19] John M. Mellor-Crummey and Michael L. Sco�. Algorithms for scal-able synchronization on shared-memory multiprocessors. ACM Trans.Comput. Syst., 9(1):21–65, 1991.

[20] Z. Radovic and E. Hagersten. Hierarchical backo� locks for nonuni-form communication architectures. In International Symposium onHigh-Performance Computer Architecture (HPCA), pages 241–252, 2003.

13