Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the...

14
This paper is included in the Proceedings of the 2015 USENIX Annual Technical Conference (USENIC ATC ’15). July 8–10, 2015 • Santa Clara, CA, USA ISBN 978-1-931971-225 Open access to the Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC ’15) is sponsored by USENIX. Thread and Memory Placement on NUMA Systems: Asymmetry Matters Baptiste Lepers, Simon Fraser University; Vivien Quéma, Grenoble INP; Alexandra Fedorova, Simon Fraser University https://www.usenix.org/conference/atc15/technical-session/presentation/lepers

Transcript of Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the...

Page 1: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

This paper is included in the Proceedings of the 2015 USENIX Annual Technical Conference (USENIC ATC ’15).

July 8–10, 2015 • Santa Clara, CA, USA

ISBN 978-1-931971-225

Open access to the Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC ’15) is sponsored by USENIX.

Thread and Memory Placement on NUMA Systems: Asymmetry Matters

Baptiste Lepers, Simon Fraser University; Vivien Quéma, Grenoble INP; Alexandra Fedorova, Simon Fraser University

https://www.usenix.org/conference/atc15/technical-session/presentation/lepers

Page 2: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

USENIX Association 2015 USENIX Annual Technical Conference 277

Thread and Memory Placement on NUMA Systems:Asymmetry Matters

Baptiste LepersSimon Fraser University

Vivien QuemaGrenoble INP

Alexandra FedorovaSimon Fraser University

AbstractIt is well known that the placement of threads and

memory plays a crucial role for performance on NUMA(Non-Uniform Memory-Access) systems. The conven-tional wisdom is to place threads close to their memory,to collocate on the same node threads that share data,and to segregate on different nodes threads that com-pete for memory bandwidth or cache resources. Whilemany studies addressed thread and data placement, noneof them considered a crucial property of modern NUMAsystems that is likely to prevail in the future: asymmetricinterconnect. When the nodes are connected by links ofdifferent bandwidth, we must consider not only whetherthe threads and data are placed on the same or differentnodes, but how these nodes are connected.

We study the effects of asymmetry on a widely avail-able x86 system and find that performance can vary bymore than 2× under the same distribution of thread anddata across the nodes but different inter-node connectiv-ity. The key new insight is that the best-performing con-nectivity is the one with the greatest total bandwidth asopposed to the smallest number of hops. Based on ourfindings we designed and implemented a dynamic threadand memory placement algorithm in Linux that deliverssimilar or better performance than the best static place-ment and up to 218% better performance than when theplacement is chosen randomly.

1 Introduction

Typical modern CPU systems are structured as sev-eral CPU/memory nodes connected via an interconnect.These architectures are usually characterized by non-uniform memory access times (NUMA), meaning thatthe latency of data access depends on where (whichCPU-cache or memory node) the data is located. Forthis reason, the placement of threads and memory playsa crucial role in performance. This property inspired

many NUMA-aware algorithms for operating systems.Their insight is to place threads close to their mem-ory [19, 12, 9], spread the memory pages across the sys-tem to avoid the overload on memory controllers and in-terconnect links [12], to collocate data-sharing threadson the same node [30, 31] while avoiding memory con-troller contention [7, 31, 10], and to segregate threadscompeting for cache and memory bandwidth on differ-ent nodes [34].

Further, modern operating systems aim to reduce thenumber of hops used for thread-to-thread and thread-to-memory communication. When balancing the loadacross CPUs, Linux first uses CPUs on the same node,then those one hop apart and lastly two or more hopsapart. These techniques assume that the interconnect be-tween nodes is symmetric: given any pair of nodes con-nected via a direct link, the links have the same band-width and the same latency. On modern NUMA systemsthis is not the case.

Figure 1 depicts an AMD Bulldozer NUMA machinewith eight nodes (each hosting eight cores). Interconnectlinks exhibit many disparities: (i) Links have differentbandwidths: some are 16-bit wide, some are 8-bit wide;(ii) Some links can send data faster in one direction thanin the other (i.e., one side sends data at 3/4 the speed of a16-bit link, while the other side can only send data at thespeed of an 8-bit link). We call these links 16/8-bit links;(iii) Links are shared differently. For instance the linkbetween nodes 4 and 3 is only used by these two nodes,while the link between nodes 2 and 3 is shared by nodes0, 1, 2, 3, 6 and 7; (iv) Some links are unidirectional.For instance node 7 sends requests directly to node 3, butnode 3 routes its answers via node 2. This creates anasymmetry in read/write bandwidth: node 7 can write at4GB/s to node 3, but can only read at 2GB/s.

The asymmetry of interconnect links has dramatic andat times surprising effects on performance. Figure 2shows the performance of 20 different applications on

Page 3: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

278 2015 USENIX Annual Technical Conference USENIX Association

Machine A (Figure 1)1. Each application runs with 24threads and so it needs three nodes to run on. We varywhich three nodes are assigned to the application andhence the connectivity between the nodes. The rela-tive placement of threads and memory on those nodesis identical in all configurations. The only difference ishow the chosen nodes are connected. The figure showsthe performance on the best-performing and the worst-performing subset of nodes for that application comparedto the average (obtained by measuring the performanceon all 336 unique subsets of nodes and computing themean). We make several observations. First, the perfor-mance on the best subset is up to 88% faster than theaverage, and the performance on the worst subset is upto 44% slower. Second, the maximum performance dif-ference between the best and the worst subsets is 237%(for facerec). Finally, the mean difference between thebest and worst subsets is 40% and the median 14%. Inthe following section we demonstrate that these perfor-mance differences are caused by the asymmetry of theinterconnect between the nodes.

This work makes the following contributions:

• We quantify and characterize the effects of asym-metric interconnect on a commercial x86 NUMAsystem. The key insight is that the best-performingconnectivity is the one with the greatest total band-width as opposed to the smallest number of hops.

• We design, implement and evaluate a new algorithmthat dynamically picks the best subset of nodes forapplications requiring more than one node. Thisalgorithm places the clusters of threads and theirmemory to ensure that the most intensive CPU-to-CPU or CPU-to-memory communication occurs be-tween the best-connected nodes. Our evaluationshows that this algorithm performs as well as or bet-ter than the best set of nodes chosen statically.

• Our implementation revealed a limitation in hard-ware counters, which prevented us from having cer-tain flexibilities in the algorithm. We discuss themand make suggestions for improvements.

The paper is structured as follows. Section 2 stud-ies the impact of interconnect asymmetry and discusseschallenges in catering to this phenomenon in an oper-ating system. Section 3 discusses current architecturaltrends and shows that machines are becoming increas-ingly asymmetric. Section 4 presents our algorithm, andSection 5 reports on the evaluation. Section 6 discussesrelated work, and Section 7 provides a summary.

1Additional details about the applications and the machine are pro-vided in Section 5.

2 The Impact of Interconnect Asymmetryon Performance

To explain the reasons behind the performance reportedin Figure 2, Figure 3 shows the memory latency mea-sured when the application runs on the best and worstnode subsets relative to the latency averaged across all336 possible subsets. We can see that memory accessesperformed by facerec are approximately 600 cycles fasterwhen running on the best subset of nodes relative to theaverage, and 1400 cycles faster relative to the worst. Wecan see that the latency differences are tightly correlatedwith the performance difference between configurations.The applications that are the most affected by the choiceof nodes on which to run are also those with the highestdifference in the memory latencies.

To further understand the cause of very high laten-cies on “bad” configurations we analyzed streamcluster– an application from the Parsec [26] benchmark suite,which was among the most affected by the placement ofits threads and memory. In the following experiment werun streamcluster with 16 threads on two nodes. Table 1presents the salient metrics for each possible two-nodesubset. Depending on which two nodes we chose, weobserve large (up to 133%) disparities in performance.The data in Table 1 leads to several crucial observations:

• As shown earlier, performance is correlated with thelatency of memory accesses.

• Surprisingly, the latency of memory accesses isnot correlated with the number of hops betweenthe nodes: some two-hop configurations (shown inbold) are faster than one-hop configurations.

• The latency of memory accesses is actually corre-lated with the bandwidth between the nodes. Notethat this makes sense: the difference between one-hop vs. two-hop latency is only 80 cycles when theinterconnect is nearly idle. So a higher number ofhops alone cannot explain the latency differences ofthousands of cycles.

Bandwidth between the nodes matters more than the

distance between them.So the problem of choosing a “good” subset of nodes

is essentially the problem of the placement of threads

and memory pages on a well-connected subset of nodes.When an application executes on only two nodes on amachine similar to the one used in the aforementionedexperiments, the placement on the nodes connected withthe widest (16-bit) link is always the best because it max-imizes the bandwidth and minimizes the latency betweenthe nodes. However, when an application needs morethan two nodes to run, no configuration exists with 16-bit links between every pair of nodes, so we must decide

2

Page 4: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

USENIX Association 2015 USENIX Annual Technical Conference 279

Node 0 Node 4 Node 5 Node 1

Node 6 Node 2 Node 3 Node 7

8b link16b link16b/8b link

(Machines A and B)

Figure 1: Modern NUMA systems, with eight nodes. The width of links varies, some paths are unidirectional (e.g.,between 7 and 3) and links may be shared by multiple nodes. Machine A has 64 cores (8 cores per node - notrepresented in the picture) and machine B has 48 cores (6 cores per node). Not shown in the picture: the links betweennodes 4 and 1 and between nodes 2 and 7 are bidirectional on machine B. This changes the routing of requests fromnode 7 to 2 and node 1 to 4.

-15-10-5 0 5

10 15

bt.B.xcg.C.x

ep.C.xft.C.x

is.D.xlu.B.x

mg.C.x

sp.A.xua.B.x

swaptions

kmeans

matrixmultiply

wc wr wrmemPerf

. im

pro

vem

ent re

lative

to a

vera

ge p

lacem

ent (%

)

Worst PlacementBest Placement

-40-30-20-10

0 10 20 30 40

graph500

specjbb

-60-40-20

0 20 40 60 80

100

streamcluster

pcafacerec

Figure 2: Performance difference between the best, and worst thread placement with respect to the average threadplacement on Machine A. Applications run with 24 threads on three nodes. Graph500, specjbb, streamcluster, pca andfacerec are highly affected by the choice of nodes and are shown separately with a different y-axis range.

-150-100-50

0 50

100 150

bt.B.xcg.C.x

ep.C.xft.C.x

is.D.xlu.B.x

mg.C.x

sp.A.xua.B.x

swaptions

kmeans

matrixmultiply

wc wr wrmem

Late

ncy o

f m

em

ory

accesses c

om

pare

d to

avera

ge p

lacem

ent (c

ycle

s)

Worst PlacementBest Placement

-200-150-100-50

0 50

100 150 200

graph500

specjbb

-1000-800-600-400-200

0 200 400 600 800

streamcluster

pcafacerec

Figure 3: Difference in latency of memory accesses between the best, and worst thread placement with respect to theaverage node configuration on Machine A. Positive numbers mean that memory accesses are faster than the average.

3

Page 5: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

280 2015 USENIX Annual Technical Conference USENIX Association

Master thread Execution Time Diff with Latency of memory % accesses Bandwidth tonode (s) 0-1 (%) accesses (cycles) via 2-hop the “master”

(compared to 0-1(%)) links node (MB/s)

0 1 - 148 0% 750 0 5598

0 4 - 228 56% 1169 (56%) 0 2999

0 20 228 56% 1179 (57%) 0 29732 168 15% 855 (14%) 0 4329

0 32

1

0 340 133% 1527 (104%) 98 19153 185 27% 1040 (39%) 98 3741

0 540 340 133% 1601 (113%) 98 19035 228 56% 1206 (61%) 98 2884

3 72 3 185 27% 1020 (36%) 0 3748

7 338 132% 1614 (115%) 98 1928

5 14 1 338 132% 1612 (115%) 98 1891

5 230 58% 1200 (60%) 0 2880

2 73

2 167 15% 867 (16%) 98 37487 225 54% 1220 (63%) 0 3014

4 15

4 230 58% 1205 (60%) 0 29591 226 55% 1203 (60%) 98 2880

Table 1: Performance of streamcluster executing with 16 threads on 2 nodes on machine A. The performance dependson the connectivity between the nodes on which streamcluster is executing and on the node on which the master threadis executing. Numbers in bold indicate 2-hops configurations that are as fast or faster than some 1-hop configurations.

which nodes to pick. When there is more than one ap-plication running, we need to decide how to allocate thenodes among multiple applications.

Nodes % perf. relative to best subsetstreamcluster SPECjbb

0, 1, 3, 4, and 7 -64% 0% (best)2, 3, 4, 5, and 6 0% (best) -9.4%

Table 2: Performance of streamcluster and SPECjbb ontwo different set of nodes on machine A, relative to thebest set of nodes for the respective application.

In this paper, we present a new thread and memoryplacement algorithm. Designing such an algorithm forasymmetrically connected NUMA systems is challeng-ing for the following reasons:

Efficient online measurement of communicationpatterns is challenging: The algorithm must measurethe volume of CPU-to-CPU and CPU-to-memory com-munication for different threads in order to determine thebest placement when we cannot run the entire applicationon the best connected nodes. This measurement processmust be very efficient, because it must be done continu-ously in order to adapt to phase changes.

Changing the placement of threads and mem-ory may incur high overhead: Frequent migration ofthreads may be costly, because of the associated CPUoverhead, but most importantly because cache affinityis not preserved. Moreover, when threads are migratedto “better” nodes, it might be necessary to migrate theirmemory in order to avoid the overhead of remote ac-cesses and overloaded memory controllers. Migratinglarge amounts of memory can be extremely costly. Thus,

thread migration must be done in a way that minimizesmemory migration.

Accomodating multiple applications simultane-ously is challenging: Applications have different com-munication patterns and are thus differently impacted bythe connectivity between the nodes they run on. As anillustration, Table 2 presents the performance of stream-cluster and SPECjbb executing on two different sets offive nodes (the best set of nodes for the two applications,respectively). The two applications behave differently onthese two sets of nodes: streamcluster is 64% slower onthe best set of nodes for SPECjbb than on its own bestset. The algorithm must, therefore, determine the bestset of nodes for every application. Furthermore, it can-not always place each application on its best set of nodes,because applications may have conflicting preferences.

Selecting the best placement is combinatorially dif-ficult: The number of possible application placementson an eight-node machine is very large (e.g., 5040 possi-ble configurations for four applications executing on twonodes). So, (i) it is not possible to try all configurationsonline by migrating threads and then choosing the bestconfigurations, and (ii) doing even the simplest compu-tation involving “all possible placements” can still add asignificant overhead to a placement algorithm.

Before describing how we addressed these challenges,we briefly discuss architectural trends and the increasingimpact of interconnect asymmetry.

3 Architectural trends

Asymmetric interconnect is not a new phenomenon.Nevertheless, we show in this section that its effects on

4

Page 6: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

USENIX Association 2015 USENIX Annual Technical Conference 281

performance are increasing as machines are built withmore nodes and cores. For that purpose, we measuredthe performance of streamcluster on four different asym-metric machines: two recent machines with 64 and 48cores respectively, and 8 nodes (Machines A and B, Fig-ure 1), and two older machines with 24 and 16 coresrespectively, and 4 nodes (Machines C and D, not de-picted). Machines A and B have highly asymmetric in-terconnect; Table 1 lists all possible interconnect config-urations between 2 nodes. Machines C and D have a lesspronounced asymmetry. Machine C has full connectiv-ity, but two of the links are slower than the rest. MachineD has links with equal bandwidth, but two nodes do nothave a link between them.

Table 3 shows the performance of streamclusterwith 16 threads on the best-performing and the worst-performing set of nodes on each machine. The perfor-mance difference between the best and worst configura-tions increases with the number of cores in the machine:from 3% for the 16-core machine to 133% for the 64-coremachine. We explain this as follows: (i) On the 16-coreMachine D, the only difference between configurationsis the longer wire delay between the nodes that are notconnected via a direct link. This delay is not significantcompared to the extra latency induced by bandwidth con-tention on the interconnect. (ii) The CPUs on 24-coreMachine C have a low frequency compared to the othermachines. As a result, the impact of longer memory la-tency is not as pronounced. More importantly, the net-work on this machine is still a fully connected mesh, sothere is less asymmetry than on Machines A and B. (iii)The 48- and 64-core Machines B and A offer a widerrange of bandwidth configurations, which increases thedifference between the best and the worst placements.The 64-core machine is more affected than the 48-coremachine because it has more cores per node, which in-creases the effects of bandwidth contention.

If this trend holds across different machines and archi-tectures, then it is clear that the effects of asymmetry canno longer be ignored.

Machine Best time Worst time DifferenceA (64 cores) 148s 340s 133%B (48 cores) 149s 277s 85%C (24 cores) 171s 229s 33%D (16 cores) 255s 262s 3%

Table 3: Performance of streamcluster executing on 2nodes on machine A, B, C, and D. The performance ofstreamcluster depends on the placement of its threads.The impact of thread placement is more important on re-cent machines (A and B) than on older ones (C and D).

4 Solution

4.1 OverviewWe designed AsymSched, a thread and memory place-ment algorithm that takes into account the bandwidthasymmetry of asymmetric NUMA systems. Asym-Sched’s goal is to maximize the bandwidth for CPU-to-CPU communication, which occurs between threadsthat exchange data, and CPU-to-memory communica-tion, which occurs between a CPU and a memory nodeupon a cache miss. To that end, AsymSched placesthreads that perform extensive communication on rela-tively well-connected nodes and places the frequentlyaccessed memory pages such that the data requests areeither local or travel across high-bandwidth paths.

AsymSched is implemented as a user level process andinteracts with the kernel and the hardware using systemcalls and /proc file system, but could also be easily in-tegrated with the kernel scheduler if needed.

AsymSched continuously monitors hardware countersto detect opportunities for better thread placements. Thethread placement decision occurs every second, andAsymSched only migrates threads when the benefits ofmigration is expected to exceed its overhead. The place-ment of memory pages follows the placement of threads.

AsymSched relies on three main techniques to managethreads and memory: (i) Thread migration: changingthe node where a thread is running. (ii) Full memory mi-gration: migrating all pages of an application from onenode to another. Full memory migration is performed us-ing a new system call that we present in Section 4.3. (iii)Dynamic memory migration: migrating only the pagesthat an application actively accesses. Dynamic memorymigration uses Instruction-Based Sampling (IBS), a pro-filing mechanism available in AMD processors2, to sam-ple memory accesses and to identify the most frequentlyaccessed pages. Then, the pages that are not shared aremigrated to the node that accesses them. Shared pagesare spread across multiple nodes. We use the same al-gorithm and techniques described in [12]; we thereforeomit further details on dynamic memory migration.

4.2 AlgorithmAsymSched relies on 3 components. The measurementcomponent continuously computes salient metrics. Thedecision component uses these metrics to periodicallycompute the best thread placements. The migration com-ponent migrates threads and memory. Table 4 presentsthe definitions relevant to AsymSched. Algorithm 1 sum-marizes the algorithm.

2Intel processors have a similar mechanism called Precise Event-Based Sampling (PEBS).

5

Page 7: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

282 2015 USENIX Annual Technical Conference USENIX Association

Per cluster (C) statisticsCrbw Remote bandwidth: the number of mem-

ory accesses performed by threads in thecluster to another node, i.e., remote ac-cesses.

Cweight “Weight” of the cluster. Clusters with thehighest weights are scheduled on the nodeswith the highest interconnect bandwidth.By default Cweight = log(Crbw).

Cbw(P) Maximum bandwidth of C threads onplacement P.Per placement (P) statistics

Pwbw Weighted total bandwidth of P. Is equal tothe sum of the Cbw(P) ∗Cweight for everyplaced cluster C.

Pmm Amount of memory that has to be migratedto use this placement.Per application (A) statistics

Atm Time already spent migrating memory.Att Dynamic running time of the application.Amm[node] Resident set size of the application, per

node.AoldA Percentage of memory accesses performed

on nodes on which the application wasscheduled but is no longer scheduled on.

Table 4: Definitions relevant to AsymSched.

Measurement. AsymSched continuously gathers themetrics characterizing the volume of CPU-to-CPU andCPU-to-memory communication. On our experimentalsystem there is a single counter that captures both: itmeasures the number of data accesses performed by aCPU to a given node and includes both the accesses tocached data (CPU-to-CPU communication) and to thedata located in RAM (CPU-to-memory communication).Ideally, we would like to measure the communicationvolume from every CPU to every other CPU, howeverthe counters available on AMD systems do not offer thisopportunity. One alternative is to use AMD’s Instruc-tion Based Sampling (IBS)3. Unfortunately, to accuratelytrack CPU-to-CPU communication, IBS requires a highsampling rate, and that introduces too much overhead.Lightweight Profiling (LWP), a new profiling facility ofAMD processors, has a smaller overhead, but is only par-tially implemented in current processors. Despite theselimitations, we believe that it is only a matter of time un-til they are addressed in the mainstream hardware, so forthe time being we use the following work-around.

The algorithm described below relies on detectingwhich threads share data. Since we can only practically

3PEBS, Precise Event-Based Sampling, is a similar feature on Intelsystems.

measure the communication between a CPU and a re-mote node, but not CPU-to-CPU communication (eitheracross or within nodes), we make the following simpli-fying assumptions: (a) a thread may share data with anyother thread running on the same node, (b) if there is ahigh volume of communication between a CPU and anode, a thread running on that CPU may share data withany thread of the same application on that node. To re-duce the occurrence of situations where we assume datasharing while in reality there is none, we initially collo-cate threads from the same application on the same node,to the extent possible. Data sharing is far more commonbetween threads from the same application than betweenthreads from different applications.

The downside of this simplifying assumption is thatwe may unnecessarily keep a group of threads collocatedon the same node even if they do not share data. But thereis also an important benefit: characterizing the commu-nication in terms of CPU-to-node keeps the number ofsharing relationships to consider small and reduces thecomplexity of the algorithm.

Decision. The following description relies on defini-tions in Table 4. Step 1: AsymSched groups threads ofthe same application that share data in virtual clusters.A cluster is simply a list of threads that share data. Itthen assigns a weight Cweight to each cluster; clusters withthe highest weights will be scheduled on the nodes withthe best connectivity. By default clusters are weighedby the logarithm of the number of remote memory ac-cesses performed by their threads (Cweight = log(Crbw)).The logarithm deemphasizes small differences in Crbwbetween the clusters, while preserving large differences.This makes it much easier for the algorithm to pick outthe clusters with a relatively high Crbw and place them onwell-connected nodes.

Step 2: AsymSched computes possible placements forall the clusters. A placement is an array mapping clus-ters to nodes. It works at the node granularity, so thenumber of possible placements is equal to the number ofnode permutations (i.e., migrating all threads of node Xto node Y and vice versa). As this number can be verylarge, it is important that AsymSched not test all possibleplacements. Section 4.3 details how AsymSched avoidstesting all possible placements. For each placement P,AsymSched computes the maximum bandwidth Cbw(P)that each cluster C would receive if it were put in thisplacement. Each placement is assigned a performancemetric, Pwbw, the weighted bandwidth of P, defined asPwbw = ∑

C∈clustersCbw(P) ∗Cweight . The higher Pwbw, the

higher the bandwidth available to clusters that perform alot of remote communications. The definition of Cweightimplies that our algorithm aims to optimize the overallcommunication bandwidth across all applications. The

6

Page 8: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

USENIX Association 2015 USENIX Annual Technical Conference 283

algorithm can be easily changed to optimize a differentmetric, e.g., one that takes into account application pri-orities, by changing the definition of Cweight .

Step 3: AsymSched filters placements to keep onlythose that have a weighted bandwidth value at least equalto 90% of the maximum weighted bandwidth. Amongthese remaining placements AsymSched chooses thosethat will minimize the number of page migrations.

Step 4: For each application, AsymSched estimatesthe overhead of memory migration assuming the costof 0.3s per GB, which was derived on our system us-ing simple experiments. If the overhead is deemed toohigh, the new placement will not be applied. Anothergoal here is to avoid migrating the applications backand forth because of recurring changes in communica-tion patterns and accumulating a high overhead. To thatend, AsymSched keeps track of the total time alreadyspent doing memory migration for the application: Atm.If that time plus the estimated cost of additional migra-tion ( ∑

n∈migrated nodesAmm[n]∗0.3) exceeds 5% of the run-

ning time of the application (Att ), then AsymSched doesnot apply the new thread placement. We chose 5% asa reasonable maximum overhead value. In practice, thehighest overhead we observed was around 3%.

Migration. Step 1: AsymSched migrates threads us-ing system calls that are available in the Linux kernel.

Step 2: AsymSched relies on dynamic migration to mi-grate the subset of pages that the application uses. If,after two seconds, the application still performs morethan 90% of its memory accesses on the nodes whereit was previously running (AoldA > 90%), then Asym-Sched concludes that dynamic migration was not able tomigrate the working set of the application and performsa full memory migration.

The Measurement, Decision and Migration phases de-scribed above are performed continuosly to account forphase changes in applications and other dynamics.

4.3 Optimizations and tricks

We integrated several optimizations within AsymSchedto ensure that it runs accurately and with low overhead.

Fast memory migration. When AsymSched performsfull memory migration, all the pages located on one nodeare migrated to another node. The applications we testedhave large working sets (up to 15GB per node), andmigrating pages is costly. We measured that migrating10GB of data using the standard migrate pages sys-tem call takes 51 seconds on average, making migrationof large applications impractical.

Therefore, we designed a new system call for memorymigration. This system call performs memory migrationwithout locks in most cases, and exploits the parallelism

Algorithm 1 AsymSched algorithm1: if Threads of nodes N1 and N2 access a common

memory controller and threads of N1 and N2 havethe same pid then

2: Put all threads running on N1 and N2 in a clusterC and Increase Crbw

3: end if4: Compute relevant cluster placements5: Maxwbw = 06: for all P ∈ computed placements do7: Pwbw = ∑

C∈clustersCbw(P)∗Cweight

8: Maxwbw = max(Maxwbw,Pwbw)9: end for

10: for all P ∈ computed placements do11: Skip if Pwbw <90%*Maxwbw12: Compute Pmm13: end for14: Choose the placement with the lowest Pmm15: for all A ∈ migrated applications do16: if Atm + ∑

n∈migrated nodesAmm[n] ∗ 0.3 > 0.05*Att

then17: Do not change thread placement18: end if19: end for20: Migrate threads21: Use dynamic memory migration22: After 2 seconds:23: for all A ∈ migrated applications do24: if AoldA > 90% then25: Fully migrate memory of A26: end if27: end for

available on multicore machines. Using our system call,migrating memory between two nodes is on average 17×faster than using the default Linux system call and is onlylimited by the bandwidth available on interconnect links.Unlike the Linux system call, our system call can migratememory from multiple nodes simultaneously. So if weare migrating the memory simultaneously between twopairs of nodes that do not use the same interconnect path,our system call will run about 34 times faster.

Fast migration works as follows. (i) First, we “freeze”the application by sending SIGSTOP to all its threads.Freezing the application is done to ensure that the appli-cation does not allocate or free pages during migration.This allows removing many locks taken by the Linuxmemory migration mechanism, and since up to 80% ofmigration time can be wasted waiting on locks, the re-sulting performance improvements are significant. (ii)Second, we parse the memory map of the applicationand store all pages in an array. We then launch worker

7

Page 9: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

284 2015 USENIX Annual Technical Conference USENIX Association

threads on the node(s) on which the application is sched-uled. Worker threads process pages stored in the arrayin chunks of 30 thousand. The old page is unmapped,data is copied to a new page, the new page is remapped,and the old page is freed. Shared pages or pages that arecurrently swapped are ignored.

Avoiding evaluation of all possible placements. Thenumber of all possible thread placements on a machinecan be very large. We use two techniques to avoid com-puting all thread placements: (i) A lot of thread place-ment configurations are “obviously” bad. For instance,when a communication-intensive application uses twonodes, we only consider configurations with nodes con-nected with a 16-bit link. (ii) Several configurations areequivalent (e.g., the bandwidth between nodes 0 and 1and between nodes 2 and 3 is the same). To avoid esti-mating the bandwidth of all placements, we create a hashfor each placement. The hash is computed so that equiva-lent configurations have the same hash. Using simple dy-namic programming techniques, we only perform com-putations on non-equivalent configurations.

These two techniques allow skipping between 67%and 99% of computations in all tested configurationswith clusters of 2, 3 or 5 nodes (e.g., with 4 clusters of 2nodes, we only evaluate 20 configurations out of 5040).

5 Evaluation

Our goal is to evaluate the impact of asymmetry-awarethread placement in isolation from other effects, suchas those stemming purely from collocating threads thatshare data on the same node. Performance benefits ofsharing-aware thread clustering are well known [30].AsymSched clusters threads that share data as describedin the Section 4; the Linux thread scheduler, how-ever, does not. We experimentally observed that Linuxperformed worse than clustered configurations. E.g.,when graph500 and specjbb are scheduled simultane-ously, both run 23% slower on Linux than on an av-erage clustered placement. Since comparing Linux toAsymSched would not be meaningful because of that, weinstead compare AsymSched4 to the best and the worststatic placements of data-sharing thread clusters. Wealso compare the average performance achieved under allstatic placements that are unique in terms of connectiv-ity. We obtain all unique static placements with respect toconnectivity by examining the topology of the machine.There are 336 placements for single-application scenar-ios and 560 placements for multi-application scenarios.

Further, we want to isolate the effects of thread place-ment with AsymSched from the effects of dynamic mem-

4When running AsymSched, thread clusters are initially placed on arandomly chosen set of nodes.

ory migration. To that end, we compare AsymSched tothe subset of our algorithm that performs the dynamicplacement of memory only, turning off the parts perform-ing thread placement.

5.1 Experimental platformWe evaluate AsymSched on machine A. It is equippedwith four AMD Opteron 6272 processors, each with twoNUMA nodes and 8 cores per node (64 cores in total).The machine has 256GB of RAM and uses HyperTrans-port 3.0. It runs Linux 3.9.

We used several benchmark suites: the NAS Paral-lel Benchmarks suite [6] which is composed of numerickernels, MapReduce benchmarks from Metis [25], paral-lel applications from Parsec [26], Graph500 [1], a graphprocessing application with a problem size of 21, Fac-eRec from the ALPBench benchmark suite [11], andSPECjbb [2] running on OpenJDK7. From the NASand Parsec benchmark suites we picked the benchmarksthat run for at least 15 seconds, and that can be exe-cuted with arbitrary numbers of threads. The memoryusage of the benchmarks ranges from 518MB for EPfrom the NAS suite to 34,291MB for IS from NAS. Ex-cept for SPECjbb, we use the execution time of applica-tions as performance indicator. SPECjbb runs during afixed amount of time; we use the throughput (measuredin SPECjbb bops) as performance indicator.

5.2 Single application workloadsThe results are presented in Figure 4. AsymSched alwaysperforms close to the best static thread placement. Ina few cases where it does not, the difference is not sta-tistically significant. For applications that produce thehighest degree of contention on the interconnect links(streamcluster, pca, and facerec), AsymSched achievesmuch better performance than the best thread placement,because the dynamic memory migration component bal-ances memory accesses across nodes, thus reducing con-tention on interconnect links and memory controllers.

We also observe that dynamic memory migrationwithout the migration of threads is not sufficient toachieve the best performance. More precisely, dynamicmemory migration alone often achieves performanceclose to average. Moreover, it produces a high standarddeviation for many benchmarks: the minimum and max-imum performance often being the same as that of thebest and worst static thread placement. For instance,on SPECjbb, the difference between the minimum andmaximum performance with dynamic memory migrationalone is 91%.

In contrast, AsymSched produces a very low standarddeviation for most benchmarks. Two exceptions are is.D

8

Page 10: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

USENIX Association 2015 USENIX Annual Technical Conference 285

-15-10

-5 0 5

10 15

bt.B.xcg.C.x

ep.C.xft.C.x

is.D.xlu.B.x

mg.C.x

sp.A.xua.B.x

swaptions

kmeans

matrixmultiply

wc wr wrmemPe

rf.

imp

rove

me

nt

rela

tive

to a

ve

rag

e p

lace

me

nt

(%) Worst placement

Best placementDynamic Memory Placement Only

AsymSched

-40-30-20-10

0 10 20 30 40

graph500

specjbb

-50 0

50 100 150 200 250

streamcluster

pcafacerec

Figure 4: Performance difference between the best and worst static thread placement, dynamic memory placement,AsymSched and the average thread placement on machine A. Applications run with 24 threads on 3 nodes.

-100-50

0 50

100 150 200 250

bt.B.xcg.C.x

ep.C.xft.C.x

is.D.xlu.B.x

mg.C.x

sp.A.xua.B.x

swaptions

kmeans

matrixmultiply

wc wr wrmem

La

ten

cy o

f m

em

ory

acce

sse

s c

om

pa

red

to

ave

rag

e p

lace

me

nt

(cycle

s)

Worst PlacementBest Placement

Dynamic Memory Placement OnlyAsymSched

-200-150-100

-50 0

50 100 150 200

graph500

specjbb

-1000-500

0 500

1000 1500

streamcluster

pcafacerec

Figure 5: Memory latency under the best and worst static thread placement, dynamic memory placement, AsymSchedand the average thread placement on machine A. Applications run with 24 threads on 3 nodes.

and SPECjbb. This is because in both cases, AsymSchedmigrates a large amount of memory. Both applicationsbecome memory intensive after an initialization phase,and AsymSched starts migrating memory only after theentire working set has been allocated. For instance, inthe case of is.D, AsymSched migrates between 0GB and20GB, depending on the initial placement of threads.

Figure 5 shows the latency of memory accesses com-pared to the average. For most applications, the dynam-ics of latency closely matches that of the performance.A few exceptions are is.D, lu.B and kmeans. For is.D,the latency is drastically improved by AsymSched butthe impact on performance is not visible because of thetime lost performing memory migrations. Lu.B is ex-tremely memory intensive during its first seconds of ex-ecution, but performs very few memory accesses there-after; AsymSched improves this initial phase but has noimpact on the rest of the running time. Kmeans is verybursty; placing its threads has a huge impact on the la-tency of memory accesses performed during bursts ofmemory accesses but not on the rest of the execution.

5.3 Multi application workloadsWe evaluate several multi-application workloads us-ing the applications studied in section 5.2. We chosefour applications that benefit to various degrees from

AsymSched: streamcluster (benefits to a high degree),SPECjbb (benefits to a moderate degree), graph500 (ben-efits to a small degree), and matrixmultiply (does notbenefit). Some of these applications have differentphases during their execution; for instance, streamclus-ter processes its input set in five distinct rounds, andSPECjbb spends significant amount of time initializingdata before emulating a three-tier client/server system.

Figure 6 presents the performance on multi applicationworkloads. We chose two different clustering configura-tions: (i) Three applications executing on three, three andtwo nodes, respectively; (ii) Two applications executingon five and three nodes respectively.

In all workloads, AsymSched achieves performancethat is close or better than the best static thread place-ment on Linux. Furthermore, it produces a very lowstandard deviation. In constrast, dynamic memory mi-gration alone exhibits high standard deviation and, likewith single application workloads, is unable to improveperformance for Graph500 and SPECjbb.

AsymSched significantly improves the latency of ap-plications that benefit from thread and memory migra-tion (Figure 7), in particular for streamcluster. This isbecause AsymSched chooses configurations in which thelinks used by streamcluster are not shared with any otherapplication.

9

Page 11: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

286 2015 USENIX Annual Technical Conference USENIX Association

-50 0

50 100 150 200 250

specjbb-3

graph500-3

matrixmultiply-2

streamcluster-3

graph500-3

specjbb-2

streamcluster-3

streamcluster-3

streamcluster-2

specjbb-5

matrixmultiply-3

specjbb-5

streamcluster-3

Pe

rf.

imp

rove

me

nt

rela

tive

to a

ve

rag

e p

lace

me

nt

(%)

Worst Thread PlacementBest Thread Placement

Dynamic Memory PlacementAsymSched

Figure 6: Performance difference between the best thread placement, the worst thread placement, dynamic memoryplacement, AsymSched and the average thread placement of applications on machine A. The numbers appended to thename of applications specify the number of nodes on which the application runs.

-1500-1000-500

0 500

1000 1500 2000

specjbb-3

graph500-3

matrixmultiply-2

streamcluster-3

graph500-3

specjbb-2

streamcluster-3

streamcluster-3

streamcluster-2

specjbb-5

matrixmultiply-3

specjbb-5

streamcluster-3

Late

ncy o

f m

em

ory

accesses c

om

pare

d to

avera

ge p

lacem

ent (c

ycle

s)

Worst Thread PlacementBest Thread Placement

Dynamic Memory PlacementAsymSched

Figure 7: Performance difference between the best thread placement, the worst thread placement, dynamic memoryplacement, AsymSched and the average thread placement of applications on machine A.

cg.B ft.C is.D sp.A streamcluster graph500 specJBBMigrated memory (GB) 0.17 2.5 20 0.1 0.15 0.3 10

Average time - Linux syscall (ms) 860 12700 101000 490 750 1500 50500Average time - fast migration (ms) 51 380 3050 30 45 90 1500

Table 5: Average amount of migrated memory for various applications running on 3 nodes and required time toperform the migration using the standard Linux system call and using fast memory migration.

5.4 Overhead

The main overhead of AsymSched is due to memory mi-gration. This explains why we implemented a customsystem call (see Section 4.3). Table 5 compares the mi-gration time when running the standard Linux systemcall and when running our custom system call. For in-stance, for is.D, migration takes 101 seconds using theLinux system call (50% overhead), but only 3 secondsusing our custom system call (1.5% overhead). To keepthe overhead low, AsymSched performs migrations onlyif the predicted overhead is below 5%. In practice, themaximum migration overhead we observed was 3%.

The cost of collecting metrics and computing clus-ter placement is below 0.5% on all studied applications.Moreover, AsymSched requires less than 2MB of RAM.

The overhead of thread migration is negligible and wedid not observe any noticeable effect of thread migrationson cache misses.

Finally, when dynamic memory placement is used,IBS sampling incurs a light overhead (within 2% inour experiments) and statistics on memory accesses arestored in about 20MB of RAM.

5.5 Discussion - Applicability on futureNUMA machines

We believe that the findings and the solution presentedin this paper are likely to be applicable on future NUMAsystems. First, we believe that the clustering and place-ment techniques used in AsymSched can scale on ma-chines with a much larger number of nodes. With verysimple heuristics we were able to avoid computing upto 99% of the possible thread placements. Such op-timizations will still likely be possible on future ma-chines, as machines are usually made of multiple iden-tical cores/sockets (e.g., our 64-core machine has 4 iden-tical sockets). On machines that offer a wider diversity

10

Page 12: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

USENIX Association 2015 USENIX Annual Technical Conference 287

of thread placements, a possibility is to use statistical ap-proaches, such as that of Radojkovic et al. [27] to findgood thread placements with a bounded overhead.

Furthermore, AsymSched can easily be adapted to dif-ferent optimization goals. On current NUMA machines,maximizing the bandwidth between threads was the keyto achieving good performance, but our solution could beeasily adapted to take other metrics into account.

6 Related Work

NUMA optimizations and contention management:Optimizing thread and memory placement on NUMAsystems has been extensively studied [8, 9, 20, 33, 19, 12,9, 7, 31, 5, 23, 22, 10]. However, as shown in [19, 12, 5,23], contention on interconnect links and memory con-trollers remains a major source of inefficiencies on mod-ern NUMA machines. Our work complements these pre-vious studies by minimizing contention on interconnectlinks on asymmetrically connected NUMA systems. Weadopt a dynamic memory management algorithm pre-sented in [12] for memory placement, but the key contri-bution of our work is the algorithm that efficiently com-putes the placement of threads on nodes to maximize thebandwidth between communicating threads.

Several extensions to Linux improve data-access lo-cality on NUMA systems, but do not improve the band-width for communicating threads and do not address in-terconnect asymmetry. For example, Sched/NUMA [32]adds the notion of a “home node”: when schedulingthreads, Linux will try to collocate threads and data onthe “home node” of the corresponding process. Whilethis may improve communication bandwidth for applica-tions with the number of threads not exceeding the num-ber of cores in a node, it does not address applicationsspanning several nodes. Another extension, called Au-toNUMA [4], implements locality optimizations by mi-grating pages on the nodes from which they are accessed.AsymSched uses a similar dynamic algorithm to migratememory, but unlike AutoNUMA it also places threads soas to optimize communication bandwidth.

Several studies addressed contention for the memoryhierarchy of UMA systems [16, 24, 34] by segregatingcompeting threads on different nodes. None of these sys-tems, however, addressed contention on the interconnect.

Scheduling on asymmetric architectures: Severalthread schedulers catered to the asymmetry of CPUs[29, 15, 21, 17]. They optimize thread placement on pro-cessors with asymmetric characteristics (e.g., differentfrequencies, or different hardware features). The tech-niques used to address processor asymmetry are funda-mentally different than those needed to address intercon-nect asymmetry, so there is no overlap with our study.

Thread clustering: Pusukuri et al. [18] clusterthreads based on lock contention and memory access la-tencies. Kamali [14] and Tam [30] proposed algorithmsthat cluster threads that share data on the same sharedcache. AsymSched uses a similar high-level idea: it sam-ples hardware counters to detect communicating threadsand place them onto a well-connected nodes. However,the problem addressed in AsymSched (asymmetric inter-connect) and the specific algorithm proposed is quite dif-ferent from those in the aforementioned studies.

Radojkovic et al. [28] present a scheduler that takesinto account resource sharing inside a processor. Theymodel the benefits and drawbacks of data and instructioncache sharing between threads, and they schedule threadson the the set of cores that will maximize performance.Their solution explores all possible thread placements.Their follow-up work [27] refines the solution to use astatistical approach to find the optimal placement. Oursolution could benefit from a similar technique on ma-chines with a much larger number of dissimilar nodes.

Network optimizations: Network traffic optimizationis a well studied problem. Machines are often intercon-nected with asymmetric Ethernet links; optimizing thebandwidth on asymmetric NUMA systems shares a lot ofsimilarities with optimizations problems found in thesesystems. For instance, Volley [3] proposed an algorithmto place data used by Cloud services. As AsymSched,this algorithm takes into account the available bandwidthbetween nodes (that are geographically distributed com-puters in their case) in order to optimize performance.

Hermenier et al. [13] present a consolidation man-ager for distributed systems. The goal of their systemis to minimize energy consumption in a cluster. To thisend, they place virtual machines on the smallest pos-sible number of physical machines while meeting cer-tain performance constraints. They model the problemas the multiple knapsack problem and use a constraint-satisfaction solver to find good placements. AsymSchedcould use a similar technique, but we found that on ex-isting systems a much simpler solution was sufficient.

7 Conclusion

We showed that the asymmetry of the interconnectin modern NUMA systems drastically impacts perfor-mance. We found that the performance is more affectedby the bandwidth between nodes than by the distance be-tween them. We developed AsymSched, a new thread andmemory placement algorithm that maximizes the band-width for communicating threads.

As the number of nodes in NUMA systems increases,the interconnect is less likely to remain symmetric.AsymSched design principles will, therefore, be of grow-ing importance in the future.

11

Page 13: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

288 2015 USENIX Annual Technical Conference USENIX Association

References

[1] Graph500 reference implementation. http://

www.graph500.org/referencecode.

[2] specjbb2005 - industry-standard server-side javabenchmark (j2se 5.0). http://www.spec.org/

jbb2005/.

[3] Sharad Agarwal, John Dunagan, Navendu Jain,Stefan Saroiu, Alec Wolman, and HarbinderBhogan. Volley: Automated data placement forgeo-distributed cloud services. In Proceedings ofthe 7th USENIX Conference on Networked SystemsDesign and Implementation, NSDI’10, pages 2–2,Berkeley, CA, USA, 2010. USENIX Association.

[4] AutonNUMA: the other approach to NUMAscheduling. LWN.net, March 2012. http://lwn.net/Articles/488709/.

[5] M. Awasthi, D. Nellans, K. Sudan, R. Balasubra-monian, and A. Davis. Handling the problems andopportunities posed by multiple on-chip memorycontrollers. In PACT, 2010.

[6] D.H. Bailey, E. Barszcz, J. T. Barton, D. S. Brown-ing, R. L. Carter, L. Dagum, R.A Fatoohi, P. O.Frederickson, T. A Lasinski, R. S. Schreiber, H.D.Simon, V. VenkataKrishnan, and S.K. Weeratunga.The nas parallel benchmarks summary and prelim-inary results. In Supercomputing, 1991. Supercom-puting ’91. Proceedings of the 1991 ACM/IEEEConference on, pages 158–165, Nov 1991.

[7] S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fe-dorova. A Case for NUMA-aware Contention Man-agement on Multicore Systems. In USENIX ATC,2011.

[8] W. Bolosky, R. Fitzgerald, and M. Scott. Simple butEffective Techniques for NUMA Memory Manage-ment. In SOSP, 1989.

[9] Timothy Brecht. On the Importance of Parallel Ap-plication Placement in NUMA Multiprocessors. InUSENIX SEDMS, 1993.

[10] J. Mark Bull and Chris Johnson. Data distribution,migration and replication on a ccnuma architecture.In In Proceedings of the Fourth European Work-shop on OpenMP, 2002.

[11] CSU Face Identification Evaluation Sys-tem. http://www.cs.colostate.edu/

evalfacerec/index10.php.

[12] Mohammad Dashti, Alexandra Fedorova, JustinFunston, Fabien Gaud, Renaud Lachaize, BaptisteLepers, Vivien Quema, and Mark Roth. Traf-fic management: A holistic approach to memoryplacement on numa systems. In Proceedings ofthe eighteenth international conference on Archi-tectural support for programming languages andoperating systems, pages 381–394. ACM, 2013.

[13] Fabien Hermenier, Xavier Lorca, Jean-MarcMenaud, Gilles Muller, and Julia Lawall. En-tropy: A consolidation manager for clusters. InProceedings of the 2009 ACM SIGPLAN/SIGOPSInternational Conference on Virtual Execution En-vironments, VEE ’09, pages 41–50, New York, NY,USA, 2009. ACM.

[14] Ali Kamali. Sharing aware scheduling on multicoresystems. In MSc Thesis, Simon Fraser Univ., 2010.

[15] Vahid Kazempour, Ali Kamali, and Alexandra Fe-dorova. Aash: An asymmetry-aware scheduler forhypervisors. In Proceedings of the 6th ACM SIG-PLAN/SIGOPS International Conference on Vir-tual Execution Environments, VEE ’10, pages 85–96, New York, NY, USA, 2010. ACM.

[16] Rob Knauerhase, Paul Brett, Barbara Hohlt, TongLi, and Scott Hahn. Using OS Observations to Im-prove Performance in Multicore Systems. IEEEMicro, 28(3):pp. 54–66, 2008.

[17] David Koufaty, Dheeraj Reddy, and Scott Hahn.Bias scheduling in heterogeneous multi-core archi-tectures. In Proceedings of the 5th European Con-ference on Computer Systems, EuroSys ’10, pages125–138, New York, NY, USA, 2010. ACM.

[18] Kishore Kumar Pusukuri, Rajiv Gupta, andLaxmi N. Bhuyan. Adapt: A framework forcoscheduling multithreaded programs. ACM Trans.Archit. Code Optim., 9(4):45:1–45:24, January2013.

[19] Renaud Lachaize, Baptiste Lepers, and VivienQuema. MemProf: A Memory Profiler for NUMAMulticore Systems. In USENIX ATC, 2012.

[20] Richard P. LaRowe, Jr., Carla Schlatter Ellis, andMark A. Holliday. Evaluation of NUMA Mem-ory Management Through Modeling and Measure-ments. IEEE TPDS, 3:686–701, 1991.

[21] Tong Li, Dan Baumberger, David A Koufaty, andS. Hahn. Efficient operating system scheduling forperformance-asymmetric multi-core architectures.In Supercomputing, 2007. SC ’07. Proceedings of

12

Page 14: Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

USENIX Association 2015 USENIX Annual Technical Conference 289

the 2007 ACM/IEEE Conference on, pages 1–11,Nov 2007.

[22] Zoltan Majo and Thomas R. Gross. Memory man-agement in numa multicore systems: Trapped be-tween cache contention and interconnect overhead.In ISMM, 2011.

[23] Zoltan Majo and Thomas R. Gross. Memory Sys-tem Performance in a NUMA Multicore Multipro-cessor. In SYSTOR, 2011.

[24] Andreas Merkel, Jan Stoess, and Frank Bellosa.Resource-Conscious Scheduling for Energy Effi-ciency on Multicore Processors. In EuroSys, 2010.

[25] Metis MapReduce Library. http://pdos.csail.mit.edu/metis/.

[26] PARSEC Benchmark Suite. http://parsec.cs.princeton.edu/.

[27] Petar Radojkovic, Vladimir Cakarevic, MiquelMoreto, Javier Verdu, Alex Pajuelo, Francisco J.Cazorla, Mario Nemirovsky, and Mateo Valero.Optimal task assignment in multithreaded proces-sors: A statistical approach. SIGARCH Comput.Archit. News, 40(1):235–248, March 2012.

[28] Petar Radojkovic, Vladimir Cakarevic, JavierVerdu, Alex Pajuelo, Francisco J. Cazorla, MarioNemirovsky, and Mateo Valero. Thread to strandbinding of parallel network applications in mas-sive multi-threaded systems. In Proceedings ofthe 15th ACM SIGPLAN Symposium on Principles

and Practice of Parallel Programming, PPoPP ’10,pages 191–202, New York, NY, USA, 2010. ACM.

[29] Juan Carlos Saez, Manuel Prieto, Alexandra Fe-dorova, and Sergey Blagodurov. A comprehen-sive scheduler for asymmetric multicore systems.In Proceedings of the 5th European Conference onComputer Systems, EuroSys ’10, pages 139–152,New York, NY, USA, 2010. ACM.

[30] David Tam, Reza Azimi, and Michael Stumm.Thread Clustering: Sharing-Aware Scheduling onSMP-CMP-SMT Multiprocessors. In EuroSys,2007.

[31] Lingjia Tang, J. Mars, Xiao Zhang, R. Hagmann,R. Hundt, and E. Tune. Optimizing google’swarehouse scale computers: The numa experi-ence. In High Performance Computer Architecture(HPCA2013), 2013 IEEE 19th International Sym-posium on, pages 188–197, Feb 2013.

[32] Toward better NUMA scheduling. LinuxWeekly News, March 2012. http://lwn.net/

Articles/486858/.

[33] Ben Verghese, Scott Devine, Anoop Gupta, andMendel Rosenblum. Operating system support forimproving data locality on CC-NUMA computeservers. In ASPLOS, 1996.

[34] Sergey Zhuravlev, Sergey Blagodurov, and Alexan-dra Fedorova. Addressing Contention on Multicore

Processors via Scheduling. In ASPLOS, 2010.

13