Domain-Specialized Cache Management for Graph...

15
In Proceedings of the 26 th International Symposium on High-Performance Computer Architecture (HPCA’20) Domain-Specialized Cache Management for Graph Analytics Priyank Faldu The University of Edinburgh [email protected] Jeff Diamond Oracle Labs [email protected] Boris Grot The University of Edinburgh [email protected] Abstract—Graph analytics power a range of applications in areas as diverse as finance, networking and business logistics. A common property of graphs used in the domain of graph analytics is a power-law distribution of vertex connectivity, wherein a small number of vertices are responsible for a high fraction of all connections in the graph. These richly-connected, hot, vertices inherently exhibit high reuse. However, this work finds that state-of-the-art hardware cache management schemes struggle in capitalizing on their reuse due to highly irregular access patterns of graph analytics. In response, we propose GRASP, domain-specialized cache management at the last-level cache for graph analytics. GRASP augments existing cache policies to maximize reuse of hot vertices by protecting them against cache thrashing, while maintaining sufficient flexibility to capture the reuse of other vertices as needed. GRASP keeps hardware cost negligible by leveraging lightweight software support to pinpoint hot vertices, thus eliding the need for storage-intensive prediction mechanisms employed by state-of-the-art cache management schemes. On a set of diverse graph-analytic applications with large high-skew graph datasets, GRASP outperforms prior domain-agnostic schemes on all datapoints, yielding an average speed-up of 4.2% (max 9.4%) over the best-performing prior scheme. GRASP remains robust on low-/no-skew datasets, whereas prior schemes consistently cause a slowdown. Keywords-graph analytics; graph reordering; skew; last-level cache; cache management; domain-specialized; I. I NTRODUCTION Graph analytics is an exciting and rapidly growing field with applications spanning diverse areas such as optimizing routes, uncovering latent relationships, pinpointing influencers in social graphs, and many more. Graph analytics commonly process large graph datasets whose main memory footprint span tens to hundreds of gigabytes [4, 54]. When processing such large graphs, graph-analytic applications exhibit a lack of cache locality, leading to frequent misses in on-chip caches, compromising application performance [2, 3, 19, 20, 30, 31, 36]. A distinguishing property of graph datasets common in many graph-analytic applications is that the vertex degrees follow a skewed power-law distribution, in which a small fraction of vertices have many connections while the majority of vertices have relatively few connections [2, 3, 19, 45, 65, 66]. Graphs characterized by such a distribution are known as natural or scale-free graphs and are prevalent in a variety of domains, including social networks, computer networks, financial networks, semantic networks, and airline networks. We observe that the skew in the degree distribution means that a small set of vertices with a large fraction of connections is responsible for a major share of off-chip memory accesses. The fact that these richly-connected vertices, hot vertices, comprise a small fraction of the overall footprint while exhibiting high reuse makes them prime candidates for caching. Meanwhile, the rest of the vertices, cold vertices, comprise a large fraction of the overall footprint while exhibiting low or no reuse. We find that the existing caches struggle in exploiting the high reuse inherent in hot vertices due to the following two reasons: First, graph-analytic applications are notorious for exhibiting irregular access patterns that cause severe cache thrashing when processing large graphs. Accesses to a large number of cold vertices are responsible for thrashing, often forcing hot vertices out of the cache. Second, hot vertices are sparsely distributed throughout the memory space, exhibiting a lack of spatial locality. When hot vertices share the same cache block with cold vertices, valuable cache space is underutilized. Existing software techniques [2, 3, 19] solve the latter problem by reordering vertices in the memory space, such that hot vertices share cache blocks with other hot vertices. However, the former problem of protecting hot vertices from premature evictions remains an open challenge, even for the state-of-the-art thrash-resistant hardware cache management schemes. Almost all prior works on hardware cache management (i.e., cache replacement) that target cache thrashing are domain-agnostic. These hardware schemes aim to perform two tasks: (1) identify cache blocks that are likely to exhibit high reuse, and (2) protect high reuse cache blocks from cache thrashing. To accomplish the first task, these schemes deploy either probabilistic or prediction-based hardware mechanisms [5, 10, 13, 26, 28, 29, 41, 49, 51, 52, 53, 57, 58, 59, 60]. However, our work finds that graph-dependent irregular access patterns prevent these schemes from correctly learning which cache blocks to preserve, rendering them deficient for the broad domain of graph analytics. Meanwhile, to accomplish the second task, recent work proposes pinning of high-reuse cache blocks in LLC to ensure that these blocks are not evicted [7]. However, we find that pinning-based schemes are overly rigid and result in sub-optimal utilization of cache capacity.

Transcript of Domain-Specialized Cache Management for Graph...

  • In Proceedings of the 26th International Symposium on High-Performance Computer Architecture (HPCA’20)

    Domain-Specialized Cache Management for Graph Analytics

    Priyank FalduThe University of Edinburgh

    [email protected]

    Jeff DiamondOracle Labs

    [email protected]

    Boris GrotThe University of Edinburgh

    [email protected]

    Abstract—Graph analytics power a range of applications inareas as diverse as finance, networking and business logistics.A common property of graphs used in the domain of graphanalytics is a power-law distribution of vertex connectivity,wherein a small number of vertices are responsible for a highfraction of all connections in the graph. These richly-connected,hot, vertices inherently exhibit high reuse. However, thiswork finds that state-of-the-art hardware cache managementschemes struggle in capitalizing on their reuse due to highlyirregular access patterns of graph analytics.

    In response, we propose GRASP, domain-specialized cachemanagement at the last-level cache for graph analytics. GRASPaugments existing cache policies to maximize reuse of hotvertices by protecting them against cache thrashing, whilemaintaining sufficient flexibility to capture the reuse of othervertices as needed. GRASP keeps hardware cost negligibleby leveraging lightweight software support to pinpoint hotvertices, thus eliding the need for storage-intensive predictionmechanisms employed by state-of-the-art cache managementschemes. On a set of diverse graph-analytic applications withlarge high-skew graph datasets, GRASP outperforms priordomain-agnostic schemes on all datapoints, yielding an averagespeed-up of 4.2% (max 9.4%) over the best-performing priorscheme. GRASP remains robust on low-/no-skew datasets,whereas prior schemes consistently cause a slowdown.

    Keywords-graph analytics; graph reordering; skew; last-levelcache; cache management; domain-specialized;

    I. INTRODUCTIONGraph analytics is an exciting and rapidly growing

    field with applications spanning diverse areas such asoptimizing routes, uncovering latent relationships, pinpointinginfluencers in social graphs, and many more. Graph analyticscommonly process large graph datasets whose main memoryfootprint span tens to hundreds of gigabytes [4, 54]. Whenprocessing such large graphs, graph-analytic applicationsexhibit a lack of cache locality, leading to frequent misses inon-chip caches, compromising application performance [2,3, 19, 20, 30, 31, 36].

    A distinguishing property of graph datasets common inmany graph-analytic applications is that the vertex degreesfollow a skewed power-law distribution, in which a smallfraction of vertices have many connections while the majorityof vertices have relatively few connections [2, 3, 19, 45, 65,66]. Graphs characterized by such a distribution are knownas natural or scale-free graphs and are prevalent in a varietyof domains, including social networks, computer networks,financial networks, semantic networks, and airline networks.

    We observe that the skew in the degree distribution meansthat a small set of vertices with a large fraction of connectionsis responsible for a major share of off-chip memory accesses.The fact that these richly-connected vertices, hot vertices,comprise a small fraction of the overall footprint whileexhibiting high reuse makes them prime candidates forcaching. Meanwhile, the rest of the vertices, cold vertices,comprise a large fraction of the overall footprint whileexhibiting low or no reuse.

    We find that the existing caches struggle in exploitingthe high reuse inherent in hot vertices due to the followingtwo reasons: First, graph-analytic applications are notoriousfor exhibiting irregular access patterns that cause severecache thrashing when processing large graphs. Accesses toa large number of cold vertices are responsible for thrashing,often forcing hot vertices out of the cache. Second, hotvertices are sparsely distributed throughout the memory space,exhibiting a lack of spatial locality. When hot vertices sharethe same cache block with cold vertices, valuable cachespace is underutilized. Existing software techniques [2, 3,19] solve the latter problem by reordering vertices in thememory space, such that hot vertices share cache blockswith other hot vertices. However, the former problem ofprotecting hot vertices from premature evictions remains anopen challenge, even for the state-of-the-art thrash-resistanthardware cache management schemes.

    Almost all prior works on hardware cache management(i.e., cache replacement) that target cache thrashing aredomain-agnostic. These hardware schemes aim to performtwo tasks: (1) identify cache blocks that are likely to exhibithigh reuse, and (2) protect high reuse cache blocks from cachethrashing. To accomplish the first task, these schemes deployeither probabilistic or prediction-based hardware mechanisms[5, 10, 13, 26, 28, 29, 41, 49, 51, 52, 53, 57, 58, 59,60]. However, our work finds that graph-dependent irregularaccess patterns prevent these schemes from correctly learningwhich cache blocks to preserve, rendering them deficientfor the broad domain of graph analytics. Meanwhile, toaccomplish the second task, recent work proposes pinning ofhigh-reuse cache blocks in LLC to ensure that these blocksare not evicted [7]. However, we find that pinning-basedschemes are overly rigid and result in sub-optimal utilizationof cache capacity.

  • To overcome the limitations of existing hardware cachemanagement schemes, we propose GRASP – GRAph-SPecializedLast-Level Cache (LLC) management. To thebest of our knowledge, this is the first work to introducedomain-specialized cache management for the domain ofgraph analytics. GRASP augments existing cache insertionand hit-promotion policies to provide preferential treatmentto cache blocks containing hot vertices to shield them fromthrashing. To cater to the irregular access patterns, GRASPpolicies are designed to be flexible to cache other blocksexhibiting reuse. Unlike pinning, GRASP maximizes cacheefficiency based on observed access patterns.

    GRASP relies on lightweight software support toaccurately pinpoint hot vertices amidst irregular accesspatterns, in contrast to state-of-the-art schemes that relyon storage-intensive hardware mechanisms. By leveragingexisting vertex reordering techniques, GRASP enables alightweight software-hardware interface comprising of onlya few configurable registers, which are programmed bysoftware using its knowledge of the graph data structures.

    GRASP requires minimal changes to the existing micro-architecture as GRASP only augments existing cache policiesand its interface is lightweight. GRASP does not requireadditional metadata in the LLC or storage-intensive predictiontables. Thus, GRASP can easily be integrated into commodityserver processors, enabling domain-specific acceleration forgraph analytics at minimal hardware cost.

    To summarize, our contributions are as follows:• We qualitatively and quantitatively show that a wide range

    of state-of-the-art domain-agnostic cache managementschemes, despite their sophisticated prediction mechanisms,are inefficient for the domain of graph analytics.

    • We introduce GRASP, graph-specialized LLC managementfor graph analytics in processing natural graphs. GRASPaugments existing cache policies to protect hot verticesagainst cache thrashing while also maintaining flexibilityto capture reuse in other cache blocks. GRASP leveragesa lightweight software interface to pinpoint hot verticesamidst irregular accesses, which eliminates the need forprediction metedata storage at the LLC, keeping theexisting cache structure largely unchanged.

    • Our evaluation on several multi-threaded graph-analyticapplications operating on large, high-skew datasets showsthat GRASP outperforms state-of-the-art domain-agnosticschemes on all datapoints, yielding an average speed-up of4.2% (max 9.4%) over the best-performing prior scheme.GRASP is also robust on low-/no-skew datasets whereasprior schemes consistently cause a slowdown.

    II. MOTIVATION

    A. Skew in Natural Graphs

    A distinguishing property of natural graphs is the skewin their degree distribution [2, 3, 19, 45, 65, 66]. The skew

    Table IROWS #2 AND #4 SHOW THE PERCENTAGE OF VERTICES HAVING DEGREE

    EQUAL OR GREATER THAN THE AVERAGE (I.E., HOT VERTICES), WITHRESPECT TO IN-EDGES AND OUT-EDGES, RESPECTIVELY; THE HIGHER

    THE SKEW, THE LOWER THE PERCENTAGE. ROWS #3 AND #5 SHOW THEPERCENTAGE OF IN-EDGES AND OUT-EDGES CONNECTED TO THE HOT

    VERTICES, RESPECTIVELY; THE HIGHER THE SKEW, THE HIGHERTHE PERCENTAGE.

    Dataset lj pl tw kr sdIn Hot Vertices (%) 25 16 12 9 11

    Edges Edge Coverage (%) 81 83 84 93 88Out Hot Vertices (%) 26 13 10 9 13

    Edges Edge Coverage (%) 82 88 83 93 88

    follows a power-law, with the vast majority of the verticeshaving relatively few edges and a small fraction of verticesfeaturing a large number of edges. Such skewed distribution isprevalent in many domains and found, for instance, in nodesin large-scale communication networks (e.g., the internet),web pages in the web graph, and individuals in social graphs.

    Table I quantifies the skew for the datasets evaluated in thiswork (more details of the datasets in Table V). For example,in the Twitter [54] dataset (labeled tw), 12% of total verticesare classified as hot vertices in terms of their in-degree (10%for out-degree) distribution. These hot vertices are connectedto 84% of all in-edges (83% of all out-edges) in the graph.Similarly, in other datasets, 9-26% of vertices are classifiedas hot vertices, which are connected to 81-93% of all edges.In the following sections, we explain how this skew can beleveraged to improve cache efficiency.

    B. Graph Processing Basics

    The majority of shared-memory graph frameworks arebased on a vertex-centric model, in which an applicationcomputes some information for each vertex based on theproperties of its neighbouring vertices [32, 35, 42, 43, 46, 55].Applications may perform pull- or push-based computations.In pull-based computations, a vertex pulls updates from itsin-neighbors. In push-based computations, a vertex pushesupdates to its out-neighbors. This process may be iterative,and all or only a subset of vertices may participate in agiven iteration.

    The Compressed Sparse Row (CSR) format is commonlyused to represent graphs in a storage-efficient manner. CSRuses a pair of arrays, Vertex and Edge, to encode the graph.CSR encodes in-edges for pull-based computations andout-edges for push-based computations. In this discussion,we focus on pull-based computations and note that theobservations hold for push-based computation. For everyvertex, the Vertex Array maintains an index that points to itsfirst in-edge in the Edge Array. The Edge Array stores allin-edges, grouped by destination vertex ID. For each in-edge,the Edge Array entry stores the associated source vertex ID.

    The graph applications use an additional Property Array(s)to hold partial or final results for every vertex. For example,the Pagerank application maintains two ranks for every

    2

  • 0

    2

    3 4

    5

    1

    0 1 4 6 9 10

    3 2 0 5 1 5 4 5 2 5

    P0 P1 P2 P3 P4 P5

    Vertex

    Edge

    Property

    Reuse

    ID-1 ID-3

    (a) (b)

    Figure 1. (a) An example graph. (b) CSR format encoding in-edges.Elements of the same colors in all arrays, correspond to the same destinationvertex. Number of bars (labeled Reuse) below each element of the PropertyArray shows the number of times an element is accessed in one full iteration,where the color of a bar indicates the vertex making an access.

    pl tw

    BC SSSP PR PRD Radii BC SSSP PR PRD Radii0%25%50%75%100%

    Accesses outside Property ArrayAccesses within Property Array

    Misses outside Property ArrayMisses within Property Array

    Figure 2. Classification of LLC accesses and misses (normalized to totalaccesses) for the pl and tw datasets for the applications from Table III.

    vertex; one computed from the previous iteration and onebeing computed in the current iteration. Implementationmay use either two separate arrays (each storing one rankper vertex) or may use one array (storing two ranks pervertex). Fig. 1(a) and 1(b) shows a simple graph and its CSRrepresentation for pull-based computations, along with oneProperty Array.

    C. Cache Behavior in Graph Analytics

    At the most fundamental level, a graph applicationcomputes a property for a vertex based on the properties of itsneighbours. To find the neighbouring vertices, an applicationtraverses the portion of the Edge Array corresponding toa given vertex, and then accesses elements of the PropertyArray corresponding to these neighbouring vertices. Fig. 1(b)highlights the elements accessed during the computations forvertex ID-1 and ID-3.

    As the figure shows, each element in the Vertex and theEdge Array is accessed exactly once during an iteration,exhibiting no temporal locality at LLC. These arrays mayexhibit high spatial locality, which is filtered by the L1-Dcache, leading to a streaming access pattern in the LLC.

    In contrast, the Property Array does exhibit temporal reuse.However, reuse is not consistent for all elements. Specifically,reuse is proportional to the number of out-edges for pull-based algorithms. Thus, the elements corresponding to highout-degree vertices exhibit high reuse. Fig. 1(b) shows thereuse for high out-degree (i.e., hot) vertices P2 and P5 ofthe Property Array assuming pull-based computations; otherelements do not exhibit reuse. The same observation appliesto high in-degree vertices in push-based algorithms.

    Finally, Fig. 2 quantifies the LLC behavior of the graph

    applications listed in Table III on the pl and tw datasets asrepresentative examples (Refer to Sec. IV for methodologydetails). The figure differentiates all LLC accesses andmisses as falling either within or outside the Property Array.Unsurprisingly, the Property Array accounts for 78-94% of allLLC accesses. However, despite the high reuse, the PropertyArray is also responsible for a large fraction of LLC misses,the reasons for which are explained next.

    D. Challenges in Caching the Property Array

    As discussed in the previous section, elements in theProperty Array corresponding to the hot vertices exhibit highreuse. Unfortunately, on-chip caches struggle in capitalizingon the high reuse for the following two reasons:

    1 Lack of spatial locality: the hot vertices are sparselydistributed throughout the memory space of the PropertyArray. Moreover, the size of each element in the PropertyArray is much smaller than the size of a cache block.Thus, inevitably, hot vertices share space in a cache blockwith cold vertices. This leads to cache underutilization as aconsiderable fraction of a cache block capacity is occupiedby the cold vertices.

    2 Thrashing in the LLC: the access pattern to the PropertyArray is highly irregular, being heavily dependent on bothgraph structure and application. Between a pair of accessesto a given hot vertex in the Property Array, a number of other,unrelated, cache blocks may be accessed, leading to thrashing.Any block allocated by these unrelated accesses will triggerevictions at the LLC, potentially displacing blocks holdinghot vertices.

    Overcoming the former problem requires improving cacheblock utilization by focusing on intra-block reuse, whichis effectively addressed by existing software techniques [2,3, 19]. The latter problem requires retaining high-reuseblocks in the LLC by focusing on reuse across cache blocks;as explained below, this remains an open challenge notaddressed by prior works. We note that the two problemsare orthogonal in nature; solving one problem does not solvethe other one.

    We next discuss most relevant state-of-the-art techniquesin both software and hardware and their shortcomings inaddressing cache thrashing for graph analytics.

    E. Prior Software Schemes

    Prior works have proposed leveraging application-visibleproperties, such as vertex connectivity or vertex degree, toimprove cache locality. This is accomplished by reorderingvertices in memory such that vertices that will be frequentlyaccessed together are kept close by. These techniques areparticularly attractive because they require no modificationsto the graph algorithms [2, 3, 15, 19, 21, 30, 47, 48, 69, 70,71]. To be effective, these techniques face two constraintsduring reordering. First, they must keep the reordering

    3

  • cost to a minimum to improve end-to-end applicationperformance. Second, they should minimize disruption tothe underlying graph structure, specifically for graphs thatexhibit community structure. Prior works have noted thatvertex order for many real-world graph datasets closely followunderlying community structure, meaning vertices from thesame community are ordered close by in memory, exhibitinggood spatio-temporal locality that should be preserved [2, 3,19].

    While finding an optimal vertex ordering is NP-hard,techniques like Gorder attempt to approximate such orderingby comprehensively analyzing the graph structure based on alikely traversal order [30]. However, recent works show thatwhile such techniques are effective in reducing LLC misses,they incur a staggering reordering cost that is often multipleorders of magnitude higher than the application runtime, thusrendering them impractical [2].

    Consequently, the same works argue for lightweight skew-aware techniques that provide application speed-up even afteraccounting for their reordering cost. To keep the reorderingcost low, these techniques reorder vertices solely based onvertex degree, with the goal of improving spatial locality byensuring that hot vertices share cache blocks only with otherhot vertices. To achieve this effect, skew-aware techniques(e.g., HubSort [19] and DBG [2]) rely on some variantof degree-based sorting to segregate hot vertices form thecold ones. While effective in improving spatial locality (asexplained in Sec. II-D), these techniques still suffer fromcache thrashing stemming from the fact that the footprint ofonly hot vertices also exceeds the available cache capacity [2].

    F. Prior Hardware Schemes

    Prior hardware cache management schemes targeting cachethrashing can be broadly classified into three categories:

    1 History-agnostic lightweight schemes use simpleheuristics to manage cache [41, 51, 52, 57, 58, 60]. RRIP [52]is the state-of-the-art technique in this category that relies ona probabilistic approach to classify a cache block as low- orhigh-reuse at the time of inserting a new block in the cache.As these techniques do not record the past reuse behaviorof cache blocks, they are limited in accurately identifyinghigh-reuse blocks.

    2 History-based predictive schemes such as the state-of-the-art Hawkeye [26] and many others [5, 10, 13, 28, 29, 49,53] learn past reuse behavior of cache blocks by employingsophisticated storage-intensive prediction mechanisms. Alarge body of recent works focus on history-based schemesas they generally provide higher performance than thelightweight schemes for a wide range of applications.However, for graph analytics, we find that graph-dependentirregular access patterns prevent these history-based schemesfrom correctly learning which cache blocks to preserve. Forexample, most history-based schemes rely on a PC-based

    (Program Counter) reuse correlation1 to learn which set ofPC addresses access high-reuse cache blocks to prioritizethese blocks for caching over others. Meanwhile, we observethat the reuse for elements of the Property Array, whichare the prime target for LLC caching in graph analytics(Sec II-C), does not correlate with the PC because the samePC accesses hot and cold vertices alike.

    3 Pinning-based schemes such as XMem [7] dedicatepartial or full cache capacity by pinning high-reuse blocksto cache. Hardware ensures that the pinned blocks cannot beevicted by other cache blocks and thus are protected fromcache thrashing. Such an approach is only feasible when thehigh-reuse working set fits in the available cache capacity.Unfortunately, for large graph datasets, even with high skew,it is unlikely that all hot vertices will fit in the LLC; recallfrom Table I that hot vertices account for up to 26% of thetotal vertices. Moreover, some of the colder vertices mightalso exhibit short-term temporal reuse, particularly in graphswith community structure.

    These observations call for a new LLC management schemethat employ (1) a reliable mechanism to identify hot verticesamidst irregular access patterns and (2) flexible cache policiesthat maximize reuse among hot vertices by protecting themin the cache without denying colder vertices the ability tobe cache resident if they exhibit reuse.

    III. GRASP: CACHING IN ON THE SKEW

    This work introduces GRASP, graph-specialized LLCmanagement for graph analytics processing natural graphs.GRASP augments existing cache management schemes withsimple additions to their insertion and hit-promotion policiesthat provide preferential treatment to cache blocks containinghot vertices to protect them from thrashing. GRASP policiesare sufficiently flexible to capture reuse of other blocksas needed.

    GRASP’s domain-specialized design is influenced by thefollowing two challenges faced by existing hardware cachemanagement schemes. First, hardware alone cannot enforcespatial locality, which is dictated by vertex placement in thememory space and is under software control. Second, domain-agnostic hardware cache management schemes struggle inpinpointing hot vertices under cache thrashing due to irregularaccess patterns endemic of graph analytics.

    To overcome both challenges, GRASP relies on existingskew-aware reordering software techniques to induce spatiallocality by segregating hot vertices in a contiguous memoryregion [2, 19]. While these techniques offer different trade-offs in terms of reordering cost and their ability to preservegraph structure, they all work by isolating hot vertices fromthe cold ones. Fig. 3(a) shows a logical view of the placementof hot vertices in the Property Array after reordering by such

    1All seven schemes [8, 9, 12, 14, 16, 17, 18] presented at the CacheReplacement Championship’17 [11] rely on PC-based reuse correlation.

    4

  • PropertyArrayOriginalOrdering

    PropertyArrayAfterVertexReordering

    LLCsizeHot

    Vertices

    ColdVertices

    (a)SWView (c)HWView

    PropertyArray

    Start

    End

    AddressBoundRegisters

    (b)SW-HWInterface

    HighReuse

    ModerateReuse

    Highestdegree

    Lowestdegree

    Vertices

    LLCsize

    Figure 3. GRASP overview. (a) Software applies vertex reordering, whichsegregates hot vertices at the beginning of the array. (b) GRASP interfaceexposes an ABR pair per Property Array to be configured with the boundsof the array. (c) GRASP identifies regions exhibiting different reuse basedon an LLC size.

    MMU/TLB

    L1-DCacheCPUCore LLCVirtualAddress(VA)

    ABRs&Classificationlogic

    PhysicalAddress(PA)

    LLCRequest(PA,hint)

    ReuseHint

    2-bitsGRASPpolicies

    startend

    Figure 4. Block diagram of GRASP and other hardware components withwhich it interacts. GRASP components are shown in color. For brevity, thefigure shows only one CPU core.

    a technique. GRASP subsequently leverages the contiguityamong hot vertices in the memory space to (1) pinpointthem via a lightweight interface and (2) protect themfrom thrashing. GRASP design consists of three hardwarecomponents as follows.

    A Software-hardware interface: GRASP interface isminimal, consisting of a few configurable registers thatsoftware populates with the bounds of the Property Arrayduring the initialization of an application (see Fig. 3(b)). Oncepopulated, GRASP does not rely on any further interventionfrom software.

    B Classification logic: GRASP logically partitions theProperty Array into different regions based on expected reuse.(See Fig. 3(c)). GRASP implements simple comparison-basedlogic, which, at runtime, classifies whether a cache requestbelongs to one of these regions.

    C Specialized cache policies: GRASP specializes cachepolicies for each region to ensure hot vertices are protectedfrom thrashing while retaining flexibility in caching otherblocks. The classification logic informs the choice of whichpolicy to apply to a given cache block.

    Fig. 4 shows how GRASP interacts with other hardwarecomponents in the system. In the following sections, wedescribe each of GRASP’s components in detail.

    A. Software-Hardware Interface

    GRASP’s interface consists of one pair of Address BoundRegisters (ABR) per Property Array; recall from Sec. II-B that

    an application may maintain more than one Property Array,each of which requires a dedicated ABR pair. ABRs are partof an application context and are exposed to the software.At application start-up, the graph framework populates eachABR pair with the start and end virtual address of the entireProperty Array (Fig. 3(b)). Setting these registers activatesthe custom cache management for graph analytics. Whenthe ABRs are not set by the software (i.e., the default casefor other applications), specialized cache management isessentially disabled.

    The use of virtual addresses keeps the GRASP interfaceindependent of the existing TLB design, allowing GRASPto perform address classification (described next) in parallelwith the usual virtual-to-physical address translation carriedout by TLB (see Fig. 4). Prior works have used similarapproaches to pass data-structure bounds to aid micro-architecture mechanisms [7, 20, 27, 39].

    B. Classification Logic

    This component of GRASP is responsible for reliablyidentifying cache blocks containing hot vertices in hardwareby leveraging the bounds of the Property Array(s) availablein the ABRs as explained in the following sections:

    Identifying Hot Vertices: In theory, all hot vertices shouldbe cached. In practice, it is unlikely that all hot vertices willfit in the LLC for large datasets. In such a case, providingpreferential treatment to all hot vertices is not beneficialas they can thrash each other in the LLC. To avoid thisproblem, GRASP prioritizes cache blocks containing only asubset of hot vertices, comprised of only the hottest verticesbased on available LLC capacity. Conveniently, the hottestvertices are located at the beginning of the Property Array ina contiguous region thanks to the application of skew-awarereordering as seen in Fig. 3(a).

    Pinpointing the High Reuse Region: GRASP labels twoLLC-sized sub-regions within the Property Array: The LLC-sized memory region at the start of the Property Array islabeled as High Reuse Region; another LLC-sized memoryregion starting immediately after the High Reuse Region islabeled as the Moderate Reuse Region (Fig. 3(c)). Finally,if an application specifies more than one Property Array,GRASP divides LLC-size by the number of Property Arraysbefore labeling the regions.

    Classifying LLC Accesses: At runtime, GRASP classifies amemory address making an LLC access as High-Reuse if theaddress belongs to the High Reuse Region of any PropertyArray; GRASP determines this by comparing the addresswith the bounds of the High Reuse Region of each PropertyArray. Similarly, an address is classified as Moderate-Reuse ifthe address belongs to the Moderate Reuse Region. All otherLLC accesses are classified as Low-Reuse. For non-graphapplications, the ABRs are not initialized and all accesses areclassified as Default, effectively disabling domain-specialized

    5

  • cache management. GRASP encodes the classification result(High-Reuse, Moderate-Reuse, Low-Reuse or Default) as a2-bit Reuse Hint, and forwards it to the LLC along witheach cache request, as shown in Fig. 4, to guide specializedinsertion and hit-promotion policies as described next.

    C. Specialized Cache Policies

    This component of GRASP implements specialized cachepolicies that protect the cache blocks associated with High-Reuse LLC accesses against thrashing. One naive way ofdoing so is to pin the High-Reuse cache blocks in theLLC. However, pinning would sacrifice any opportunity inexploiting temporal reuse that may be exposed by other cacheblocks (e.g., Moderate-Reuse cache blocks).

    To overcome this challenge, GRASP adopts a flexibleapproach by augmenting an existing cache replacement policywith a specialized insertion policy for LLC misses and a hit-promotion policy for LLC hits. GRASP’s specialized policiesprovide preferential treatment to High-Reuse blocks whilemaintaining flexibility in exploiting temporal reuse in othercache blocks, as discussed next.

    Insertion Policy: Accesses tagged as High-Reuse, comprisingthe set of the hottest vertices belonging to the High ReuseRegion, are inserted in the cache at the MRU position toprotect them from thrashing. Accesses tagged as Moderate-Reuse, likely exhibiting lower reuse when compared to theHigh-Reuse region, are inserted near the LRU position. Suchinsertion policy allows Moderate-Reuse cache blocks anopportunity to experience a hit without causing thrashing.Finally, accesses tagged as Low-Reuse, comprising the rest ofthe graph dataset, including the long tail of the Property Arraycontaining cold vertices, are inserted at the LRU position, thusmaking them immediate candidates for replacement whilestill providing them with an opportunity to experience a hitand be promoted using the specialized policy described next.

    Hit-Promotion Policy: Cache blocks associated with High-Reuse LLC accesses are immediately promoted to the MRUposition on a hit to protect them from thrashing. LLC hits toblocks classified as Moderate-Reuse or Low-Reuse make foran interesting case. On the one hand, the likelihood of theseblocks having further reuse is quite limited, which meansthey should not be promoted directly to the MRU position.On the other hand, by experiencing at least one hit, theseblocks have demonstrated temporal locality, which cannot becompletely ignored. GRASP takes a middle ground for suchblocks by gradually promoting them towards MRU positionon every hit.

    Eviction Policy: GRASP’s eviction policy does not utilizehint to differentiate blocks at replacement time; hence, it isunmodified from the baseline scheme. This is a key factorthat keeps the cache management flexible for GRASP. Bynot prioritizing candidates for eviction, GRASP ensures thatblocks classified as High-Reuse but not referenced for a long

    Table IIPOLICY COLUMNS SHOW HOW GRASP UPDATES PER-BLOCK 3-BIT

    RRPV COUNTER OF RRIP (BASE SCHEME) FOR A GIVEN REUSE HINT.HIGHER RRPV VALUE INDICATES HIGHER EVICTION PRIORITY.

    Reuse Hint Insertion Policy Hit PolicyHigh-Reuse RRPV = 0 RRPV = 0Moderate-Reuse RRPV = 6 if RRPV > 0:Low-Reuse RRPV = 7 RRPV - -Default RRPV = 6 or 7 RRPV = 0

    time can yield cache space to other blocks that do exhibitreuse. Because the unchanged eviction policy does not needto differentiate between blocks with High-Reuse and otherhints, cache blocks do not need to explicitly store the ReuseHint as additional LLC metadata.

    Table II shows the specialized cache policies for all ReuseHints under GRASP. While the table, and our evaluation,assumes RRIP [52] as the base replacement scheme, we notethat GRASP is not fundamentally dependent on RRIP andcan be implemented over many other schemes including, butnot limited to, LRU, Pseudo-LRU and DIP [60].

    D. Benefits of GRASP over Prior Schemes

    The state-of-the-art history-based schemes [5, 10, 13,26, 28, 29, 49, 53] require intrusive modifications to thecache structure in form of embedded metadata in cacheblocks and/or dedicated predictor tables. These schemes alsorequire propagating a PC signature through the core pipelineall the way to the LLC, which so far has hindered theircommercial adoption.

    In comparison, GRASP is implemented within the samehardware structure required by the base lightweight scheme(e.g., RRIP). GRASP propagates only a 2-bit Reuse Hint tothe LLC on each cache access to guide cache policy decisions.By relying on lightweight software support, GRASP reliablypinpoints hot vertices in hardware without requiring costlyprediction tables and/or additional per-cache-block metadata.

    When compared to pinning-based schemes, GRASPpolicies protect hot vertices from thrashing while remainingflexible to capture reuse of other blocks as needed. Combiningrobust cache policies with minimal hardware modificationsmakes GRASP feasible for commercial adoption while alsoproviding higher LLC efficiency.

    IV. METHODOLOGY

    A. Graph Processing Framework

    For the evaluation, we use Ligra [43], a widely used graphprocessing framework that supports both pull- and push-based computations, including switching from pull to push(and vice versa) at the start of every iteration. We combinethe five diverse applications listed in Table III with the fivehigh-skew graph datasets listed in Table V, resulting in 25benchmarks. To test the robustness of GRASP to adversarialworkloads, we use two additional datasets with low-/no-skew.

    6

  • Table IIIGRAPH-ANALYTIC APPLICATIONS.

    Application Brief descriptionBetweenness

    Centrality(BC)

    finds the most central vertices in a graph by using aBFS kernel to count the number of shortest pathspassing through each vertex from a given root vertex.

    Single SourceShortest Path

    (SSSP)

    computes shortest distance for vertices in a weightedgraph from a given root vertex using theBellman-Ford algorithm.

    Pagerank(PR)

    is an iterative algorithm that calculates ranks ofvertices based on the number and quality ofincoming edges to them [68].

    PageRank-Delta(PRD)

    is a faster variant of PageRank in which vertices areactive in an iteration only if they have accumulatedenough change in their PageRank score.

    RadiiEstimation

    (Radii)

    estimates the radius of each vertex by performingmultiple parallel BFS’s from a small sample ofvertices [56].

    Table IVEFFECT OF OUR OPTIMIZATION ON THE ORIGINAL LIGRA

    IMPLEMENTATION FOR DIFFERENT APPLICATIONS. PR APPLIESPULL-BASED COMPUTATIONS WHEREAS SSSP APPLIES PUSH-BASEDCOMPUTATIONS THROUGHOUT THE EXECUTION; THE REST OF THE

    APPLICATIONS SWITCH BETWEEN PULL OR PUSH BASED ON A NUMBEROF ACTIVE VERTICES IN A GIVEN ITERATION.

    Application Merging opportunity? Speed-upBC No -

    SSSP Yes 3-8%PR Yes 40-52%

    PRD Yes 14-49%Radii No -

    We obtained the source code for the graph applicationsfrom Ligra [43] and applied a simple data-structureoptimization to improve locality in the baselineimplementation as follows. As explained in Sec. II-C, graphapplications exhibit irregular accesses for the Property Array,with applications potentially maintaining more than one sucharray. When multiple Property arrays are used, elementscorresponding to a given vertex may need to be sourcedfrom all of the arrays. We merge these arrays to inducespatial locality, which reduces number of misses, and inturn, improves performance on all datasets for PR, PRD andSSSP (see Table IV). We use the optimized implementationof these three applications as a stronger baseline for ourevaluation. The optimized applications are available athttps://github.com/faldupriyank/grasp. We do note thatGRASP does not mandate merging arrays as GRASP designcan accommodate multiple arrays. Nevertheless, mergingdoes reduce the number of arrays needed to be tracked.

    For PRD, two versions of the algorithm are providedwith Ligra: push-based and pull-push. In the baselineimplementation, the push-based version is faster. However,after merging the Property Arrays, the pull-push variantperforms better, and is what we use for the evaluation.

    B. Methodology for Software Evaluation

    Server Configuration: Native experiments (Sec. V-C) areperformed on a dual-socket server with two Broadwell based

    Table VPROPERTIES OF THE GRAPH DATASETS. TOP FIVE DATASETS ARE USED IN

    THE MAIN EVALUATION WHEREAS THE BOTTOM TWO DATASETS AREUSED AS ADVERSARIAL DATASETS.

    Dataset Vertex Count Edge Count Avg. DegreeLiveJournal (lj) [38] 5M 68M 14

    PLD (pl) [4] 43M 623M 15Twitter (tw) [54] 62M 1,468M 24

    Kron (kr) [32] 67M 1,323M 20SD1-ARC (sd) [4] 95M 1,937M 20Friendster (fr) [23] 64M 2,147M 33Uniform (uni) [62] 50M 1,000M 20

    Intel Xeon CPU E5-2630 [25], each with 10 cores clockedat 2.2GHz and a 25MB shared LLC. Hyper-threading isenabled, exposing 40 hardware execution contexts acrossboth CPUs. The machine has 128GB of DRAM provided byeight DIMMs clocked at 2133MHz. All experiments wererun using 40 threads, and we pinned the software threadsto avoid performance variations due to OS scheduling. Tofurther reduce sources of performance variation, we alsodisable the turbo boost DVFS features. Finally, we enabledTransparent Huge Pages to reduce TLB misses.

    Evaluation of software reordering techniques are carriedout on the server mentioned above. We report the performancespeed-up over the entire application runtime (includingreordering cost) but exclude the graph loading time from thedisk. For iterative applications, PR and PRD, we run themuntil convergence and consider the runtime over all iterations.For root-dependent traversal applications, SSSP and BC, werun them from eight different root vertices for each inputdataset and consider the runtime over all eight traversals.Finally, we run six executions for each application-datasetpair and report the geometric mean of the five trials, excludingthe timing of the first trial to allow the caches to warm up.We note that runtime is relatively stable across executions; foreach reported datapoint, coefficient of variation is below 2%.

    Reordering Techniques: We evaluate the followingreordering techniques and use the source code for skew-aware techniques from https://github.com/faldupriyank/dbgand for Gorder from https://github.com/datourat/Gorder.Sort reorders vertices in the memory space by sorting themin the descending order of their degree.HubSort [19] segregates hot vertices in a contiguous regionby assigning them a continuous range of vertex IDs intheir descending order of degree. In doing so, Hub Sortingessentially sorts all hot vertices, while largely preservingstructure for the cold vertices.DBG [2], unlike Sort and HubSort, does not rely on sortingto segregate hot vertices. Instead, DBG coarsely partitions allvertices into a small number of groups based on their degree.Similar to Sort and HubSort, DBG is effective at improvingspatial locality; however, unlike the other two techniques,DBG is able to largely preserve the existing graph structure.

    7

    https://github.com/faldupriyank/grasphttps://github.com/faldupriyank/dbghttps://github.com/datourat/Gorder

  • Table VIPARAMETERS OF THE SIMULATED SYSTEM FOR EVALUATION OF THE

    HARDWARE SCHEMES.

    Core OoO @ 2.66GHz, 4-wide front-end

    L1-I/D Cache 4/8-ways 32KB, 4 cycles access latencystride-based prefetchers with 16 streamsL2 Cache Unified, 8-ways 256KB, 6 cycles access latency

    L3 Cache 16-ways 16MB NUCA (2MB slice per core),10 cycles bank access latencyNOC Ring network with 2 cycles per hop

    Memory 50ns latency, 2 on-chip memory controllers

    Gorder [30] is evaluated as a representative of complextechniques. As Gorder is only available in a single-threadimplementation, while reporting the net runtime of Gorder fora given dataset, we optimistically divide the reordering timeby 40 (maximum number of threads supported on the server)to provide a fair comparison with skew-aware techniqueswhose reordering implementation is fully parallelized.

    C. Methodology for Hardware Evaluation

    Simulation Infrastructure: We use the Sniper [37] simulatormodeling 8 OoO cores. Table VI lists the parameters of thesimulated system. The applications are evaluated in a multi-threaded mode with 8-threads.

    We find that the graph applications spend significantfraction (86% on average in our evaluations) of time inpush-based iterations for SSSP or pull-based iterations forall other evaluated applications. Thus, we simulate theRegion of Interest (ROI) covering only push- or pull-basediterations (whichever one dominates) for the respectiveapplications. Because simulating all iterations of a graph-analytic application in a detailed microarchitectural simulatoris prohibitive, time-wise, we instead simulate one iterationthat has the highest number of active vertices. To validatethe soundness of our methodology, we also simulated onemore randomly chosen iteration for each application-datasetpair with at least 20% of vertices active and observed trendssimilar to the ones reported in the paper.

    Hardware Cache Management Schemes: We evaluateGRASP and compare it with the state-of-the-art thrash-resistant cache management schemes described below.RRIP [52] is the state-of-the-art lightweight scheme thatdoes not depend on history-based learning. RRIP is the mostappropriate comparison point given that GRASP builds uponRRIP as the base scheme (Sec. III-C). We implement RRIP(specifically, DRRIP) based on the source code from thecache replacement championship (CRC1) [50] for RRIP, anduse a 3-bit counter per cache block. We use RRIP as a high-performance baseline and report speed-up for all hardwareschemes over the RRIP baseline (except for the studies inSec V-D that use LRU baseline).Signature-based Hit Predictor (SHiP) [49] is the state-of-the-art insertion policy which builds on RRIP [52]. Due tothe shortcomings of PC-based reuse correlation for graph

    applications as explained in Sec. II-F, we evaluate a SHiP-MEM variant that correlates a block’s reuse to the block’smemory region. We evaluate 16KB memory regions as in theoriginal proposal. The predictor table is provisioned with anunlimited number of entries to assess the maximum potentialof the scheme. Each table entry consists of a 3-bit saturatingcounter that tracks the re-reference behavior of cache blocksof the memory region associated with that entry.Hawkeye [26] is the state-of-the-art cache managementscheme and winner of the cache replacement championship(CRC2) [11]. Hawkeye trains its predictor table by applyingBelady’s MIN algorithm on past LLC accesses to infer block’scache friendliness. We use the source code from the CRC2 forHawkeye that improves upon the prefetcher-agnostic designof Hawkeye from [26]. We appropriately scale the number ofsampling sets and predictor table entries for a 16MB cache.Leeway [10] is a history-based cache management schemethat applies dead block predictions based on a metric calledLive Distance, which conservatively captures the reuseinterval of a cache block. We use the most recent versionof the source code for Leeway from https://github.com/faldupriyank/leeway. We appropriately scale the number ofsampling sets and predictor table entries for a 16MB cache.XMem [7] is a pinning-based scheme, originally proposedfor algorithms that benefit from cache tiling. A cache block,once pinned, cannot be evicted until explicitly unpinned bythe software, usually done when the processing of a tile iscomplete. In the original proposal, XMem reserves 75% ofLLC capacity to pin tile data whereas the remaining capacityis managed by the base replacement scheme for the rest ofthe data. In this work, we explore four configurations ofXMem, labeled PIN-X, where X refers to the percentage(25%, 50%, 75% or 100%) of LLC capacity reserved forpinning. We adopt XMem design for graph analytics andidentify the cache blocks from the high reuse region thatbenefit from pinning using the GRASP interface. Finally,XMem requires an additional 1-bit for every cache blockto identify whether a cache block is pinned, along with anadditional mechanism to track how much of the capacity isused by the pinned cache blocks at any given time.GRASP is the proposed domain-specialized cachemanagement scheme for graph analytics. We instrumentthe applications to communicate the address bounds of theProperty Arrays to the simulated GRASP hardware. Foreach evaluated application, we needed to instrument at mosttwo arrays. Finally, GRASP uses RRIP as the base cachescheme with a 3-bit saturating counter and does not add anyfurther storage to per-block metadata.

    V. EVALUATION

    We first evaluate hardware cache management schemeson top of a skew-aware software reordering technique(Sec. V-A & V-B). Due to long simulation time, evaluating

    8

    https://github.com/faldupriyank/leewayhttps://github.com/faldupriyank/leeway

  • -19 -17 -17 -21 -19 -19 -19 -34 -23 -44 -27 -19 -33 -23 -44 -27 -30 -20 -33 -31 -23

    BC SSSP PR PRD Radii GM

    lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd all-15-10

    -505

    1015

    %

    Miss

    es E

    limina

    ted SHIP-MEM Hawkeye Leeway GRASP

    Figure 5. LLC miss reduction for GRASP and state-of-the-art history-based cache management schemes over the RRIP baseline.

    -14 -12 -15 -17 -16 -15-14 -13 -12 -25 -16 -30 -20 -13 -24 -16 -30 -20 -18 -17 -21 -16

    BC SSSP PR PRD Radii GM

    lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd all-10

    -505

    10

    Spee

    d-up

    (%)

    SHIP-MEM Hawkeye Leeway GRASP

    Figure 6. Speed-up for GRASP and state-of-the-art history-based cache management schemes over the RRIP baseline.

    all hardware schemes on top of all four software reorderingtechniques would be prohibitive. Thus, without loss ofgenerality, we evaluate hardware schemes on top of DBG,which consistently outperforms other reordering techniques(Sec. V-C). In Sec. V-C, we evaluate GRASP with otherreordering techniques to show GRASP’s generality.

    A. History-based Predictive Schemes

    In this section, we compare GRASP with the state-of-the-art hardware schemes, SHiP-MEM [49], Hawkeye [26] andLeeway [10]. Prior cache management proposals typicallyused LRU as the baseline, which is known to be inefficientagainst thrashing access patterns. Thus, we use RRIP, thestate-of-the-art history-agnostic caching scheme as a strongbaseline. Finally, we use DBG as the software baseline;thus, all speed-ups reported in this section are over andabove DBG.

    Miss reduction: Fig. 5 shows the miss reduction over theRRIP baseline. GRASP consistently reduces misses on alldatapoints, eliminating 6.4% of LLC misses on averageand up to 14.2% in the best case (on lj dataset for theRadii application). The domain-specialized design allowsGRASP to accurately identify the high-reuse working set(i.e., hot vertices), which GRASP is able to retain in thecache through its specialized policies, effectively exploitingthe temporal reuse.

    Among prior techniques, Leeway is the only techniquethat reduces misses, albeit marginal, with an average missreduction of 1.1% over the RRIP baseline. The other twotechniques are not effective for graph applications, with SHiP-MEM and Hawkeye increasing misses across datapoints, withan average miss reduction of -4.8% and -22.7%, respectively,over the baseline. This is a new result as prior works showthat Hawkeye and SHiP-MEM outperform RRIP on a widerange of applications [26, 49]. The result indicates that the

    learning mechanisms of the state-of-the-art domain-agnosticschemes are deficient in retaining the high-reuse workingset for graph applications, which ends up hurting applicationperformance as discussed next.

    Application speed-up: Fig. 6 shows the speed-up forhardware schemes over the RRIP baseline. Overall,performance correlates well with the change in LLC misses;GRASP consistently provides a speed-up across datapointswith an average speed-up of 5.2% and up to 10.2% in thebest case (on pl dataset for the SSSP application) over thebaseline. When compared to the same baseline, SHiP-MEMand Hawkeye consistently cause slowdown with an averagespeed-up of -5.5% and -16.2%, respectively whereas Leewayyields a marginal speed-up of 0.9%. Finally, when comparedto prior works directly, GRASP yields 4.2%, 5.2%, 11.2%and 25.5% average speed-up over Leeway, RRIP, SHiP-MEMand Hawkeye, respectively, while not causing slowdown onany datapoints.

    We also evaluated prior schemes without applying anyvertex reordering. Leeway, SHiP-MEM and Hawkeye yield anaverage speed-up of -0.8%, -5.7% and -14.8%, respectively,over RRIP on the datasets with no reordering. However,detailed study is omitted due to space constraints.

    Dissecting Performance of SHiP-MEM: SHiP-MEM is ahistory-based scheme that predicts reuse of a cache blockbased on the fine-grained memory region, to which it belongs.Thus, SHiP-MEM relies on a homogeneous cache behaviorfor all blocks belonging to the same memory region. Intheory, DBG (or another skew-aware technique) should allowSHiP-MEM to identify memory regions containing hottest ofvertices (corresponding to High Reuse Region from Fig. 3(c)).In practice, irregular access patterns to these regions andthrashing by cache blocks from other regions impede learning.Thus, despite leveraging software and utilizing a sophisticatedstorage-intensive prediction mechanism in hardware, SHiP-

    9

  • MEM underperforms domain-specialized GRASP.

    Dissecting Performance of Hawkeye: Hawkeye is the state-of-the-art history-based scheme that uses PC-based reusecorrelation to predict whether a cache block has a cache-friendly or cache-averse behavior based on past LLC accesses.Thus, Hawkeye fundamentally relies on homogeneous cachebehavior for all blocks accessed by the same PC address.When Hawkeye is employed for graph analytics, Hawkeyestruggles to learn the behavior of cache blocks in the PropertyArray as hot vertices exhibit cache-friendly behavior whilecold vertices exhibit cache-averse behavior, yet all verticesare accessed by the same PC address. To make matters worse,if a block incurs a hit and Hawkeye predicts the PC makingthe access as cache-averse, the cache block is prioritizedfor eviction instead of promoting the block to MRU as isdone in the baseline. Thus, Hawkeye performs even worsethan the baseline for all combinations of graph applicationsand datasets. While not evaluated, other PC-based schemes(e.g., [49, 53]) that rely on a PC-based correlation wouldalso struggle on graph applications for the same reason.

    Dissecting Performance of Leeway: Leeway, like Hawkeye,also relies on a PC-based reuse correlation, and thus is notexpected to provide significant speed-ups for graph-analytics.However, Leeway successfully avoids the slowdown on 10of the 25 datapoints and significantly limits the slowdown onthe rest of the datapoints (max slowdown of 2.1% vs 13.6%for SHiP-MEM and 30.2% for Hawkeye). The reasons whyLeeway perfroms better than prior PC-based schemes can beattributed to (1) the conservative nature of the Live Distancemetric, which Leeway uses to determine if a cache blockis dead, and (2) adaptive reuse-aware policies that controlthe rate of predictions based on the observed access patterns.Because of these two factors, performance of Leeway remainsclose the the base replacement scheme in the presence ofvariability in cache blocks reuse behavior.

    Dissecting Performance of GRASP: Performance of GRASPover its base scheme, RRIP, can be attributed to three features:software hints, insertion policy and hit-promotion policy.Fig. 7 shows the performance impact due to each of thesefeatures. RRIP inserts every new cache block at one of thetwo positions (as specified in the Default Reuse Hint ofTable II); a cache block is inserted at the LRU positionwith a high probability or near the LRU position with a lowprobability. RRIP+Hints is identical to RRIP except for how anew cache block is assigned these positions. RRIP+Hints usessoftware hints (similar to GRASP) to guide the insertion.A cache block with High-Reuse hint is inserted near theLRU position and all other blocks are inserted at the LRUposition. GRASP (Insertion-Only) refers to the scheme thatapplies insertion policy of GRASP as specified in Table II butthe hit-promotion policy is unchanged from RRIP. Finally,GRASP (Hit-Promotion) refers to the scheme that applies hit-promotion policy of GRASP along with its insertion policy,

    BC SSSP PR PRD Radii GM

    lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd all-4048

    12

    Spee

    d-up

    (%)

    RRIP+Hints GRASP (Insertion-Only) GRASP (Hit-Promotion)

    Figure 7. Impact of GRASP features on performance.

    which is essentially the full GRASP design. Note that eachsuccessive scheme adds a new feature on top of the featuresincorporated by the previous ones. For example, GRASP(Insertion-Only) features a new insertion policy in additionto the software hints.

    As the figure shows, RRIP+Hints yields an average speed-up of 3.3% over probabilistic RRIP, confirming the utility ofsoftware hints. GRASP (Insertion-Only) further increasesperformance by yielding an average speed-up of 5.0%.GRASP (Insertion-Only) provides additional protection tothe High-Reuse cache blocks in comparison to RRIP+Hintsby inserting High-Reuse cache blocks directly at theMRU position. Finally, GRASP (Hit-Promotion) yields anaverage speed-up of 5.2%. Difference between GRASP (Hit-Promotion) and GRASP (Insertion-Only) is marginal as thehit-promotion policy of GRASP has negative effect on slightlyless than half the datapoints. The results are inline with theobservations from prior work that showed that the value-addition of hit-promotion policies over insertion policies islow in presence of cache thrashing [22].

    Summary: Hardware cache management is an establisheddifficult problem, which is reflected in the small averagespeed-ups (usually 1%-5%) achieved by state-of-the-art cachemanagement schemes over the prior best schemes [5, 26,28, 29, 49, 52, 53]. Our work shows that graph applicationspresent a particularly challenging workload for these schemes,in many cases leading to significant performance slowdowns.In this light, GRASP is quite successful in improvingperformance of graph applications by yielding an averagespeed-up of 5.2% (max 10.2%) while not causing slow-down on any datapoint. Moreover, unlike state-of-the-artschemes, GRASP achieves this without requiring storage-intensive metadata.

    B. Pinning-based Schemes

    In this section, we show the benefit of flexible GRASPpolicies over pinning-based rigid approaches. We first presentthe results on the high-skew datasets and then on the low-/no-skew datasets to test their resilience in adversarial scenarios.

    High-skew datasets: Fig. 8 shows speed-ups for four XMemconfiguration (PIN-25, PIN-50, PIN-75 and PIN-100) andGRASP over the RRIP baseline on high-skew datasets.GRASP outperforms all XMem configurations on 24 of 25

    10

  • -8

    BC SSSP PR PRD Radii GM

    lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd all-505

    10

    Spee

    d-up

    (%)

    PIN-25 PIN-50 PIN-75 PIN-100 GRASP

    Figure 8. Speed-up for GRASP and pinning-based schemes over the RRIP baseline on high-skew datasets.

    -14

    fr uni

    BC SSSP PR PRD Radii BC SSSP PR PRD Radii-6-3036

    Spee

    d-up

    (%)

    PIN-75 PIN-100 GRASP

    Figure 9. Speed-up over the RRIP baseline on fr, a low-skew dataset anduni, a no-skew dataset.

    datapoints with an average speed-up of 5.2%. In comparison,PIN-25, PIN-50, PIN-75 and PIN-100 yield 0.4%, 1.1%,2.0% and 2.5%, respectively.

    PIN-100 outperforms the other three XMem configurationsas for those configurations, a significant fraction of thecapacity can still be occupied by cold vertices, which causesthrashing in the unreserved capacity. Nevertheless, PIN-100causes slowdown on many datapoints (e.g., for BC, PR andPRD applications on tw and sd datasets). Moreover, PIN-100cannot capitalize on reuse from Moderate Reuse Region aspinned vertices cannot be evicted even when they stoppedexhibiting reuse. Thus, PIN-100 provides only a marginalspeed-up on many datapoints (e.g., Radii application on lj,tw and kr datasets).

    PIN-75 and PIN-100, two best-performing XMemconfigurations, while yield only marginal speed-ups, stilloutperform the state-of-the-art history-based schemes, SHiP-MEM, Leeway and Hawkeye (Figs. 6 & 8), which confirmsthat utilizing software knowledge for cache management is apromising direction over storage-intensive domain-agnosticdesign for the challenging access patterns of graph analytics.

    Low-/No-skew datasets: Next, we evaluate the robustness ofGRASP and pinning-based schemes (PIN-75 and PIN-100)for adversarial datasets with low-/no-skew. Naturally, theseschemes are not expected to provide a significant speed-up inthe absence of high skew; however, a robust scheme wouldreduce/avoid the slowdown. Fig. 9 shows the speed-up fora low-skew dataset fr and a no-skew dataset uni for theseschemes over the RRIP baseline.

    GRASP provides a net speed-up on 9 out of 10 datapointseven for low-/no-skew datasets. On the low-skew dataset fr,GRASP yields a speed-up between 0.4% and 4.3% whereason the no-skew dataset uni, GRASP yields a speed-upbetween -0.1% and 2.4%. In contrast, PIN-75 and PIN-100cause slowdown on almost all datapoints.

    In the absence of high skew, cache blocks belonging tothe High Reuse Region do not dominate the overall LLCaccesses. Thus, pinning these blocks throughout the executionis counter-productive for PIN-75 and PIN-100. In contrast,GRASP adopts a flexible approach, wherein the high prioritycache blocks from High Reuse Region can make way forother blocks that observe some reuse, as needed. Thus,GRASP successfully limits slowdown, and even providesreasonable speed-up on some datapoints, for such highlyadversarial datasets.

    Finally, combining results on all 7 datasets (5 datasets fromFig. 8 and 2 from Fig. 9), GRASP yields an average speed-up of 4.1%. In comparison, PIN-75 and PIN-100 providea marginal speed-up of only 0.5% and 0.1%, respectively.PIN-75 and PIN-100 cause slowdown of up to 5.3% and14.2% whereas max slowdown for GRASP is only 0.1%.

    C. Reordering Techniques and GRASP

    Thus far, we evaluated GRASP on graph applicationsprocessing datasets that are reordered using DBG. In thissection, we compare performance of vertex reorderingtechniques, followed by an evaluation of GRASP on topof these techniques, demonstrating GRASP’s generality.

    Effectiveness of Reordering Techniques: In this section,we first show that skew-aware techniques can improveperformance of graph applications even as a standalonesoftware optimization, thus justifying their existence. Weevaluate three skew-aware techniques – Sort, HubSort [19]and DBG [2] – and a complex vertex reordering technique– Gorder [30]. We perform these studies on a real machinewith 40 hardware threads as mentioned in Sec. IV-B.

    Fig. 10(a) shows the speed-up for these existing softwaretechniques after accounting for their reordering cost over thebaseline with no reordering. Among skew-aware techniques,all techniques are effective on largest of the datasets (e.g., krand sd) and long iterative applications (e.g., PR). As thesetechniques rely on a low cost approach for reordering, thereordering cost is amortized quickly when the applicationruntime is high, making these solutions practically attractive.Averaged across all application and dataset pairs, skew-awaretechniques yield a net speed-up of 2.6% for Sort, 0.6% forHubSort and 10.8% for DBG.

    Unsurprisingly, Gorder causes significant slowdown on alldatapoints due to its large reordering cost, yielding an average

    11

  • -17-17 -59 -85 -90 -94 -81 -86 -81 -59 -89 -94 -85-10

    0

    10

    20

    30

    lj pl tw kr sd BC SSSP PR PRDRadii GM

    Spee

    d-up

    (%) Sort HubSort DBG Gorder

    (a) Net speed-up for existing software reordering techniques after accountingfor their reordering cost on a real machine.

    0

    3

    6

    9

    lj pl tw kr sd BC SSSP PR PRD Radii GM

    Spee

    d-up

    (%)

    Over Sort Over HubSort Over DBG Over Gorder(+DBG)

    (b) Application speed-up of GRASP over the RRIP baseline on top ofdifferent reordering techniques.

    Figure 10. Reordering Techniques + GRASP: the left group shows speed-upfor a dataset across all applications while the right group shows speed-upfor an application across all datasets.

    speed-up of -85.4%. Thus, Gorder is less practical whencompared to simple yet effective skew-aware techniques,corroborating prior work [2].

    Generality of GRASP: As software vertex reorderingtechniques offer different trade-offs in preserving graphstructure and reducing reordering cost, it is important forGRASP to not be coupled to any one software technique. Inthis section, we evaluate GRASP with different reorderingtechniques, both skew-aware and complex ones. While skew-aware techniques are readily compatible with GRASP, Gorderrequires a simple tweak as follows.

    After applying Gorder on an original dataset, we applyDBG to further reorder vertices, which results in a vertexorder that retains most of the Gorder ordering while alsosegregating hot vertices in a contiguous region, makingGorder compatible with GRASP.

    Fig. 10(b) shows the speed-up for GRASP over RRIPon top of the same reordering technique as the baseline.As with DBG, GRASP consistently provides a speed-upacross datasets and applications on top of other reorderingtechniques as well. On average, GRASP yields a speed-upof 4.4%, 4.2%, 5.2% and 5.0% on top of Sort, HubSort,DBG and Gorder, respectively. The result confirms thatGRASP complements a broad class of existing softwarereordering techniques.

    D. GRASP vs Optimal Replacement (OPT)

    In this section, we compare GRASP with Belady’s optimalreplacement policy (OPT) [72]. As OPT requires the perfectknowledge of the future, we generate the traces of LLCaccesses (up to 2 billion for each trace) for the applicationsprocessing graph datasets reordered using DBG on thesimulation baseline configuration specified in Sec. IV-C. Weapply OPT on each trace for five different LLC sizes – 1MB,

    Table VIIPERCENTAGE OF MISSES ELIMINATED OVER LRU FOR DIFFERENT

    LLC SIZE.

    Scheme 1MB 4MB 8MB 16MB 32MBRRIP 15.9% 16.4% 15.7% 15.2% 16.2%GRASP 15.4% 17.0% 18.1% 19.7% 21.2%OPT 27.5% 32.2% 33.3% 34.3% 34.5%

    010203040

    lj pl tw kr sd BC SSSP PR PRDRadii GM% M

    isses

    Elim

    inate

    d RRIP GRASP OPT

    Figure 11. Percentage of misses eliminated over LRU.

    4MB, 8MB, 16MB and 32MB – to obtain the minimumnumber of misses for a given cache size and report thepercentage of misses eliminated over LRU on the sameLLC size.

    Miss reduction on 16MB LLC: Fig. 11 shows the resultsfor OPT along with RRIP and GRASP for 16MB LLCsize. OPT eliminates 34.3% of total misses over LRU. Incomparison, GRASP eliminates 19.7% of misses (vs 15.2%for RRIP). Overall, GRASP is 57.5% effective in eliminatingmisses when compared to OPT, an offline technique withperfect knowledge of the future. While GRASP is the mosteffective among the online techniques, the results also showthat the remaining opportunity (difference between OPT andGRASP) is still significant, which warrants further researchin this direction.

    Sensitivity of GRASP to LLC size: Table VII shows theaverage percentage of misses eliminated by RRIP, GRASPand OPT for different LLC sizes over LRU. With the increasein LLC size, GRASP becomes more effective at eliminatingmisses over LRU (average miss reduction of 15.4% for 1MBvs 21.2% for 32MB). This is expected, as the larger LLCsize allows GRASP to provide preferential treatment to morehot vertices. In general, yet larger LLC sizes are expected tobenefit even more from GRASP until the LLC size becomeslarge enough to accommodate all hot vertices.

    VI. RELATED WORK

    Shared-memory graph frameworks: A significant amount ofresearch has focused on designing high performance shared-memory frameworks for graph applications. Majority of theseframeworks are vertex-centric [32, 35, 42, 43, 46, 55] and useCSR or its variants to encode a graph, making GRASP readilycompatible with these frameworks. More generally, GRASPrequires classification of only the Property Array(s), makingit independent of the specific data structure used to representthe graph, which further increases compatibility across thespectrum of frameworks. Thus, we expect GRASP to reducemisses across frameworks, though absolute speed-ups will

    12

  • likely vary.

    Distributed-memory graph frameworks: these frameworkscan also benefit from GRASP. For example, PGX [33] andPowerGraph [45] proposed duplicating high degree vertices inthe graph partitions to reduce high communication overheadacross computing nodes. These optimizations are largelyorthogonal to GRASP cache management. As such, GRASPcan be applied to distributed graph processing by cachinghigh-degree vertices within each nodes’s LLC to improvenode-level cache behavior.

    Streaming graph frameworks: In this work, we haveassumed that graphs are static. In practice, graphs may evolveover time and a stream of graph updates (i.e., addition orremoval of vertices or edges) are interleaved with graph-analytic queries (e.g., computing pagerank of vertices orcomputing shortest path from different root vertices). Forsuch deployment settings, a CSR-based structure is infeasible.Instead, researchers have proposed various data structuresfor graph encoding that can accommodate fast graph updatesand allow space-efficient versioning [1, 34, 44]. Meanwhile,each graph query is performed on a consistent view (i.e.,static snapshot) of a graph. For example, Aspen [1], arecent graph-streaming framework, uses Ligra (a static graph-processing framework) in the back-end to run graph-analyticqueries. Thus, the observations made in this paper regardingcache thrashing due to the irregular access patterns of theProperty Array, as well as skew-aware reordering and GRASPbeing complementary in combating cache thrashing, are alsorelevant for dynamic graphs.

    For static graphs, vertex reordering cost is amortized overmultiple graph traversals for a single graph query (as shownin Fig. 10(a)). However, for dynamic graphs, reorderingcost can be further amortized over multiple graph queries.Intuitively, addition or deletion of some vertices or edges in alarge graph would not lead to a drastic change in the degreedistribution, and thus unlikely to change which vertices areclassified hot in a short time window. Therefore, skew-awarereordering can be applied at periodic intervals to improvecache behavior after a series of updates has been made to agraph, amortizing reordering cost over multiple graph queries.

    Software-hints through profiling: Prior works have proposedISA changes by embedding load/store instructions withreuse hints to improve cache replacement decisions [40,61, 63, 64, 67]. These works perform program analysisvia an additional compiler pass and/or runtime profilingto identify data that are unlikely to be referenced again.Compiler uses a custom memory instruction tagged with alow-reuse hint to access such data so that the cache hardwarecan prioritize the associated cache block for eviction. Suchmechanisms, however, are largely effective for only thoseprograms that are dominated by loops with regular accesspatterns. For example, Pacman [40] fails when a loop indexvariable and the reuse distance for the element accessed in

    a given loop iteration does not exhibit a linear correlation(e.g., indirect memory accesses, dominant type of memoryaccesses for graph analytics). In contrast, GRASP leveragesvertex placement in memory after skew-aware reorderingpass to correctly learn high-reuse vertices, that too withoutrequiring custom instructions or a priori program analysis.

    Hardware prefetchers: Modern processors typically employprefetchers that target stride-based access patterns and thusare not amenable to graph analytics. Researchers haveproposed custom prefetchers at L1-D that specifically targetindirect memory access patterns of graph analytics [20,36]. Nevertheless, prefetching can only hide memory accesslatency. Unlike cache replacement, prefetching cannot reducememory bandwidth pressure or DRAM energy expenditure.Indeed, prior work observes that even a 100% accurateprefetcher for graph analytics is bottlenecked by memorybandwidth [36]. In contrast, GRASP reduces bandwidthpressure by reducing LLC misses, and thus is complementaryto prefetching.

    Traversal scheduling: Mukkara et al. proposed HATS [6], ahardware-accelerator implementing locality-aware schedulingto exploit cache locality for graphs exhibiting communitystructure. While effective, it requires intrusive hardwarechanges, including a specialized hardware unit with eachcore and an ISA change on the host core. In contrast,GRASP requires a minimal hardware interface and trivialchanges to the cache policy while largely utilizing the existingcache hardware.

    Graph slicing: Researchers have proposed slicing, a softwareoptimization that slices the graph in LLC-size partitions andprocesses one partition at a time to reduce irregular memoryaccesses. Specifically, Graphicionado [24] uses slicing to fita working set into a large 64MB on-chip scratchpad whileZhang et al. use CSR Segmenting to break the vertices intosegments that fit in LLC [19].

    While generally effective, slicing has two importantlimitations. First, it requires invasive framework changesto form the slices (which may include replicating vertices toavoid out-of-slice accesses) and manage them at runtime.Secondly, for a given cache size, the number of slicesincreases with the size of the graph, resulting in greaterprocessing overheads in creating and maintaining partitionsfor larger graphs.

    In comparison, GRASP is a hardware scheme thatintelligently leverages lightweight software support. GRASPrequires minimal changes in a graph framework and doesnot require any changes in the graph algorithms. Having saidthat, GRASP is complementary to slicing; by preserving thecritical working set in the cache (i.e., hot vertices), GRASPcould be used to improve the performance of Graphicionadoto reduce the number of slices by making intra-slice cachemisses infrequent.

    13

  • VII. CONCLUSIONThis work explores how hardware cache management

    should be designed to tackle cache thrashing at LLC forgraph-analytic applications. We show that state-of-the-artcache management schemes are deficient in presence ofcache thrashing stemming from irregular access patternsof graph applications processing large graphs. In response,we introduce GRASP – specialized cache management forLLC for graph analytics on power-law graphs. GRASP’sspecialized cache policies exploit the high reuse inherentin hot vertices while retaining the flexibility to capturereuse in other cache blocks. GRASP leverages existingsoftware reordering optimizations to enable a lightweightinterface that allows hardware to pinpoint hot vertices amidstirregular access patterns. In doing so, GRASP avoids the needfor a storage-intensive prediction mechanism or additionalmetadata storage in the LLC. GRASP requires minimalhardware support, making it attractive for integration intoa commodity server processor to enable acceleration forthe domain of graph analytics. Finally, GRASP deliversconsistent performance gains on high-skew datasets, whilepreventing slowdowns on low-skew datasets.

    ACKNOWLEDGMENTThis work was supported in part by a research grant

    from Oracle Labs. We thank Amna Shahab, AntoniosKatsarakis, Arpit Joshi, Artemiy Margaritov, Avadh Patel,Dmitrii Ustiugov, Rakesh Kumar, Vasilis Gavrielatos, VijayNagarajan, and the anonymous reviewers for their valuablefeedback on an earlier draft of this work.

    REFERENCES[1] L. Dhulipala, G. E. Blelloch, and J. Shun. “Low-latency

    Graph Streaming Using Compressed Purely-functional Trees”. In:International Conference on Programming Language Design andImplementation (PLDI). 2019.

    [2] P. Faldu, J. Diamond, and B. Grot. “A Closer Look at LightweightGraph Reordering”. In: International Symposium on WorkloadCharacterization (IISWC). 2019.

    [3] V. Balaji and B. Lucia. “When is Graph Reordering an Optimization?Studying the Effect of Lightweight Graph Reordering AcrossApplications and Input Graphs”. In: International Symposium onWorkload Characterization (IISWC). 2018.

    [4] Hyperlink Graphs. http://webdatacommons.org/hyperlinkgraph. WebData Commons, 2018.

    [5] A. Jain and C. Lin. “Rethinking Belady’s Algorithm to AccommodatePrefetching”. In: International Symposium on Computer Architecture(ISCA). 2018.

    [6] A. Mukkara, N. Beckmann, M. Abeydeera, X. Ma, and D.Sanchez. “Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling”. In: International Symposium onMicroarchitecture (MICRO). 2018.

    [7] N. Vijaykumar, A. Jain, D. Majumdar, K. Hsieh, G. Pekhimenko,E. Ebrahimi, N. Hajinazar, P. B. Gibbons, and O. Mutlu. “A Casefor Richer Cross-Layer Abstractions: Bridging the Semantic Gapwith Expressive Memory”. In: International Symposium on ComputerArchitecture (ISCA). 2018.

    [8] J. Dı́az, P. Ibáñez, T. Monreal, V. Viñals, and J. Llaberı́a. “ReD:A Policy Based on Reuse Detection for a Demanding BlockSelection in Last-Level Caches”. In: International Workshop onCache Replacement Championship (CRC), co-located with ISCA.2017.

    [9] P. Faldu and B. Grot. “Reuse-Aware Management for Last-Level Caches”. In: International Workshop on Cache ReplacementChampionship (CRC), co-located with ISCA. 2017.

    [10] P. Faldu and B. Grot. “Leeway: Addressing Variability in Dead-BlockPrediction for Last-Level Caches”. In: International Conference onParallel Architectures and Compilation Techniques (PACT). 2017.

    [11] P. V. Gratz, J. Kim, A. R. Alameldeen, C. Wilkerson, S. Pugsley,A. Jaleel, B. Falsafi, M. Sutherland, and J. Picorel. The 2nd CacheReplacement Championship (CRC), co-located with ISCA. http://crc2.ece.tamu.edu. 2017.

    [12] A. Jain and C. Lin. “Hawkeye Cache Replacement: LeveragingBelady’s Algorithm for Improved Cache Replacement”. In:International Workshop on Cache Replacement Championship (CRC),co-located with ISCA. 2017.

    [13] D. A. Jiménez and E. Teran. “Multiperspective Reuse Prediction”.In: International Symposium on Microarchitecture (MICRO). 2017.

    [14] D. A. Jiménez. “Multiperspective Reuse Prediction”. In: InternationalWorkshop on Cache Replacement Championship (CRC), co-locatedwith ISCA. 2017.

    [15] K. Lakhotia, S. Singapura, R. Kannan, and V. Prasanna. “ReCALL:Reordered Cache Aware Locality Based Graph Processing”. In:International Conference on High Performance Computing (HiPC).2017.

    [16] A. Vakil-Ghahani, S. Mahdizadeh-Shahri, M. Lotfi-Namin, M.Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad. “CacheReplacement Policy Based on Expected Hit Count”. In: InternationalWorkshop on Cache Replacement Championship (CRC), co-locatedwith ISCA. 2017.

    [17] J. Wang, L. Zhang, R. Panda, and L. John. “Less is More: LeveragingBelady’s Algorithm with Demand-based Learning”. In: InternationalWorkshop on Cache Replacement Championship (CRC), co-locatedwith ISCA. 2017.

    [18] V. Young, C. Chou, A. Jaleel, and M. K. Qureshi. “SHiP++:Enhancing Signature-Based Hit Predictor for Improved CachePerformance”. In: International Workshop on Cache ReplacementChampionship (CRC), co-located with ISCA. 2017.

    [19] Y. Zhang, V. Kiriansky, C. Mendis, S. Amarasinghe, and M.Zaharia. “Making caches work for graph analytics”. In: InternationalConference on Big Data (Big Data). 2017.

    [20] S. Ainsworth and T. M. Jones. “Graph Prefetching UsingData Structure Knowledge”. In: International Conference onSupercomputing (ICS). 2016.

    [21] J. Arai, H. Shiokawa, T. Yamamuro, M. Onizuka, and S. Iwamura.“Rabbit Order: Just-in-Time Parallel Reordering for Fast GraphAnalysis”. In: International Parallel and Distributed ProcessingSymposium (IPDPS). 2016.

    [22] P. Faldu and B. Grot. “LLC Dead Block Prediction Considered NotUseful”. In: International Workshop on Duplicating, Deconstructingand Debunking (WDDD), co-located with ISCA. 2016.

    [23] Friendster network dataset – KONECT. http://konect.uni-koblenz.de/networks/friendster. The Koblenz Network Collection, 2016.

    [24] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M.Martonosi. “Graphicionado: A high-performance and energy-efficientaccelerator for graph analytics”. In: International Symposium onMicroarchitecture (MICRO). 2016.

    [25] Intel Xeon Processor E5-2630 v4. https://ark.intel.com/products/92981/Intel-Xeon-Processor-E5-2630-v4-25M-Cache-2 20-GHz.Intel Corporation, 2016.

    [26] A. Jain and C. Lin. “Back to the Future: Leveraging Belady’sAlgorithm for Improved Cache Replacement”. In: InternationalSymposium on Computer Architecture (ISCA). 2016.

    [27] A. Mukkara, N. Beckmann, and D. Sanchez. “Whirlpool: ImprovingDynamic Cache Management with Static Data Classification”. In:International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS). 2016.

    [28] E. Teran, Y. Tian, Z. Wang, and D. A. Jiménez. “Minimal disturbanceplacement and promotion”. In: International Symposium on HighPerformance Computer Architecture. 2016.

    [29] E. Teran, Z. Wang, and D. A. Jiménez. “Perceptron Learning forReuse Prediction”. In: International Symposium on Microarchitecture(MICRO). 2016.

    14

    http://webdatacommons.org/hyperlinkgraphhttp://crc2.ece.tamu.eduhttp://crc2.ece.tamu.eduhttp://konect.uni-koblenz.de/networks/friendsterhttp://konect.uni-koblenz.de/networks/friendsterhttps://ark.intel.com/products/92981/Intel-Xeon-Processor-E5-2630-v4-25M-Cache-2_20-GHzhttps://ark.intel.com/products/92981/Intel-Xeon-Processor-E5-2630-v4-25M-Cache-2_20-GHz

  • [30] H. Wei, J. X. Yu, C. Lu, and X. Lin. “Speedup Graph Processingby Graph Ordering”. In: International Conference on Managementof Data (SIGMOD). 2016.

    [31] S. Beamer, K. Asanovic, and D. Patterson. “Locality Exists in GraphProcessing: Workload Characterization on an Ivy Bridge Server”.In: International Symposium on Workload Characterization (IISWC).2015.

    [32] S. Beamer, K. Asanovic, and D. A. Patterson. “The GAP BenchmarkSuite”. In: CoRR (2015). http://arxiv.org/abs/1508.03619.

    [33] S. Hong, S. Depner, T. Manhardt, J. V. D. Lugt, M. Verstraaten,and H. Chafi. “PGX.D: a fast distributed graph processing engine”.In: International Conference for High Performance Computing,Networking, Storage and Analysis (SC). 2015.

    [34] P. Macko, V. J. Marathe, D. W. Margo, and M. I. Seltzer. “LLAMA:Efficient graph analytics using Large Multiversioned Arrays”. In:International Conference on Data Engineering (ICDE). 2015.

    [35] N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J.Anderson, S. G. Vadlamudi, D. Das, and P. Dubey. “GraphMat: HighPerformance Graph Analytics Made Productive”. In: The VLDBEndowment (2015).

    [36] X. Yu, C. J. Hughes, N. Satish, and S. Devadas. “IMP:Indirect Memory Prefetcher”. In: International Symposium onMicroarchitecture (MICRO). 2015.

    [37] T. E. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout.“An Evaluation of High-Level Mechanistic Core Models”. In: ACMTransactions on Architecture and Code Optimization (TACO) (2014).

    [38] J. Leskovec and A. Krevl. SNAP Datasets: Stanford Large NetworkDataset Collection. http://snap.stanford.edu/data. 2014.

    [39] X. Tong and A. Moshovos. “BarTLB: Barren page resistant TLBfor managed runtime languages”. In: International Conference onComputer Design (ICCD). 2014.

    [40] J. Brock, X. Gu, B. Bao, and C. Ding. “Pacman: Program-assistedCache Management”. In: International Symposium on MemoryManagement (ISMM). 2013.

    [41] D. A. Jiménez. “Insertion and promotion for tree-based PseudoLRUlast-level caches”. In: International Symposium on Microarchitecture(MICRO). 2013.

    [42] D. Nguyen, A. Lenharth, and K. Pingali. “A LightweightInfrastructure for Graph Analytics”. In: International Symposium onOperating Systems Principles (SOSP). 2013.

    [43] J. Shun and G. E. Blelloch. “Ligra: A Lightweight Graph ProcessingFramework for Shared Memory”. In: International Symposium onPrinciples and Practice of Parallel Programming (PPoPP). 2013.

    [44] D. Ediger, R. McColl, J. Riedy, and D. A. Bader. “Stinger: Highperformance data structure for streaming graphs”. In: InternationalConference on High Performance Extreme Computing (HPEC). 2012.

    [45] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin.“PowerGraph: Distributed Graph-parallel Computation on NaturalGraphs”. In: International Conference on Operating Systems Designand Implementation (OSDI). 2012.

    [46] A. Kyrola, G. Blelloch, and C. Guestrin. “GraphChi: Large-scaleGraph Computation on Just a PC”. In: International Conference onOperating Systems Design and Implementation (OSDI). 2012.

    [47] I. Stanton and G. Kliot. “Streaming Graph Partitioning for LargeDistributed Graphs”. In: International Conference on KnowledgeDiscovery and Data Mining (KDD). 2012.

    [48] U. Kang and C. Faloutsos. “Beyond ‘Caveman Communities’: Hubsand Spokes for Graph Compression and Mining”. In: InternationalConference on Data Mining (ICDM). 2011.

    [49] C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C.Steely Jr., and J. Emer. “SHiP: Signature-based Hit Predictorfor High Performance Caching”. In: International Symposium onMicroarchitecture (MICRO). 2011.

    [50] A. R. Alameldeen, A. Jaleel, M. K. Qureshi, and J. Emer. JILPWorkshop on Cache Replacement Championship (CRC). http://www.jilp.org/jwac-1. 2010.

    [51] H. Gao and C. Wilkerson. “A dueling segmented LRU replacementalgorithm with adaptive bypassing”. In: In JILP Workshopon Computer Architecture Competitions: Cache ReplacementChampionship (CRC). 2010.

    [52] A. Jaleel, K. B. Theobald, S. C. Steely Jr., and J. Emer. “HighPerformance Cache Replacement Using Re-reference IntervalPrediction (RRIP)”. In: International Symposium on ComputerArchitecture (ISCA). 2010.

    [53] S. M. Khan, Y. Tian, and D. A. Jiménez. “Sampling Dead BlockPrediction for Last-Level Caches”. In: International Symposium onMicroarchitecture (MICRO). 2010.

    [54] H. Kwak, C. Lee, H. Park, and S. Moon. “What is Twitter, a socialnetwork or a news media?” In: International Conference on WorldWide Web (WWW). 2010.

    [55] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, andJ. M. Hellerstein. “GraphLab: A New Framework For ParallelMachine Learning”. In: The Conference on Uncertainty in ArtificialIntelligence (UAI). 2010.

    [56] C. Magnien, M. Latapy, and M. Habib. “Fast Computation ofEmpirically Tight Bounds for the Diameter of Massive Graphs”.In: Journal of Experimental Algorithmics (2009).

    [57] Y. Xie and G. H. Loh. “PIPP: Promotion/Insertion Pseudo-partitioningof Multi-core Shared Caches”. In: International Symposium onComputer Architecture (ISCA). 2009.

    [58] A. Jaleel, W. Hasenplaugh, M. K. Qureshi, J. Sebot, S. Steely Jr.,and J. Emer. “Adaptive Insertion Policies for Managing SharedCaches”. In: International Conference on Parallel Architectures andCompilation Techniques (PACT). 2008.

    [59] M. Kharbutli and Y. Solihin. “Counter-Based Cache Replacementand Bypassing Algorithms”. In: IEEE Transactions on Computers(2008).

    [60] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer.“Adaptive Insertion Policies for High Performance Caching”. In:International Symposium on Computer Architecture (ISCA). 2007.

    [61] K. Beyls and E. H. D’Hollander. “Generating Cache Hints forImproved Program Efficiency”. In: Journal of Systems Architecture(2005).

    [62] D. Chakrabarti, Y. Zhan, and C. Faloutsos. “R-MAT: A recursivemodel for graph mining”. In: SIAM International Conference onData Mining (SDM). 2004.

    [63] Z. Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems.“Using the compiler to improve cache replacement decisions”. In:International Conference on Parallel Architectures and CompilationTechniques (PACT). 2002.

    [64] P. Jain, S. Devadas, D. Engels, and L. Rudolph. “Software-assisted cache replacement mechanisms for embedded systems”.In: International Conference on Computer Aided Design (ICCD).2001.

    [65] A.-L. Barabási and R. Albert. “Emergence of Scaling in RandomNetworks”. In: Science (1999).

    [66] M. Faloutsos, P. Faloutsos, and C. Faloutsos. “On Power-lawRelationships of the Internet Topology”. In: The Conferenceon Applications, Technologies, Architectures, and Protocols forComputer Communication (SIGCOMM). 1999.

    [67] A. R. Lebeck, D. R. Raymond, C.-L. Yang, and M. Thottethodi.“Annotated Memory References: A Mechanism for Informed CacheManagement”. In: International Euro-Par Conference on ParallelProcessing (Euro-Par). 1999.

    [68] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRankCitation Ranking: Bringing Order to the Web. Technical Report1999-66. http:/ / ilpubs.stanford.edu:8090/422. Stanford InfoLab,1999.

    [69] G. Karypis and V. Kumar. “Multilevel k-way Partitioning Scheme forIrregular Graphs”. In: Journal of Parallel and Distributed Computing(1998).

    [70] J. Banerjee, W. Kim, S. Kim, and J. F. Garza. “Clustering a DAG forCAD databases”. In: IEEE Transactions on Software Engineering(1988).

    [71] E. Cuthill and J. McKee. “Reducing the Bandwidth of SparseSymmetric Matrices”. In: The National Conference. 1969.

    [72] L. A. Belady. “A Study of Replacement Algorithms for a Virtual-storage Computer”. In: IBM Systems Journal (1966).

    15

    http://arxiv.org/abs/1508.03619http://snap.stanford.edu/datahttp://www.jilp.org/jwac-1http://www.jilp.org/jwac-1http://ilpubs.stanford.edu:8090/422

    IntroductionMotivationSkew in Natural GraphsGraph Processing BasicsCache Behavior in Graph AnalyticsChallenges in Caching the Property ArrayPrior Software SchemesPrior Hardware Schemes

    GRASP: Caching in on the SkewSoftware-Hardware InterfaceClassification LogicSpecialized Cache PoliciesBenefits of GRASP over Prior Schemes

    MethodologyGraph Processing FrameworkMethodology for Software Evaluation Methodology for Hardware Evaluation

    EvaluationHistory-based Predictive SchemesPinning-based SchemesReordering Techniques and GRASPGRASP vs Optimal Replacement (OPT)

    Related WorkConclusion