A Case for Unlimited Watchpoints - Carnegie Mellon University

13
A Case for Unlimited Watchpoints Joseph L. Greathouse Hongyi Xin ‡* Yixin Luo †§ Todd Austin [email protected] [email protected] [email protected] [email protected] Advanced Computer Architecture Laboratory SAFARI Research Group § UM-SJTU Joint Institute University of Michigan Carnegie Mellon University Shanghai Jiao Tong University Ann Arbor, MI, USA Pittsburgh, PA, USA Shanghai, China Abstract Numerous tools have been proposed to help developers fix software errors and inefficiencies. Widely-used techniques such as memory checking suffer from overheads that limit their use to pre-deployment testing, while more advanced systems have such severe performance impacts that they may require special-purpose hardware. Previous works have described hardware that can accelerate individual analyses, but such specialization stymies adoption; generalized mech- anisms are more likely to be added to commercial processors. This paper demonstrates that the ability to set an unlim- ited number of fine-grain data watchpoints can reduce the runtime overheads of numerous dynamic software analysis techniques. We detail the watchpoint capabilities required to accelerate these analyses while remaining general enough to be useful in the future. We describe a hardware design that stores watchpoints in main memory and utilizes two different on-chip caches to accelerate performance. The first is a bitmap lookaside buffer that stores fine-grained watch- points, while the second is a range cache that can efficiently hold large contiguous regions of watchpoints. As an example of the power of such a system, it is possible to use watch- points to accelerate read/write set checks in a software data race detector by nearly 9×. Categories and Subject Descriptors B.3.2 [Memory Structures ]: Design Styles—cache memories; C.0 [General ]: hardware/software interfaces; D.2.0 [Software Engineer- ing ]: General—protection mechanisms; D.2.5 [Software En- gineering ]: Testing and Debugging—debugging aids, testing tools General Terms Design, Performance Keywords Watchpoints, Data Race Detection, Determin- istic Concurrent Execution, Taint Analysis, Demand-Driven Analysis 1. Introduction Billions of dollars and millions of man-hours are spent each year attempting to build correct programs. As Bessey et al. * Author was at the University of Michigan and Shanghai Jiao Tong University while working on this project. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS’12, March 3–7, 2012, London, England, UK. Copyright c 2012 ACM 978-1-4503-0759-8/12/03. . . $10.00 stated, “Assuming you have a reasonable [software analysis] tool, if you run it over a large, previously unchecked system, you will always find bugs” [7]. The fact that developers looking to increase performance on multiprocessors must explicitly utilize concurrency only adds to this problem. Numerous tools exist to help developers make more ro- bust software. Valgrind’s MemCheck, for instance, checks dynamic memory operations for errors such as memory leaks [37]. While such dynamic tools are powerful, they suffer from large computational overheads that limit their adoption to small groups of developers and dedicated testers. MemCheck can cause the original program to run 30× slower, while more heavyweight tools, such as Larson and Austin’s sym- bolic execution engine, see slowdowns of over 200× [24]. More insidiously, because dynamic analyses can only observe bugs on executed paths, the power of these tools is further limited by slowdowns that reduce the number of situations observable in a reasonable amount of time. Application-specific hardware is one way of solving this problem. Such mechanisms are custom-designed to accel- erate individual analyses, but none offers a comprehensive way of accelerating many tools. As such, they are unlikely to meet the stringent requirements needed to be integrated into a modern commercial microprocessor; their benefits are too narrowly defined compared to their high area, design, and verification costs. Commercial processors tend to favor generic solutions, where costs can be amortized across multiple uses. As an ex- ample, the performance counters available on modern micro- processors are used to find performance-degrading hotspots in software [39], but they are also used during hardware bring-up to identify correctness issues [48]. Similarly, while Sun added hardware into their prototype Rock processor in order to support transactional memory [10], the same mech- anisms were also used to support runahead execution and speculative execution [11]. 1.1 Contributions of This Paper We theorize that a general acceleration mechanism should alert programmers when specified locations in memory are being accessed. In other words, having the ability to set a large number of data watchpoints (WPs) would benefit a wide range of analyses. As the original program executes, many runtime anal- ysis systems also perform actions on shadow values. In taint analysis, for example, each location in memory has a shadow value that marks it as trusted or untrusted. Check- ing this meta-data, in order to decide if analyses should oc- cur, is slow. As an example, we previously demonstrated that checking read/write sets using software routines was 3-10× slower than using hardware to do the same [19].

Transcript of A Case for Unlimited Watchpoints - Carnegie Mellon University

A Case for Unlimited Watchpoints

Joseph L. Greathouse† Hongyi Xin‡ ∗ Yixin Luo†§ Todd Austin†

[email protected] [email protected] [email protected] [email protected]†Advanced Computer Architecture Laboratory ‡SAFARI Research Group §UM-SJTU Joint Institute

University of Michigan Carnegie Mellon University Shanghai Jiao Tong UniversityAnn Arbor, MI, USA Pittsburgh, PA, USA Shanghai, China

Abstract

Numerous tools have been proposed to help developers fixsoftware errors and inefficiencies. Widely-used techniquessuch as memory checking suffer from overheads that limittheir use to pre-deployment testing, while more advancedsystems have such severe performance impacts that theymay require special-purpose hardware. Previous works havedescribed hardware that can accelerate individual analyses,but such specialization stymies adoption; generalized mech-anisms are more likely to be added to commercial processors.

This paper demonstrates that the ability to set an unlim-ited number of fine-grain data watchpoints can reduce theruntime overheads of numerous dynamic software analysistechniques. We detail the watchpoint capabilities requiredto accelerate these analyses while remaining general enoughto be useful in the future. We describe a hardware designthat stores watchpoints in main memory and utilizes twodifferent on-chip caches to accelerate performance. The firstis a bitmap lookaside buffer that stores fine-grained watch-points, while the second is a range cache that can efficientlyhold large contiguous regions of watchpoints. As an exampleof the power of such a system, it is possible to use watch-points to accelerate read/write set checks in a software datarace detector by nearly 9×.

Categories and Subject Descriptors B.3.2 [MemoryStructures]: Design Styles—cache memories; C.0 [General ]:hardware/software interfaces; D.2.0 [Software Engineer-ing ]: General—protection mechanisms; D.2.5 [Software En-gineering ]: Testing and Debugging—debugging aids, testingtools

General Terms Design, Performance

Keywords Watchpoints, Data Race Detection, Determin-istic Concurrent Execution, Taint Analysis, Demand-DrivenAnalysis

1. Introduction

Billions of dollars and millions of man-hours are spent eachyear attempting to build correct programs. As Bessey et al.

∗Author was at the University of Michigan and Shanghai JiaoTong University while working on this project.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage andthat copies bear this notice and the full citation on the first page. Tocopy otherwise, to republish, to post on servers or to redistribute tolists, requires prior specific permission and/or a fee.

ASPLOS’12, March 3–7, 2012, London, England, UK.Copyright c© 2012 ACM 978-1-4503-0759-8/12/03. . . $10.00

stated, “Assuming you have a reasonable [software analysis]tool, if you run it over a large, previously unchecked system,you will always find bugs” [7]. The fact that developerslooking to increase performance on multiprocessors mustexplicitly utilize concurrency only adds to this problem.

Numerous tools exist to help developers make more ro-bust software. Valgrind’s MemCheck, for instance, checksdynamic memory operations for errors such as memory leaks[37]. While such dynamic tools are powerful, they suffer fromlarge computational overheads that limit their adoption tosmall groups of developers and dedicated testers. MemCheckcan cause the original program to run 30× slower, whilemore heavyweight tools, such as Larson and Austin’s sym-bolic execution engine, see slowdowns of over 200× [24].More insidiously, because dynamic analyses can only observebugs on executed paths, the power of these tools is furtherlimited by slowdowns that reduce the number of situationsobservable in a reasonable amount of time.

Application-specific hardware is one way of solving thisproblem. Such mechanisms are custom-designed to accel-erate individual analyses, but none offers a comprehensiveway of accelerating many tools. As such, they are unlikelyto meet the stringent requirements needed to be integratedinto a modern commercial microprocessor; their benefits aretoo narrowly defined compared to their high area, design,and verification costs.

Commercial processors tend to favor generic solutions,where costs can be amortized across multiple uses. As an ex-ample, the performance counters available on modern micro-processors are used to find performance-degrading hotspotsin software [39], but they are also used during hardwarebring-up to identify correctness issues [48]. Similarly, whileSun added hardware into their prototype Rock processor inorder to support transactional memory [10], the same mech-anisms were also used to support runahead execution andspeculative execution [11].

1.1 Contributions of This Paper

We theorize that a general acceleration mechanism shouldalert programmers when specified locations in memory arebeing accessed. In other words, having the ability to set alarge number of data watchpoints (WPs) would benefit awide range of analyses.

As the original program executes, many runtime anal-ysis systems also perform actions on shadow values. Intaint analysis, for example, each location in memory has ashadow value that marks it as trusted or untrusted. Check-ing this meta-data, in order to decide if analyses should oc-cur, is slow. As an example, we previously demonstratedthat checking read/write sets using software routines was3-10× slower than using hardware to do the same [19].

Number of Watchpoints

Watchpoint Granularity

Fine-grained

Mem Protection

Mondriaan

like

Systems

Our Proposal

Virtual Memory Watchpoints

Debug

Regs

Figure 1: Existing Watchpoint Systems Are Inadequate.WP registers are too few in number. VM’s granularity is toocoarse. Mondriaan-like systems cannot quickly change manyWPs, and other systems are often only useful for small regions.

This paper makes a case for the hardware-supported abil-ity to set a virtually unlimited number of fine-grained datawatchpoints. We will show that this mechanism can accel-erate numerous software tools and is more general than theapplication-specific hardware often touted in the literature.

This is an improvement over existing memory-watchingmechanisms, as qualitatively illustrated in Figure 1. Hard-ware watchpoint registers are too limited in number forthe advanced systems we wish to accelerate. Virtual mem-ory watchpoints, which are not constrained by hardware re-sources, are limited by the coarse granularity of pages. Fi-nally, a number of proposals for fine-grained hardware mem-ory protection and tagged memory systems exist in the lit-erature, but they too are not general enough.

We present a mechanism that avoids these limitations.Our hardware/software hybrid watchpoint system storesper-thread virtual address watchpoints in main memory,avoiding hardware limitations on the total number of watch-points. It makes use of two on-chip caches to hold thesewatchpoints closer to the pipeline. The first is a modified ver-sion of the range cache proposed by Tiwari et al. [40], whichefficiently encodes continuous regions of watched memory.The second is a bitmapped lookaside cache, which increasesthe number of cached WP ranges when they are small.

By combining these mechanisms with the abilities to takefast WP faults and set WPs with user-level instructions, wecan greatly accelerate tools that work on shadow data whileremaining general enough to be useful for other memory-watching tasks. Our simulated results show, for instance,that a data race detector built using our technique can checkread/write sets up to 9× faster than one built entirely withbinary instrumentation and 3× faster than one using otherfine-grained memory protection systems.

This paper presents the following novel contributions:

• We design hardware that allows software to set a virtuallyunlimited number of byte-accurate watchpoints.• We study numerous dynamic software analysis tools and

show how they can utilize watchpoints to run moreaccurately and much faster.• We demonstrate that this design performs better on a

wide range of tools and applications than other state-of-the-art memory monitoring technologies.

We detail the design of our watchpoint hardware inSection 2 and discuss software systems that could be builtwith it in Section 3. We compare our system to previousfine-grain memory protection works while running a varietyof software analysis tasks in Section 4. Finally, we reviewother related works in Section 5 and conclude in Section 6.

Table 1: Hardware Watchpoint Support in Modern ISAs.Most ISAs support a small number of hardware-assisted watch-points. While they reduce debugger overheads, their small num-bers and reach are usually inadequate for more complex tools.

ISA # Known As Possible Size

x86 [-64] 4 Debug Register 1, 2, 4, [8] bytes

ARMv7 16 Debug Register 1-8 bytes or up to 2GBusing low-order masking

ePOWER 2 Data AddressCompare

1 byte or 64-bit addresswith any bit masked orrange up to 264 bytes

Itanium 4 Data BreakpointRegister

1 byte to 64PB using low-order masking

MIPS 8 WatchLo/Hi 8 bytes, naturally aligned

POWER 1 DABR 1-8 bytes

SPARC 2 Watchpoint Reg. 1-8 bytes

z/Arch 1 PER RO Range up to 264 bytes

2. Fast Unlimited Watchpoints

This section describes a system that allows software to seta virtually unlimited number of byte-accurate watchpoints.We first review existing WP hardware and list what proper-ties are needed to effectively accelerate a variety of software.We then present a design that meets these needs.

2.1 Existing HW Watchpoint Support

Watchpoints, also known as data breakpoints, are debuggingmechanisms that allow a developer to demarcate memoryregions and take interrupts whenever they are accessed[22, 27]. Using software to check each memory access cancause slowdowns, so most processors include some form ofhardware support for watchpoints.

As Wahbe discussed, existing support can be brokendown into specialized hardware watchpoint mechanisms andvirtual memory [43]. The first commonly takes the form ofwatchpoint registers that hold individual addresses and raiseexceptions when these addresses are touched. Unfortunately,as Table 1 shows, no modern ISA offers more than a smallhandful of these registers. This makes them difficult (if notimpossible) to use for many analyses [14].

The second method marks pages containing watched dataas unavailable or read-only in the virtual memory system.The kernel then checks the offending address against a listof watchpoints during each page fault. Though this systemhas been implemented on existing processors [4, 36], it hasa number of restrictions that limit its usefulness.

Individual threads within a process cannot easily havedifferent VM watchpoints. Additionally, the large size ofpages reduces their effectiveness. Faults taken when access-ing unwatched data on pages that also contain watched datacan result in unacceptable performance overheads; we mea-sured pathological cases on x86 Solaris that showed slow-downs of over 10,000×. Ho et al. also observed this problemand claimed that they “anticipate that using [finer-granu-larity] techniques would greatly improve performance” [21].

ECC memory can be used to set watchpoints at a finergranularity [32, 34]. By setting a value in memory with ECCenabled, then disabling ECC, writing a scrambled version ofthe data into the same location, and finally re-enabling errorcorrection, it is possible to take a fault whenever a watchedmemory line is accessed. However, changing watchpoints, aswell as taking watchpoint faults, is extremely slow in such asystem. We do not explore this further.

In all, existing hardware is inadequate to support thevaried needs of the wide range of dynamic analysis tools.

2.2 Unlimited Watchpoint Requirements

While this tells us what current systems do not offer, wemust still answer the question of what a WP system shouldoffer. Section 3 will detail watchpoint-based algorithms fornumerous dynamic analysis systems, but in the vein of Appeland Li’s paper on virtual memory primitives [2], we first listthe properties which should be made available:

• Large number: Some systems watch gigabytes of datato observe program behavior.• Byte/word granularity: Many tools use watchpoints

at a very fine granularity to reduce false faults.• Fast fault handler: Some applications take many

faults, so this would greatly increase their performance.• Fast watchpoint changes: Numerous tools frequently

change WPs in response to the program’s actions.• Per-thread: Separate watchpoints on threads within a

single process would allow tools to use watchpoints inparallel programs without taking false faults.• Set ranges: Many tools require the ability to watch

large ranges of addresses without needing to mark everycomponent byte individually.• Break ranges: It is also important to be able to quickly

remove sections in the middle of ranges without rewritingevery byte’s WP. This is often used to carve out aworking set of unwatched data.

2.3 Efficient Watchpoint Hardware

This section describes a hardware mechanism that operatesin parallel with the virtual memory system in order todeliver on these requirements. Each watchpoint is definedby two bits that indicate whether it is read- and/or write-watched. To allow a large number of these watchpoints,the full list of watched addresses is stored in main memory.Accessing memory for each check would be prohibitivelyslow, so virtual addresses are first sent from the addressgeneration unit (AGU) to an on-chip WP cache that isaccessed in parallel to the data translation lookaside buffer(DTLB), as shown in Figure 2.

To deliver these watchpoints at byte granularity , thecache compares each virtual address that the instructiontouches and outputs a logical OR of their watched statuses.This check need not complete until the instruction attemptsto commit, and so it may be pipelined.

If the WP unit indicates a fault, a precise exception israised upon attempting to commit the offending instruction.This could either be a user-level fault, which is treated asa mispredicted branch, or a kernel fault. We assume theformer in order provide a fast fault handler . Using suchhandlers for kernel-controlled watchpoints, or cross-processwatchpoints, is beyond the scope of this paper.

In order to yield fast watchpoint changes, it is impor-tant that the on-chip caches be able to hold dirty data andonly write back to main memory on dirty evictions. This,and the fact that the watchpoints are stored as virtual ad-dresses, means that watchpoints can be changed with simpleuser-level instructions instead of system calls.

In order to support per-thread (rather than per-core)watchpoints, any dirty state in the WP cache will neces-sarily be part of process state. This can be switched lazily,only being saved if another process utilizing watchpoints isloaded, to increase performance. Additionally, each core willrequire a thread ID register so that individual threads canbe targeted with watchpoint changes.

AGU

L1D DTLB

Main

Memory

WP

StorageWLBRange

Cache

Data/

Tag

Physical

AddressWatched?

Watchpoint Cache

Figure 2: Watchpoint Unit in the Pipeline. Because watch-points are set on virtual addresses, the WP system is accessedin parallel with the DTLB. This also allows WPs to be set withuser-level instructions while not affecting the CPU’s critical path.

Watchpoints are stored on-chip in three different forms.They are first stored in a range cache (RC), which holdsthe start and end addresses of each cached watchpoint andstatus values that hold each range’s 2-bit watchpoint. Thissystem, which is further explained in Section 2.3.1, helpssupport setting and breaking ranges.

Small ranges can negatively impact the reach, or thenumber of addresses covered by all entries, of the RC. In thiscase, the WPs within a contiguous region of memory can beheld in a single bitmap stored in main memory, rather thanas multiple small ranges. The RC then stores the boundaryaddresses for the entire bitmapped region, and the statusbits associated with that entry will point to the base ofthe bitmap. Accessing this bitmap would normally requirea load from main memory, so we also include a WatchpointLookaside Buffer (WLB) that is searched in parallel withthe RC. This is detailed further in Section 2.3.2.

Finally, because creating bitmaps can be slow (requiringkilobytes of data to be written to main memory), it canbe useful to store small bitmaps directly on chip. By usingthe storage that normally holds a pointer to a bitmap, itis possible to also make an On-Chip Bitmap (OCBM) thatstores the watchpoints for a region of memory that is largerthan a single byte but smaller than a main memory bitmap.This is described in Section 2.3.3.

2.3.1 Range Cache

The first mechanism for storing watchpoints within the coreis a modified version of the range cache proposed by Tiwariet al. [40]. They made the observation that “many dataflowtracking applications exhibit very significant range locality,where long blocks of memory addresses all store the same tagvalue,” and we have found this statement even more accuratewhen dealing with 2-bit watchpoints rather than many-bittags. This cache, shown as part of Figure 3, stores theboundary addresses for numerous ranges within the virtualmemory space of a thread as well as the 2-bit watchpointstatus associated with each of these regions.

Virtual addresses are sent to the RC in parallel withDTLB lookups, and the boundary addresses of each accessare compared with those of the cached ranges. When anyaddress that a memory instruction is attempting to accessoverlaps with a range stored in the RC, the hardware checksthat range’s watchpoint bits. If an overlapping range ismarked as watched for this type of access, the instruction isset to cause a watchpoint fault when it commits. This willcause execution to jump to a software fault handler.

If there are no watchpoints set on a region of memory, theRC will hold a region with R- and W-watched bits both ‘0’.This means that if the range cache misses on any lookup, itshould attempt to retrieve that range from main memory.

WP Lookaside Buffer Range Cache

Start Address 1Comparator (≥)

End Address 1Comparator (≤). . .

. . .

Start Address nComparator (≥)

End Address nComparator (≤)

encoderHit?

Instr. Start

& End Addr

Instr. Start

& End Addr

Start MatchEnd Match

Cached WP

Cached WP

R-Watched W-Watched unused 0 0

On-Chip

Bitmap

Off-Chip

BitmapCached Status

Pointer to Bitmap in Main Memory 0 1

1 0R W R W R W R W...

Addr. Tag 64B WP Status

Addr. Tag 64B WP Status

Main Memory

Ranges Bitmaps

Instr. A

ddress

If Off-Chip Bitmap:

Consult WLB

Else:

Output Cached WP

2

On Miss:

Fault to Backing Store Handler

On Hit:

Watch Fault Decided by Cache

1Parallel to RC:

Check Watch Bits

On Needed Miss:

HW Loads Line

from Bitmap 3

. . .

. . . .

or

or

Input Virtual Addresses Input Virtual Addresses Input Virtual Addresses

Figure 3: Unlimited Watchpoint Architecture. This system uses a combination of a range cache, center, with bitmaps and alookaside buffer, left, to accelerate accesses to watchpoints that are stored into main memory, right. (1) On a RC access miss, a softwarehandler loads new ranges from main memory. A hit causes the watchpoint system to check the associated status entry. (2) This rangecould be a uniform watchpoint, an off-chip bitmap, or an on-chip bitmap. In the case of an off-chip bitmap, the output of the lookasidebuffer, which is accessed in parallel to the RC, is consulted. (3) If the WLB misses, the pointer from the RC status entry is used by thehardware to load a line in from the main memory bitmap.

The original RC design brought in 64 byte chunks froma two-level trie whenever it missed. We found this methodinefficient for our tools, as it required a large number ofwrites to save non-aligned ranges. Instead, our RC causes afault to a software handler on a miss. This handler loads anentire range from the storage system in main memory. Weimplemented a backing store handler that keeps a balancedtree of non-overlapping ranges, which was modeled after thewatchpoint data structure in the OpenSolaris kernel.

Programs set watchpoints on their own memory space us-ing range-based instructions, which are described in Section2.3.4. These instructions can set or remove ranges by directlyinserting the new watchpoint tag into the range cache, whichuses a dirty-bit write-back policy to avoid taking a fault onevery WP change. The RC employs a pseudo-LRU replace-ment policy, which keeps track of the most-likely candidatefor eviction. If the range cache overflows and the LRU entryis dirty, the cache will cause a user-level fault that reroutesexecution to the software backing store handler.

Updating a watchpoint range is more complicated thansetting or removing, as it may require loading in ranges fromthe backing store to find the value that is to be modified.

2.3.2 Bitmap and Watchpoint Lookaside Buffer

The RC is designed to quickly handle large ranges of watch-points, and its write-back policy reduces the number ofwrites to main memory. However, its reach can be limited ifit contains many short ranges. In the worst case, a 128-entryRC, which takes up an area roughly equal to 4KB of L1D,may only have a reach of 128 bytes. We utilize a secondhardware structure to handle these cases.

Bitmaps are a compact way of storing small watchpointregions, because they only take 2 bits per byte within theregion, as demonstrated in Figure 4. Instead of storing thestart and end address for each small range, we thereforechoose to sometimes store into the range cache the firstand last address of a large region whose small watchpointsare contained in a bitmap in memory. This increases therequired amount of status storage in the range cache, whichoriginally only held the read-watched and write-watchedbits. It instead requires 32 or 64 bits of storage to holdthe pointer to the bitmap and one bit to denote whetheran entry is a bitmap pointer or a range. The two watchedbits can be mapped onto the pointer bits to save space.

We found it difficult to design a hardware-based algo-rithm to decide when to change a collection of ranges intoa bitmap. Doing so in an intelligent manner requires know-ing how many watchpoints are contained within a particulararea of memory. This knowledge may best be gathered bythe backing store software, as it can see the entire state ofa process’s watchpoints. We therefore leave it to the rangecache miss handler to decide when to toggle a region of mem-ory between a bitmap and ranges. The algorithm modeledin this paper moves a naturally aligned 4KB region from arange to a bitmap if the number of internal ranges exceedsan upper threshold of 16. The dirty eviction handler changesa bitmap back to ranges if this number falls below 4. Thisis illustrated in Figure 5.

While this mechanism increases the reach of the rangecache, it could adversely affect performance if every accessto a bitmapped range required checking main memory. Thisis especially true because WPs are set on virtual addresses,meaning that each access to the bitmap would wait onthe TLB. In order to accelerate this process, our systemincludes a Watchpoint Lookaside Buffer (WLB). This cacheis accessed in parallel to the RC, and any watchpoint statusit returns is consulted if the RC indicates that this locationis stored in a bitmap. On a miss in the WLB, a 64-byte lineof WP bits is loaded by a hardware engine from the bitmappointed to by the RC entry. Changes to the watchpoints ina bitmapped region cause a WLB eviction.

This WLB could be replaced with extended cache linebits, such as iWatcher uses to store its bitmapped watch-points [50]. Sentry’s power-saving method of only storingunwatched lines in the L1 (and thus only checking the WLBon cache misses) may also be useful [38]. We leave a morein-depth analysis of these tradeoffs for future work, and fo-cus on a system with a separate watchpoint cache, much likeMemTracker uses [42].

If a bitmapped range is evicted from the RC to mainmemory, the modeled software handler stores the entirerange as a single entry in the balanced tree, along with thepointer. It then brings the entire bitmap range back into therange cache on the next miss, though it may also need toupdate the bitmap to handle changes that occurred in therange cache but have not yet been evicted.

Watchpoint Layout (128 bytes) Ranges Bitmap

0 1 2 3 4 5 6 7 . . . . 126 1271251 on-chip entry = 8 bytes 256 bits = 32 bytes

128 entries = 1024 bytes 256 bits = 32 bytes

4 entries = 32 bytes 256 bits = 32 bytes

Figure 4: Different Watchpoint Storage Methods. This shows three examples of watched ranges. In the first, a range cache canhold the entire region in a single on-chip entry. The second, however, would take 1KB of on-chip storage if it were held in a range cache,while a bitmap method would only take 32 bytes. The final entry shows a break-even point between the two methods.

Load

Bitmap

into RC

Bitmap

Region?

RC Load

Miss

>16 Ranges

in Nat. Aligned, 4KB

Region?

Transition

to BitmapYes

Yes

No

Load

Range into

RC

No

(a)

Store

Range in

Tree

Bitmap

Range?

RC Dirty

Eviction

Store

Bitmapped

Range into

Tree

>4 Ranges

in Bitmap?

Yes

NoNo

Yes

(b)

Figure 5: Software Algorithm for Bitmapped Ranges. Because hardware has a myopic view of the global status of watchpoints,we rely on the software fault handlers to choose when a region of watchpoints should toggle between bitmap and range.

2.3.3 On-Chip Bitmap

There are also situations where a collection of small rangeswithin a large region prematurely displace useful data fromthe range cache. Though the backing store handler may laterfix this by changing the region to a bitmapped one, we alsodevised a mechanism to better utilize the pointer bits in arange cache entry when it is not bitmapped. It is possibleto store a small bitmap for a region ≤16 bytes (on a 32-bit machine) within these bits, allowing the RC to slightlyincrease its reach. This on-chip bitmap (OCBM) range isstored much like a normal range, with an arbitrary startand end address. However, unlike a uniformly tagged range,the watchpoint statuses of OCBM entries are stored as abitmap in the range cache’s status tag bits, which normallyhold either a pointer to a main memory bitmap or the 2-bittag of a range. The low-order bits of the address are used toindex into this bitmap, allowing a single range cache entryto hold up to 16 consecutive small ranges.

Unlike the bitmaps stored in main memory, the transitionto OCBM requires little knowledge. Simply put, if a rangebecomes small enough to be held in a single OCBM entry, itis converted to one by the hardware. Returning to a uniformrange will occur when the hardware detects that all of thewatchpoints within the OCBM are equal. If an OCBM ischosen to be written back to memory, our backing storehandler will store it as individual ranges. If two OCBMscover adjacent ranges, the hardware will merge them only iftheir total size would still result in an OCBM.

2.3.4 ISA Changes

Interfacing with this watchpoint system requires modifica-tions to the ISA. For instance, we must add instructions thatcan set or remove ranges from the RC, as well as instruc-tions that allow the backing store handler to directly talk tothe RC hardware. Table 2 lists the instructions that mustbe added and describes what each does.

The most complicated instruction semantics relate tomodifying watchpoints in multi-threaded programs. TheWLB, for example, must be able to respond to shootdownrequests in order to remain synchronized across multiplethreads. Most importantly, instructions must exist to add,remove, and update ranges of watchpoints for sibling proces-sors. To do this quickly, our system must send these requests

without going through the OS. One method of doing this in-volves broadcasting to all cores a process ID (e.g., the CR3register in x86) and thread ID along with the request toperform a global update. Each core can then update its owncache entry if it matches the target process and thread ID.This instruction must be a memory fence, however, to main-tain consistency between watchpoints and normal requeststo watched memory locations. After the remote threads up-date their watchpoint cache, they must also send back ac-knowledgments. If any target thread ID does not return anacknowledgment within some timeout period, it may not berunning, and its watchpoints may be saved in memory. Thesource processor must then raise an interrupt and allow theOS to update the unscheduled process’s watchpoints.

3. Watchpoint Applications

This section analyzes potential applications for this watch-point system. We detail how each requires some subset ofthe requirements listed in Section 2.1 and then developwatchpoint-based algorithms that can accelerate each anal-ysis. Space limitations prevent us from detailing more thanwhat we tested in our experiments, but Table 3 covers othertools, which are also briefly discussed in Section 5.

3.1 Dynamic Dataflow Analysis

Dynamic dataflow analyses associate shadow values withprogram data, propagate them alongside the execution ofthe program, and perform a variety of checks on them tofind errors. This meta-data can represent myriad detailsabout the associated memory location such as trustworthi-ness [12], symbolic limits [24], or identification tags [29].Unfortunately, these systems suffer from high runtime over-heads, as every memory access must first check its associatedmeta-data before calculating any propagation logic.

Ho et al. described a method for dynamically disabling ataint analysis tool when it is not operating on tainted vari-ables, allowing the majority of memory operations to pro-ceed without any analysis overheads [21]. Pages that containany tainted data are marked unavailable in the virtual mem-ory system. Programs will execute unencumbered when op-erating on untainted data, but will cause a page fault if theyaccess data on a tainted page. At this point, the programcan be moved into the slow analysis tool.

Table 2: ISA Additions for Unlimited Watchpoints. Watchpoint modifications must be memory fences, and instructions workingon remote cores cause a shootdown in the remote WLB. Instructions used by the backing store handler can set all bits in a RC entry,and are also used during task switching.

Watchpoint Modifications

set local wp start, end, {r,w,rw,0} Adds a R/W/RW/not-watched range into this CPU’s RC. Overwrites any overlappingranges.

add/rm local wp start, end, {r,w} Updates an entry in the RC of this CPU. If anything between start and end is not in theRC, it must be read from the backing store to properly update it.

set remote wp start, end, tid, {r,w,rw,0} Adds an entry into the RC of the CPU with TID register tid.

add/rm remote wp start, end, tid, {r,w} Updates an entry in the RC of the CPU with TID register tid.

Backing Store Handler

read rc entry n Reads the nth entry of the RC (LRU order). This allows the oldest entries to be sent to thebacking store. A bulk-read instruction could be used for task-switching.

store rc entry start, end, status bits Allows writing an entire range (including bitmap and OCBM status bits) into a range cacheentry. Used for reading in a watchpoint on a RC miss.

Watchpoint System Interface

enable/disable wp Enable or disable the watchpoint system on this core. Kernel-level instruction.

set cpu tid tid Set the TID register on this CPU at thread switch time.

enable/disable wp thread tid Toggle WP operation for any core with the same PID as this core’s and TID equal to tid.

set handler addr addr Set the address to jump to when a watchpoint-related fault occurs. If faults are user-level,then this instruction is user-level.

get cpu tid Find the value in this processor’s TID register.

Ho et al. discussed the problem of false tainting, wherethe relatively coarse granularity of pages causes unneces-sary page faults when touching untainted data. Ideally,such a demand-driven taint analysis tool would utilize byte-accurate watchpoints. Additionally, although they partiallymitigated the slowdowns caused by the lack of a fast faulthandler by remaining inside their analysis tool for long peri-ods of time, this can limit performance. Their tool conserva-tively remains enabled while performing no useful analysis.

A WP-based algorithm for this type of system can besummarized as setting RW watchpoints on any data thatis tainted and running until a WP fault occurs. The faultindicates that a tainted value is either being written overor read from, and the propagation logic or instrumentedcode should be called from the WP handler. The tool shouldalso remain enabled while tainted data exists in registers, asthose values cannot be covered by watchpoints.

3.2 Deterministic Concurrent Execution

Deterministic concurrent execution systems attempt tomake more robust parallel programs by guaranteeing thatthe outputs of concurrent regions are the same each time aprogram is run with a particular input [5]. This can be ac-complished by allowing parallel threads to run unhinderedwhen they are not communicating, but only committingtheir memory speculatively. If one thread concurrently mod-ifies the data being used by another, they must be serializedin some manner.

Grace, one example of this type of system, splits fork/joinparallel programs into multiple processes and marks poten-tially shared heap regions as watched in the virtual memorysystem [6]. Any time the program reads or writes a newheap page, a watchpoint fault is taken, whereupon the pageis put into that process’s read or write set and marked asread-only or available, respectively. As the processes mergewhile joining, any conflicting write updates cause one of thethreads to roll back and reexecute.

While Berger et al. demonstrated the effectiveness ofthis system for a collection of fork/join parallel programs,using virtual memory watchpoints in such a way limitsthe applicability of their system. First, it only works on

programs that can easily be split into multiple processes,which can lead to performance and portability problemsin some operating systems. Additionally, it can only lookfor conflicting accesses at the page granularity. While manydevelopers work to limit false cache line sharing betweenthreads, it is much less likely that they care to limit falsesharing at the page granularity. Such a mismatch can leadto many unnecessary rollbacks, again reducing performance.

Our watchpoint-based deterministic execution algorithmis similar to Grace, except that it works on a per-threadbasis and sets watchpoints at a 64B cache line granularity.This could also be done at the byte granularity, but withhigher overheads.

3.3 Data Race Detection

Unordered accesses to shared memory locations by multiplethreads, or data races, can allow variables to be changedin undesired orders, potentially causing data corruptionand crashes [31]. Dynamic data race detectors can helpprogrammers build parallel programs by informing themwhenever shared memory is accessed in a racy way [35].

We previously showed a data race detection mechanismthat operated on similar principles to demand-driven taintanalysis, though it required the careful usage of performancemonitoring hardware to observe cache events [19]. That sys-tem could run into performance problems due to false shar-ing and may miss some races due to cache size limitations,among other issues.

It is possible to build such a demand-driven analysissystem using per-thread watchpoints. Similar to how Graceoperates, all regions of shared memory are initially watchedfor each thread. As a thread executes, it will take a numberof watchpoint faults in order to fill its read and write sets ina byte-accurate manner. In our implementation, reads thattake a watchpoint fault remove the local read watchpointand set a write watchpoint on all other threads (in orderto catch WAR sharing), while writes completely remove thelocal watchpoint and set RW watchpoints on all other cores(to catch RAW and WAW sharing). Any time an instructiontakes a WP fault, the software race detector can assumethat it was caused by inter-thread data sharing, and should

Table 3: Applications of Watchpoints. This is a sample of software systems that could utilize hardware-supported watchpointsto perform more efficiently and accurately. High-level algorithms that focus on how such systems would interact with the watchpointhardware are given for each, as well as an overview of the watchpoint capabilities that would be useful or needed for each algorithm.

Software System High Level Algorithm

Large

Num

ber

Byte

Granula

r

Fast

Handle

r

Fast

Changes

Per

Thread

Set

Ranges

Break

Ranges

Demand-DrivenDataflow Analysis

Set shadowed values as RW watched.Enable analysis tool only on watchpoint faults.

X X X X

DeterministicExecution

Start with shared memory RW watched in all threads.On local access fault:

Check for write conflict between threads. If so, serialize.Unwatch (R or RW, depending on access) cache line locally.Rewatch cache lines on other threads after serialization.

X X X X

Demand-DrivenData RaceDetection

Start with shared memory RW watched in all threads.On local access fault:

Run software race detector on this access.Unwatch (R or RW, depending on access) address locally.Rewatch address on other threads.

X X X X X X X

Bounds Checking Set W-watchpoints on canaries, return addresses, and heap meta-data. X X

SpeculativeProgramOptimization

Mark data regions as W watched in speculative and normal threads.On faults:

Save list of modified regions (perhaps larger than pages).Mark page available in this thread.

Compare values in modified regions when verifying speculation.

X X X X

HybridTransactionalMemory

At start of transaction, set local memory as RW watched.On local access fault:

Save original value for rollback.Check for conflicts with other transactions.Unwatch (R or RW, depending on access) address locally

Unwatch memory for this thread on transaction commit.

X X X X X X

Semi-spaceGarbageCollection

During from-space/to-space switch:Mark all memory in from-space as RW watched to executing threads.Update dereferenced pointer to be consistent on faults.

X X X

send the access through its slow race detection algorithm.A similar WP-based race detection method was recentlydemonstrated in DataCollider, though they are only ableto concurrently watch four variables due to WP registerlimitations [16].

This tool is a prime example of the need to break largeranges, as the program initially starts by watching largeranges and slowly splits them to form a local working set.

4. Experiments

Any slowdowns caused by issues such as WP cache missesshould not outweigh the performance gains a WP systemoffers software tools. To that end, this section presentsexperiments that evaluate the performance of our systemand a collection of other fine-grained memory protectionsystems when they are used to accelerate dynamic analysistools.

4.1 Experimental Setup

We implemented a high-level simulation environment usingPin [25] running on x86-64 Linux hosts. The software anal-ysis tools were implemented as pintools and were used toanalyze 32-bit x86 benchmark applications. These pintoolscommunicated with a simulator that modeled the watch-point hardware. It also kept track of events that would causeslowdowns. The overheads of events that were fully exposedto the pipeline (e.g. faults to the kernel) were calculated bymultiplying the event count by values derived from runninglmbench [28] on a collection of x86 Linux systems. Thesevalues are listed in Table 4.

Events that are not fully exposed, such as the workdone by software handlers, were logged and run through a

trace-based timing tool on a 2.4GHz Intel Core 2 processorrunning Red Hat Enterprise Linux 6.1. These actions arelisted in Table 5. The runtime of the timing tool is recordedand used to derive the cycle overhead of such events.

While this setup is likely to have inaccuracies (due to,for instance, caches and branch predictors being in differ-ent states), it is still useful in giving rough estimates of theperformance of numerous different systems across a multi-tude of tools and benchmarks. It’s worth nothing that even“cycle-accurate” simulators have inaccuracies compared toreal hardware [47]. The larger testing space available to ourfast simulation can still lead to useful design decisions.

4.1.1 Hardware Designs Modeled

To compare our design to previous works that could alsooffer watchpoints, we built models for a number of otherhardware memory protection mechanisms. We assume thatevery system except virtual memory has user-level faults soas not to bias our results away from other hardware designs.

Virtual Memory – This models the traditional way ofoffering large numbers of watchpoints [2]. Touching a pagewith watched data causes a page fault, whereupon the kernellooks through a list of byte-accurate watchpoints to decideif this was a true watchpoint fault. If so, a signal is sentto the user code. If not, the page is marked available, thenext instruction single-stepped, and the page is then markedunavailable again after the subsequent return to the kernel.The overheads of this system can be estimated as: (# truefaults × (Tkernel +Tsignal)) + (# false faults × Tkernel × 2)+ (# WP changes × Tsyscall) + TSWcheck + TSWset + TV M .

MemTracker – This implements a design that has alookaside buffer that is separate from the L1D cache (called

Table 4: Exposed Latency Values. These events are countedin our simulator, and each event is estimated to take the listednumber of cycles, on average.

Event Added Cycles Symbol

Kernel fault 700 Tkernel

Syscall entry 400 Tsyscall

Signal from kernel 3000 Tsignal

User-level fault time 20 Tuser

the “Taint L1” in FlexiTaint [41] and “State L1” in Mem-Tracker [42]). In these tests, the State L1 (SL1) was a 4KB,4-way set associative cache with 64-byte lines. The originalMemTracker did not have user-level faults, so their backingstore was a bitmap in main memory. Our initial tests showedthat the vast majority of time was spent writing data tothis bitmap, so we changed the system to use a software-controlled backing store handler similar to our design. Theoverheads of this system can be estimated as: ((# of faults+ # SL1 misses) × Tuser) + TSWcheck + TSWset.

Mondriaan Memory Protection – MMP utilizes atrie in main memory to store the watchpoints for individualbytes [46]. Upper levels of the table can be set to mark large,aligned regions as watched in one action. The protectionlookaside buffer (PLB) is 256 entries and can hold thesehigher-level entries (using low order don’t-care bits). We didnot utilize the mini-SST optimization because Witchel laterdescribed how such entries can yield significant slowdowns ifpermission changes are frequent [45]. The overheads for thissystem are: (# of faults + Tuser) + THWcheck + TMLPT

Range Cache – This system is a 128-entry range cache,where each entry holds 2 bits of WP data. Tiwari et al.estimated that a 128-entry range cache with 32-bit entrieswould take up nearly the same amount of space as 4KB ofdata cache [40]. The OH is: ((# of faults + # RC misses+ # write-backs) × Tuser) + (complex range updates × 64cycles) + TSWcheck + TSWset.

RC + Bitmap – Our technique adds bitmapped ranges,OCBMs, and a 2-way set-associative, 2KB WLB to the rangecache design. The size of the RC is reduced to 64 entriesbecause of the extra area needed for these features. Thissystem can have further overhead-causing events besidesthose of the range cache: (THWcheck on WLB miss) + (timeto decide to switch to/from bitmap ranges) + Tbitmap.

4.1.2 Software Test Clients

The software analyses discussed in Section 3 are utilizedas clients for the simulated hardware watchpoint system.The overheads caused by the tools themselves are commonamongst all watchpoint designs and are not modeled. Inother words, the reported performance differences are rel-ative to the meta-data checks only, not the shadow propa-gation, serialization algorithms, or race detection logic.

Demand-Driven Taint Analysis – This tool performstaint analysis on target applications, marking data readfrom disk and the network as tainted. As a baseline, wecompare against two state-of-the-art analysis systems. Thefirst, MINEMU 0.3, is an extremely fast taint analysis toolthat utilizes multiple non-portable techniques (such as usingSSE registers to hold taint bits) in order to run as fast aspossible [9]. Because this system is a full taint analysis tool,its measured overheads do include taint propagation andchecks. The second baseline system is Umbra, a more generalshadow memory system built on top of DynamoRio [49]. Itsoverheads come from calculating shadow value locations and

Table 5: Pipelined Events. These are pipelined events thatcause overlapping slowdowns. They are logged and their over-heads estimated using a trace-based timing simulator.

Event Symbol

Load watchpoints using SW handler. TSWcheck

Store watchpoints using SW handler. TSWset

Load watchpoints using hardware THWcheck

Change page table protection bits. TV M

Create MLPT entries in MMP. TMLPT

Write data into bitmaps. Tbitmap

accessing them on each memory operation. Similar to thetests done by Tiwari et al. [40], we test these systems usingthe SPEC CPU2000 integer benchmarks.

Deterministic Concurrent Execution – We test thedeterministic execution tool in a similar manner to Grace byusing the benchmark tests included in the Phoenix shared-memory MapReduce suite [33]. We also tested the SPECOMP2001 benchmarks, another set of programs with thefork-join parallelism that Grace is designed to support.

Demand-Driven Data Race Detection – The defaulttool we compare against is a commercial data race detectorthat performs this sharing analysis in software. We test thissystem in a similar manner to our previous work [19] byrunning the Phoenix [33] and PARSEC [8] suites.

4.2 Results for Taint Analysis

Figure 6 details the average number of memory operationsbetween each WP cache miss for all of the systems that usesome cache mechanism (i.e., not virtual memory). Becausethese programs have large regions of unwatched data, thistool shows the power of the range cache to cover nearlythe entire working set of a program. Similarly, becausethe MMP PLB can hold aligned ranges, it has a largerreach than MemTracker’s cache, which can only hold smallsections of per-byte bitmaps. Nonetheless, Figure 7 showsthat MMP has worse performance, on average, than MT.This is primarily because some benchmarks cause MMP tospend a great deal of time deciding whether to set or removeupper levels of the trie, in order to best utilize ranges in itsPLB.

The taint analysis tool rarely causes our hybrid system totransition from using normal to bitmapped ranges. Becausethe size of the RC is reduced from 128 to 64 entries, its missrate is slightly higher than the RC-only system’s. The onlycounterexample is 255.vortex, which taints a large numberof small regions. However, when WLB misses are taken intoaccount (the line “RC+Bitmap” only accounts for rangecache misses), the total hit rate drops precipitously. This isbecause the benchmark has a large number of WP changes,and each change into a bitmapped region causes a WLBeviction.

Figure 7 compares the performance of the HW-assistedwatchpoint systems against the two software-based shadowvalue analysis tools. The lower hit rate in the hybrid systemmeans that it is about 7% slower than one with a largerRC. Nonetheless, the high hit rates of both systems meanthat nearly every instruction that is not tainted suffers noslowdown, yielding large speedups over the software analysistools. It is important to note, however, that MINEMU isalso performing taint propagation, overheads which we donot analyze for our hardware systems. Nonetheless, usinghardware-supported watchpoints can still result in largeperformance gains over an always-on tool like Umbra.

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1.E+10

164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parse 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf GeoMean

Me

mO

ps

Pe

r C

ach

e M

iss

MT

MMP

RC

RC+Bitmap

RC+B w/ WLB

Figure 6: Memory Operations per Watchpoint Cache Miss for Taint Analysis. Most programs store a relatively small numberof tainted regions, leading the range cache to have a very high hit rate (some of the benchmarks never miss). “RC+B w/ WLB” countsboth range cache misses and the WLB misses, though the latter are filled much more quickly than the former.

01234567

8

164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf GeoMean

Slo

wd

ow

n (

x)

MINEMU

Umbra

VM

MT

MMP

RC

RC+Bitmap

19x

Figure 7: Performance of Demand-Driven Taint Analysis. The patterned are binary instrumentation systems that perform fulltaint analysis and shadow value lookup, respectively. The maximum slowdowns of the VM system are not shown, as they can go as highas 1400×. The RC + Bitmap solution performs poorly on 255.vortex because each WP change to a bitmapped region causes an evictionfrom the WLB, and these are quite frequent. Nonetheless, this system easily outperforms manual shadow value lookups.

This performance benefit is dampened in 255.vortex byits extremely low cache hit rates. All of the demand-drivenanalysis systems performed poorly for this benchmark, andwould probably continue to do so even with high cache hitrates, since the software analysis tool would rarely be dis-abled – almost 10% of the memory operations in 255.vortexoperate on tainted values. Demand-driven tools are poor ataccelerating applications such as this, and would need mech-anisms such as sampling to run faster [18].

4.3 Results for Deterministic Execution

Figures 8 and 9 show the cache miss rates and performance,respectively, of the WP-based deterministic execution sys-tem. The performance graph is normalized to a Grace-likesystem that removes watchpoints one page, rather than onecache line, at a time. The normalized performance of themore fine-grained mechanisms will be lower than 1.

Because this system operates on more watchpoints thanthe taint analysis tool, the cache hit rates are lower. Veryfew of the watchpoints used in this tool can be held in higherlevels of the trie, so MMP’s PLB has a much worse hit rate.On average, MemTracker’s hit rate is higher than MMP’sdue to its larger cache. Despite the increased number ofwatchpoints, the range cache maintains a high hit rate. Thiscan partially be attributed to the tool using watchpoints of64 bytes in length at minimum, which increases the reach ofthe range cache in highly fragmented cases.

As Figure 9 illustrates, attempting to use finer-grainedwatchpoints in the virtual memory system reduces perfor-mance significantly (it averages out to 670× slower thanusing 4KB watchpoints). All of the systems designed forfine-grained watchpoints handle this change much better.The systems that utilize ranges execute at about 90% of thespeed of the VM-based system that uses 4KB watchpoints,with the bitmapped range system edging slightly higher.Because this tool primarily works on ranges of data, bothrange-based systems perform better than the other hardwaresystems that operate solely on bitmaps.

4.4 Results for Data Race Detection

The cache miss rate for a demand-driven data race detectoris shown in Figures 10. This tool deals with a much greater

number of watchpoints, as it slowly unwatches data touchedby individual threads at a byte granularity. The effect thishas on the range cache can be easily observed in the Phoenixsuite, as it falls from 100,000 memory operations betweeneach range cache miss (in Grace) to 10,000 in this tool. ThePARSEC benchmark canneal is particularly egregious, asnone of the hardware caches go more than an average of 100memory operations before missing. Because most of thesebenchmarks have a large number of relatively small ranges,however, this suite shows the benefit of the bitmappedranges. In PARSEC, the geometric mean of the range cachehit rate is 5× higher when small ranges can be stored asbitmaps, even though the RC itself is half the size.

The performance of the hardware-assisted sharing detec-tors, shown in Figure 11, is almost always higher than whenusing software. In particular, the range based systems, evenwith low hit rates, can still outperform a software systemthat must check every memory access. The exception in can-neal, where range-based hardware systems suffer high over-heads from the backing store fault handler. In total, however,our hybrid system is able to demonstrate a 9× performanceimprovement, on average.

4.5 Experimental Takeaways

On the whole, the benefits of the range cache are pro-nounced. Its hit rate is significantly higher than caches thatonly store WP bitmaps. Perhaps even more importantly, itacts as a write filter. Many of the slowdowns seen by Mem-Tracker are due to writing to the backing store repeatedly.In a similar vein, the algorithm for filling out the trie inMMP can take up a great deal of time when there are manyWP changes, and breaking ranges apart can cause a largeamount of memory traffic.

Nonetheless, we’ve found that the addition of a bitmapsystem to the range cache is beneficial when the watchpointsare small. There are numerous applications that create alarge number of small ranges. dedup within a race detector,for instance, is significantly helped by the increased reach ofthe RC when using bitmapped ranges. We found, however,that the WLB miss rate was quite high in many cases. Thisis partially because writes to bitmapped regions currentlyevict matching entries from the WLB. It is also the case that

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

1E+07

1E+08

1E+09

1E+10

Me

mO

ps

Pe

r C

ach

e M

iss

(a) Phoenix Suite

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

1E+07

1E+08

1E+09

1E+10

Me

mO

ps

Pe

r C

ach

e M

iss MT

MMP

RC

RC+

BitmapRC+B w/

WLB

(b) SPEC OMP2001

Figure 8: Memory Operations per WP Cache Miss for Deterministic Execution. The increased number of WPs causes theRC to have lower hit rates than it had for taint analysis. Because watchpoints are set and removed on cache line sized regions, theaddition of a bitmap is only somewhat useful. It does not, however, increase the reach of the system as a whole since the size of the RCis reduced. Nonetheless, both systems take many fewer misses than other watchpoint hardware systems.

-

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Pe

rfo

rm

an

ce

vs G

race

(a) Phoenix Suite

-

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Pe

rfo

rm

an

ce

vs G

race

VM

MT

MMP

RC

RC+

Bitmap

(b) SPEC OMP2001Figure 9: Deterministic Execution Performance: Cache Line vs. Page Granularity Watchpoints. The baseline system setsand removes watchpoints at the 4KB page granularity, while these systems operate on 64 byte lines. On average, both range-basedsystems operate at 90% of the baseline speed while maintaining more accurate checks.s

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

1E+07

Me

mO

ps

Pe

r C

ach

e M

iss

(a) Phoenix Suite

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

1E+07

Me

mO

ps

Pe

r C

ach

e M

iss MT

MMP

RC

RC+

BitmapRC+B w/

WLB

(b) PARSEC

Figure 10: Memory Operations per WP Cache Miss for Data Race Detection. This system sets a large number of fine-grainedwatchpoints, making the bitmap addition especially useful. It increases the number of instructions between each RC miss by over 5× inPARSEC. The total miss rate with WLB included is higher, but those are filled much faster than RC misses.

-

5

10

15

20

25

30

Sp

ee

du

p (

x)

(a) Phoenix Suite

-

5

10

15

20

25

30

Sp

ee

du

p (

x)

VM

MT

MMP

RC

RC+

Bitmap

(b) PARSECFigure 11: Performance of Demand-Driven Data Race Detection vs. Binary Instrumentation. Because the binaryinstrumentation tool is slow, most of the systems see significant speedups. The WLB miss handler is much faster than RC miss handler,so the bitmapped range system is noticeably faster than the RC system, even though it takes many WLB misses.

that bitmapped ranges are often enabled in sections of codethat are difficult to cache. This warrants future research toperhaps find a better way to store these bitmapped ranges.

Though we did not illustrate the tests we did on theOCBM, we found that it was a net benefit, on average.However, its benefit was generally measured in single-digitpercentages, rather than the larger gains we often saw withthe full bitmap system.

Perhaps the most important area that could be improvedin our system is the backing store handler. Range cachemisses or overflows required hundreds to thousands of extracycles to handle (depending on the size of backing storetree and the complexity of the insertion). With overheadsthis high, the range cache needs a particularly good hitrate to maintain high performance. One of the benefits wefound with the bitmapped ranges was that the WLB misshandler was simpler and faster. It is probably possible tobuild more efficient backing store algorithms, such that theywould switch between storage methods depending on themiss rate, eviction rate, and number of watchpoints in thesystem.

In summary, the hardware-assisted watchpoint systemsallowed sizable performance improvements in demand-driven tools. For the deterministic execution system, thefiner-grained watchpoints resulted in minor slowdowns, butallowed more accurate sharing analyses than can currentlybe performed at such a speed. This is the crux of the argu-ment that hardware-assisted watchpoints are a generalizedmechanism to improve software tools.

5. Related Works

This section explores works related to watchpoints and theiruses. We first discuss hardware proposals that could be usedto provide unlimited watchpoints, but which have limita-tions that keep them from being as general as we wouldlike. We then list other applications that could potentiallyutilize an unlimited watchpoint system.

5.1 Memory Protection Systems

There have been numerous proposals for hardware thatprovides watchpoint-like mechanisms. Capabilities, for in-stance, can be used to control software’s access to particularregions in memory [15]. Few capability-based systems werebuilt, and even fewer still exist. Most recent publicationsfocus instead on fine-grain memory protection.

One of the most widely cited works in this area is Mon-driaan (also spelled Mondrian) Memory Protection [46].MMP was designed with fine-grained inter-process protec-tion mechanisms in mind, and is optimized for applicationsthat do not perform frequent updates. The protection infor-mation is stored in main memory and is cached in a protec-tion lookaside buffer (PLB). MMP utilized a ternary CAMfor this cache, allowing naturally aligned ranges to be com-pactly stored. The first method Witchel proposed for storingprotection regions in memory was a sorted segment table(SST), a list sorted by starting address. Though this al-lows O(ln n) lookups and can efficiently store ranges, it isunsuitable for frequent updates. The second, a multi-levelpermission table (MLPT), is a trie that holds a bitmap ofword-accurate permissions in its lowest level. It is beneficialto use upper levels of this table whenever possible in orderto increase the PLB’s reach. Checking ranges of permissionsin order to “promote” a region can cause significant updateslowdowns, however. In general, MMP is designed to work

with tools that, while needing memory protection, do notperform frequent updates.

One common method of storing protection data is to putit alongside cache lines [30, 32, 50]. iWatcher, for example,stores per-word watchpoints alongside the cache lines thatcontain the watched data [50]. These bits are initially setby hardware and are temporarily stored into a victim ta-ble on cache evictions. The hardware falls back to virtualmemory watchpoints if this table overflows. iWatcher canwatch a small number of ranges, which must be pinned inphysical memory. If this range hardware overflows, the sys-tem falls back to setting a large number of per-word watch-points. In general, this system is inadequate for tools thatrequire more than a small number of large ranges. Of note,Zhou et al. correctly state the usefulness of user-level faultsin handling frequently-touched watchpoints. UFO, a similarsystem, used watchpoints to accelerate speculative softwareoptimization [30]. Unfortunately, because it stores watch-points in memory by updating the ECC bits of individuallines, it does not allow large ranges to be quickly set or re-moved. SafeMem uses a similar scheme [32].

MemTracker holds analysis meta-data (which is analo-gous to watchpoints) in a separate L1 cache, similar to ourWLB. It also offers the option of using a hardware statemachine to perform meta-data propagation [42]. This meansthat it can have even less overhead for some types of analysesthan a system with a user-level fault handler. However, be-cause its meta-data is stored as a packed array in main mem-ory, it is time-consuming to set or remove watchpoints onlarge ranges. FlexiTaint, a follow-on work to MemTracker,has the same issues [41].

In order to reduce the number of accesses to the memoryprotection hardware, Sentry stores its protection data at thecache line granularity [38]. Every line in L1 is unwatched,and the on-chip protection structure is only checked onL1D misses. Their paper gives an excellent analysis of thetype of hardware needed to perform watchpoints. However,their system has large overheads when performing frequentupdates because it must go to a software handler on everychange and evict cache lines accordingly.

5.2 Other Uses for Watchpoints

Watchpoints can also be used to accelerate software systemsbeyond the three examined in the experiments. Table 3 listsalgorithms for the tools discussed in this section.

Hill et al. previously made an argument for deconstructedhardware transactional memory (TM) systems [20]. Theypushed for HTM additions that could be used for otherthings, like watchpoints; we instead argue that watchpointscan be used to make (among other things) faster TM sys-tems. A WP-based algorithm to accelerate software TM, asdescribed by Baugh et al., sets WPs on data touched bytransactional code; any time another thread touches thisdata, a conflict will be caught [3]. If transactions are partic-ularly long, forcibly setting watchpoints after every memoryaccess may be slow. In this case, carving working sets outof large watched regions (as we demonstrated for determin-istic execution and data race detection) may be faster, asthe initial fault times are amortized across multiple memoryoperations.

Beyond correctness and debugging analyses, speculativeparallelization systems allow some pieces of sequential codeto run concurrently, and only later check if the result wasindeed correct. In essence, watchpoints can be set on valuescreated by the speculative code so that they are verified

before being used elsewhere in the program. Fast Trackperforms these checks in software using virtual memorywatchpoints [23].

Because they allow write faults to be taken on specifiedread-only regions of memory, watchpoints also enable secu-rity systems both as common as bounds checking [14] and asesoteric as kernel rootkit protection [44]. The latter requireslittle explanation, but it is useful to note that the authorslamented the “protection granularity gap,” or the inabilityto set fine-grained watchpoints.

Watchpoints can even be used, in a fashion, for garbagecollection. A semi-space collector will move all reachableobjects from one memory space to another at collectiontime [13, 17]. To implement this efficiently, the programwill continue to execute while data is moved, pausing ifthe program attempts to access data that has not yet beenappropriately moved. Appel et al. did this by setting virtualmemory watchpoints on each memory location that is to bemoved [1]. This is similar to the mechanism that Lyu et al.use to perform online software updates [26].

6. Conclusion and Future Work

In this paper we presented a hardware design that allowssoftware to utilize a virtually unlimited number of low-overhead watchpoints. By combining the ability of a rangecache to store long watched regions with a bitmap’s fastaccess and succinct encoding for small watchpoints, we wereable to demonstrate an effective system for a collection ofsoftware tools. We used this system to demonstrate thatthe software community can benefit from generic primitivesprovided by hardware.

Though our results show that a design of this nature ispromising, a number of avenues for future research remainopen. Our range cache’s software-controlled miss and evic-tion handler would be unusable without fast fault support.It may therefore be advantageous to look into hardware de-signs for operating on the backing store. There are also openquestions as to the best algorithms for moving from ranges tobitmaps (and vice-versa). We presented a simple algorithmthat showed decent results, but it is probably not optimaleither in its runtime or its decisions.

Finally, there are potentially promising research resultsinto new types of software tools that could be built on topof watchpoint systems. Software developers and researchershave been valiantly extending the uses of the virtual memorysystem for decades. This ingenuity may very well yield noveluses for a byte-granularity watchpoint system in the future.

Acknowledgments

We wish to thank Debapriya Chatterjee, Andrea Pellegrini,and the anonymous reviewers, whose suggestions greatlyimproved this work, as well as Evelyn Duesterwald fromIBM Research, for assistance in the final revision. Thanksalso to Qin Zhao for providing the Umbra code used inour experiments. The authors acknowledge the support ofthe Gigascale Systems Research Center, one of five researchcenters funded under the Focus Center Research Program,a Semiconductor Research Corporation program.

References[1] A. W. Appel, J. R. Ellis, and K. Li. Real-time Concurrent

Collection on Stock Multiprocessors. In the Proc. of the Con-ference on Programming Language Design and Implementa-tion (PLDI), 1988.

[2] A. W. Appel and K. Li. Virtual Memory Primitives forUser Programs. In the Proc. of the International Conferenceon Architectural Support for Programming Languages andOperating Systems (ASPLOS), 1991.

[3] L. Baugh, N. Neelakantam, and C. Zilles. Using HardwareMemory Protection to Build a High-Performance, Strongly-Atomic Hybrid Transactional Memory. In the Proc. of the In-ternational Symposium on Computer Architecture (ISCA),2008.

[4] B. Beander. VAX DEBUG: An Interactive, Symbolic, Mul-tilingual Debugger. In the Proc. of the Software EngineeringSymposium on High-level Debugging, 1983.

[5] T. Bergan, J. Devietti, N. Hunt, and L. Ceze. The Determin-istic Execution Hammer: How Well Does it Actually PoundNails? In the Workshop on Determinism and Correctness inParallel Programming (WoDet), 2011.

[6] E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: SafeMultithreaded Programming for C/C++. In the Proc. of theInternational Conference on Object-Oriented Programming,Systems, Languages, and Applications (OOPSLA), 2009.

[7] A. Bessey, K. Block, B. Chelf, A. Chou, B. Fulton, S. Hallem,C. Henri-Gros, A. Kamsky, S. McPeak, and D. Engler. AFew Billion Lines of Code Later: Using Static Analysis toFind Bugs in the Real World. Communications of the ACM,53(2):66–75, February 2010.

[8] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PAR-SEC Benchmark Suite: Characterization and ArchitecturalImplications. In the Proc. of the International Conference onParallel Architectures and Compilation Techniques (PACT),2008.

[9] E. Bosman, A. Slowinska, and H. Bos. Minemu: The World’sFastest Taint Tracker. In the Proc. of the International Sym-posium on Recent Advances in Intrusion Detection (RAID),2011.

[10] S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, A. Landin,S. Yip, H. Zeffer, and M. Tremblay. Rock: A High-Performance Sparc CMT Processor. IEEE Micro, 29(2):6–16,2009.

[11] S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, A. Landin,S. Yip, H. Zeffer, and M. Tremblay. Simultaneous SpeculativeThreading: A Novel Pipeline Architecture Implemented inSun’s ROCK Processor. In the Proc. of the InternationalSymposium on Computer Architecture (ISCA), 2009.

[12] S. Chen, J. Xu, N. Nakka, Z. Kalbarczyk, and R. K. Iyer.Defeating Memory Corruption Attacks via Pointer Tainted-ness Detection. In the Proc. of the International Conferenceon Dependable Systems and Networks (DSN), 2005.

[13] C. J. Cheney. A Nonrecursive List Compacting Algorithm.Communications of the ACM, 13(11):677–678, November1970.

[14] T. Chiueh. Fast Bounds Checking Using Debug Registers.In the Proc. of the International Conference on High Per-formance Embedded Architectures & Compilers (HiPEAC),2008.

[15] D. R. Ditzel. Architectural Support for Programming Lan-guages in the X-Tree Processor. In Spring COMPCON, 1980.

[16] J. Erickson, M. Musuvathi, S. Burckhardt, and K. Olynyk.Effective Data-Race Detection for the Kernel. In the Proc.of the Symposium on Operating Systems Design and Imple-mentation (OSDI), 2010.

[17] R. R. Fenichel and J. C. Yochelson. A LISP Garbage-Collector for Virtual-Memory Computer Systems. Commu-nications of the ACM, 12(11):611–612, November 1969.

[18] J. L. Greathouse, C. LeBlanc, T. Austin, and V. Bertacco.Highly Scalable Distributed Dataflow Analysis. In the Proc.

of the Symposium on Code Generation and Optimization(CGO), 2011.

[19] J. L. Greathouse, Z. Ma, M. I. Frank, R. Peri, and T. Austin.Demand-Driven Software Race Detection using HardwarePerformance Counters. In the Proc. of the InternationalSymposium on Computer Architecture (ISCA), 2011.

[20] M. D. Hill, D. Hower, K. E. Moore, M. M. Swift, H. Volos,and D. A. Wood. A Case for Deconstructing HardwareTransactional Memory Systems. Technical report, Universityof Wisconsin-Madison, 2007.

[21] A. Ho, M. Fetterman, C. Clark, A. Warfield, and S. Hand.Practical Taint-Based Protection using Demand Emulation.In the Proc. of the European Conference on Computer Sys-tems (EuroSys), 2006.

[22] M. S. Johnson. Some Requirements for Architectural Sup-port of Software Debugging. In the Proc. of the InternationalConference on Architectural Support for Programming Lan-guages and Operating Systems (ASPLOS), 1982.

[23] K. Kelsey, T. Bai, C. Ding, and C. Zhang. Fast Track:A Software System for Speculative Program Optimization.In the Proc. of the Symposium on Code Generation andOptimization (CGO), 2009.

[24] E. Larson and T. Austin. High Coverage Detection of Input-Related Security Faults. In the Proc. of the USENIX SecuritySymposium, 2003.

[25] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser,G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin:Building Customized Program Analysis Tools with DynamicInstrumentation. In the Proc. of the Conference on Program-ming Language Design and Implementation (PLDI), 2005.

[26] J. Lyu, Y. Kim, Y. Kim, and I. Lee. A Procedure-BasedDynamic Software Update. In the Proc. of the InternationalConference on Dependable Systems and Networks (DSN),2001.

[27] R. E. McLear, D. M. Scheibelhut, and E. Tammaru. Guide-lines for Creating a Debuggable Processor. In the Proc. ofthe International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS),1982.

[28] L. McVoy and C. Staelin. lmbench: Portable Tools forPerformance Analysis. In the Proc. of the USENIX AnnualTechnical Conference, 1996.

[29] S. Mysore, B. Mazloom, B. Agrawal, and T. Sherwood.Understanding and Visualizing Full Systems with Data FlowTomography. In the Proc. of the International Conferenceon Architectural Support for Programming Languages andOperating Systems (ASPLOS), 2008.

[30] N. Neelakantam and C. Zilles. UFO: A General-PurposeUser-Mode Memory Protection Technique for ApplicationUse. Technical report, University of Illinois at Urbana-Champaign, 2007.

[31] R. H. B. Netzer and B. P. Miller. What Are Race Conditions?Some Issues and Formalizations. ACM Letters on Program-ming Languages and Systems, 1(1):74–88, March 1992.

[32] F. Qin, S. Lu, and Y. Zhou. SafeMem: Exploiting ECC-Memory for Detecting Memory Leaks and Memory Corrup-tion During Produciton Runs. In the Proc. of the Interna-tional Symposium on High-Performance Computer Architec-ture (HPCA), 2005.

[33] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, andC. Kozyrakis. Evaluating MapReduce for Multi-core andMultiprocessor Systems. In the Proc. of the InternationalSymposium on High-Performance Computer Architecture(HPCA), 2007.

[34] I. Schoinas, B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R.Larus, and D. A. Wood. Fine-grain Access Control for Dis-tributed Shared Memory. In the Proc. of the InternationalConference on Architectural Support for Programming Lan-guages and Operating Systems (ASPLOS), 1994.

[35] E. Schonberg. On-The-Fly Detection of Access Anomalies.In the Proc. of the Conference on Programming LanguageDesign and Implementation (PLDI), 1989.

[36] E. Schrock. Watchpoints 101. http://blogs.oracle.com/eschrock/entry/watchpoints_101, July 2004.

[37] J. Seward and N. Nethercote. Using Valgrind to DetectUndefined Value Errors with Bit-precision. In the Proc. ofthe USENIX Annual Technical Conference, 2005.

[38] A. Shriraman and S. Dwarkadas. Sentry: Light-Weight Aux-iliary Memory Access Control. In the Proc. of the Interna-tional Symposium on Computer Architecture (ISCA), 2010.

[39] B. Sprunt. The Basics of Performance-Monitoring Hardware.IEEE Micro, 22(4):64–71, 2002.

[40] M. Tiwari, B. Agrawal, S. Mysore, J. Valamehr, and T. Sher-wood. A Small Cache of Large Ranges: Hardware Methodsfor Efficiently Searching, Storing, and Updating Big DataflowTags. In the Proc. of the International Symposium on Mi-croarchitecture (MICRO), 2008.

[41] G. Venkataramani, I. Doudalis, Y. Solihin, and M. Prvulovic.FlexiTaint: A Programmable Accelerator for Dynamic TaintPropagation. In the Proc. of the International Symposium onHigh-Performance Computer Architecture (HPCA), 2008.

[42] G. Venkataramani, B. Roemer, Y. Solihin, and M. Prvulovic.MemTracker: Efficient and Programmable Support for Mem-ory Access Monitoring and Debugging. In the Proc. of theInternational Symposium on High-Performance ComputerArchitecture (HPCA), 2007.

[43] R. Wahbe. Efficient Data Breakpoints. In the Proc. ofthe International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS),1992.

[44] Z. Wang, X. Jiang, W. Cui, and P. Ning. Countering KernelRootkits with Lightweight Hook Protection. In the Proc. ofthe Conference on Computer and Communications Security(CCS), 2009.

[45] E. Witchel. Considerations for Mondriaan-like Systems. Inthe Workshop on Duplicating, Deconstructing, and Debunk-ing (WDDD), 2009.

[46] E. Witchel, J. Cates, and K. Asanovic. Mondrian MemoryProtection. In the Proc. of the International Conferenceon Architectural Support for Programming Languages andOperating Systems (ASPLOS), 2002.

[47] M. T. Yourst. PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator. In the Proc. of the Interna-tional Symposium on Performance Analysis of Systems andSoftware (ISPASS), 2007.

[48] M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Perfor-mance Analysis Using the MIPS R10000 Performance Coun-ters. In the Proc. of the ACM/IEEE Conference on Super-computing, 1996.

[49] Q. Zhao, D. Bruening, and S. Amarasinghe. Umbra: Effi-cient and Scalable Memory Shadowing. In the Proc. of theSymposium on Code Generation and Optimization (CGO),2010.

[50] P. Zhou, F. Qin, W. Liu, Y. Zhou, and J. Torrellas. iWatcher:Efficient Architectural Support for Software Debugging. Inthe Proc. of the International Symposium on Computer Ar-chitecture (ISCA), 2004.