A Resistive TCAM Accelerator for Data-Intensive Computing

12
A Resistive TCAM Accelerator for Data-Intensive Computing Qing Guo 2 Xiaochen Guo 1 Yuxin Bai 1 Engin ˙ Ipek 1,2 1 Department of Electrical and Computer Engineering 2 Department of Computer Science University of Rochester Rochester, NY 14627 USA {qing.guo, yuxin.bai, xiaochen.guo, engin.ipek}@rochester.edu ABSTRACT Power dissipation and o-chip bandwidth restrictions are critical challenges that limit microprocessor performance. Ternary content addressable memories (TCAM) hold the potential to address both problems in the context of a wide range of data-intensive workloads that benefit from associa- tive search capability. Power dissipation is reduced by elimi- nating instruction processing and data movement overheads present in a purely RAM based system. Bandwidth demand is lowered by processing data directly on the TCAM chip, thereby decreasing o-chip trac. Unfortunately, CMOS- based TCAM implementations are severely power- and area- limited, which restricts the capacity of commercial products to a few megabytes, and confines their use to niche network- ing applications. This paper explores a novel resistive TCAM cell and array architecture that has the potential to scale TCAM capacity from megabytes to gigabytes. High-density resistive TCAM chips are organized into a DDR3-compatible DIMM, and are accessed through a software library with zero modifications to the processor or the motherboard. On applications that do not benefit from associative search, the TCAM DIMM is configured to provide ordinary RAM functionality. By tightly integrating TCAM with conventional virtual mem- ory, and by allowing a large fraction of the physical address space to be made content-addressable on demand, the pro- posed memory system improves average performance by 4and average energy consumption by 10on a set of evalu- ated data-intensive applications. Categories and Subject Descriptors B.3 [Memory Structures] General Terms Design, Performance Keywords Resistive memory, TCAM, Accelerator This work was supported in part by NSF CAREER award CCF-1054179 and gifts from Qualcomm. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MICRO’11 December 3-7, 2011, Porto Alegre, Brazil Copyright 2011 ACM 978-1-4503-1053-6/11/12 ...$10.00. 1. INTRODUCTION As CMOS scaling continues into the billion transistor era, power dissipation and o-chip bandwidth limitations are threatening to bring an end to microprocessor performance growth. Data intensive applications such as data mining, information retrieval, video processing, and image coding demand significant computational power and generate sub- stantial memory trac, which places a heavy strain on both o-chip bandwidth and overall system power. Device, cir- cuit, and architecture innovations are needed to surmount this problem. Ternary content addressable memories (TCAM) are an at- tractive solution to curb both power dissipation and o-chip bandwidth demand in a wide range of data-intensive appli- cations. When associative lookups are implemented using TCAM, data is processed directly on the TCAM chip, which decreases o-chip trac and lowers bandwidth demand. Of- ten, a TCAM-based system also improves energy eciency by eliminating instruction processing and data movement overheads that are present in a purely RAM based sys- tem. Unfortunately, even an area-optimized, CMOS-based TCAM cell is over 90larger than a DRAM cell at the same technology node, which limits the capacity of commercially available TCAM parts to a few megabytes, and confines their use to niche networking applications. This paper explores a new technique that aims at cost- eective, modular integration of a high-capacity TCAM sys- tem within a general-purpose computing platform. TCAM density is improved by more than 20over existing, CMOS- based parts through a novel, resistive TCAM cell and array architecture. High-capacity resistive TCAM chips are placed on a DDR3-compatible DIMM, and are accessed through a user-level software library with zero modifications to the processor or the motherboard. The modularity of the re- sulting memory system allows TCAM to be selectively in- cluded in systems running workloads that are amenable to TCAM-based acceleration; moreover, when executing an ap- plication or a program phase that does not benefit from associative search capability, the TCAM DIMM can be con- figured to provide ordinary RAM functionality. By tightly integrating TCAM with conventional virtual memory, and by allowing a large fraction of the physical address space to be made content-addressable on demand, the proposed memory system improves average performance by 4and average energy consumption by 10on a set of evaluated data-intensive applications.

Transcript of A Resistive TCAM Accelerator for Data-Intensive Computing

Page 1: A Resistive TCAM Accelerator for Data-Intensive Computing

A Resistive TCAM Accelerator for Data-Intensive Computing⇤

Qing Guo

2Xiaochen Guo

1Yuxin Bai

1Engin

˙

Ipek

1,2

1Department of Electrical and Computer Engineering

2Department of Computer Science

University of Rochester

Rochester, NY 14627 USA

{qing.guo, yuxin.bai, xiaochen.guo, engin.ipek}@rochester.edu

ABSTRACTPower dissipation and o↵-chip bandwidth restrictions arecritical challenges that limit microprocessor performance.Ternary content addressable memories (TCAM) hold thepotential to address both problems in the context of a widerange of data-intensive workloads that benefit from associa-tive search capability. Power dissipation is reduced by elimi-nating instruction processing and data movement overheadspresent in a purely RAM based system. Bandwidth demandis lowered by processing data directly on the TCAM chip,thereby decreasing o↵-chip tra�c. Unfortunately, CMOS-based TCAM implementations are severely power- and area-limited, which restricts the capacity of commercial productsto a few megabytes, and confines their use to niche network-ing applications.

This paper explores a novel resistive TCAM cell and arrayarchitecture that has the potential to scale TCAM capacityfrom megabytes to gigabytes. High-density resistive TCAMchips are organized into a DDR3-compatible DIMM, and areaccessed through a software library with zero modificationsto the processor or the motherboard. On applications thatdo not benefit from associative search, the TCAM DIMMis configured to provide ordinary RAM functionality. Bytightly integrating TCAM with conventional virtual mem-ory, and by allowing a large fraction of the physical addressspace to be made content-addressable on demand, the pro-posed memory system improves average performance by 4⇥and average energy consumption by 10⇥ on a set of evalu-ated data-intensive applications.

Categories and Subject DescriptorsB.3 [Memory Structures]

General TermsDesign, Performance

KeywordsResistive memory, TCAM, Accelerator

⇤This work was supported in part by NSF CAREER awardCCF-1054179 and gifts from Qualcomm.Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

MICRO’11 December 3-7, 2011, Porto Alegre, Brazil

Copyright 2011 ACM 978-1-4503-1053-6/11/12 ...$10.00.

1. INTRODUCTIONAs CMOS scaling continues into the billion transistor era,

power dissipation and o↵-chip bandwidth limitations arethreatening to bring an end to microprocessor performancegrowth. Data intensive applications such as data mining,information retrieval, video processing, and image codingdemand significant computational power and generate sub-stantial memory tra�c, which places a heavy strain on botho↵-chip bandwidth and overall system power. Device, cir-cuit, and architecture innovations are needed to surmountthis problem.

Ternary content addressable memories (TCAM) are an at-tractive solution to curb both power dissipation and o↵-chipbandwidth demand in a wide range of data-intensive appli-cations. When associative lookups are implemented usingTCAM, data is processed directly on the TCAM chip, whichdecreases o↵-chip tra�c and lowers bandwidth demand. Of-ten, a TCAM-based system also improves energy e�ciencyby eliminating instruction processing and data movementoverheads that are present in a purely RAM based sys-tem. Unfortunately, even an area-optimized, CMOS-basedTCAM cell is over 90⇥ larger than a DRAM cell at the sametechnology node, which limits the capacity of commerciallyavailable TCAM parts to a few megabytes, and confines theiruse to niche networking applications.

This paper explores a new technique that aims at cost-e↵ective, modular integration of a high-capacity TCAM sys-tem within a general-purpose computing platform. TCAMdensity is improved by more than 20⇥ over existing, CMOS-based parts through a novel, resistive TCAM cell and arrayarchitecture. High-capacity resistive TCAM chips are placedon a DDR3-compatible DIMM, and are accessed througha user-level software library with zero modifications to theprocessor or the motherboard. The modularity of the re-sulting memory system allows TCAM to be selectively in-cluded in systems running workloads that are amenable toTCAM-based acceleration; moreover, when executing an ap-plication or a program phase that does not benefit fromassociative search capability, the TCAM DIMM can be con-figured to provide ordinary RAM functionality. By tightlyintegrating TCAM with conventional virtual memory, andby allowing a large fraction of the physical address spaceto be made content-addressable on demand, the proposedmemory system improves average performance by 4⇥ andaverage energy consumption by 10⇥ on a set of evaluateddata-intensive applications.

Page 2: A Resistive TCAM Accelerator for Data-Intensive Computing

2. BACKGROUND AND MOTIVATIONWe review CMOS-based ternary CAMs to motivate resis-

tive TCAM systems, and provide background on resistivememories as applicable to memory system design.

2.1 Associative Computing and Ternary Con-tent Addressable Memories

An e↵ective way of addressing power and bandwidth lim-itations on many data intensive workloads is to use memo-ries that are accessed based on content (also named associa-tion [28]), rather than index. Associative computing, whichleverages memory systems that store and retrieve data byassociation, has been broadly applied to both software andhardware design. The best known software solution is ahash table, whereby data is located by an address computedthrough a hash function. A hash table has O(1) lookuptime and is proven more e�cient than other data structuresin handling sparse data (i.e., when the number of keys in useis far less than the total number of possible keys). On thehardware side, a simple example of associative computing isthe use of content addressable memory (CAM), which hasseen wide use since 1970’s [9]. Nowadays, CAMs are com-monly used in highly associative caches, translation looka-side bu↵ers (TLBs), and microarchitectural queues (e.g., is-sue queues). Compared to a hash table, a CAM has nocollision problems and o↵ers better storage utilization andshorter search time; moreover, a CAM avoids the softwareoverhead of rehashing and chaining.

A ternary CAM (TCAM) is a special type of associativememory that allows for both storing and searching with awildcard (X) in addition to a logic zero or one. A wild-card, when part of either the search key or the stored dataword, matches against both binary states (as well as anotherwildcard). This flexibility can be exploited by a surprisinglylarge number of applications.

TCAM has been widely used in networking for decades.The primary commercial application of earlier generationTCAM was in routers [38], where it was used to store therouting table and to perform fast lookups through longestprefix matching [52]. With technology scaling over the last20 years, TCAM capacity has increased from 64Kb [57] to9Mb [26], while typical array width has grown from 72 bits to576 bits. Numerous networking applications have emergedto leverage the benefits of TCAM, including packet classi-fication [29], access control list filtering [39], and networkintrusion detection [11].

Table 1 shows a comparison among CMOS-based TCAM,SRAM, and DRAM chips. State-of-the-art TCAM devicesare as fast as SRAM, but the cost-per-bit is 8⇥ higher dueto the lower density of TCAM cells. Furthermore, TCAMis 10⇥ more power hungry than SRAM and DRAM.

An example TCAM cell is shown in Figure 1: the two pairsof cross-coupled inverters act as bistable storage elementsthat hold the cell’s value, and the two access transistorsM1 and M2 are used to write a new value to the cell. On asearch, the cross-coupled inverters supply the cell’s contents,and the bottom four NMOS transistors (M3�M6) comparethe search key to the data stored in the cell.

When searching for a 0 or 1, searchlines supply the com-plementary search values SL and SL; when searching witha wildcard (X), both SL and SL are driven low. On amismatch, one of the pull-down paths (M3 �M5 or M4 �M6) is activated; on a match, all pull-down paths are inac-

D D SL SL Match0 1 0 1 10 1 1 0 01 0 0 1 01 0 1 0 11 1 - - 1- - 0 0 1

WordLine

MatchLine

SL SLM1 M2

M3

M5

M4

M6

DD

Figure 1: Illustrative example of a CMOS TCAM cell andits truth table.

tive. A CMOS TCAM cell is relatively large: even with anarea-e�cient implementation that uses half-latches insteadof cross-coupled inverters [8], its area is 541F

2 (where F isthe feature size), which is 3.8⇥ as large as an SRAM cell,and over 90⇥ as large as a DRAM cell.

2.2 Resistive Memory TechnologiesAs CMOS scales to 22nm and beyond, charge-based mem-

ory technologies such as DRAM, Flash, and SRAM are start-ing to experience scalability problems [24]. In response,the industry is exploring resistive memory technologies thatcan serve as scalable, long-term alternatives to charge-basedmemories. Resistive memories, which include phase-changememory (PCM) and spin-torque transfer magnetoresistiveRAM (STT-MRAM), store information by modulating theresistance of nanoscale storage elements, and are expectedto scale to much smaller geometries than charge-based mem-ories. Moreover, resistive memories are non-volatile, whichprovides near-zero leakage power and immunity to radiationinduced transient faults. However, because resistive mem-ories need to change material states to update stored data,they generally su↵er from high write energy, long write la-tency, and limited write endurance.

PCM. PCM [30] is argurably the most mature among allresistive memory technologies, as evidenced by 128Mb partsthat are currently in production [16], as well as gigabit arrayprototypes [51, 12]. A PCM cell is formed by sandwichinga chalcogenide phase-change material such as Ge2Sb2Te5

(GST) between two electrodes. PCM resistance is deter-mined by the atomic ordering of this chalcogenide storageelement, and can be changed from less than 10K⌦ in crys-talline state to greater than 1M⌦ in amorphous state [51,24]. On a write, a high amplitude current pulse is appliedto the cell to induce Joule heating. A slow reduction inwrite current causes the device to undergo a fast annealingprocess, whereby the material reverts to a crystalline state.Conversely, an abrupt reduction in current causes the deviceto retain its amorphous state. On a read, a sensing currentlower than the write current is passed through the cell, andthe resulting voltage is sensed to infer the cell’s content.Since the ratio of the high (RHI) and low (RLO) resistancesis as high as 100, a large sensing margin is possible; how-ever, the absolute resistance is in the mega-ohm range, whichleads to large RC delays, and hence, slow reads [24]. PCMalso su↵ers from finite write endurance; the typical numberof writes that can be performed before a cell wears out is 106-108 [24]. Consequently, several architectural techniques havebeen proposed to alleviate PCM’s wear-out problem [47, 23].

STT-MRAM. As a universal embedded memory candi-date, STT-MRAM has a read speed as fast as SRAM [62],practically unlimited write endurance [4], and favorable de-

Page 3: A Resistive TCAM Accelerator for Data-Intensive Computing

Memory Single-chip Capacity $ / chip $ / MByte Access Speed (ns) Watts / chip Watts / MByte

DRAM 128MB $10-$20 $0.08-$0.16 40-80 1-2 0.008-0.016SRAM 9MB $50-$70 $5.5-$7.8 3-5 1.5-3 0.17-0.33TCAM 4.5MB $200-$300 $44.5-$66.7 4-5 15-20 3.33-4.44

Table 1: Comparison of high-speed memory technologies [18].

lay and energy scaling characteristics [24]. Multi-megabit ar-ray prototypes at competitive technology nodes (e.g., 45nm,65nm) have already been demonstrated [56, 33], and theITRS projects STT-MRAM to be in production by 2013 [24].In STT-MRAM, information is stored by modulating themagnetoresistance of a thin film stack called a magnetic tun-nel junction (MTJ). An MTJ is typically implemented usingtwo ferromagnetic Co40Fe40B20 layers, and an MgO tunnelbarrier that separates the two layers. One of the ferromag-netic layers, the pinned layer, has a fixed magnetic spin,while the magnetic spin of the free layer can be altered byapplying a high-amplitude current pulse through the MTJ.Depending on the direction of the current, the magnetic po-larity of the free layer can be made either parallel or an-tiparallel to that of the pinned layer. In the case of paral-lel alignment, the MTJ exhibits high resistance (12.5K⌦ at22nm [24]); in the case of antiparallel alignment, a low re-sistance (5K⌦) is observed [24]. Although the typical RHI

RLO

ratio (2.5) is lower than that of PCM (100), it is still rela-tively easy to sense the state of a single bit [54].

3. OVERVIEWThis paper proposes a novel resistive TCAM chip, which

can be integrated on a DDR3-compatible DIMM and se-lectively placed on the memory bus. Figure 2 presents anexample computer system with the proposed TCAM DIMM.A multicore processor connects to main memory through anon-chip memory controller. The TCAM DIMM sits side-by-side with DRAM on the DDR3 bus. An on-DIMM TCAMcontroller serves as the interface to DDR3, and is in chargeof DIMM control. The processor communicates with thecontroller through a set of memory-mapped control regis-ters (for configuring functionality) and a memory-mappedkey store that resides with the controller (for bu↵ering thesearch key). A 2KB result store on the controller die bu↵erssearch results for multiple processes. All TCAM chips sharethe on-DIMM command, address, and data buses; however,a search operation is restricted to be on a single chip dueto power constrains. Each TCAM chip has 8 banks; a bankcomprises a set of arrays that are searched against the querykey, as well as a hierarchical reduction network for count-ing the number of matches and picking the highest-prioritymatching row.

IL1 DL1Core 1

L2 cache

IL1 DL1Core N. . .

Memory Controller

Processor TCAM Controller

Ctrl Regs

Key Store

Result Store

DRAM DIMM TCAM DIMM

TCAM Bank

Array Reduction Network

DDR3 Bus

Figure 2: Illustrative example of a computer system withthe proposed resistive TCAM DIMM.

4. CIRCUIT-LEVEL BUILDING BLOCKSBuilding a high-performance, low-power TCAM system

requires the development of a high-density resistive TCAMcell, as well as an appropriate row organization with atten-dant sensing and writing mechanisms. However, achievingthe required cell density is complicated by the fact that cellsneed to be written as well as searched, which, in a naıve im-plementation, would require multiple access transistors perstorage element to update cell contents. A TCAM designthat leverages the matchlines to write the cells is proposednext to address this challenge.

4.1 Resistive TCAM CellThe area of a TCAM cell not only a↵ects the cost per bit

in a multi-megabit array, but also has a profound e↵ect onspeed and energy since it determines the length (and thus,capacitance) of the matchlines and searchlines that need tobe charged and discharged on every access. Figure 3 demon-strates the proposed area-e�cient resistive TCAM cell. Inthe figure, a resistor represents a resistive storage element,which could be a GST stack or an MTJ (the impact of the ex-act technology on row organization, array architecture, andperformance is explained later in Sections 4.2, 5, and 8).A TCAM cell consists of three pairs of resistors and accesstransistors. The first two resistors store the data bit and itscomplement; the third resistor is permanently programmedto RHI . To store a logic 1 or 0, the leftmost resistor is pro-grammed to store the data bit (D), while the resistor in themiddle is programmed to store the complement of the bit(D). For example, when storing a logic 1 (Figure 4-a), theresistor on the left is programmed to RHI , and the resistorin the middle is programmed to RLO. To store a wildcard(X), the two leftmost resistors are both programmed to RHI .

D DML

SL SL

HI

SX

ML: Matchline SL: SearchlineSX: Search X line

Figure 3: Illustrative example of a resistive TCAM cell.

To search for a logic 0 or 1, SL and SL are driven withthe search bit and its complement, respectively, turning oneof the access transistors on, and the other o↵. A match isdecided based on the e↵ective resistance between the match-line and ground. If a resistor in its high-resistance state isin series with the on transistor—adding a resistance of RHI

between the matchline and ground—the search results ina match; conversely, a resistance of RLO connected to thematchline indicates a mismatch. To search for a wildcard(X), SL and SL are disabled and SX is driven high; hence,a resistor in its RHI state is connected to the matchline re-gardless of the value stored in the cell. Examples are shownin Figure 4: (a) demonstrates a mismatch case when thesearch bit is 0 and stored data is 1; (b) presents a matchscenario which searches for a 0 when a 0 is stored; (c) shows

Page 4: A Resistive TCAM Accelerator for Data-Intensive Computing

HI HI

1 0

HI

0

HI LOML

HI

0 0 1

ML

(c) (d)

LO HI

0 1

HI

0

ML

(b)

HI LO

0 1

HI

0

ML

(a)

SL SL SX SL SL SX

SL SL SX SL SL SX

Figure 4: Illustrative example of a resistive TCAM cell ona match (b, c, d) and on a mismatch (a). Inactive circuitpaths are shown in light grey.

a case where a wildcard (X) is stored in the cell, in whichsearching for either a 0 or a 1 will result in a match; and (d)illustrates a search for a wildcard (X). The proposed TCAMcell consists of 3 1T1R cells sharing matchline and GNDcontacts. At 22nm, this amounts to an area of 27F

2 (3⇥the 1T1R PCM cell size projected by ITRS [24]), which is120 of a CMOS TCAM cell’s area (Section 2.1).

4.2 Row OrganizationA resistive TCAM row has an organization similar to that

of a CMOS TCAM, where cells are cascaded to connect toa matchline in parallel. An example of the proposed roworganization is shown in Figure 5. The key idea is that,on a match, each cell will connect the matchline to groundthrough an RHI path, whereas at least one cell will providean RLO path to ground on a mismatch. Hence, the e↵ec-tive parallel resistance to ground (and hence, the matchlinevoltage) will always be higher in the case of a match.

HI LO HImlpre

en

HI HI HI

mlpre MLso

Msense

MkMen

. . .

Figure 5: Illustrative example of a resistive TCAM row.

A precharge-low sensing scheme is employed on the match-line, where the matchline is discharged in the prechargephase and actively driven in the evaluation phase. The inputof the inverter is also charged high in the precharge phase.On a match, the gate voltage of Msense is higher than it ison a mismatch. Msense and the keeper PMOS Mk are sizedso that on a match, the gate-to-source voltage of Msense islarge enough to win against Mk, pulling the inverter inputlow and in turn driving the matchline output (MLso) high.On a mismatch, the input to the inverter stays high andMLso outputs a zero.

Searching. Searching a TCAM row involves distinguish-ing the e↵ective matchline-to-ground resistance on a wordmatch, where all bits stored in a row match the correspond-ing search bits, from the e↵ective resistance on a word mis-match, where one or more bits mismatch the search key. Ina TCAM row, each matching bit contributes a parallel resis-tance of RHI + RON , and each mismatching bit contributes

RLO +RON to the parallel resistance of the row, where RON

represents the on-resistance of an access transistor. On aword match, the total resistance between the matchline andground is Rmatch = RHI+RON

N , where N is the number ofbits in a word. On a worst-case mismatch (N � 1 match-ing bits plus one mismatching bit), the matchline-to-ground

resistance is Rmismatch = (RLO+RON )(RHI+RON )(N�1)(RLO+RON )+(RHI+RON ) .

HI LO HI LO HI HI SA

mlpre

en

1 1

HI LO HI LO HI HI SA

mlpre

en

X 0

Men

Men

Figure 6: Illustrative examples of a TCAM row on a wordmismatch (top) and on a word match (bottom). Inactivecircuit paths are shown in light grey.

An example is shown in Figure 6: on the top is a mis-match case, which has one resistor in its low-resistance stateconnected to the matchline, and on the bottom is a matchcase, in which all resistors connected to the matchline arein their high-resistance state. The ratio between matchand worst-case mismatch resistances is Rmatch

Rmismatch= 1 +

RHI�RLON(RLO+RON ) . This Rmatch

Rmismatchratio must be high enough

to tolerate process-voltage-temperature (PVT) variations,which a↵ect both the transistors and the resistive devices.Therefore, a resistive memory technology with a greater dif-ference between its high and low resistances (e.g., PCM) al-lows a larger number of bits to be sensed in parallel, whereasa technology with a lower RHI

RLOratio (e.g., STT-MRAM)

has a tighter constraint on the number of bits that can besearched at the same time.

At the evaluation stage, current is supplied through theenabling PMOS transistor (Men), and the voltage drop onthe TCAM cells is sensed by a matchline sense amplifier.Men is sized appropriately to limit the current through theresistors so that a search will not accidentally change cellstates.

Figure 7: Sensing margin as a function of key width.

Figure 7 shows Cadence (Spectre) simulation results onPCM and STT-MRAM (details on the experimental setup

Page 5: A Resistive TCAM Accelerator for Data-Intensive Computing

are presented in Section 7). As the number of simultane-ously searched bits increases, both STT-MRAM and PCMencounter a significant drop on the voltage di↵erence be-tween a match and a mismatch, but the drop is much steeperin the case of STT-MRAM. Taking PVT variations intoaccount, a smaller voltage di↵erence (i.e., sensing margin)results in lower yield; hence, to tolerate variations and toimprove yield, the number of bits that can be searched si-multaneously must be limited. Given its superior sensingmargin, the rest of this paper assumes a 22nm PCM tech-nology to implement the resistive storage elements. We haveconfirmed that the TCAM array functions correctly in theface of extreme variations in RHI (1M⌦-500K⌦) and RLO

(15K⌦-30K⌦). These resistance ranges cover 98% of all de-vices fabricated in a recent 45nm PCM prototype [51]; nev-ertheless, if the variation were to be higher, the search widthcould be reduced to improve the margin.

Writing. In order to maintain the density advantage ofresistive memories in TCAM design, it is important not tounduly increase the number of transistors in a cell. Writ-ing would normally require a write port, which would needadditional access transistors and wires. Instead, we proposebulk-sequential writes, a new write mechanism that leveragesthe matchlines for writing in order to eliminate the need forextra write ports.

Figure 8 shows an example of how a TCAM row gets writ-ten under bulk-sequential writes. During a write, the senseamplifier and all search-X lines are disabled. Writing takesplace in two phases. In the first phase, all resistors to beupdated to RLO are connected to the matchline by enablingthe relevant searchlines, and are programmed by applyingthe appropriate switching current through the matchline.Next, resistors to be updated to RHI are written in twosteps. In the first step, the leftmost resistor in each cell tobe updated with a 1 or X is programmed; in the second step,the middle resistor in every cell to be updated with a 0 or Xis written. This two-step procedure when writing RHI limitsthe maximum number of resistors to be programmed simul-taneously to 128. Minimum-pitch local wires at 22nm cancarry enough current to write these 128 bits in parallel, anda 200F

2 write driver supplies adequate switching current.

Reading. Since the proposed TCAM architecture is ulti-mately integrated on a discrete, DDR3-compatible DIMM,it can be selectively included in systems which run workloadsthat can leverage associative search. Nevertheless, even forsuch systems, it is desirable to use the TCAM address spaceas ordinary RAM when needed. Fortunately, this is straight-forward to accomplish with the proposed TCAM array.

When the TCAM chip is configured as a RAM module,data is stored and retrieved column-wise in the TCAM arrayby the TCAM controller, and the searchlines SL, SL, andSX serve as word lines. The key observation that makessuch configurability possible is that reading a cell is equiv-alent to searching the corresponding row with a 1 at theselected column position, and with wildcards at all otherbit positions. An illustrative figure of the array and addi-tional details on the required array-level support to enableconfigurability is outlined in Section 5.

Detecting errors. If the cells are implemented with a re-sistive memory technology that su↵ers from finite write en-durance (e.g., PCM), it is necessary to verify the correctnessof every write to the array. Error detection takes place in

two steps. In the first step, the enable bit (Section 5) ofall rows in the array are flash-cleared, the enable bit of thenewly written row is set high, and the array is searchedtwice—once with 1s in place of wildcards, and once with0s—to ensure that all match cases function correctly. In thesecond step, each newly written RLO value is checked onecell at a time (128 array searches maximum) to prevent aninadvertent wildcard from being stored to the row. Becausea checker search accesses only one of 1024 arrays, the energyoverhead of error detection is 1

8 th of a full search.The proposed error detection technique allows for detect-

ing permanent write failures. If, on the other hand, the tech-nology su↵ers from transient faults, the TCAM controllerperiodically puts the TCAM DIMM into RAM mode, andrefreshes the contents (similar to DRAM scrubbing).

HI SA

LO LO HI HI HI(1)

HI SA

LO LO HI HI HI(2)

HI SA

LO LO HI HI HI(3)

Figure 8: Illustrative example of writing the three bit pat-tern “10X” to a row by (1) programming all low resistors,(2) programming all leftmost high resistors, and (3) pro-gramming all middle high resistors. Inactive circuit pathsare shown in light grey.

5. ARRAY ARCHITECTUREFigure 9 shows an example 1K⇥1K TCAM array. Cells

(C) are arranged in a 2D organization with horizontal match-lines (ML) and vertical searchlines (SL). Searchlines are drivenby search key registers and drivers, while matchlines connectto sense amplifiers (SA). A hierarchical reduction networkfacilitates counting the number of matching lines and select-ing one of multiple matching rows (Figure 10).

SA

REG

FFML

To local population count

& priority index logic

EnablePartialresult

CMLSLSXSAL

:::::

TCAM cellMatchline SearchlineSearch X line Sense and latch

ML1

ML2

ML3

ML1K

128 bits 128 bits 128 bits

. . .

. . .

. . .

. . .

. . .

Search key registers and drivers

Local Population Count

& Priority Index Logic. .

.

C

C

C

C

. . .

. . .

. . .

. . .

. . .

C

C

C

C

. . .

C

C

C

C

. . .

. . .

. . .

. . .

. . .

C

C

C

C

. . .

C

C

C

C

. . .

. . .

. . .

. . .

. . .

C

C

C

CSAL

SAL

SAL

SALSL SL SX1 1 1 SL SL SX1K 1K 1K

Figure 9: Illustrative example of a 1K⇥1K TCAM arraywith 128b wide matchline segmentation.

Page 6: A Resistive TCAM Accelerator for Data-Intensive Computing

An important design decision that a↵ects area e�ciency,power, speed, and reliability is the size of an array. Largerarrays improve area e�ciency by amortizing the area over-head of the peripheral circuitry over more cells; however,sensing margin deteriorates with the number of bits that aresearched in parallel (Figure 7). Table 2 compares absolutearea and area e�ciency for 1Gb TCAM chips constructedfrom arrays of di↵erent sizes (See Section 7 for the exper-imental setup). As array size increases from 128⇥128 to1K⇥1K, overall chip area reduces by more than 2⇥.

Array Size Total Chip Area (mm2) Area E�ciency

1K⇥1K 36.00 38.98%512⇥512 41.04 34.19%256⇥256 51.49 27.25%128⇥128 77.51 18.10%

Table 2: Total chip area and area e�ciency with di↵erentarray sizes.

Interestingly, it is possible to enjoy the high area e�ciencyof a large array while also delivering the large sensing mar-gin of a smaller array through matchline segmentation. Thekey idea is to build a wide row, only a part (segment) ofwhich can be searched simultaneously, providing additionalperipheral circuitry to support iterative searching for a keylarger than the size of a segment. Figure 9 shows an exam-ple 1K⇥1K array, where each row is partitioned into 128bsegments. To perform a full 1024b wide search across a row,the array is accessed eight times, once per each segment.On each access, searchlines connected to the relevant seg-ment are driven with the corresponding bits from the searchkey, while all other searchlines are driven low. Each rowis augmented with a helper flip-flop that stores the partialresult of the ongoing search operation as di↵erent segmentsare accessed. This helper flop is initially set high; at theend of each cycle, its content is ANDed with the sense am-plifier’s output, and the result overwrites the old content ofthe helper flop. Hence, at the end of eight cycles, the helperflop contains a logic 1 if and only if all segments within therow match. To permanently disable a row, an additional“enable” flop is added to the design.

The proposed resistive TCAM is capable of providing twodi↵erent results on a search operation: (1) a populationcount indicating the total number of matching rows in thesystem, and (2) a priority index indicating the address ofthe matching row with the lowest index. Once the searchis complete, the array’s local population count and prioritylogic start operating on the results (Figure 10).

Priority index logic. The highest priority row among allmatching rows is selected in a hierarchical fashion. 32 pri-ority encoders are used in the first level of the hierarchy,where each encoder selects one valid matchline out of 32helper flops; also selected are the 5b addresses of the corre-sponding matchlines. Then, 32 valid bits with correspondingaddresses enter the second level priority encoder. Finally, an11b result, containing 1 valid bit and a 10b address, is se-lected and forwarded to the reduction network.

Population count logic. The local population count forthe array is computed iteratively, accumulating the resultof a 64b population count each cycle (itself computed using16 four-input lookup tables and a carry save adder tree). Ittakes a total of 16 cycles to walk the array and accumulatethe population count, after which the result is sent to thereduction network for further processing.

5.1 Reduction NetworkThe reduction network is a quad tree, whose leaves are

the results of di↵erent arrays’ population count and prioritylogic. Each internal node of the quad tree takes the outputsof its four children, processes them by counting or priorityencoding, and then forwards the result to its parent (Fig-ure 10). Once the results of a search propagate to the rootof the quad tree, final results are obtained and placed in anSRAM-based result store that resides with the TCAM con-troller. For a fixed-capacity chip, the size of an array a↵ectsthe size, and hence, the latency and energy of the reductionnetwork. Table 3 shows the energy and delay when search-ing for a 128b key in an eighth of a 1Gbit chip constructedfrom di↵erent size arrays. As array size increases, delayand energy within an array increase due to longer search-lines and matchlines; however, delay and energy consumedin the reduction network decrease because there are fewerlevels of priority and population count computation in thehierarchy. Since search energy dominates total energy andreduction network delay dominates total delay, enlarging thearray size results in higher energy and lower delay. Consid-ering the area e�ciency is highest with the largest array size(Table 2), the rest of this paper focuses on 1K⇥1K arrays.

5.2 Result StoreSearch throughput can be improved significantly by pipelin-

ing the reduction network, the local population count, andpriority computations; however, this requires the ability tobu↵er the results of in-flight search operations and to re-trieve them at a later time.

To facilitate such bu↵ering, the on-DIMM TCAM con-troller provides a 2KB SRAM mapped to the system’s phys-ical address space. As part of the search operation, the ap-plication sets up a number of memory-mapped control reg-isters, one of which indicates the location where the searchresults should be placed. In this way, the application over-laps multiple search operations in a pipelined fashion, andretrieves the results through indexed reads from the resultstore.

6. SYSTEM ARCHITECTUREThe level of the memory hierarchy at which the TCAM

is placed can have a significant impact on overall systemcost, performance, and flexibility. On the one hand, an in-tegrated solution that places the TCAM on the processordie would result in the shortest possible communication la-tency, at the expense of much desired modularity. On theother hand, treating the TCAM as an I/O device and placingit on the PCI or PCI Express bus would result in a modu-lar system architecture (in which the TCAM can be selec-tively included), but the high latency of I/O busses wouldlimit performance. Instead, this paper explores an interme-diate solution that places the TCAM on a DDR3-compatibleDIMM with an on-DIMM TCAM controller. The resultingdesign requires no changes to existing processors, memorycontrollers, or motherboards, while providing modularity toenable selective inclusion of TCAM in systems that benefitfrom it. Moreover, with the ability to configure the TCAMas a regular RAM, users can also leverage the TCAM DIMMas a byte-addressable, persistent memory module.

6.1 Processor InterfaceSoftware is given access to a TCAM accelerator by map-

ping the TCAM address range to the system’s physical ad-

Page 7: A Resistive TCAM Accelerator for Data-Intensive Computing

64 pop count logic

16MUX

FF+

from Arrayfrom Array

from Arrayi+1i+2

i+3

To upper level

1K x 1K Array SA logic

. . .priorityencoder

160MUX

32

645 AL

lower 5 AL upper 5 AL 1 VL

+

FF

priority index logic population count logic

i

i+1 i+2

i+3H-tree

( AL: address line, VL: valid line )

1 VL

1024

matchlines

32

11

. . .

110AL from3 Arrays

VL from 3 Arrays

To upper level

1 VL210

i

Figure 10: Architecture of the population count logic, priority index logic, and the reduction network.

Energy of Energy with Energy with Delay of Local Delay with Delay withArray Size Search (nJ) Priority Index (nJ) Pop. Count (nJ) Search (ns) Priority Index (ns) Pop. Count (ns)

1K⇥1K 244.97 248.59 249.64 2.50 21.57 60.28512⇥512 176.16 180.22 181.30 1.75 21.91 61.48256⇥256 96.21 100.85 102.36 1.00 23.78 64.63128⇥128 55.57 61.78 63.33 0.50 28.29 67.67

Table 3: Energy and delay comparison of di↵erent array sizes.

dress space. In addition, the on-DIMM TCAM controllermaintains a 2KB RAM array (Section 5.2), to implementmemory-mapped control registers, search key bu↵ers, andthe result store. All accesses issued to TCAM are uncacheable,and subject to strong ordering, which prevents all of the re-quests to the memory-mapped TCAM address range frombeing reordered. (This can be accomplished, for example, bymarking the corresponding physical pages strong-uncacheablein the page attribute table of any Intel X86 processor sincethe Intel 386 [22].)

Communication between the processor and TCAM takesplace in four ways:

1. Device configuration. The processor configures theTCAM system by writing to memory-mapped control reg-isters. Configurable system attributes include key length,required result type (population count, priority index, orboth), and whether the module should operate in RAM orTCAM mode.

2. Content update. The processor stores data intoTCAM control registers, after which the TCAM controllerupdates the TCAM array.

3. Search. The processor stores the query key into thememory-mapped TCAM key bu↵er, which resides with theTCAM controller. As soon as the last word of the key isreceived, the TCAM controller initiates a search operationwhose results are written to the appropriate words withinthe memory-mapped result store.

4. Read. After a search, the processor loads the outcomefrom the result store.

6.2 TCAM ControllerTo plug into an existing DIMM socket with no modifica-

tions to the memory controller, the TCAM’s memory bus in-terface should be fully compatible with DDRx and its timingconstraints. Here we discuss the TCAM interface in the con-text of a modern DDR3 protocol and an FR-FCFS [50] basedmemory controller, but the memory controller’s schedulingpolicy is orthogonal to the TCAM DIMM, since a DDR3compatible TCAM DIMM is compatible with any DDR3-compatible memory controller by definition. In DDR3, a

complete read (write) transaction includes a row activation(activate a row in a bank and move data to a row bu↵er), acolumn read (write) involving data transfer, and a precharge(write data back to the row and precharge the bitlines).When consecutive requests hit an open row, no activate orprecharge operation is needed. The minimum number of cy-cles between a write and a consecutive read in DDR3 is lessthan the time needed to perform a search; hence, withoutadditional support, it is possible for the memory controllerto issue a read from the result store before the correspondingsearch operation is complete.

To avert this problem, the software library used for access-ing TCAM inserts an appropriate number of dummy writesbetween the search and the subsequent read from the resultstore, generating the required number of idle cycles on theDDR3 bus for the search operation to complete. (Recallthat all requests to TCAM control registers are processedin-order, since the corresponding physical pages are markedstrong-uncacheable.) This is accomplished by issuing writesto a known location in physical memory, which allows theTCAM controller to detect the dummy write and silentlydrop it upon receipt.

Although injecting dummy writes (i.e., bubbles) into theDDR3 bus guarantees correctness, it also lowers the uti-lization of the bus and wastes precious bandwidth. To im-prove search throughput, the TCAM chip is pipelined (Sec-tion 5) so that shortly after an array search is complete,the next search is allowed to begin, largely overlapping thearray search operation of a later request with the reduc-tion network and result store update of an earlier request.Consequently, when performing multiple search operations,it becomes possible to pipeline search and read operations,thereby reducing the number of required dummy writes.Figure 11 shows an example of pipelined search with a 128bkey, where pipelining improves performance by almost 3⇥.

6.3 Software SupportA user-level library serves as the API to the programmer,

and implements four functions.

Create. This function is called at the beginning of an ap-

Page 8: A Resistive TCAM Accelerator for Data-Intensive Computing

Command

Address

Data

Command

Address

Data

Command

Address

Data

(a)

(b)

(c)

write key 0 read result 0 write key 1

write key 0 write key 1 write key 2 read result 0 write key 3search key 0 search key 1

search key 0 search key 1 search key 2 search key 3

Write data Read data Bubble

tWL tBURST tWTR tCL

WRITE

COL

READ

COL

WRITE

COL

READ

COL

WRITE

COL

WRITE

COL

NOP NOP NOP NOP NOP NOP NOP

WRITE

COL

NOP NOP WRITE

COL

NOP NOP WRITE

COL

NOP

READ

COL

WRITE

COL

READ

COL

WRITE

COL

Figure 11: A consecutive write-read-write command sequence in DDR3 (a); a consecutive search-read command sequencein TCAM without pipelining (b); and a consecutive search-read command sequence in TCAM with pipelining (c). Dummywrites are labeled “NOP”.

plication to map a virtual address range to a portion of theTCAM’s physical address space. Physical pages in TCAMare marked strong-uncacheable (Section 6.1), and the OSis signaled to allocate content addressable physical pagesby leveraging one of the reserved flags of the mmap sys-tem call [55]. Upon successful termination, create returnsa pointer to the newly allocated TCAM space, which helpsdistinguish independent TCAM regions.

Destroy. This function releases the allocated TCAM space.

Store. This function updates the TCAM array. The func-tion accepts a mask indicating the bit positions that shouldbe set to wildcards (X), as well as the data to be stored in theremaining positions. Internally, the function communicatesto the TCAM controller by writing two memory-mappedcontrol registers, and the TCAM controller writes the datainto the TCAM array. To hide the long latency of writing toTCAM, the library function distributes consecutive writesto di↵erent TCAM banks. In the case of a bank conflict,an appropriate number of dummy writes are inserted by thelibrary to cover the busy period. (In the evaluated bench-marks, the database is stored in a sequential order; thus, nobank conflicts were observed.)

Search. This function performs a search operation in threesteps, by (1) storing the search key into the memory-mappedTCAM query key register, (2) issuing enough dummy writesto pad the waiting period between the search operation andthe subsequent read from the result store, and (3) readingthe results from the memory-mapped result store.Two dif-ferent flavors of the search function are provided: a “single”search call performs a search with a single key, whereas a“batch” search call searches for multiple independent keys ina pipelined fashion to improve throughput (keys are stored inmemory, and are passed to the batch search function througha pointer). A pointer to the TCAM region to be searched isgiven as an argument to the search call, and the library en-sures that only results from the specified region are returnedby storing a (searchable) unique region ID with each TCAMrow.

6.3.1 MultiprogrammingSupporting multiprogramming with the proposed TCAM

system requires mechanisms to prevent the search results ofone application from being a↵ected by data that belongs toanother application. Hence, although conventional virtualmemory solves protection problems in the case of TCAMreads and writes, search operations need additional support.

To enable multiprogramming, each process is assigned anaddress space identifier (ASID); when the process allocatesspace in the TCAM address range, the OS records the ASIDin a memory-mapped control register within the same phys-ical page that contains the key bu↵er and the result storefor that process. On a write, the TCAM controller storesthe corresponding ASID along with each word in a givenrow. On a search, after the process writes the key bu↵er,the TCAM controller appends the ASID to the search key;as a result, all rows that belong to other processes result inmismatches, and do not a↵ect search results.

6.3.2 Handling MisfitsAlthough the OS naturally supports virtualization of the

physical address space, searching a data structure largerthan the allocated TCAM space requires extra care. Specifi-cally, two types of misfit are possible: (1) a horizontal misfit,in which the key is larger than the width of a TCAM array(e.g., 2Kb key for a 1K⇥1K array), and (2) a vertical mis-fit, in which the number of rows to be searched exceeds thecapacity of the allocated TCAM region.

To solve the horizontal misfit problem, the word is bro-ken into a series of 1Kb-wide subwords, each of which isstored in a consecutive row of an array (if the final subwordis narrower than 1Kb, it is padded with enough wildcards(X) to cover the row). On a search, the TCAM controllerpartitions the key in units of 1Kb subwords, and searchesthe array one row after another. At the end of a searchoperation, the content of the helper flip-flop that connectsto a row is shifted into the helper flip-flop of the next row,and is ANDed with the outcome of the next row’s searchoperation. Hence, the helper flop of the final row containsthe complete search result.

Page 9: A Resistive TCAM Accelerator for Data-Intensive Computing

Core 8 cores, 4.0 GHz, 4-issueFunctional units Int/FP/Ld/St/Br units 2/2/2/2/2, Int/FP Mult 1/1

IQ, LSQ, ROB size IssueQ 32, LoadQ/StoreQ 24/24, ROB 96Physical registers Int/FP 96/96Branch predictor Hybrid, local/global/meta 2K/2K/8K, 512-entry direct-mapped BTB, 32-entry RAS

IL1 cache (per core) 32KB, direct-mapped, 32B block, 2-cycle hit timeDL1 cache (per core) 32KB, 4-way, LRU, 32B block, hit/miss delay 3/3, MESI protocol

L2 cache (shared) 4MB, 8-way, LRU, 64B block, 24-cycle hit timeMemory controller 4-channel, 64-entry queue, FR-FCFS, page interleavingDRAM Subsystem DDR3-1066 MHz

Timing (DRAM cycles) [41] tRCD: 7, tCL: 7, tWL: 6, tCCD: 4, tWTR: 4, tWR: 8, tRTP: 4,tRP: 7, tRRD: 4, tRAS: 20, tRC: 27, tBURST: 4, tFAW: 20

Table 4: Core parameters.

Benchmarks Input Description TCAM Content TCAM Search Keys

Apriori [5] 95,554 transactions association rule mining transaction database candidate itemsets1000 items, 2.7MB

BitCount [20] 75,000 array size non-recursive bit count by bytes integer data array N-bit vectors with64 bits per element a single one and N-1 wildcards

Histogram [60] 34,843,392 pixels pixel value distribution pixel values distinct RGB values104MB in bitmap image

ReverseIndex 78,355 files extract links and compile URLs URLs14,025 folders, 1.01GB an index from links to files

StringMatch 50MB non-encrypted file string search on encrypted file encrypted strings encrypted keysand non-encrypted keys

Vortex [27] 3 inter-related databases, insert, delete and lookup unique item ID unique item ID200MB operations and valid bit

WordCount 10MB text file count frequencies of distinct words English words distinct keywords

Table 5: Evaluated applications.

On a vertical misfit, the search is staged over multiple localsearch operations, in which the missing pages are transferredfrom DRAM to TCAM. This process is transparent to theuser and is handled by the TCAM library. Since data trans-fer between TCAM and DRAM can be expensive, the searchis optimized for minimizing data movement. For example, ifthe search region is larger than the capacity of the TCAM,the library function partitions the search region to fit theTCAM space, and does batch search (Section 6.3) in eachsubregion. The final results are calculated by merging allof the partial results. This is obviously cheaper than doingeach single search in the entire region, which would generateconstant data movement.

7. EXPERIMENTAL SETUPWe evaluate the proposed TCAM accelerator on seven

data-intensive applications from existing benchmark suites,running on a model of an eight-core processor with a DDR3-1066 memory system.

Circuits. We model the TCAM array and sensing circuitryusing BSIM-4 predictive technology models (PTM) [61] ofNMOS and PMOS transistors at 22nm, and conduct cir-cuit simulations using Cadence (Spectre). Resistances andcapacitances on searchlines, matchlines, and the H-tree aremodeled based on interconnect projections from ITRS [24].All peripheral circuits (decoders, population count logic, pri-ority index logic, and reduction network nodes) are syn-thesized using Cadence Encounter RTL Compiler [1] withFreePDK [2] at 45nm, and results are scaled to 22nm (rel-evant parameters are shown in Table 7). Resistive memoryparameters based on ITRS projections are listed in Table 6.

Architecture. We use a heavily modified version of theSESC simulator [49] to model an eight-core system with a1GB TCAM DIMM. Energy results for the cores and theDRAM subsystem are evaluated using MCPAT [32]. Detailsof the experiments are shown in Table 4.

Technology RHI RLO Cell Size Write Current

PCM 1M⌦ 15K⌦ 8.4F2 48µASTT-MRAM 12.5K⌦ 5K⌦ 8F2 35µA

Table 6: Resistive memory parameters [24]

Technology Voltage FO4 Delay

45nm 1.1 V 20.25ps22nm 0.83 V 11.75ps

Table 7: Technology parameters [24, 61]

Applications. We modify seven data-intensive applica-tions from MiBench [20], NU-MineBench [43], SPEC2000 [27],and Phoenix [60] benchmark suites to take advantage of theproposed TCAM system. Table 5 presents a summary ofeach benchmark, as well as a description of how it is adaptedto TCAM. Aside from BitCount and Vortex—which are se-quential applications—all baselines are simulated with eightthreads. TCAM-accelerated versions of all codes use onethread.

We found many other application domains amenable toTCAM acceleration, albeit with no readily available, repre-sentative benchmarks in most cases. These applications in-clude sorting and searching [52], similarity search [53], subsetand superset queries [18], decision tree training [25], searchengines [13], spell checking [20], sequential pattern min-ing [31], packet classification [29], IP rounting [38], paramet-ric curve extraction [40], Hough transformation [42], Hu↵-man encoding [35], Lempel-Ziv compression [58], image cod-ing [45], and logic minimization [6]. Evaluation of these andother applications is left for future work.

8. EVALUATIONWe first evaluate the contribution of di↵erent hardware

structures to search energy, search delay, and overall area.We then present performance and energy improvements ofa single-threaded, TCAM-accelerated version of each appli-cation over a baseline parallel execution with eight threads.

Page 10: A Resistive TCAM Accelerator for Data-Intensive Computing

Figure 12: TCAM performance (top) and energy (bottom) normalized to the RAM-based baseline. All baseline applicationsexcept BitCount and Vortex (sequential codes) execute with eight threads.

8.1 TCAM Delay, Energy, and Area: Whereare the Bottlenecks?

Figure 13 shows the breakdown of search delay, searchenergy, and die area over the reduction network, local pop-ulation count/priority logic, and array search operations fora 1Gb TCAM chip. Delay and energy results correspondto a chip-wide search with a 128b key; in each case, resultsare reported for both population count and priority indexconfigurations of the system.

Figure 13: Delay, energy, and area breakdown.

Overall delay is dominated by the delay of the reductionnetwork in both priority index and population count configu-rations. This is because each node of the reduction network,which is implemented as a quad tree, depends on its chil-dren. To improve area e�ciency, a reduction network nodeimplements population count using an accumulator that it-eratively sums the counts sent from four lower-level nodes;

this adds three clock cycles per network node. As a result,obtaining the global population count takes over 3⇥ longerthan the priority index.

Although searching the array contributes a small fractionof the overall delay, it almost entirely dominates energy con-sumption. This is because all matchlines are activated withevery search operation. Since the array search—which isneeded for priority and population count configurations—dominates energy, both configurations of the system areequally energy-hungry.

The area e�ciency of TCAM is 39%, which is competitivebut less than the projected DRAM area e�ciency (50% [24])at 22nm. Nearly half the area is devoted to the reductionnetwork, which is responsible for distributing the 128b keyto all arrays, and aggregating results.

8.2 System Performance and EnergyFigure 12 shows performance and energy evaluations on

seven data-intensive benchmarks. The TCAM acceleratorachieves significant performance improvement over the base-line multicore system on six of the seven benchmarks (ex-cept ReverseIndex), with an average speedup of 4⇥. Highestspeedups are obtained on BitCount (71.7⇥) and Histogram(24.3⇥), where over 99% of the baseline runtime is in searchand comparisons—operations amenable to TCAM acceler-ation. ReverseIndex performs considerable preprocessing,whereby a set of HTML files are parsed to extract links toothers files. This portion of the application cannot lever-age the TCAM accelerator; we have measured the maxi-mum theoretical (i.e., Amdahl’s Law) speedup on a single-threaded version of this benchmark, finding it to be only1.06⇥. The TCAM accelerator achieves a 1.04⇥ speedupover this sequential version, but is nearly 5⇥ slower than theparallel version of the benchmark which uses eight threads(recall that the TCAM-enabled version of the benchmarksare single-threaded).

Figure 12 also shows overall system energy with the TCAMaccelerator, normalized to the eight-core baseline. Six of theseven benchmarks achieve significant energy reduction, withan overall average of 10⇥. The e�ciency gains are due to twofactors. First, TCAM eliminates o↵-chip data movementand instruction processing overheads by processing data di-

Page 11: A Resistive TCAM Accelerator for Data-Intensive Computing

rectly on the chip; second, the faster execution time leads tolower leakage energy. As one would expect, energy savingson ReverseIndex are lower than other applications due tothe limited applicability of TCAM in this benchmark.

9. RELATED WORKThis paper builds upon existing work in resistive memo-

ries, TCAM, and processing in memory.

Resistive Memories. Recent research on TCAM is seek-ing opportunities to utilize the low leakage power and highscalability of resistive memories to improve power e�ciency.PCM-based 2T2R (two pairs of transistors and resistorsin parallel) TCAM cells have been demonstrated in priorwork [14, 48]. In a patent [3], a 2T2R PCM TCAM is pro-posed with two additional bit lines. Matsunaga et al. [36]propose a bit-serial 2T2R TCAM design using MTJ devices.Xu et al. [59] propose an STT-MRAM CAM design withhigh sensing and search speed. Matsunaga et al. [37] laterpresent a 6T2R cell, which adds assist circuitry to each cellto increase the sensing margin and search width. Alibart etal. [7] propose a TCAM cell with a pair of memristors anddemonstrate how layout can be done in such a way that ittakes full advantage of the potential for memristor densities.Eshraghian et al. [17] evaluate four types of memristor-basedCAM designs. Existing resistive TCAM proposals focus oncircuit design, whereas this paper explores a 3T3R TCAMcell, its array organization, and the system architecture foran on-DIMM TCAM accelerator. Moreover, prior work doesnot explore bulk-sequential writes—a parallel writing mech-anism with no area overhead added to the cell (Section 4.2).Other proposals for improving TCAM power e�ciency in-clude a stacked 3D-TCAM design [34] and a scheme for vir-tualizing TCAM [10].

TCAM. TCAMs are commonly used in networking [46, 29].Recent work applies TCAM to a wider range of applications.Panigrahy et al. use TCAM in sorting and searching [52],Goel and Gupta solve set query problems [18], Shinde et al.study similarity search and locality sensitive hashing [53].Hashimi and Lipasti [21] propose a TCAM accelerator as afunctional unit. Other applications of TCAM include deci-sion tree training [25], search engines [13], spell checking [20],sequential pattern mining [31], packet classification [29], IProuting [46], parametric curve extraction [40], Hough trans-formation [42], Hu↵man encoding [35], Lempel-Ziv compres-sion [58], image coding [45], and logic minimization [6].

Processing in memory. Processing in memory has beenproposed to reduce memory bandwidth demand in priorwork. Elliott et al. [15] build a computational RAM, whichadds processing elements (PE) of a SIMD machine directlyto the sense amplifiers of a 4Mb DRAM chip. Gokhale etal. [19] propose the processor-in-memory (PIM) chip. PIMcan be configured in conventional memory mode or in SIMDmode to speedup massively parallel SIMD applications. Os-kin et al. [44] propose Active Pages, which adds reconfig-urable logic elements to each DRAM subarray to processdata. None of the previous work has proposed the use of a re-sistive TCAM chip on a DRAM channel to accelerate searchoperations and to reduce memory bandwidth demand.

10. CONCLUSIONSWe have presented a new resistive TCAM cell and array

architecture that deliver a 20⇥ density improvement over

existing CMOS-based solutions. We have explored a modu-lar memory system that places resistive TCAM chips on aDDR3-compatible DIMM, and accesses the DIMM througha user-level software library with zero modifications to theprocessor or the motherboard. By tightly integrating TCAMwith conventional virtual memory, we have observed an av-erage performance improvement of 4⇥ and an average en-ergy reduction of 10⇥ on a set of data-intensive applicationsthat can exploit associative search capability. We believethis work is part of a larger trend toward leveraging re-sistive memory technologies in designing memory systemswith qualitatively new capabilities. With technology scaling,power and bandwidth restrictions will become more strin-gent while resistive memories move to mainstream, makingsuch systems increasingly appealing and viable.

11. ACKNOWLEDGMENTSThe authors would like to thank Eby Friedman, Seung

Kang, and anonymous reviewers for useful feedback.

12. REFERENCES[1] Encounter RTL compiler.

http://www.cadence.com/products/ld/rtl_compiler/.[2] Free PDK 45nm open-access based PDK for the 45nm

technology node. http://www.eda.ncsu.edu/wiki/FreePDK.[3] Phase-change based TCAM architectures and cell.

http://ip.com/IPCOM/000187746.[4] A. Driskill-Smith (Grandis, Inc). Latest advances and future

prospects of STT-MRAM. In Non-Volatile MemoriesWorkshop, University of California, San Diego, Apr. 2010.

[5] R. Agrawal and R. Srikant. Fast algorithms for miningassociation rules. In Proceedings of the 20th Very LargeDatabases Conference, Santioago de Chile, Chile, Sept. 1994.

[6] S. Ahmad and R. Mahapatra. An e�cient approach to on-chiplogic minimization. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems, 15(9):1040 –1050, Sept. 2007.

[7] F. Alibart, T. Sherwood, and D. Strukov. HybridCMOS/nanodevice circuits for high throughput patternmatching applications. In NASA/ESA Conference on AdaptiveHardware and Systems (AHS), June 2011.

[8] I. Arsovski, T. Chandler, and A. Sheikholeslami. A ternarycontent-addressable memory TCAM based on 4T static storageand including a current-race sensing scheme. Journal ofSolid-State Circuits, 38(1):155 – 158, Jan. 2003.

[9] K. E. Batcher. Staran parallel processor system hardware. InProceedings of the national computer conference andexposition, AFIPS, 1974.

[10] S. Bhattacharya and K. Gopinath. Virtually cool ternarycontent addressable memory. In Proceedings of the 13thUSENIX conference on hot topics in OS, May 2011.

[11] Y.-K. Chang, M.-L. Tsai, and C.-C. Su. Improved TCAM-basedpre-filtering for network intrusion detection systems. In 22ndInternational Conference on Advanced InformationNetworking and Applications, Mar. 2008.

[12] H. Chung et al. A 58nm 1.8V 1Gb PRAM with 6.4MB/sprogram BW. In IEEE International Solid-State CircuitsConference Digest of Technical Papers, Feb. 2011.

[13] O. de Kretser and A. Mo↵at. Needles and Haystacks: a searchengine for personal information collections. In 23rdAustralasian Computer Science Conference, 2000.

[14] N. Derharcobian and C. N. Murphy. Phase-change memory(PCM) based universal content-addressable memory (CAM)configured as binary/ternary CAM. United States Patent US7,675,765 B2, Agate Logic, Inc., Mar. 2010.

[15] D. Elliott, W. Snelgrove, and M. Stumm. Computational Ram:A memory-SIMD hybrid and its application to DSP. InProceedings of the IEEE Custom Integrated CircuitsConference, pages 30.6.1 –30.6.4, May 1992.

[16] Micron Technology, Inc. http://www.micron.com/products/pcm.[17] K. Eshraghian et al. Memristor MOS content addressable

memory (MCAM): Hybrid architecture for future highperformance search engines. IEEE Transactions on Very Large

Page 12: A Resistive TCAM Accelerator for Data-Intensive Computing

Scale Integration (VLSI) Systems, 19(8):1407 –1417, Aug.2011.

[18] A. Goel and P. Gupta. Small subset queries and bloom filtersusing ternary associative memories, with applications. InProceedings of the ACM SIGMETRICS internationalconference on Measurement and modeling of computersystems, SIGMETRICS, 2010.

[19] M. Gokhale and et al. Processing in memory: the terasysmassively parallel PIM array. Computer, 28(4):23 –31, Apr.1995.

[20] M. R. Guthaus et al. MiBench: A free, commerciallyrepresentative embedded benchmark suite. In IEEE 4thAnnual Workshop on Workload Characterization, Austin, TX,Dec. 2001.

[21] A. Hashmi and M. Lipasti. Accelerating search and recognitionwith a TCAM functional unit. In IEEE InternationalConference on Computer Design, Oct. 2008.

[22] Intel Corporation. Intel 64 and IA-32 Architectures SoftwareDeveloper’s Manual Volume 3A: System Programming Guide,Part 1, May 2011.http://www.intel.com/products/processor/manuals.

[23] E. Ipek et al. Dynamically replicated memory: Building reliablesystems from nanoscale resistive memories. In InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems, Mar. 2010.

[24] ITRS. International Technology Roadmap forSemiconductors: 2010 Update.http://www.itrs.net/links/2010itrs/home2010.htm.

[25] M. Joshi, G. Karypis, and V. Kumar. ScalParC: A new scalableand e�cient parallel classification algorithm for mining largedatasets. In Proceedings of the International ParallelProcessing Symposium, pages 573–579, 1998.

[26] G. Kasai et al. 200MHz/200MSPS 3.2W at 1.5V Vdd, 9.4Mbitsternary CAM with new charge injection match detect circuitsand bank selection scheme. In Proceedings of the IEEE CustomIntegrated Circuits Conference, pages 387 – 390, Sept. 2003.

[27] A. KleinOsowski and D. J. Lilja. MinneSPEC: A new SPECbenchmark workload for simulation-based computerarchitecture research. Computer Architecture Letters, 1, 2002.

[28] T. Kohonen. Content-Addressable Memories. Springer-Verlag,1980.

[29] K. Lakshminarayanan, A. Rangarajan, and S. Venkatachary.Algorithms for advanced packet classification with ternaryCAMs. In Proceedings of the conference on Applications,technologies, architectures, and protocols for computercommunications, SIGCOMM, pages 193–204, 2005.

[30] B. Lee et al. Architecting phase-change memory as a scalableDRAM alternative. In International Symposium on ComputerArchitecture, Austin, TX, June 2009.

[31] M. Leleu, C. Rigotti, and J. francois Boulicaut. GO-SPADE:Mining sequential patterns over datasets with consecutiverepetitions. In Proceedings of International Conference onMachine Learning and Data Mining in Pattern Recognition(MLDM), pages 293–306, 2003.

[32] S. Li et al. McPAT: An integrated power, area, and timingmodeling framework for multicore and manycore architectures.In International Symposium on Computer Architecture, pages469–480, 2009.

[33] C. J. Lin et al. 45nm low power CMOS logic compatibleembedded STT-MRAM utilizing a reverse-connection1T/1MTJ cell. In IEEE Electron Devices Meeting, Dec. 2009.

[34] M. Lin, J. Luo, and Y. Ma. A low-power monolithically stacked3D-TCAM. In IEEE International Symposium on Circuitsand Systems, pages 3318 –3321, May 2008.

[35] L.-Y. Liu et al. CAM-based VLSI architectures for dynamichu↵man coding. In Consumer Electronics, Digest of TechnicalPapers, IEEE International Conference on, Jun. 1994.

[36] S. Matsunaga et al. Standby-power-free compact ternarycontent-addressable memory cell chip using magnetic tunneljunction devices. Applied Physics Express, 2(2):23004, 2009.

[37] S. Matsunaga et al. Fully parallel 6T-2MTJ nonvolatile TCAMwith single-transistor-based self match-line discharge control. In2011 Symposium on VLSI Circuits (VLSIC), June 2011.

[38] A. J. Mcauley and P. Francis. Fast routing table lookup usingCAMs. In Proceedings.Twelfth Annual Joint Conference ofthe IEEE Computer and Communications Societies, 1993.

[39] R. McGeer and P. Yalagandula. Minimizing rulesets for TCAMimplementation. In Proceedings. The 28th Conference onComputer Communications, Apr. 2009.

[40] M. Meribout, T. Ogura, and M. Nakanishi. On using the CAM

concept for parametric curve extraction. IEEE Transactionson Image Processing, 9(12):2126 – 2130, Dec. 2000.

[41] Micron Technology, Inc.,http://www.micron.com//get-document/?documentId=425.1Gb DDR3 SDRAM, 2006.

[42] M. Nakanishi and T. Ogura. A real-time CAM-based houghtransform algorithm and its performance evaluation. InProceedings of the 13th International Conference on PatternRecognition, Aug. 1996.

[43] R. Narayanan et al. Minebench: A benchmark suite for datamining workloads. In IEEE International Symposium onWorkload Characterization, Oct. 2006.

[44] M. Oskin, F. Chong, and T. Sherwood. Active pages: acomputation model for intelligent memory. In The 25th AnnualInternational Symposium on Computer Architecture, 1998.

[45] S. Panchanathan and M. Goldberg. A content-addressablememory architecture for image coding using vectorquantization. IEEE Transactions on Signal Processing,39(9):2066 –2078, Sept. 1991.

[46] T.-B. Pei and C. Zukowski. VLSI implementation of routingtables: tries and CAMs. In Proceedings. Tenth Annual JointConference of the IEEE Computer and CommunicationsSocieties, Apr. 1991.

[47] M. K. Qureshi et al. Enhancing lifetime and security ofPCM-based main memory with start-gap wear leveling. InProceedings of the 42nd Annual IEEE/ACM InternationalSymposium on Microarchitecture, 2009.

[48] B. Rajendran et al. Demonstration of CAM and TCAM usingphase change devices. In 3rd IEEE International MemoryWorkshop, May 2011.

[49] J. Renau et al. SESC simulator, Jan. 2005.http://sesc.sourceforge.net.

[50] S. Rixner et al. Memory access scheduling. In Proceedings ofthe 27th annual international symposium on Computerarchitecture, May 2000.

[51] G. Servalli. A 45nm generation phase change memorytechnology. In 2009 IEEE International Electron DevicesMeeting.

[52] S. Sharma and R. Panigrahy. Sorting and searching usingternary CAMs. In Proceedings. 10th Symposium on HighPerformance Interconnects, Dec. 2002.

[53] R. Shinde et al. Similarity search and locality sensitive hashingusing ternary content addressable memories. In Proceedings ofthe 2010 international conference on Management of data,Jun. 2010.

[54] J. M. Slaughter. Materials for magnetoresistive random accessmemory. Annual Review of Materials Research, 39:277–296,2009.

[55] R. W. Stevens and S. A. Rago. Advanced Programming in theUNIX(R) Environment (2nd Edition). Addison-WesleyProfessional, 2005.

[56] K. Tsuchida et al. A 64Mb MRAM with clamped-reference andadequate-reference schemes. In Proceedings of the IEEEInternational Solid-State Circuits Conference, 2010.

[57] J. Wade and C. Sodini. Dynamic cross-coupled bitline contentaddressable memory cell for high density arrays. InInternational Electron Devices Meeting, 1985.

[58] B. Wei et al. A single chip lempel-ziv data compressor. In IEEEInternational Symposium on Circuits and Systems, 1993.

[59] W. Xu, T. Zhang, and Y. Chen. Design of spin-torque transfermagnetoresistive RAM and CAM/TCAM with high sensing andsearch speed. IEEE Transactions on Very Large ScaleIntegration Systems, 18(1):66–74, Jan. 2010.

[60] R. M. Yoo, A. Romano, and C. Kozyrakis. Phoenix rebirth:Scalable MapReduce on a large-scale shared-memory system. InProceedings of the 2009 IEEE International Symposium onWorkload Characterization, 2009.

[61] W. Zhao and Y. Cao. New generation of predictive technologymodel for sub-45nm design exploration. In InternationalSymposium on Quality Electronic Design, 2006.

[62] W. Zhao, C. Chappert, and P. Mazoyer. Spin transfer torque(STT) MRAM-based runtime reconfiguration FPGA circuit.ACM Transactions on Embedded Computing Systems,9(2):14:1–14:16, Oct. 2009.