Ancs6819 Bloom Filter Packet Classification

7/21/2019 Ancs6819 Bloom Filter Packet Classification

1/10

Fast Packet Classification Using Bloom filters

Sarang Dharmapurikar , Haoyu Song , Jonathan Turner , and John LockwoodDept. of Computer Science and Engineering, Washington University in St. Louis

St. Louis, MO, 63130

(sarang, hs1, jst, lockwood)@arl.wustl.edu

ABSTRACT

Ternary Content Addressable Memory (TCAM), althoughwidely used for general packet classification, is an expensiveand high p ower-consuming device. Algorithmic solutionswhich rely on commodity memory chips are relatively inex-pensive and power-efficient but have not been able to matchthe generality and p erformance of TCAMs. Therefore, the

development of fast and power-efficient algorithmic packetclassification techniques continues to be a research subject.In this paper we propose a new approach to packet classi-

fication which combines architectural and algorithmic tech-niques. Our starting point is the well-known crossproductalgorithm which is fast but has significant memory overheaddue to the extra rules needed to represent the crossproducts.We show how to modify the crossproduct method in a waythat drastically reduces the memory requirement withoutcompromising on performance. Unnecessary accesses to theoff-chip memory are avoided by filtering them through on-chip Bloom filters. For packets that match prules in a ruleset, our algorithm requires just 4+p+independent memoryaccesses to return all matching rules, where 1 is a smallconstant that depends on the false positive rate of the Bloom

filters. Using two commodity SRAM chips, a throughput of38 Million packets per second can be achieved. For rule setsizes ranging from a few hundred to several thousand filters,the average rule set expansion factor attributable to the al-gorithm is just 1.2 to 1.4. The average memory consumptionper rule is 32 to 45 bytes.

Categories and Subject Descriptors

C.2.6 [Internetworking]: Routers

General Terms

Algorithms, Design, Performance

KeywordsPacket Classification, TCAM

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ANCS06, December 35, 2006, San Jose, California, USA.Copyright 2006 ACM 1-59593-580-0/06/0012 ...$5.00.

1. INTRODUCTIONThe general packet classification problem has received a

great deal of attention over the last decade. The ability toclassify packets into flows based on their packet headers isimportant for QoS, security, virtual private networks (VPN)and packet filtering applications. Conceptually, a packetclassification system must compare each packet header re-

ceived on a link against a large set of rules, and return theidentity of the highest priority rule that matches the packetheader (or in some cases, all matching rules). Each rule canmatch a large number of packet headers, since the rule spec-ification supports address prefixes, wild cards and port num-ber ranges. Much of the research to date has concentratedon the algorithmic techniques which use hardware or soft-ware lookup engines, which access data structures stored incommodity memory. However none of the algorithms devel-oped to date has been able to displace TCAMs, in practicalapplications.

TCAMs offer consistently high performance, which is in-dependent of the characteristics of the rule set, but theyare relatively expensive and use large amounts of power. ATCAM requires a deterministic time for each lookup, and

recent devices can classify more than 100 million packetsper second. Although TCAMs are a favorite choice of net-work equipment vendors, alternative solutions are still beingsought, primarily due to the high cost of the TCAM devicesand their high power consumption. The cost per bit of ahigh performance TCAM is about 15 times larger than acomparable SRAM [2], [1] and they consume more than 50times as much power, per access [14], [11]. This gap betweenSRAM and TCAM cost and power consumption makes itworthwhile to continue to explore better algorithmic solu-tions.

We propose a memory efficient and fast algorithm basedon Bloom filters. Our starting point is the naive crossprod-uct algorithm [9]. Naive crossproduct suffers the exorbitantmemory overhead due to the extra crossproduct rules intro-duced. To reduce memory consumption, we divide the rulesinto multiple subsets and then construct a crossproduct ta-ble for each subset. This reduces the overall crossproductoverhead drastically. Since the rules are divided into mul-tiple subsets, we need to perform a lookup in each subset.However, we can use Bloom filters to avoid lookups in sub-sets that contain no matching rules, making it possible tosustain high throughput. In particular, we demonstrate amethod, based on Bloom filters and hash tables, that canclassify a packet in 4 +p+ memory accesses where is asmall constant 1 determined by the false positive proba-


2/10

bility of the Bloom filters. The first four memory accessesare required to perform a Longest Prefix Matching (LPM) onthe source/destination addresses and the source/destinationports. The next p memory accesses are requires to lookupthe p matching rules for a given packet. Furthermore, theLPM phase and the rule lookup phase can be pipelined withtwo independent memory chips such that the memory ac-cesses per packet can be reduced tomax{4, p}. We leveragesome of the existing work on Bloom filters and hardwareimplementation to design our packet classification system.Our results show that our architecture can handle large rulesets, containing hundreds of thousands of rules, efficientlywith an average memory consumption of 32 to 45 bytes perrule.

The rest of the paper is organized as follows. In the nextsection we discuss the related work. We describe the naivecrossproduct algorithm in more details in Section 3. InSection 4, we discuss our Multi-Subset Crossproduct Algo-rithm. In Section 5 we describe our heuristics for intelligentpartitioning of the rules into subsets to reduce the overallcrossproducts. Finally, in Section 6, we evaluate and dis-cuss the performance of our algorithm in terms of memory

requirement and throughput. Section 7 concludes the paper.

2. RELATED WORKThere is a vast body of literature on packet classification.

An excellent survey and taxonomy of the existing algorithmsand architectures can be found in [11]. Here, we discuss onlythe algorithms that are closely related to our work. Algo-rithms that can provide deterministic lookup throughputis somewhat akin to the basic crossproduct algorithm [9].The basic idea of the crossproduct algorithm is to performa lookup on each field first and then combine the results toform a key to index a crossproduct table. The best-matchedrule can be retrieved from the crossproduct table in only onememory access. The single field lookup can be performed by

direct table lookup as in the RFC algorithm [5] or by usingany range searching or LPM algorithms. The BV [6] andABV [3] algorithms use bit vector intersections to replacethe crossproduct table lookup. However, the width of a bitvector equals to the number of rules and each unique valueon each field needs to store such a bit vector. Hence, thestorage requirement is significant, which limits its scalabil-ity.

Using a similar reduction tree as in RFC, the DCFL [10]algorithm uses hash tables rather than direct lookup ta-bles to implement the crossproduct tables at each tree level.However, depending on the lookup results from the previ-ous level, each hash table needs to be queried multiple timesand multiple results are retrieved. For example, at the firstlevel, if a packet matches m nested source IP address pre-

fixes and n nested destination IP address prefixes, we needm n hash queries to the hash table with the keys thatcombine these two fields and the lookups typically result inmultiple valid outputs that require further lookups. For amulti-dimensional packet classification, this incurs a largeperformance penalty.

TCAMs are widely used for packet classification. LatestTCAM devices also include the banking mechanism to re-duce the power consumption by selectively turning off theunused banks. Traditionally, TCAM devices needed to ex-pand the range values into prefixes for storing a rule withrange specifications. The recently introduced algorithm,

DIRPE [7], uses a clever technique to encode ranges differ-ently which results in overall lesser rule expansion comparedto the traditional method. The authors also recognized thatin modern security applications, it is not sufficient to stopthe matching process after the first match is found but allthe matching rules for a packet must be reported. Theydevised a multi-match scheme with TCAMs which involvesmultiple TCAM accesses.

Yu et. al. described a different algorithm for multi-match packet classification based on geometric intersectionof rules [13]. A packet can match multiple rules becausethe rules overlap. However, if the rules are broken intosmaller sub-rules such that all the rules are mutually ex-clusive then the packet can match only one rule at a time.This overlap-free rule set is obtained through geometric in-tersection. Unfortunately, the rule set expansion due to thenewly introduced rules by the intersection can be very large.In [14], they describe a modified algorithm called SSA whichreduces the overall expansion. They observe that if the rulesare partitioned into multiple subsets in order to reduce theoverlap then the resulting expansion will b e small. At thesame time one would need to probe each subset indepen-

dently to search a matching rule. Our algorithm is similarto SSA in that we also try to reduce the overlap betweenthe rules by partitioning them into multiple subsets andthus reduce the overall expansion. However, while SSA onlycares about an overlap in all the dimensions, our algorithmconsiders the overlap in any dimension for the purpose ofpartitioning. Hence the partitioning technique are different.Moreover, SSA is a TCAM based algorithm whereas ours ismemory based. Finally, SSA requires to probe all the sub-sets formed, one by one, requiring as many TCAM accesseswhereas our algorithm needs only p memory accesses, justas many matching rules as there are per packet.

3. NAIVE CROSSPRODUCT ALGORITHM

Before we explain out multi-subset crossproduct algorithm,we explain the basic crossproduct algorithm. The basiccrossproduct algorithm first performs a longest prefix match(LPM) on each field. Let vi be the longest matching prefixof field fi. It then looks up the key v1, v2, . . . , vk into acrossproduct rule table (implemented as a hash table). Ifthere is a matching key, then the associated rule IDs arereturned. For this algorithm to work, the rule table needsto be modified. It can be explained through the followingexample. We will assume only two fields,f1 and f2, for thepurpose of illustration. Let each field be 4-bit wide. Con-sider a rule set with three rules r1 : 1, , r2 : 1, 00,r3 : 101, 100. We can represent these rules with a triebuilt for each field as shown in Figure 1. In this figure, thenodes corresponding to the valid prefixes of a field are col-

ored black. A connection between the nodes of two fieldsrepresents a rule. It is important to note that a match forr2 : 1, 00also means a match for r1: 1, because theprefix of the second field is a prefix of 00*. Hence, r2 ismore specific than r1 or put it differently, r2 is contained inr1. Therefore, when r2 matches, both r2 and r1 should bereturned as the matching rule IDs. With the similar argu-ment, the matching rule IDs associated with r3 are r3 andr1.

Suppose a packet arrives whose field f1 has a longestmatching prefix 101* and field f2 has 00*. There is nooriginal rule 101*, 00*. However, note that 1* is a pre-


3/10

1f

2

f1

f2

f1

f2

r1

r2, r1

matchlabel

r1

Original rules Additional pseudorules

r1

r2,r1

r3,r1

matchlabel

p3 101* *

p2 101* 00*

p1 1* 100*r1 1* *

r2 1* 00*

r3 101* 100*

p2

p3

r2

p1

m2

0 1

0

0 1

0

0 1

1

10

0 1

r1

r3

f

Figure 1: Illustration of the basic crossproduct algo-rithm

fix of 101*. Therefore, a match for more specific prefix101* automatically implies a match for less specific prefix1*. Hence, if we try to look for the key 101*, 00*, thenwe should get a match for the rule r2 : 1, 00. To main-tain the correctness of the search, we add a pseudo rule tothe rule table, p1 : 101, 00 and associate the rule ID r2with it. Similarly, if 1* is the longest matching prefix forf1 and 100* for f2, then although there is no original rule1*, 100*, it implies a match for r1 : 1, . Hence, weneed to add a pseudo-rule p2 : 1, 100 to the table andassociate rule ID r1 with it. To summarize, a match fora prefix is also a match for its shorter prefixes. When thesearch on individual fields stops after longest matching pre-fix has been found, these prefixes must also account for the

rules formed by their sub-prefixes. If a rule formed by thelongest matching prefixes does not exist in the table, it mustbe artificially added to the set and the original rule IDs thatshould match are associated with this new pseudo rule. Itis not hard to see that if we build such a rule table then theonly additional pseudo-rules required are p1, p2, and p3 asshown in Figure 1.

Let P.fi denote the value of field i in packet P. Thepacket classification process can be summarized in the fol-lowing pseudo-code.

ClassifyPacket(P)1. for each field i2. vi LPM(P.fi)3. {match,{Id}} HashLookup(v1, . . . , vk)

As the algorithm describes, we first execute LPM on eachfield value. Then we look up the key formed by all thelongest matching prefixes in the hash table. The result ofthis lookup indicates if the rule matched or not and also out-puts a set of matching rule IDs associated with a matchingrule.

It is evident that the crossproduct algorithm is efficient interms of memory accesses: the memory accesses are requiredfor only LPM on each field and the final hash table lookup tosearch the rule in the crossproduct table. For 5-tuple classi-fication, we dont need to perform the LPM for the protocol

field; it can be a direct lookup in a small on-chip table.Moreover, if we use the Bloom filter based LPM techniquedescribed in [4], we need approximately one memory accessper LPM. Therefore, the entire classification process takesfive memory accesses with very high probability to classify apacket. However, the overhead of pseudo-rules can be verylarge. We consider another example rule set as shown inFigure 2(A) and add pseudo-rules to it. As can b e seen, thisrule set with six rules requires seven more pseudo-rules. Ifeach field has 100 unique prefixes in the rule set (ignoring theprotocol field) then the expanded rule set can be potentiallyas large as 1004 making it impractical to scale for larger rulesets. We analyzed several real and synthetic rule sets andfound that the naive crossproduct expands the rule set bya factor of 200 to 5.7 106! Clearly, due to a very largeoverhead of pseudo-rules, this algorithm is impractical. Wemodify this algorithm to reduce this expansion overhead tojust a factor of 1.2 on an average while retaining the speed.

r4

r5

r1

r2

r6

r3

p1

p2

p3

p4

p5

p6

p7

(A)

r1 : 1* *

(B)

r2 : 1* 00*

r5 : 101* 11*

r3 : 01* 100*

r4 : 101* 100*

r6 : 00* *

r3

r4

r5

r1

r2

r6

Figure 2: An example of naive crossproduct. (A) Therule set and its trie-based representation. (B) Rule set

after adding the pseudo-rules.

4. MULTI-SUBSET CROSSPRODUCT

ALGORITHM

In the naive scheme we require just one hash table ac-cess to get the list of matching rules. However, if we allowourselves to use multiple hash table accesses then we cansplit the rule set into multiple smaller subsets and take thecrossproduct within each of them. With this arrangement,the total number of pseudo-rules can be reduced significantlycompared to the naive scheme. This is illustrated in Fig-ure 3. We divide the rule set into three subsets. How toform these subsets is discussed in Section 5. For now, letsassume an arbitrary partitioning. Within each subset, wetake a crossproduct, This results in inserting pseudo-rulesp7 in subset 1 (G1) and p2 in subset 2 (G2). All the other


4/10

r1

p2

r6

r2

G1

G2

G3

G1 G2 G3

LPM Table for field 1

01*00*1*

101*2

211

13

G1 G2 G3

LPM Table for field 2

11*00**

100*

00002002033

r4

r5

p7

r3

Figure 3: Dividing rules in separate subsets to reduceoverlap. The corresponding LPM tables.

pseudo-rules shown in Figure 2(B) vanish and the overheadis significantly reduced. Why does the number of pseudo-rules reduce drastically? This is because the crossproductis inherently multiplicative in nature. Due to the rule setpartitioning, the number of overlapping prefixes of a field iget reduced by a factor ofxi, the resulting reduction in thecrossproduct rules is of the order xi and hence large. Now,an independent hash table can be maintained for each rule

subset and an independent rule lookup can be performedin each. The splitting introduces two extra memory accessoverheads: 1) The entire LPM process on all the fields needsto be repeated for each subset 2) a separate hash table ac-cess per subset is needed to look up the final rule. We nowdescribe how to avoid the first overhead and reduce secondoverhead.

With reference to our example in Figure 3, due to the par-titioning of rules in subsets G1,G2 andG3, the sets of validprefixes of the first field are {1*, 101*}for G1,{1*, 01*}forG2 and{00*} for G3. Hence, the longest prefix in one sub-set might not be the longest prefix in other subset requiringa separate LPM for each subset. However, an independentLPM for each subset can be easily avoided by modifying theLPM data structure. For each field, we maintain only one

global prefix table which contains the unique prefixes of thatfield from all the subsets. When we perform the LPM ona field, the matching prefix is the longest one across all thesubsets. Therefore, the longest prefix for any subset is eitherthe prefix that matches or its sub-prefix. With each prefixin the LPM table, we maintain a list of sub-prefixes, each isthe longest prefix within a subset. For instance, consider thefield 1 of our example. If we find that the longest matchingprefix for this field is 101* then we know that the longestmatching prefixes within each group must be either this pre-fix or its sub-prefix. In G1, the longest matching prefix is101*, in G2 it is 1* and in G3 it is NULL. Lets denote by

ti an arbitrary entry in the LPM table of field i. ti is arecord that consists of a prefix ti.v which is the lookup keyportion of that entry and the corresponding sub-prefix for gsubsets,ti.u[1] . . . ti.u[g]. Each ti.u[j] is a prefix ofti.v andti.u[j] is the longest matching prefix of field i in subset j Ifti.u[j] == NULL then there isnt any prefix of ti.v whichis the longest prefix in that subset.

After a global LPM on the field, we have all the informa-tion we need about the matching prefixes in individual sub-sets. Secondly, sinceti.u[j] is a prefix ofti.v, we do not needto maintain the complete prefix ti.u[j] but just its length.The prefixti.u[j] can always be obtained by considering thecorrect number of bits ofti.v.

The LPM table corresponding to the example shown inFigure 3 is also shown. In this example, since we have threesubsets, with each prefix we have three entries. For instance,the table for field 1 tells that if the longest matching prefixon this field in the packet is 101* then there is a sub prefixof 101*of length 3 (which is 101*itself) that is the longestprefix in G1; there is a sub prefix of length 1 (which is 1*)that is the longest prefix in G2and there is no sub prefix (in-dicated by ) of 101*that is the longest prefix in G3. Thus,

after finding the longest prefix of a field, we can read the listof longest prefixes for all the subsets and use it to probe thehash tables. For example if 101is the longest matching pre-fix for field 1 and 100 for the field 2 then we will probe G1rule hash table with the key 101, 100, G2 rule hash tablewith the key1, 100 and we dont need to probe G3 hashtable. The classification algorithm is described b elow.

ClassifyPacket(P)

1. for eachfield i

2. ti LP M(P.fi)

3. for eachsubset j

4. for eachfield i

5. if(ti.u[j] == NULL) skip this subset

6. {match,{Id}} HashLookupj(t1.u[j], . . . , tk.u[j])

Thus, even after splitting the rule set into multiple sub-sets, only one LPM is required for each field (line 1-2). Hencewe maintain a similar performance as the naive crossprod-uct algorithm as far as LPM is concerned. After the LPMphase, individual rule subset tables are probed one by onewith the keys formed from the longest matching prefixeswithin that subset (line 3-6). A probe is not required fora subset if there is no sub-prefix corresponding at least onefield within that subset. In this case, we simply move to thenext subset (line 4-5). However, for the purpose of analysis,we will stick to a conservative assumption that all the fieldshave some sub-prefix available for each subset and hence allthe g subsets need to be probed. We will now explain how

we can avoid probing all these subsets by using Bloom fil-ters. If a packet can match at the most p rules and if allthese rules reside in distinct hash tables then only p of theseg hash table probes will be successful and return a match-ing rule. Other memory accesses are unnecessary which canbe avoided using on-chip Bloom filters. We maintain oneBloom filter in the on-chip memory corresponding to eachoff-chip rule subset hash table. We first query the Bloomfilters with the keys to be looked up in the subsets. If thefilter shows a match, we look up the key in the off-chip hashtable. With a very high probability, only those Bloom filterscontaining a matching rule show a match. The flow of our


5/10

algorithm is illustrated in the Figure 4.

hash table

01*

00*

1*

11*

00*

*

1011 1001

field1 field2

LPM table field1 LPM table field2

longest matching prefixes

101* 100*

2

2

11

00

002

002

101*

100*

intersection

100*

101*

3

3

13 033

033

13

1

3

G1 BF

Bloom filter queries

G2 BF

G3 BF

rule tables

Bloom filters

onchip parallel for all subsets

in the offchip memory

p2

offchip memory

accesses

r4

matching rules

G3

G2

G1

hash table

hash table

Figure 4: Illustration of the flow of algorithm. First,LPM is performed on each field. The result is used to

form a set ofg tuples, each of which indicates how many

prefix bits to use for constructing keys corresponding to

that subset. The keys are looked up in Bloom filters

first. Only the keys matched in Bloom filters are used

to query the corresponding rule subset hash table kept

in the off-chip memory.

The average memory accesses for LPM on fieldi using theLPM technique discussed in [4] can be expressed as

ti = 1 + (Wi1)f (1)

where Wi is the width of field i in bits and f is the falsepositive probability of each Bloom filter (assuming that theyhave been tuned to exhibit the same false positive probabil-ity). For IPv4, we need to perform LPM on the source anddestination IP address (32 bits each) and the source anddestination ports (16 bits each). The protocol field can belooked up in a 256 entry direct lookup array kept in the on-chip registers. We dont need memory accesses for protocolfield lookup. We can use a set of 32 Bloom filters to storethe source and destination IP address prefixes of differentlengths. While storing a prefix, we tag it with its type tocreate a unique key (for instance, source IP type = 1, desti-nation IP type = 2 etc.). While querying a Bloom filter witha prefix, we create the key by combining the prefix with its

type. Similarly the same set of Bloom filters can be used tostore the source and destination port prefixes as well. TheBloom filters 1 to 16 can be used to store the source portprefixes and 17 to 32 can be used for destination port pre-fixes. Hence the total number of hash table accesses requiredfor LPM on all of these four fields can be expressed as

Tlpm = (1 + 31f) + (1 + 31f) + (1 + 15f) + (1 + 15f)

= 4 + 92f (2)

We needg more Bloom filters for storing the rules of eachsubset. During the rule lookup phase, when we query the

Bloom filters of all the g subsets, we will have up to p truematches and the remaining gp Bloom filters can show amatch, each with false positive probability off. Hence thehash probes required in the rule matching are

Tg = p+ (gp)f (3)

The total number of hash table probes required in theentire process of packet classification is

T =Tg+Tlpm= 4 +p+ (92 +gp)f= 4 +p+ (4)

where = (92 +g p)f. By keeping the value of f small(e.g. 0.0005), the can be made negligibly small, givingus the total accesses 4 + p. It should be noted that sofar we have dealt with the number of hash table accessesand not the memory accesses. A carefully constructed hashtable requires close to one memory access for a single hashtable lookup. Secondly, our algorithm is a multi-matchalgorithm as opposed to the priority rule match. For ouralgorithm, priorities associated with all the matching rulesneed to be explicitly compared to pick the highest priority

match.As the equation 4 shows, the efficiency of the algorithm

depends on how small g and fare. In the next section, weexplore the trade-off involved in minimizing the values ofthese two system parameters.

5. INTELLIGENT GROUPINGIf we try to create fewer subsets with a given rule set then

it is possible that within each subset there is still a signif-icant number of crossproducts. On the other hand, we donot want a very large number of subsets because it will needa large number of Bloom filters requiring more hardwareresources. Hence we would like to limit g to a moderatelysmall value. The key to reducing overhead of pseudo-rules

is to divide the rule set into subsets intelligently to mini-mize the crossproducts. The pseudo-rules are required onlywhen rules within the same subset have overlapping prefixes.So, is there an overlap-free decomposition into subsets suchthat we dont need to insert any pseudo-rules at all? Alter-natively, we would also like to know: given a fixed numberof subsets, how can we create them with minimum pseudo-rules? We address these questions in this section.

5.1 Overlap-free GroupingWe illustrate a simple technique based on the concept

ofNested Level Tuplewhich gives a partition of rules intosubsets such that no subset needs to generate crossproductrules. We call the resulting algorithm Nested Level TupleSpace Search (NLTSS). We then illustrate a technique to

reduce the number of subsets formed using NLTs to a pre-determined fixed number by merging some of the subsetsand producing crossproduct for them.

Nested Level Tuple Space Search (NLTSS) Algo-rithm We begin by constructing an independent binaryprefix-trie with the prefixes of each field in the given ruleset just as shown in Figure 1(B). We will use some formaldefinitions given below.

Nested Level: The nested level of a marked node in abinary trie is the number of ancestors of this node whichare also marked. We always assume that the root node is


6/10

marked. For example, the nested level of node m2 and m3is 1 and the nested level of node m4 is 2.

Nested Level Tree:Given a binary trie with marked nodes,we construct a Nested Level tree by removing the unmarkednodes and connecting each marked node to its nearest ances-tor. Figure 5 illustrates a nested level tree for field f1 in ourexample rule set.

Nested Level = 2

m3

m1

m4

m2m2 m1m3

m4

Nested Level = 0

Nested Level = 1

Figure 5: Illustration of Nested Level Tree

Nested Level Tuple (NLT): For each field involved inthe rule set, we create a Nested Level Tree (See Figure 6).The Nested Level Tuple (NLT) associated with a rule r isthe tuple of nested levels associated with each field prefix ofthat rule. For instance, in Figure 6, the NLT for r6 is [1,0]and for r4 is [2,1].

From the definition of the nested level, it is clear thatamong the nodes at the same nested level, no one is theancestor of the other. Therefore, the prefixes represented bythe nodes at the same nested level in a tree do not overlapwith each other. Hence the set of rules contained in the sameNested Level Tuple do not need any crossproduct rules. Thisis illustrated in Figure 6. For instance, in NLT [1,0] thereare two rulesr1 and r6. Their prefixes for field 1 are 1* and00*, none of which is more specific than the other (henceoverlap-free). Likewise, they share the same prefix, that is*, for field 2. Therefore, no crossproduct is required.

The number of NLTs gives us one bound on the numberof subsets such that each subset contains overlap-free rules.We experimented with our rule sets to obtain the number

of NLTs in each of them. The numbers are presented in theTable 1. While a consistent relationship can not be derivedbetween the number of rules and the number of NLTs, it isclear that even a large rule set containing several thousandrules can map to less than 200 NLTs. The maximum NLTswere found to be 151 for about 25,000 rules.

A property of the NLT rule subset is that a prefix doesnot have any sub-prefix within the same subset. Using thisproperty, the LPM entry data structure shown in Figure 4can be compressed further. Associated with each prefix, wecan maintain a bit-map with as many bits as there are NLTs.A bit corresponding to a NLT in this bit-map is set if theprefix or its sub-prefixes belongs to a rule that is containedin this NLT. We can take an intersection of the bit-mapsassociated with the longest matching prefix of each field for

pruning the rule subsets to look up.However, given an NLT, we just know the nested level as-

sociated with each prefix. We dont know the exact prefixlength to form our query key for that NLT rule set. There-fore, we need to maintain another bit map with each prefixwhich gives a prefix length to nested level mapping. We callthis bit-map a PL/NL bit-map. For instance, for an IP ad-dress prefix, we would maintain PL/NL bit-map of 32 bitsin which a bit set at a position indicates that the prefix ofthe corresponding length is present in the rule set. Givena particular bit that is set in the PL/NL bit-map, we cancalculate the nested level of the corresponding prefix just by

summing up all the number of bits set before the given bit.We illustrate this with an example. Consider an 8-bit IP ad-dress and the PL/NL bit-map associated with it as follows:

IP address : 10110110PL/NL bit-map : 10010101

Thus, the prefixes of this IP address available in the ruleset are: 1*(nested level 1), 1011*(nested level 2), 101101*

(nested level 3) and 10110110 (nested level 4). To get thenested level of the prefix 101101* we just need to sum upall he bits set in the bit map up to the bit correspondingto this prefix. If we are interested in knowing the prefixlength at a particular nested level then we can keep addingthe bits in the PL/NL bit-map until it matches the specifiednested level and return the bit position of the set bit as theprefix length. Thus, we can construct the prefix length tuplefrom a NLT using the PL/NL bit-maps associated with theinvolved prefixes. The prefix length tuple tells us which bitsto use to construct the key while probing the associated ruleset (or Bloom filter). The modified data structures and theflow of the algorithm is shown in the Figure 7. As the figureshows, each prefix entry in the LPM tables has a PL/NL

bit-map and a NLT bit-map. For instance, prefix 101*

offield 1 has a PL/NL bit map of 1010 which indicates thatthe sub-prefixes associated with the prefix are of length 1(i.e. prefix1*) and 3 (i.e. prefix 101*itself). Therefore, thenested level associated with the prefix 1* is 1 and with 101*is 2. Another bit-map, i.e. the NLT bit-map, contains asmany bits as the number of NLTs. The bits correspondingto the NLTs to which the prefix and sub-prefixes belong areset. Thus 101* belongs to all the three NLTs whereas 1*belongs to NLT 1 and 2. After the longest matching prefixesare read out, the associated NLT bit-maps are intersectedto find the common set of NLTs that all the prefixes belongto. As the figure shows, since the prefixes belong to all theNLTs, the intersection contains all the NLTs. From thisintersection bit-map we obtain the indices of the NLTs to

check. From the NLT table, we obtain the actual NLTs.Combining the knowledge from the PL/NL bit maps of eachfield, we convert the nested level to the prefix length andobtain the list of prefix length tuples. This list tells us howmany bits to consider to form the probe key. The probe isfirst filtered through the on-chip Bloom filters and only thesuccessful ones are used to query the off-chip rule tables.As the example shows, the key1, 100gets filtered out anddoesnt need the off-chip memory access.

Note that the bit-map technique can be used instead ofthe prefix length array only because there is a unique nestedlevel or prefix length associated with a subset for a particularfield. For a generic multi-subset crossproduct, we can notuse the bit-map technique since there can be multiple sub-prefixes of the same prefix associated with the same sub-set.Therefore, we need to list the individual prefix lengths, justas shown in Figure 4.

5.2 Limiting the Number of SubsetsWhile the NLT based grouping works fine in practice, we

might ask, is there still room for improvement? Can thenumber of subsets be reduced further? This brings us backto our second question that we set to explore in the begin-ning of this section: how can we limit the number of sub-sets to a desired value? While the NLT technique gives uscrossproduct-free subsets of rules, we can still improve upon


7/10


8/10


9/10


10/10

comes the bottleneck. While NLTMC shows better through-put for all configurations compared to NLTSS, it also in-curs the associated rule expansion cost as discussed before.Therefore, there is a trade-off between the throughput andmemory consumption. Depending upon the requirement,appropriate algorithm and configuration can be chosen.

Secondly, a packet can match p rules when there are poverlapping rules that cover the common region in the 5-dimensional space. Although [7] and [5] report that thenumber of matching rules can be as high as 8, the instancesof such overlaps are usually very few. A simple solution toimprove throughput in such cases is to simply take out theoverlapping rules and keep it in the on-chip memory so thatp stays below a given threshold. In each rule set we ex-perimented with, we encountered only one to two instanceswhere there was an overlap beyond four. Therefore, it isvery easy to achieve the throughput ofp 4 by removingthis overlap.

7. CONCLUSIONSDue to the excessive power consumption and the high

cost of TCAM devices, algorithmic solutions that are cost-effective, fast and power-efficient are still of great interest.In this paper, we propose an efficient solution that meetsall of the above criteria to a great extent. Our solutioncombines Bloom filters implemented in high-speed on-chipmemories with our Multi-Subset Crossproduct Algorithm.Our algorithm can classify a single packet in only 4 +pmemory accesses on an average where p is the number ofrules a given packet can match. The classification reportsall the p matching rules. Hence, our solution is naturally amulti-match algorithm. Furthermore, the pipelined imple-mentation of our algorithm can classify packets in max{4, p}memory accesses.

Due to its primary reliance on memory, our algorithm ispower-efficient. It consumes an average 32 to 45 bytes per

rule of memory (on-chip and off-chip combined). Hence rulesets as large as 128K can be easily supported in less than5MB of SRAM. Using two 300MHz 36-bit wide SRAM chips,packets can be classified at the rate of 38 Million packetsper second (OC-192 is equivalent to 31 Mpps with 40-bytepackets).

8. REFERENCES[1] IDT Generic Part: 71P72604

http://www.idt.com/?catID=58745&genID=71P72604.

[2] IDT Generic Part: 75K72100http://www.idt.com/?catID=58523&genID=75K72100.

[3] Florin Baboescu and George Varghese. ScalablePacket Classification. In ACM SIGCOMM, 2001.

[4] Sarang Dharmapurikar, P. Krishnamurthy, and DaveTaylor. Longest Prefix Matching using Bloom Filters.In ACM SIGCOMM, August 2003.

[5] Pankaj Gupta and Nick McKeown. Packetclassification on multiple fields. In ACM SIGCOMM,1999.

[6] T. V. Lakshman and D. Stiliadis. High-speedpolicy-based packet forwarding using efficientmulti-dimensional range matching. In ACMSIGCOMM, 1998.

[7] K. Lakshminarayanan, Anand Rangarajan, andSrinivasan Venkatachary. Algorithms for AdvancedPacket Classification using Ternary CAM. In ACMSIGCOMM, 2005.

[8] Haoyu Song, Sarang Dharmarpurikar, JonathanTurner, and John Lockwood. Fast Hash Table LookupUsing Extended Bloom Filter: An Aid to NetworkProcessing. In ACM SIGCOMM, 2005.

[9] V. Srinivasan, George Varghese, Subhash Suri, andMarcel Waldvogel. Fast and Scalable Layer FourSwitching. In ACM SIGCOMM, 1998.

[10] David Taylor and Jon Turner. Scalable PacketClassification Using Distributed Crossproducting ofField Labels. In IEEE INFOCOM, July 2005.

[11] David E. Taylor. Survey and taxonomy of packetclassification techniques. Washington UniversityTechnical Report, WUCSE-2004, 2004.

[12] George Varghese. Network Algorithmics: AnInterdisciplinary Approach to Designing FastNetworked Devices. 2005.

[13] Fang Yu and Randy H. Katz. Efficient Multi-MatchPacket Classification with TCAM. In IEEE HotInterconnects, August 2003.

[14] Fang Yu, T. V. Lakshman, Martin Austin Motoyama,and Randy H. Katz. Ssa: a power and memoryefficient scheme to multi-match packet classification.In ANCS 05: Proceedings of the 2005 symposium onArchitecture for networking and communicationssystems, 2005.

Ancs6819 Bloom Filter Packet Classification

Documents

Transcript of Ancs6819 Bloom Filter Packet Classification