Content Addressable Memories Vahid Tabatabaee Fall 2007
description
Transcript of Content Addressable Memories Vahid Tabatabaee Fall 2007
1ENTS689L: Packet Processing and SwitchingCommercial Network Processor Architectures
Content Addressable Memories
Vahid Tabatabaee
Fall 2007
2ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
References
Title: Network Processors Architectures, Protocols, and PlatformsAuthor: Panos C. LekkasPublisher: McGraw-Hill
Kostas Pagiamtzis, Ali Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J of Solid-State Circuits vol. 41, No.3, March 2006.
NetLogic MicroSystems Application Note, “Intradevice Configuration of Network Search Engines”.
NetLogic MicroSystems Application Note, “High Performance Layer 3 Forwarding”.
IDT White Paper, “Taking Packet Processing to the Next Level”.
3ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
Classification and Search Engines
Classification engine receives streams of packets as its input. It applies a set of application-specific sorting rules and policies
continuously on the packets. It ends up compiling a series of new parallel packet streams in
queues of packets.ored.
For classification the NP should consult a memory bank, a lookup table or even a data base where the rules are stored.
Search engines are used for consultation of a lookup table or a database based on rules and policies for the correct classification.
Search engines are mostly based on associative memory, which is also known as CAM
4ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
What is CAM?
Content Addressable Memory is a special kind of memory!
Read operation in traditional memory: Input is address location of the
content that we are interested in it. Output is the content of that
address. In CAM it is the reverse:
Input is associated with something stored in the memory.
Output is location where the associated content is stored.
1 0 1 X X
0 1 1 0 X
0 1 1 X X
1 0 0 1 1
0 1 1 0 1
0 0
0 1
1 0
1 1
0 1
Content AddressableMemory
1 0 1 X X
0 1 1 0 X
0 1 1 X X
1 0 0 1 1
0 1
0 0
0 1
1 0
1 1
0 1 1 0 X
Traditional Memory
5ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
CAM for Routing Table Implementation
CAM can be used as a search engine. We want to find matching contents in a database or Table. Example Routing Table
Source: http://pagiamtzis.com/cam/camintro.html
6ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
Simplified CAM Block Diagram The input to the system is the search word. The search word is broadcast on the search lines. Match line indicates if there were a match btw. the search and stored word. Encoder specifies the match location. If multiple matches, a priority encoder selects the first match. Hit signal specifies if there is no match. The length of the search word is long ranging from 36 to 144 bits. Table size ranges: a few hundred to 32K. Address space : 7 to 15 bits.
Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006
7ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
CAM Memory Size
Largest available around 18 Mbit (single chip).
Rule of thumb: Largest CAM chip is about half the largest available SRAM chip. A typical CAM cell
consists of two SRAM cells.
Exponential growth rate on the size
Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006
8ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
CAM Basics
The search-data word is loaded into the search-data register.
All match-lines are pre-charged to high (temporary match state).
Search line drivers broadcast the search word onto the differential search lines.
Each CAM core compares its stored bit against the bit on the corresponding search-lines.
Match words that have at least one missing bit, discharge to ground.
Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006
9ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
Type of CAMs
Binary CAM (BCAM) only stores 0s and 1s Applications: MAC table consultation. Layer 2 security related
VPN segregation. Ternary CAM (TCAM) stores 0s, 1s and don’t cares.
Application: when we need wilds cards such as, layer 3 and 4 classification for QoS and CoS purposes. IP routing (longest prefix matching).
Available sizes: 1Mb, 2Mb, 4.7Mb, 9.4Mb, and 18.8Mb.
CAM entries are structured as multiples of 36 bits rather than 32 bits.
10ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
CAM Advantages
They associate the input (comparand) with their memory contents in one clock cycle.
They are configurable in multiple formats of width and depth of search data that allows searches to be conducted in parallel.
CAM can be cascaded to increase the size of lookup tables that they can store.
We can add new entries into their table to learn what they don’t know before.
They are one of the appropriate solutions for higher speeds.
11ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
CAM Disadvantages
They cost several hundred of dollars per CAM even in large quantities.
They occupy a relatively large footprint on a card.
They consume excessive power.
Generic system engineering problems:Interface with network processor.Simultaneous table update and looking up requests.
12ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
CAM structure
The comparand bus is 72 bytes wide bidirectional.
The result bus is output. Command bus enables
instructions to be loaded to the CAM.
It has 8 configurable banks of memory.
The NPU issues a command to the CAM.
CAM then performs exact match or uses wildcard characters to extract relevant information.
There are two sets of mask registers inside the CAM.
CAM control
Global mask registers
72 bits 131072CAM
(72 bits x 16K x 8 structures)
Mixable with72 bits x 16384144 bits x 8192288 bits x 4096576 bits x 2048
Em
pty
Bit
Prio
rity
Enc
oder
Fla
g C
ontr
ol
Out
put P
ort
Con
trol
Control & status registers
I/O P
ort C
ontr
ol
Dec
oder
Pip
elin
e ex
ecut
ion
cont
rol
(com
man
d bu
s)
13ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
CAM structure
There is global mask registers which can remove specific bits and a mask register that is present in each location of memory.
The search result can be one output (highest priority) Burst of successive results.
The output port is 24 bytes wide.
Flag and control signals specify status of the banks of the memory.
They also enable us to cascade multiple chips.
CAM control
Global mask registers
72 bits 131072CAM
(72 bits x 16K x 8 structures)
Mixable with72 bits x 16384144 bits x 8192288 bits x 4096576 bits x 2048
Em
pty
Bit
Prio
rity
Enc
oder
Fla
g C
ontr
ol
Out
put P
ort
Con
trol
Control & status registers
I/O P
ort C
ontr
ol
Dec
oder
Pip
elin
e ex
ecut
ion
cont
rol
(com
man
d bu
s)
14ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
CAM Features
CAM Cascading: We can cascade up to 8 pieces without incurring performance
penalty in search time (72 bits x 512K). We can cascade up to 32 pieces with performance degradation
(72 bits x 2M). Terminology:
Initializing the CAM: writing the table into the memory. Learning: updating specific table entries. Writing search key to the CAM: search operation
Handling wider keys: Most CAM support 72 bit keys. They can support wider keys in native hardware.
Shorter keys: can be handled at the system level more efficiently.
15ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
CAM Latency
Clock rate is between 66 to 133 MHz. The clock speed determines
maximum search capacity. Factors affecting the search
performance: Key size Table size
For the system designer the total latency to retrieve data from the SRAM connected to the CAM is important.
By using pipeline and multi-thread techniques for resource allocation we can ease the CAM speed requirements.
Source: IDT
16ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
Packet Search Speed Requirements
Source: IDT article in CommsDesign:http://www.commsdesign.com/showArticle.jhtml?articleID=16501972
Source: IDT
17ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
Management of Tables Inside a CAM
It is important to squeeze as much information as we can in a CAM. Example from Netlogic application notes:
We want to store 4 tables of 32 bit wide IP destination addresses. The CAM is 128 bits wide. If we store directly in every slot 96 bits are wasted.
We can arrange the 32 bit wide tables next to each other. Every 128 bit slot is partitioned into four 32 bit slots. These are 3rd, 2nd, 1st, and 0th tables going from left to right. We use the global mask register to access only one of the tables.
MASK 3
00000000
FFFFFFFF
FFFFFFFF
FFFFFFFF
MASK 2
FFFFFFFF
00000000
FFFFFFFF
FFFFFFFF
MASK 1
FFFFFFFF
FFFFFFFF
00000000
FFFFFFFF
MASK 0
FFFFFFFF
FFFFFFFF
FFFFFFFF
00000000
18ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
Example Continued
We can still use the mask register (not global mask register) to do maximum prefix length match.
1 0 1 0 0 0….1 0 1 1 1 0….1 0 1 1 0 1….1 1 0 1 1 1….
127 97 96 95
0
1
0
0
94
1 1 0
1 0 1
0 0 0
0 1 1
3 2 1
1
0
1
0
0
1 0 1 1 1 0…. 0 1 1 1 0
MATCH FOUND
0 0 0 0 0 1…. 1 1 1 1 1
ComparandRegister
Global MaskRegister
….….….….
….
….
19ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
Table Aggregation
We can use tag bits to aggregate multiple tables in a single CAM. Example:
We want to use a single CAM (NL85721) for IPV4 packet classification and forwarding.
We want to filter packets based on other parameters such as VPN. We can have an undesired match when we want to do a classification.
CAM word 0 does not match but the dest. address matches CAM word 1
Source: http://www.netlogicmicro.com/pdf/ncs12_rev_0_8.pdf
20ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
Tag bits to avoid undesired matches
Tag bits can be used to differentiate between tables. Tag bits should not be masked. For packet classification tag bit is 0 and for packet forwarding it is 1.
Source: http://www.netlogicmicro.com/pdf/ncs12_rev_0_8.pdf
21ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
Vertically Oriented Table Aggregation
We can use validity bits to support multiple tables with different number of entries. We need one validity bit for each table. When the validity bit in a slot is 1 the corresponding table has a valid entry. In the comparand register, only the validity bit of the table that is under search should be 1.
Source: http://www.netlogicmicro.com/pdf/ncs12_rev_0_8.pdf
22ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
System Design Issues (multiple searches)
For deep packet inspection, several searches must occur simultaneously.
For example: MAC table, IP table, rules table, flow-management table.
Question: Do we use 4 CAMs or just 1 CAM with 4 partitions. If we use only 1 CAM:
Some tables are very large and some small.
This approach wastes expensive partitions.
If we use 4 CAMs: It does suffer when smaller tables do
not justify using separate CAMs. The overall cost also increases since
we have to use separate SRAM too.
Packet Processing environmentNetwork Processor
or custom-designed ASIC
CAM
SRAM
CAM
SRAM
CAM
SRAM
CAM
SRAM
23ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
System Design Issues (shorter and longer search keys)
We showed how we can implement 36 bit search tables in a 72 bit wide CAM.
This approach reduces the speed to half since we need to search two time for each key.
Some CAMS are hardwired to support both 36 and 72 bit wide search keys but they are more expensive.
For longer search keys the are two choices: We can use double data rate (DDR) bus and load meaningful
bits at both the rising and dropping edge of the clock. We can double the clock frequency of the that loads the
comparands.
24ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
System Design Issues (simultaneous update and search)
CAMs can not be updated in a location while searching at the same time.
When we do update packets can not be forwarded and they are back logged.
We can have a backup CAM for update while search is done on the other CAM.
Some designs offer a third port for table maintenance without inhibiting search operations (SiberCore is an example). Increases pin count, board real estate, signals to be routed on
the board.
25ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
System Design Issues (CIDR table update)
Recall that CIDR works based on the longest prefix match (LPM).
CAM segments are created based on the prefix length.
Some empty slots are left in each segment to accommodate new entries.
If a segment is suddenly filled up, the table must be taken offline to reshuffle the entries.
A read and write operation is needed for each entry that must be relocated. We may need a read and write for the mask word too.
Source: http://www.netlogicmicro.com/pdf/cidr_white_paper.pdf
26ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
CIDR table update: worst case analysis
What is the worst case scenario: All segments but one are full A new entry may need up to 31 move operations. Each move requires 4 clock cycles for total of
4 x 31 = 124 clock cycles We have 3000 routing updates per second
3000 x 124 = 372000 clock cycles per second If the NP clock rate is 100 MHz the cycle time is 10 nsec How much time the update consumes:
372000 cycles x 10 nsec per cycle = 3.72 msec In OC-192 rate, we have around 20 to 30 MPPS Therefore, 74,400 to 111,600 packets will not be classified and
should be discarded.
27ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
Reproaches against CAM based search engines (POWER)
There is a misnomer that power consumption of CAM increases! It does not make sense to compare power consumptions of 2Mb CAM
clocked at 66 MHz and capable of 66 Msps with 9Mb CAM clocked at 150 MHZ capable of 125 Msps.
Power consumption is result of multiple factors such as: Semiconductor manufacturing process. Number of searches per second. Storage density.
The smaller the process the larger the capacity; it can also cause drop in the power supply and increase in the clock rate. 0.18μ process 50% less power than 0.25μ and 30% further improvement
in 0.15μ. The absolute power consumption is increasing, because:
Larger table. Wider search key for deep packet classification. Increased wire speed.
Make sure to consider worst case scenarios not the data sheet values.
28ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
Reproaches against CAM based search engines
Table maintenance and management is a software related problem. Third port (Synchronous Maintenance Interface [SMI]) for
SiberCore CAMs is an interesting way of having table maintenance without affecting of the ongoing search processes.
Sort-free CAM that do not need partitioning CAMs.
Density and footprint (Not a real issue) example: The three members in the family, the
CYNSE10512, 10256, and 10128, provide address tables of 512k, 256k, and 128k entries (18 Mbits, 9 Mbits, and 4.5 Mbits), respectively.
All three devices are housed in 388-contact BGA packages.
Price: $75, $135, $275 1,000,000 entry IPV4 can be handled in two
18Mbits CAM.
29ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)
Reproaches against CAM based search engines
Inflexibility with Table Configurations: This is a real issue Some applications need flexible table sizes and width More research and development needed.
Price In absolute terms they are expensive. They are sophisticated complex products that are
indispensable in most designs. So they should be expensive!