A Resource Efficient Content Inspection System for Next Generation Smart NICs Karthikeyan...
-
Upload
myles-geoffrey-dawson -
Category
Documents
-
view
218 -
download
0
Transcript of A Resource Efficient Content Inspection System for Next Generation Smart NICs Karthikeyan...
A Resource Efficient Content Inspection System for Next
Generation Smart NICs
Karthikeyan Sabhanatarajan, Ann Gordon-Ross*
The Energy Efficient Internet ProjectHigh-performance Computing & Simulation Research Lab
ECE Department, University of Florida, Gainesville
This work was supported by the U.S. National Science Foundation
* Also affiliated with NSF Center for High Performance Reconfigurable Computing
Introduction
INTERNET
2 of 25
• Internet has grown at an alarming rate – 305% between 2000 and 2008
Introduction
INTERNET
• Edge devices are left idle 75% of the time with power management features disabled to maintain network connectivity.
IDLE
IDLE
IDLE
3 of 25
Introduction
IDLE A solution to save power on the idle devicesis power proxying
The idle PC is allowed to sleep
Z
Z
z
z
The PC delegates responsibility to the NIC to handlenetwork traffic
Additionally, NICs can enhance network security through Network Intrusion Detection
INTERNET
4 of 25
Introduction
Next Generation Interfaces – Also known as Smart NICs are expected to take increased network responsibility
Key Requirement – Packet Inspection
HEADERPAYLOAD
Content Inspection Header Inspection
Packet
This presentation focuses on Content Inspection. Content inspection is the process of searching the payload of the packet for the occurrence of known set of patterns called signatures. 5 of 25
Software techniques cannot support high speed links with large signature sets
FPGAs – Exploits Parallelism – Prohibitive price, area, and power for wide scale deployments
TCAMs – Popular Option – Performance O(1) – However, prohibitive energy, price, and auxiliary data structure requirements for existing implementations.
Motivation
Existing Methodologies
Software Hardware
Boyer-Moore Aho Corasick Wu Manber FPGAs TCAMs Bloom FiltersBoyer-Moore Aho Corasick Wu Manber FPGAs
Bloom Filters – Energy efficient and moderate throughput – False positives required further inspection on payload matching , imposes parallelism limits (scalability)
TCAMs Bloom Filters
Auxiliary data structures such as SRAM are used to store pattern combinations to help
determine a pattern match
6 of 25
Background – TCAM Methodology
w = 4
A B C D E F G H A B C D J K L M E F G
Sample Signature:
A B C D
E F G HA B C D J K L M E F G *
When w=4:
Prefix Pattern
Suffix Pattern
TCAM
TCAM
A B C D E F G HJ K L M E F G *
TCAMs are attractive candidates for pattern matching due to their inherent simplicityin pattern matching , small look up time , high throughput, high density, and scalability.
7 of 25
Background – TCAM Methodology
w = 4
A B C D E F G HJ K L M E F G *
TCAM
A B C D E F G H J K L M E F G U I
Auxiliary SRAM structures contain several pattern permutations to identify valid patterns
A B C D E F G H J K L M E F G U IA B C D E F G H J K L M E F G U IA B C D E F G H J K L M E F G U I
O(N2) – Auxiliary SRAM structure space requirement.
Proposed by Lakshman et. al
Gao et. al reduced this requirement to O(NlogN) by storing address permutations.
Auxiliary SRAM Structures
Combined Pattern Table
Matching Table
Partial Hit List
Matched
Index
Stores information on type of matched pattern i.e, prefix, suffixStores the valid combination of allpossible prefix and suffix entries
Records the index of the constructed prefix pattern
8 of 25
Proposed Solution
Simplest and fastest technique - O(1) look up. Can match future speed limits of 10 Gbps. Highly scalable with no parallelism limits. Can accommodate signatures of varying length and different signature set sizes with ease
TCAM Techniques are :
However they suffer from :
Increased energy consumption Prohibitive price Increased auxiliary data structure requirements
Making them unsuitable for wide scale deployment in SNICs
9 of 25
We propose a hybrid TCAM based solution
Our Technique solves Energy efficiency – Through partitioned architecture
Proposed Solution
More suitable for wide scale deployment due to high energy efficiency and reduced memory requirements.
Meets throughput requirements of high speed links such as 1 Gbps/ 10 Gbps with ease
Additional further reduction in power consumption through caching by exploiting network locality
Auxiliary data structure requirement reduction using bloom filter or software techniques
10 of 25
STCAM
E F G HA B C D J K L M E F G *
Hybrid TCAM Methodology
Partition the single TCAM into a prefix TCAM (PTCAM) and a suffix TCAM (STCAM)
w = 4
TCAM
Store signatures in the STCAM and PTCAM accordingly. The signature is then expressed as permutation of STCAM and PTCAM address.
PTCAM
w = 4
w = 4
A B C D E F G H A B C D J K L M E F G
P0 S0 S1 S2 S3This permutation is then stored in bloom filter or in software
PTCAM STCAM
A B C D A B C D E F G HJ K L M E F G *
11 of 25
Our experimentation indicates that there exists sufficient locality in network traces.
To reduce unwanted switching we exploit this property and introduce a cache between the PTCAM and STCAM
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 500 1000 1500 2000
Malicious Packet ID
Rule
Id
Exploiting Signature Locality
12 of 25
PTCAM STCAM
E F G HA B C D J K L M E F G *
A B C D
w = 4
w = 4
PTCAM
A B C D
w = 4
STCAM
E F G HA B C D J K L M E F G *
w = 4
SuffixCache
$Ctrl
Hybrid TCAM Methodology
13 of 25
PTCAM
w = 4
STCAM
E F G HA B C D J K L M E F G *
w = 4
SuffixCache
$CtrlA B C D
The cache is activated (w-1) clock cycles after a TCAM hit
Activator
Right Shift
1 0 0 0 EnablerEnable
0th ..(w-1)th
Enable Buffer
Hit Hit M
iss
Enable
Hit
Pause
A cache miss pauses shifting to allow searching the suffix TCAM for the pattern
A B C D E F G H J K L M E F G U I
Left
Shi
ft
Payload is fed to the inspection system, shifted at the rate of 1 byte/clock
Cache controller ($ ctrl) updates suffix cache
Hybrid TCAM Methodology
14 of 25
PTCAM
w = 4
STCAM
E F G HA B C D J K L M E F G *
w = 4
SuffixCache
$CtrlA B C D
Activator
Right Shift
1 0 0 0 EnablerEnable
0th ..(w-1)th
Enable Buffer
Hit Hit M
iss
Enable
Hit
Pause
A B C D E F G H J K L M E F G U I
Left
Shi
ft
11
P1
01
S1
00
…
…
01
S1
00
Left
Shi
ft
P1 S1………
To Bloom Filter or Software unit to verify the combination
Hybrid TCAM Methodology
15 of 25
PTCAM
w = 4
STCAM
E F G HA B C D J K L M E F G *
w = 4
SuffixCache
$CtrlA B C D
Activator
Right Shift
1 0 0 0 EnablerEnable
0th ..(w-1)th
Enable Buffer
Hit Hit M
iss
Enable
Hit
Pause
A B C D E F G H J K L M E F G U I
Left
Shi
ft
11
P1
01
S1
00
…
…
01
S1
00
Left
Shi
ft
A contention resolution unit handles contention between identical PTCAM and STCAM patterns. Preference is given to PTCAM match over STCAM match
Contention Resolution
MatchAddr
MatchAddr
Hit
Hybrid TCAM Methodology
16 of 25
Experimental Setup
Packet traces – Malicious traces from MIT – LL and capture the flag contest from DEFCON Festival
No available power proxying traces and is an ongoing research
C-based custom simulator written to behaviorally simulate the entire system.
Packets are reassembled and fed to the simulator
STCAM accesses saved to analyze the effect of caching
TCAM energy consumption obtained from Agarwal et. al TCAM modelling tool
SNORT and ClamAV used as signature sets
17 of 25
Results – Signature Distribution
0
10000
20000
30000
40000
50000
60000
0 20 40 60 80 100 120 140 160 185
Signature length
Cum
ulat
ive
Sig
natu
re D
istr
ubut
ion Snort 2.4 Snort 2.8 Clamav
ClamAV and SNORT rule sets : SNORT smaller patterns (70% <= 4 bytes ClamAV medium sized patterns (72% <30 bytes & >100 bytes)
18 of 25
ResultsEffect of partitioning on Size
Partitioning circumvents natural TCAM compression . However, negligible increase inTCAM size.
0
1000
2000
3000
4000
5000
4 8 16 32 64 128TCAM Width (Bytes)
TCA
M S
ize
(Byt
es) P_TCAM
S_TCAMCombined TCAMsNon-partitioned TCAMs
19 of 25
ResultsEDP Reduction
-10%
0%
10%
20%
30%
40%
50%
60%
70%
4 8 16 32 64 128
TCAM Width (Bytes)
ED
P R
educ
tion
SNORT v2.8ClamAV
Partitioning reduces Energy-Delay Product (EDP) . Two smaller TCAMs are faster than One single big TCAM. Higher EDP savings for widths of 8 and 16 bytes.
20 of 25
Energy SavingsResults
0%10%20%30%40%50%60%70%80%90%
100%
4 8 16 32 64TCAM Width (Bytes)
Ene
rgy
Red
uctio
n
SNORTv2.8 - MIT_1 ClamAV - MIT_1
SNORT v2.8 - MIT_2 ClamAv - MIT_2
SNORTv2.8 - DEFCON ClamAV - DEFCON
1. Energy reduction for a partitioned system compared to a non-partitioned system verses TCAM width for real-time traffic traces.
2. Energy savings range from 6% to 69% (SNORT) and 6% to 87% (ClamAV) 3. Smaller TCAMs widths give greater energy savings.4. Larger TCAM accesses use more “don’t care” bits.
21 of 25
ResultsEffect of Caching – Hit rate
0%
20%
40%
60%
80%
100%
10 20 30 40 50 60 70 80 90
Number of Cache Entries
Cac
he H
it R
ate
SNORT v2.8 - MIT_1 SNORTv2.8 - MIT_2ClamAV - MIT_1 ClamAV - MIT_2SNORT v2.8- DEFCON ClamAV - DEFCON
1. Caching on STCAM width of 4 bytes analyzed.2. Hit rates range from 28% to 88% for cache sizes of only 40 to 60 entries3. A cache containing 40 to 60 entries represents only 0.002% to 0.004%, respectively, of the S_TCAM entries
22 of 25
Results
0%
10%
20%
30%
40%
50%
60%
70%
10 20 30 40 50 60 70 80 90Number of Cache Entries
Ener
gy sa
ving
s
SNORT v2.8 - MIT_1 SNORTv2.8 - MIT_2ClamAV - MIT_1 ClamAV - MIT_2SNORTv2.8 - DEFCON ClamAV - DEFCON
Energy savings for a partitioned TCAM system (w=4) with a suffix cache compared to a partitioned TCAM system with no suffix cache for varying number of cache entries.
13% to 64% additional Savings
Effect of Caching – Energy Savings
23 of 25
Conclusion
1. Developed an energy efficient partitioned TCAM-based content inspection system for SNICs.
2. Energy and throughput aware3. Energy Delay Product improvements of up to 62% compared to previous non-
partitioned TCAM systems. 4. Up to 87% energy savings (average) compared to a non-partitioned TCAM system. 5. A simple cache with a random replacement policy further reduces the energy
consumption by 64% compared to a partitioned TCAM system.6. Caching incurs a throughput reduction of 5.5%.
24 of 25
1. Evaluating proposed bloom filter based architecture2. Improved caching techniques3. Attack robustness to counter maliciously engineered
packets4. A pipelined architecture to hide cache misses and improve
throughput.
Future Work
25 of 25