CprE / ComS 583 Reconfigurable Computing
-
Upload
stephan-andrew -
Category
Documents
-
view
20 -
download
1
description
Transcript of CprE / ComS 583 Reconfigurable Computing
CprE / ComS 583Reconfigurable Computing
Prof. Joseph ZambrenoDepartment of Electrical and Computer EngineeringIowa State University
Lecture #8 – Reconfigurable Networking
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.2
Recap – Genetic Pattern Matching
• Comparing strings by edit distance• Motivation: The Human Genome Project
• Do two genetic strings match?• How are they related?
• When biologists characterize a new sequence, they want to compare it to the (growing) database of known sequences
• Abstraction:• What is the cost of transforming s into t • Given – costs for insertion, deletion, substitution
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.3
Alphabet and Costs
• Alphabet• Letters in the string. For DNA, there are four:
• A (Adenine)• C (Cytosine)• T (Thymine)• G (Guanine)
• Transformation Costs• Insert:1, Delete:1, Substitute:2, match:0
• Type of comparison• One target to many sources• One target to one source
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.4
Substitution Example
baboon
CostMoveWord
bab|on
bobon
boubon
bourbon
Delete ‘o’
Substitute ‘o’
Insert ‘u’
Insert ‘r’
Match?
Total cost:
1
2
1
1
0
5bourbon
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.5
Dynamic Programming Solution
• Source sequence: s1, s2, … sm
• Target sequence: t1, t2, … tn
• di,j = distance between subsequence s1, s2, …si and subsequence t1, t2, …tj, where
d0,0 = 0
di,0 = di-1,0 + Delete(si)
d0,j = d0,j-1 + Insert(tj)
di-1,j + Delete(si)
di,j = min di,j-1 + Insert(tj)
di-1,j-1 + Substitute(si, tj)• Distance(Source, Target) = dm,n
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.6
Dynamic Programming Example
baboon
0123456
b1012234
o2122223
u3233334
r4344345
b5344455
o6455456
n7566565
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.7
Parallelism on the Anti-Diagonal
baboon
0123456
b1012234
o2122223
u3233334
r4344345
b5344455
o6455456
n7566565 di,j
di-1,j
di,j-1
di-1,j-1
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.8
SCin
SDin
TCout
TDout
If (SCin != 0) and (TCin != 0)
PEDist + Substitute(SCin, TCin)
PEDist min TDin + Delete(SCin)
SDin + Insert(TCin)
else-if (SCin != 0)
PEDist SDin
else-if (TCin != 0)
PEDist TDin
endif
SCout SCin
TCout TCin
SDout PEDist
TDout PEDist
Bidirectional PE Example
BidirPE
PEDist
BidirPE
PEDist
BidirPE
PEDist
SCout
SDout
TCin
TDin
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.9
Bidirectional Summary
• 16 CLBs/PE• 384 PEs/Board• 2,100 Million Cells/sec• Requires 2*(m+n) PEs• Uses only half the processors at any one time• Must stream both source and target for each
comparison• Makes comparison against large DB impractical
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.10
Genetic Search Performance
• Nearly linear scaling in cell updates per second (CUPS)
• Need to reuse array for large patterns
Hardware CUPS
Splash 2 x16 43,000M
Splash 2 3,000M
Splash 1 370M
P-NAC (34) 500M
λ
0.60μ
0.60μ
0.60μ
2.0μ
CM-2 (64K) 150M
CM-5 (32) 33M
SPARC 10 1.2M
SPARC 1 0.87M
?
?
0.40μ
0.75μ
Area
500Mλ2x17x16
500Mλ2x16
420Mλ2x32
7.8Mλ2x34
1.6GMλ2
273Mλ2
CUP/λ2s
0.32
0.38
0.028
1.9
0.00075
0.0032
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.11
Outline
• Recap – Pattern Matching on Splash-2• The Field-Programmable Port Extender (FPX) • FPX Architecture• FPX Programming Model• FPX Applications
• Pattern Matching• Packet Classification• Rule Processing
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.12
Application – Network Processing
• Networking applications well-suited for reconfigurable hardware• Target signatures change often• Massive quantities of stream-based data• Repetitive operations
• Connecting up to a realistic networking environment is hard• Washington University experimental setup one of the
best• Shows importance of both memory and processing
capability• Numerous experiments performed over the past five
years
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.13
Network Routing with the FPX
• FPX Modules distributed across each port of a switch • IP packets (over ATM) enter and depart line card• Packet fragments processed by modules• Advantages:
• New protocols implemented directly in silicon• Easy to upgrade in the field
IP Packets
IP Packets
IPP
IPP OPP
OPP
CardOC3/OC12/OC48
Line FPX
ExtenderPort
programmableField-
FPX
ExtenderPort
Field-programmable
CardOC3/OC12/OC48
Line
SwitchFabric
Gigabit
IPP
IPP
OPP
OPP
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.14
FPX Hardware Device
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.15
FPX Hardware in a WUGS-20 Switch
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.16
FPGA-based Router
• FPX module contains two FPGAs• NID – network interface device
• Performs data queuing• RAD – reprogrammable application device
• Specialized control sequences
OSC100MHz10 MHz
OSC
SDRAM
(backside)
8Mbit ZBT
SRAM
8Mbit ZBT
SRAM
DeviceInterfaceNetworkNIDRAD
ReprogrammableApplication
Device
Virtex1000E fg680
RAD/NID Status
SDRAM
(backside)
SRAMProgram
RAD
Reprog
OC
3 / O
C12
/ O
C48
Lin
ecar
d C
onne
ctor
EPROM
NID
62.5 MHz
VR
M:
1.8
V S
witc
her
JTAG
WU
GS
Sw
itch
Bac
kpla
ne C
onne
ctor
FPXWashU / ARL
July, 2000JL / MR
OSC
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.17
Reprogrammable Application Device
• Spatial Re-use of FPGA Resources• Modules implemented using FPGA logic• Module logic can be individually reprogrammed
• Shared Access to off-chip resources• Memory Interfaces to SRAM and SDRAM• Common Datapath to send and receive data
Mo
du
le
Mo
du
le
Network Interfaces to NID
RAD
SRAMData
SRAM
SDRAM SDRAM
Data Data
Data
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.18
Architecture of the FPX
• RAD• Large Xilinx FPGA• Attaches to SRAM and
SDRAM• Reprogrammable over
network• Provides two user-defined
Module Interfaces
• NID • Provides Utopia
Interfaces between switch & line card
• Forwards cells to RAD• Programs RAD
SRAM
EC
Mo
du
le
EC
NID
Switch LineCard
Mo
du
le
RAD
VC VC
VCVC
RAD
Program
SDRAM SDRAM
Data Data
SRAM
Data
SRAM
Data
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.19
Architecture of the FPX (cont.)
PR
OM
Cache
Prog
ram
Flo
wB
uffer
Ro
ute
Filter
Exte
ns
ible
Mo
du
les
Layered Protocol Wrappers
Switch
SD
RA
MS
RA
M
SD
RA
MS
RA
M
Co
nfig
NID (FPGA)
Memory
Network Interface
RAD (FPGA)
FPX Block Diagram FPX Photo
PR
OM
Cache
Prog
ramC
acheP
rogram
Flo
wB
uffer
Ro
ute
Filter
Exte
ns
ible
Mo
du
les
Layered Protocol Wrappers
Switch
SD
RA
MS
RA
M
SD
RA
MS
RA
M
Co
nfig
NID (FPGA)
Memory
Network Interface
RAD (FPGA)
FPX Block Diagram FPX Photo
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.20
FPX SRAM
• Provide low latency for fast table-lookups• Zero Bus Turnaround (ZBT) allows back-to-back
read / write operations every 10ns• Dual, Independent Memories • 36-bit wide bus
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.21
FPX SDRAM
• Dual, independent SDRAM memories• 64-bit wide, 100 MHz• 64Mb / Module : 128 Mb total [expandable]• Burst-based transactions [1-8 word transfers]• Latency of 14 cycles to Read/Write 8-word burst
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.22
Routing Traffic Flows
• Functions• Check packets for errors• Process commands
• Control, status, & reprogramming
• Implement per-flow forwarding
NID
LineCardSwitch
EC ECEC
VC VC
VC VC
VC
ccp
ccp
• Traffic flows routed among• Switch• Line Card• RAD.Switch• RAD.Linecard
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.23
Typical Flow Configurations
VC
EC
VC
ccpEC
VC VC
RADSwitch
RADLineCard
LineCardSwitchDefault Flow Action
(Bypass)
VC
EC
VC
ccpEC
VC VC
RADSwitch
RADLineCard
LineCardSwitchIngress Processing
(IP Routing)
VC
EC
VC
ccpEC
VC VC
RADSwitch
RADLineCard
LineCardSwitchFull RAD Processing
(Packet Routing and Reassembly)
VC
EC
VC
ccpEC
VC VC
RADSwitch
RADLineCard
LineCardSwitch
(Per-flow Output Queueing)Egress Processing
VC
EC
VC
ccpEC
VC VC
LineCardSwitch
RADSwitch
RADLineCard
Full Loopback Testing(System Test)
VC
EC
VC
ccpEC
VC VC
RADSwitch
RADLineCard
LineCardSwitchPartial Loopback Testing(Egress Flow Processing Test)
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.24
Reprogramming Logic
• NID programs at boot from EPROM• Switch Controller writes RAD configuration memory to NID
• Configuration file for RAD arrives transmitted over network via control cells
• Switch Controller issues {Full/Partial} reconfigure command• NID reads RAD config memory to program RAD
• Performs complete or partial reprogramming of RAD
R A DReprogrammable
ApplicationDevice
Virtex1000E fg680
S D R A M
(ba cks id e)
8Mb
it Z
BT
SR
AM
8Mb
it Z
BT
SR
AM
S D R A M
(ba cks id e)
O SC6 2.5 M H z
O SC1 00 M H z
DeviceInterfaceNetworkN ID
R ADP rogram
FIFO E PR O MN ID
WU
GS
Sw
itch
Bac
kpla
ne C
onn
ect
or
OC
3 /
OC
12
/ OC
48 L
inec
ard
Con
nect
or
(backside)
V R M
V R M
1.8V
2.5V(backside)
P C B Trace D e ns ity
Switch Element
IPPIPPIPPIPPIPPIPPIPP
OPPOPPOPPOPPOPPOPPOPPOPPIPP
LCLCLCLCLCLCLCLC
LCLCLCLCLCLCLCLC
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.25
FPX Interfaces Provides
• Well defined Interface• Utopia-like 32-bit fast data interface• Flow control allows back-pressure
• Flow Routing• Arbitrary permutations of packet flows through ports
• Dynamically Reprogrammable• Other modules continue to operate even while new
module is being reprogrammed
• Memory Access• Shared access to SRAM and SDRAM• Request/Grant protocol
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.26
Pattern Matching using the FPX
• Use Hardware to detect a pattern in data• Modify packet based on match• Pipeline operation to maximize throughput
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.27
“Hello, World” Module Function
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.28
Logical Implementation
VCI Match
Append“WORLD”to payload
NewCell
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.29
The Wrapper Concept
App
Wrapper
Wrapper
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.30
AAL5 Encapsulation
ATM Header
Padding
AAL5 Trailer
Payload
0
0
1
CRC-32
options length
Payload
• Payload is packed in cells
• Padding may be added
• 64 bit Trailer at end of cell
• Trailer contains CRC-32
• Last Cell indication bit (last bit of PTI field)
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.31
HelloBob Module
UDP Processor
Echo
UDP Hello Bob
IP Processor
Cell Processor
Frame Processor
SRAMInterface
OutputInput
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.32
Results: Performance • Operating Frequency: 119 MHz.
• 8.4ns critical path• Well within the 10ns period RAD's clock.• Targeted to RAD’s V1000E-FG680-7
• Maximum packet processing rate:• 7.1 Million packets per second.
• (100 MHz)/(14 Clocks/Cell)• Circuit handles back-to-back packets
• Slice utilization: • 0.4% (49/12,288 slices)
• Less than one half of one percent of chip resources
• Search technique can be adapted for other types of data matching and modification
• Regular expressions• Parsing image content …
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.33
CAM-based Packet Matching
• Sample Packet:• Source Address = 128.252.5.5 (dotted.decimal)
• Destination Address = 141.142.2.2 (dotted.decimal)
• Source Port = 4096 (decimal)
• Destination Port = 50 (decimal)
• Protocol = TCP (6)
• Payload = “Consolidate your loans. CALL NOW”• Payload Lists = { General SPAM (0), Save Money SPAM (1) }• Content Vector = “00000011” (binary) = x”03” (hex)
7103 3971
Src IP (hex) =80FC0505
Dest IP (hex) =8D8E0202
SrcPort = 1000
Dest Port =0050
Proto= 06
084072
Con-tent= 03
111 104
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.34
Sample Filter• Source Address = 128.252.0.0 / 16 • Destination Address = 141.142.0.0 / 16• Source Port = Don’t Care• Destination Port = 50• Protocol = TCP (6)• Payload includes general SPAM (List 0)
7103 3971
Src IP (hex) =80FC0505
Dest IP (hex) =8D8E0202
SrcPort = 1000
Dest Port =0050
Proto= 06
084072
Src IP value =80FC0000
Dest IP (hex) =8D8E0000
SrcPort = 0000
Dest Port =
50
Proto= 06
Src IP (hex) =FFFF0000
Dest IP (hex) =FFFF0000
SrcPort = 0000
Dest Port =FFFF
Proto= FF
Value
Mask: 1=care0=don’t care
IP Packet
Con-ten t=
01
Con-ten t=
01
Con-tent=
03
DROP the packet : It matches the filter
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.35
Packet Classifier with FlowIDCAM MASK [1]
CAM VALUE [1]
CAM MASK [2]
CAM VALUE [2]
CAM MASK [3]
CAM VALUE [3]
CAM MASK [N]
CAM VALUE [N]
Flow ID [1]112 bits
Flow ID [2]
Flow ID [3]
Flow ID [N]
Flow ID
. . .. . .
. . .
16 bits
Value Comparators
Mask Matchers
Priority Encoder
Resulting Flow
Identifier
Flow List
Source Address Destination Address
16 bits
Payload Match Bits
Source Port
Dest.Port
Protocol
- - CAM Table - -
Bits in IP Header
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.36
Fast IP Lookup Algorithm
• Function• Search for best matching prefix using
Trie algorithm
Prefix Next Hop*
01*47
10* 2110* 90001* 11011* 000110* 501011* 3
0 1
0
0 0
0
0
0
1 1
1
1 1
1
1
1
1
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.37
Hardware Implementation in the FPX
SRAM1
SRAM2
IP LookupEngine
counter
On-Chip Cell Store
SRAM1 Interface
Control CellProcessor
PacketReassembler
RAD FPGA
NID FPGA
ExtractIP Headers
Remap VCIsfor IP packets
LC SW
RequestGrant0 1
0
0 0
0
0
0
1 1
1
1 1
1
1
1
1
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.38
Time (cycles)
Generate Address
Pipelined FIPL Operations
• Throughput : Optimized by interleaving memory accesses• Operate 5 parallel lookups• t_pipelined_lookup = 550ns / 5 = 110 ns • Throughput = 9.1 Million packets / second
Latch ADDRinto SRAM
SRAMD < M[A]
Latch Datainto FPGA
Compute
Time (cycles)
Space
(Parallel lookup units on FPGA)
Generate Address
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.39
Other Modules Implemented
• IPv4 CAM Filter• 104 Bit header matching
• Fast IP Lookup (FIPL)• Longest Prefix Match• MAE-West at 10M
pkts/second
• Packet Content Scanner• Reg. Expression Search
• Data Queueing• Per-flow queue in
SDRAM
• IPv6 Tunneling Module• Tunnels IPv6 over IPv4
• Statistics Module• Event counter
• Traffic Generator• Per-flow mixing
• Video Recoder• Motion JPEG
• Embedded Processor• KCPSM
CprE 583 – Reconfigurable ComputingSeptember 13, 2007 Lect-08.40
Summary
• Field Programmable Port Extender (FPX)• Network-accessible Hardware• Reprogrammable Application Device
• Module Deployment• Modules implement fast processing on data flow• Network allows Arbitrary Topologies of distributed systems
• Project Website• http://www.arl.wustl.edu/arl/projects/fpx/