Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig...
-
Upload
sybil-perry -
Category
Documents
-
view
214 -
download
0
Transcript of Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig...
Reconfigurable Computing Aspects of the Cray XD1
Sandia National Laboratories / California
Craig [email protected]
Cray User Group (CUG 2005)May 16, 2005
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Craig Ulmer, Ryan Hilles, and David Thompson SNL/CAKeith Underwood and K. Scott Hemmert SNL/NM
Overview
• Cray XD1– Dense multiprocessor
– Commodity parts
– Good HPC performance
• FPGA Accelerators in XD1– Close to host memory
– Reconfigurable Computing
– Our early experiences 0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
1 10 100 1,000 10,000 100,000 1,000,000 10,000,000
Message Size (Bytes)B
an
dw
idth
(M
illio
n B
yte
s/s)
MPI Bandwidth
PCI-X 133
1.6GB/s HT
Outline
• The Cray XD1 & Reconfigurable Computing
• Four simple applications
– DMA, MD5, Sort, and Floating Point
• Observations
• Conclusions
Cray XD1 Architecture
Cray XD1
• Dense multiprocessor system– 12 AMD Opterons on 6 blades
– 6 Xilinx Virtex-II/Pro FPGAs
– InfiniBand-like interconnect
– 6 SATA hard drives
– 4 PCI-X slots
– 3U Rack
XD1 Blade Architecture
CPU CPU MainMemory
MainMemory
NISRAM
Expansion Module
SRAM
SRAM
SRAM
FPGANI
Rapid Array Fabric
1.6 GB/s HT
1+1 GB/s “4x IB”
3.2 GB/s HT
1.6 GB/s “HT”
Reconfigurable Computing (RC)
• Use reconfigurable hardware devices to implement key computations in hardware
double doX( double *a, int n) {int i;double x;
x=0;for(i=0;i<n;i+=3){
x+= a[i] * a[i+1] + a[i+2];…
}…
return x;}
* +
+
a[i] a[i+1]
Z -1
a[i+2]
Software Hardware FPGA
Reconfigurable Computing and the XD1
• There have always been three challenges in RC:1. System Integration
2. Capacity
3. Development Environment
• XD1 is interesting because of (1) and possibly (2)– Describe our early experiences
– Four simple applications to expose performance
Application 1: Data Exchange
Data Exchange
• How fast can we move data into/out of FPGA?– Cray provides basic HyperTransport interface
– Host/FPGA initiated transfers
• HT-Compliance– 64B burst boundaries
– 64b words
• FPGA-Initiated DMA Engine– Read/Write large blocks of data
– Based on memory interface
FPGA
User’s Circuits
Host HT OpDMA
FPGA-Initiated Transfers
Data Transfer Performance
0
200
400
600
800
1,000
1,200
1,400
1,600
1 10 100 1,000 10,000 100,000 1,000,000
Transfer Size (Bytes)
Ban
dw
idth
(M
iByt
es/s
)
Host-to-FPGA (Host-Initiated)
Host-to-FPGA (FPGA-Initiated)
FPGA-to-Host (FPGA-Initiated)
FPGA-to-Host (Host-Initiated)
FPGA Frequency Affects Performance
0
200
400
600
800
1,000
1,200
1,400
1 10 100 1,000 10,000 100,000 1,000,000
Transfer Size (Bytes)
Ban
dw
idth
(M
iByt
es/s
)
200 MHz
175 MHz
150 MHz
125 MHz
100 MHz
75 MHz
Application 2: Data Hashing
Data Hashing
• Generate fingerprint for block of data
• MD5 Message Digest function– 512b blocks of data
– 64 computations per block
– Computation: Boolean, add, and rotate
• Two methods– Single-cycle computation (64)
– Multi-cycle computation (320)
MD5 Performance
• MD5 is serial– Read/Modify fingerprint
– Host does better
• FPGA clock rates– Multi-cycle: 190
MHz
– Single-cycle: 66 MHz
• Multi-cycle has 5x stages but only 3x clock rate of single-cycle
1
10
100
1,000
10,000
100,000
1 10 100 1,000 10,000 100,000 1,000,000
Message Size (Bytes)
Pro
cess
ing
Tim
e (m
icro
seco
nds)
FPGA Multi-Cycle
FPGA Single-Cycle
Host
Application 3: Sorting
Sorting
• Sort a matrix of 64b values
• Hardware implementation– Sorting element– Array of elements– Data transfer engine
• Exploits parallelism– Do n comparisons each clock– Requires 3n clock cycles
• Limited to n values– Use as building block
Sorting Array
AddressWr Req
Rd ReqDMA
EngineSortingControl
Double-BufferedOutgoing Memory
Reg
>
Sorting Element
11.532.994.188.779.434.539.254.2
24.354.14.235.1
14.83.019.791.4
17.581.39.431.394.188.732.911.579.454.239.234.5
54.135.124.34.2
91.419.714.83.0
81.331.317.59.4
unsorted sorted
Sorting Performance
• V2P50-7 FPGA– 128 elements– 110 MHz
• Data transfer by FPGA– Read/write concurrency– Pinned memory
• 2-4x Rate of Quicksort(128)
0
5
10
15
20
25
30
35
1 10 100 1,000
Matrix Rows
Sor
ting
Rat
e (M
iWor
ds/s
)
FPGA (No Copy)
FPGA (Pre/Post Copy)
CPU
Application 4: 64b Floating Point
Floating Point Computations
• Floating point has been weakness for FPGAs– Complex to implement, consume many gates
• Keith Underwood and K. Scott Hemmert– Developed FP library at SNL/NM
• FP Library– Highly-optimized and compact
– Single/Double Precision
– Add, Multiply, Divide, Square Root
– V2P50: 10-20 DP cores, 130+ MHz
Distance Computation
• Compute Pythagorean Theorem: c=sqrt(a2 + b2)– Implemented as simple pipeline (88 stages)
– Fetch two inputs, store one output each clock
– Host pushes data in, FPGA pushes data out
IncomingSRAM
OutgoingSRAM
DMAEngine
Control
IncomingSRAM
FromHost
ToHost
Distance Performance
• FPGA Design– 159 MHz clock rate– 39% of V2P50
• CPU and FPGA converge– 75 MiOperations/s– 1,200 MiB/s of input
• For Square Root and Divide, FPGA can be competitive 0
10
20
30
40
50
60
70
80
90
100
100 1,000 10,000 100,000 1,000,000
Distance Computations
Pro
cess
ing
Rat
e (M
iOpe
ratio
ns/s
)
CPU
FPGA
Observations
• FPGAs have high-bandwidth access to host memory– FPGAs can get 1,300 MiB/s – 10x PCI
– Still need planning for data transfers
– Note: host’s HT links are 2x
• Some applications work better than other– Need to exploit parallelism
• V2P50 is approaching a capacity for interesting work– Small number of FP cores
– Encourage use of larger FPGAs
Conclusions
• XD1 is an interesting platform for RC research– FPGAs require careful planning
– High-bandwidth access to memory
– Sign of interesting HT architectures
• Acknowledgments– Steve Margerm of Cray Canada for XD1 Support
– Keith Underwood and K. Scott Hemmert SNL/NM for FP library