Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina...
-
Upload
elisabeth-robinson -
Category
Documents
-
view
215 -
download
1
Transcript of Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina...
Allen MichalskiCSE Department – Reconfigurable Computing Lab
University of South Carolina
Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS
Integer Sort Benchmark within the SRC-6e Reconfigurable Computer
Page 2 MAPLD 2005/253Michalski
OutlineOutline
Reconfigurable Computing – Introduction SRC-6e architecture, programming model
Sorting Algorithms Design guidelines
Testing Procedures, ResultsConclusions, Future Work
Lessons learned
Page 3 MAPLD 2005/253Michalski
What is a Reconfigurable Computer?What is a Reconfigurable Computer?
Combination of: Microprocessor workstation for frontend processing FPGA backend for specialized coprocessing Typical PC bus for communications
Page 4 MAPLD 2005/253Michalski
What is a Reconfigurable Computer?What is a Reconfigurable Computer?
PC Characteristics High clock speed Superscalar, pipelined Out of order issue Speculative execution High-Level Language
FPGA Characteristics Low clock speed Large number of configurable
elements• LUTs, Block RAMs, CPAs• Multipliers
HDL Language
Page 5 MAPLD 2005/253Michalski
What is the SRC-6e?What is the SRC-6e?
SRC = Seymour R. Cray RC with high-throughput memory interface
1,415 MB/s for SNAP writes, 1,280 MB/s for SNAP reads PCI-X (1.0) = 1.064 GB/s
Page 6 MAPLD 2005/253Michalski
SRC-6e DevelopmentSRC-6e Development
Programming does not require knowledge of HW design C code can compile to hardware
Page 7 MAPLD 2005/253Michalski
FPGA Considerations Superscalar design
• Parallel, pipelined executionSRC Considerations
High overall data throughput• Streaming versus non-streaming data transfer?
Reduction of FPGA data processing stalls due to data dependencies, data read/write delays
• FPGA Block RAM versus SRC OnBoard Memory?Evaluate software/hardware partitioning
Algorithm partitioning Data size partitioning
SRC Design ObjectivesSRC Design Objectives
Page 8 MAPLD 2005/253Michalski
Sorting AlgorithmsSorting Algorithms
Traditional Algorithms Comparison Sorts: Θ(n lg n) best case
• Insertion sort• Merge sort• Heapsort• Quicksort
Counting Sorts• Radix sort: Θ(d(n+k))
HPCS FORTRAN code baseline Radix sort in combination with heapsort This research focuses on 128-bit operands
• SRC simplified data transfer, management
Page 9 MAPLD 2005/253Michalski
Memory Constraints SRC onboard memory
• 6 banks x 4 MB• Pipelined read or write access• 5 clock latency
FPGA BRAM memory• 144 blocks, 18 Kbit each• 1 clock read and write latency
Initial Choices Parallel Insertion Sort (BubbleSort)
• Produces sorted blocks• Use of onboard memory pipelined processing
– Minimize data access stalls Parallel Heapsort
• Random access merge of sorted lists• Use of BRAM for low latency access
– Good for random data access
Sorting – SRC FPGA ImplementationSorting – SRC FPGA Implementation
Page 10 MAPLD 2005/253Michalski
Parallel Insertion Sort (BubbleSort)Parallel Insertion Sort (BubbleSort)
Systolic array of cells Pipelined SRC processing from OnBoard Memory Keeps highest value, passes other values Latency 2x number of cells
Page 11 MAPLD 2005/253Michalski
Parallel Insertion Sort (BubbleSort)Parallel Insertion Sort (BubbleSort)
Systolic array of cells Results passed out in
reverse order of comparison
• N = # comparator cells Sorts a list completely in
Θ(L2) Limit sort size to some
number a < L (list size)• Create multiple sorted lists• Each list sorted in Θ(a)
Page 12 MAPLD 2005/253Michalski
Parallel Insertion Sort (BubbleSort)Parallel Insertion Sort (BubbleSort)#include <libmap.h>void parsort_test(int arraysize, int sortsize, int transfer, uint64_t datahigh_in[], uint64_t datalow_in[], uint64_t datahigh_out[], uint64_t datalow_out[], int64_t *start_transferin, int64_t *start_loop, int64_t *start_transferout, int64_t *end_transfer, int mapno) {
OBM_BANK_A (a, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (b, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (c, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (d, uint64_t, MAX_OBM_SIZE)
DMA_CPU(CM2OBM, a, MAP_OBM_stripe(1, "A"), datahigh_in, 1, arraysize*8, 0); wait_DMA(0); …. while (arrayindex < arraysize) { endarrayindex = arrayindex + sortsize - 1; if (endarrayindex > arraysize - 1) endarrayindex = arraysize - 1;
while (arrayindex < endarrayindex) { for (i=arrayindex; i<=endarrayindex; i++) { data_high_in = a[i];data_low_in = b[i];
parsort(i==endarrayindex, data_high_in, data_low_in, &data_high_out, &data_low_out);
c[i] = data_high_out; d[i] = data_low_out;
Page 13 MAPLD 2005/253Michalski
Parallel HeapsortParallel Heapsort
Tree structure of cells Asynchronous operation
• Acknowledged data transfer Merges sorted lists in Θ(n lg n) Designed for Independent BRAM block accesses
Page 14 MAPLD 2005/253Michalski
Parallel HeapsortParallel Heapsort
BRAM Limitations 144 Block RAMs @ 512 32 bit values = not a whole
lot of 128-bit valuesOnBoard Memory
SRC constraint – Up to 64 reads and 8 writes in one MAP C file
Cascading clock delays as number of reads increase Explore the use of MUXd access: search and update
only 6 of 48 leaf nodes at a time in round-robin fashion
Page 15 MAPLD 2005/253Michalski
FPGA Initial ResultsFPGA Initial Results
Baseline: One V26000 PAR options: -ol high –t 1
Bubblesort Results – 100 Cells 29,354 Slices(86%) 37,131 LUTs (54%) 13.608 ns = 73 MHz (verified operational at 100MHz)
Heapsort Results – 95 Cells (48 Leafs) 21,011 Slices(62%) 24,467 LUTs (36%) 11.770 ns = 85 MHz (verified operational at 100MHz)
Page 16 MAPLD 2005/253Michalski
Testing ProceduresTesting Procedures
All tests utilize one chip for baseline resultsEvaluate fastest software radix of operationHardware/Software Partitioning
Five cases - Case 5 utilizes FPGA reconfiguration Data size partitioning – 100, 500, 1000, 5000, 10000 10 runs for each
test case/data partitioning combination
List size 500000 values
Page 17 MAPLD 2005/253Michalski
ResultsResults
Fastest Software Operations (Baseline) Comparison of Radixsort and Heapsort Combinations
• Radix 4, 8 and 16 evaluated
Minimum Time: Radix-8 Radixsort + Heapsort (Size = 5000 or 10000)
Radix-16 has too many buckets for sort size partitions evaluated Heapsort comparisons faster than radixsort index updates
Software Datasize Partitioning - Radixsort vs. Radixsort + Heapsort
01020304050607080
4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16
Radixsort Radix + Heap(Listsize=100)
Radix + Heap(Listsize=500)
Radix + Heap(Listsize=1000)
Radix + Heap(Listsize=5000)
Radix + Heap(Listsize=10000)
TestCase/Radix
Tim
e (
se
c.)
HeapSort
RadixSort
Page 18 MAPLD 2005/253Michalski
ResultsResults
Fastest SW-only Time = 3.41 sec.
Fastest time including HW = 3.89 sec. Bubblesort
(HW), Heapsort (SW)
Partition Listsize of 1000
Heapsort times… Dominated by data access Significantly slower than software
SRC Softw are/Hardw are Executions (500K Data)
0
5
10
15
20
25
30
35
S-S
H-S
S-H
H-H
S-S
H-S
S-H
H-H
S-S
H-S
S-H
H-H
S-S
H-S
S-H
H-H
S-S
H-S
S-H
H-H
100 500 1000 5000 10000
Data Partition/Test Case
Tim
e (
se
c.)
Heapsort (HW)
Heapsort Config (HW)
Heapsort (SW)
Bubblesort (HW)
Bubblesort Config (HW)
Radixsort (SW)
Page 19 MAPLD 2005/253Michalski
Results – Bubblesort vs. RadixsortResults – Bubblesort vs. Radixsort
Some cases where HW faster than SW List sizes < 5000 SRC data
pipelined access Fastest SW case
was for list size = 10000
Radixsort (SW) vs. Bubblesort (HW)
0
1
2
3
4
5
6
Rad
ixso
rt(S
W)
Bub
bles
ort
(HW
)
Rad
ixso
rt(S
W)
Bub
bles
ort
(HW
)
Rad
ixso
rt(S
W)
Bub
bles
ort
(HW
)
Rad
ixso
rt(S
W)
Bub
bles
ort
(HW
)
Rad
ixso
rt(S
W)
Bub
bles
ort
(HW
)
100 500 1000 5000 10000
Data Size/Test Case
Tim
e (s
ec.) HW - Data Transfer Out
HW - Data Processing
HW - Data Transfer In
SW - Only
MAP data transfer time less significant than data processing time For size = 1000:
Input (11.3%), Analyze (76.9%), Output (11.5%)
Page 20 MAPLD 2005/253Michalski
Results - LimitationsResults - Limitations
Heapsort is limited by overhead of input servicing Random accesses of OBM not ideal Overhead of loop search, sequentially dependent
processingBubblesort limited by number of cells
Can increase by approximately 13 cells Two-chip streaming
Reconfiguration time assumed to be one-time setup factor Reconfiguration case exception – Solve by having a
core per V26000
Page 21 MAPLD 2005/253Michalski
ConclusionsConclusions
Pipelined, systolic designs are needed to overcome speed advantage of microprocessor Bubblesort works well on small data sets Heapsort’s random data access cannot exploit SRC
benefitsSRC high-throughput data transfer and high-
level data abstraction provides good framework to implement systolic designs
Page 22 MAPLD 2005/253Michalski
Future WorkFuture Work
Heapsort’s random data access cannot exploit SRC benefits Look for possible speedups using BRAM? Unroll leaf memory access Exploit SRC “periodic macro” paradigm
Currently evaluating radix sort in hardware This works better than bubblesort for larger sort sizes
Compare MAP-C to VHDL when baseline VHDL is faster than SW