Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
1
Implementing Cellular Automatain FPGA Logic
Mathias Halbach, Rolf Hoffmann
1. Introduction
2. Implementing Cellular Automata in Software (C)
3. Hardware Prototype Implementation
4. Comparison Hardware vs. Software
5. Conclusion
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
2
1. IntroductionCellular Automata – Pioneers
John von Neumann(1903-1957)
Konrad Zuse (1910-1995) with "Rechnender Raum"(computing space)
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
3
Cellular Automata (CA)
optimal model for applications with inherent local neighborhood
physical fields, lattice-gas models, models of growth, moving particles, fluid flow, logic simulation, numerical algorithms, routing problems, picture processing, genetic algorithms, cellular neural networks.
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
4
Hardware Platform CEPRA and CDL
CEPRA: Cellular Processing Architecture– CEPRA-S, 2001 (see next slide)– CEPRA-3D, 1997 (2 FPGAs, 3 dimensional)– CEPRA-1D, 1996 (1 FPGA)– CEPRA-1X, 1996 (1 FPGA)– CEPRA-8D, 1995 (8 DSPs)– CEPRA-8L, 1994 (8 FPGAs)
CDL: Cellular Description Language
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
5
CEPRA-S
2 FPGAs
8 Data Memories
1 Program Memory
1 Special Memory
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
6
Prototyping Platform
Altera Flex 10k Evaluation Board– FPGA: Flex EPF10K70RC240-4 with 3756 logic cells
MAX+plus II Tools – AHDL, VERILOG FPGA-Logic
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
7
Question
If Cellular Automata are implemented in FPGA-Logic:
How high is the speed-up in comparison to a software implementation on a PC?
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
8
CA Rules used for Comparison
1. Rule: Belousov-Zhabotinsky Reaction– describes an oscillating chemical process– This rule is neither very simple nor very complex
function fBZR(Cell, North, East, South, West);const n=127, g=11;begin
b := (North=n) + (East=n) + (South=n) + (West=n);a := (0
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
9
Rule MinMax
2. Rule: is more complex
function MinMax (Cell, North, East, South, West);
const s1=40, s2=215;A:= max (Cell, North, East, South, West);B:= min (Cell, North, East, South, West);small:= (Cells2);fMinMax := A - B + 8 * small – 8 * big;
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
10
2. Implementing Cellular Automatain Software (C)
Naive ImplementationCompute-Phasefor(x=1; x
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
11
Code Optimizing Methods
Use a one-dimensional field to reduce address calculations.
Use pointers instead of indices.Write the calculation function inlineto the compute phase.
Implement border with additional memory instead of if-statements.
Delete the update-phase by using two pointers, which point to A and B in one computation and interchange the pointers for the next computation.
Change the loop iteration index from 1..n to n-1..0 (faster loop terminate check).
Reuse variables and intermediate results.
Reduce conditional statements, e.g. by using tables.
Use “case” (switch) instead of multiple if-statements, also nest if necessary.
Keep the cache filled.
Copy whole lines by a fast copy procedure (memcopy) into a three line buffer and operate on the lines instead of accessing the whole field directly.
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
12
Computation Time BZR (optimized C code)
Size n*n #cells generationsper secondtime per celloperation
clock cyclesper cell op.
128 * 128 16 k 5794 10 ns 25256 * 256 65 k 1533 10 ns 24512 * 512 256 k 333 11 ns 28
1024 * 1024 1 M 83 11 ns 282048 * 2048 4 M 21 11 ns 284096 * 4096 16 M 5 12 ns 298192 * 8192 64 M 1.2 13 ns 31
compiler GNU gcc 3.3.1 with optimization parameter –O3.Pentium 4 2.4 GHz (Fujitsu/Siemens) Windows XPCygwin 1.5.5c
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
13
Computation Time MinMax (optimized C code)
Size n*n #cells generationsper secondtime per celloperation
clock cyclesper cell op.
128 * 128 16 k 998 60 ns 143256 * 256 65 k 258 60 ns 142512 * 512 256 k 62 61 ns 147
1024 * 1024 1 M 16 62 ns 1482048 * 2048 4 M 4 61 ns 1474096 * 4096 16 M 1 60 ns 1448192 * 8192 64 M 0.3 60 ns 143
compiler GNU gcc 3.3.1 with optimization parameter –O3.Pentium 4 2.4 GHz (Fujitsu/Siemens) Windows XPCygwin 1.5.5c
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
14
3. Hardware Prototype Implementation
module rule(clk, cnew, n, e, c, w, s);input clk;input [7:0] c,n,e,s,w;output [7:0] cnew;parameter s1 = 42, s2 = 213;
wire [2:0] as1 = (ns2);wire [7:0] as1_8bit = as1;wire [7:0] as2_8bit = as2;wire [7:0] as1_as2 = (as1_8bit
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
15
Hardware Array Size
kernel cells
border cells
6 x 6 cells4 x 4 = 16 kernel cell16 rules in parallel
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
16
Rule BZR synthesized (16 rules parallel)
pipeline stages
Area (0) .. Speed
(10)
% used chip area
# used logic cells
max. MHz
time per cell op.
speed up
0 0 48 1833 11.8 5.30 ns 1.90 5 48 1833 11.7 5.34 ns 1.90 10 59 2233 12.1 5.17 ns 1.92 0 49 1869 18.2 3.43 ns 2.92 5 49 1869 17.6 3.55 ns 2.82 10 56 2119 21.4 2.92 ns 3.4
pipeline stages– doesn’t vary the number of used logic cells significantly– increase the speed up
approx. 2119/16 = 132 logic cells per CA cell needed
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
17
Rule MinMax synthesized (16 rules parallel)
pipeline stages
Area (0) .. Speed
(10)
% used chip area
# used logic cells
max. MHz
time per cell op.
speed up
0 0 65 2460 6.6 9.47 ns 60 5 65 2460 6.5 9.62 ns 60 10 80 3016 6.4 9.77 ns 63 0 60 2259 17.7 3.53 ns 173 5 60 2259 19.8 3.16 ns 193 10 68 2554 22.8 2.74 ns 22
pipeline stages have same behaviorapprox. 2554/16 = 160 logic cells per CA cell needed
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
18
4. Comparison Hardware vs. Software
The computation time for one new cell state is in software
– ks = number of clock cycles per rule– TS = time for one computer clock.
The computation time for one new cell state is in FPGA hardware
– kH = number of clock cycles per rule– TH = time for one FPGA clock.– p = degree of parallelism in
hardware.The speed-up
24..142 / 1 16prototype platform:
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
19
2004 Altera FPGA Devices
47 times larger than used prototype
operation speed 300-500 Mhz: 400/25= 16 times faster
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
20
Performance for 2004 Technology FPGAs
Assumptions (to use EP2S180)– Total # of Logic Cells 179,400 – 50% = 89,700 will be used to implement the rule, 50% are free
for additional logic (interface to memories, input/output)– FPGA-clock rate is ≈1/15 (pessimistic assumption) of the actual
PC clock rate (3.5 GHz) 233 MHz
Degree of parallel rules– p(Rule BZR) = 89,700/132 = 680– p(Rule MM) = 89,700/160 = 561
Speed-Up– S(Rule BZR) = 24 * 1/15 * 680 = 1,088– S(Rule MM) = 142 * 1/15 * 561 = 5,311
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
21
5. Conclusion(Altera Flex 10k Stratix II vs. SW)
CA-Architectures can be implemented in parallel – Rule BZR: p = 16 680– Rule MM: p = 16 636
Number of clock cycles to execute the rule– Rule BZR: kS = 24 (in Software), kH = 1 (in Hardware)– Rule MM: kS = 142 (in Software), kH = 1 (in Hardware)
Speed-Up– Rule BZR: S = 3.4 1,088– Rule MM: S = 22 5,311
Advantage of FPGA solution– Much less hardware (1 FPGA ⇔ thousands of PCs)– Less power consumption, less space– Programming: Easier than managing thousands of PCs?
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
22
Thank You For Your Attention
Darmstadt University of TechnologyComputer Architecture GroupProf. R. HoffmannHochschulstrasse 10D-64289 DarmstadtGermany
http://www.ra.informatik.tu-darmstadt.de/
Mathias [email protected]
Top Related