Implementing Cellular Automata in FPGA Logic...Darmstadt University of Technology Computer...
Transcript of Implementing Cellular Automata in FPGA Logic...Darmstadt University of Technology Computer...
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
1
Implementing Cellular Automatain FPGA Logic
Mathias Halbach, Rolf Hoffmann
1. Introduction
2. Implementing Cellular Automata in Software (C)
3. Hardware Prototype Implementation
4. Comparison Hardware vs. Software
5. Conclusion
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
2
1. IntroductionCellular Automata – Pioneers
John von Neumann(1903-1957)
Konrad Zuse (1910-1995) with "Rechnender Raum"(computing space)
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
3
Cellular Automata (CA)
optimal model for applications with inherent local neighborhood
physical fields, lattice-gas models, models of growth, moving particles, fluid flow, logic simulation, numerical algorithms, routing problems, picture processing, genetic algorithms, cellular neural networks.
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
4
Hardware Platform CEPRA and CDL
CEPRA: Cellular Processing Architecture– CEPRA-S, 2001 (see next slide)– CEPRA-3D, 1997 (2 FPGAs, 3 dimensional)– CEPRA-1D, 1996 (1 FPGA)– CEPRA-1X, 1996 (1 FPGA)– CEPRA-8D, 1995 (8 DSPs)– CEPRA-8L, 1994 (8 FPGAs)
CDL: Cellular Description Language
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
5
CEPRA-S
2 FPGAs
8 Data Memories
1 Program Memory
1 Special Memory
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
6
Prototyping Platform
Altera Flex 10k Evaluation Board– FPGA: Flex EPF10K70RC240-4 with 3756 logic cells
MAX+plus II Tools – AHDL, VERILOG FPGA-Logic
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
7
Question
If Cellular Automata are implemented in FPGA-Logic:
How high is the speed-up in comparison to a software implementation on a PC?
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
8
CA Rules used for Comparison
1. Rule: Belousov-Zhabotinsky Reaction– describes an oscillating chemical process– This rule is neither very simple nor very complex
function fBZR(Cell, North, East, South, West);const n=127, g=11;begin
b := (North=n) + (East=n) + (South=n) + (West=n);a := (0
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
9
Rule MinMax
2. Rule: is more complex
function MinMax (Cell, North, East, South, West);
const s1=40, s2=215;A:= max (Cell, North, East, South, West);B:= min (Cell, North, East, South, West);small:= (Cells2);fMinMax := A - B + 8 * small – 8 * big;
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
10
2. Implementing Cellular Automatain Software (C)
Naive ImplementationCompute-Phasefor(x=1; x
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
11
Code Optimizing Methods
Use a one-dimensional field to reduce address calculations.
Use pointers instead of indices.Write the calculation function inlineto the compute phase.
Implement border with additional memory instead of if-statements.
Delete the update-phase by using two pointers, which point to A and B in one computation and interchange the pointers for the next computation.
Change the loop iteration index from 1..n to n-1..0 (faster loop terminate check).
Reuse variables and intermediate results.
Reduce conditional statements, e.g. by using tables.
Use “case” (switch) instead of multiple if-statements, also nest if necessary.
Keep the cache filled.
Copy whole lines by a fast copy procedure (memcopy) into a three line buffer and operate on the lines instead of accessing the whole field directly.
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
12
Computation Time BZR (optimized C code)
Size n*n #cells generationsper secondtime per celloperation
clock cyclesper cell op.
128 * 128 16 k 5794 10 ns 25256 * 256 65 k 1533 10 ns 24512 * 512 256 k 333 11 ns 28
1024 * 1024 1 M 83 11 ns 282048 * 2048 4 M 21 11 ns 284096 * 4096 16 M 5 12 ns 298192 * 8192 64 M 1.2 13 ns 31
compiler GNU gcc 3.3.1 with optimization parameter –O3.Pentium 4 2.4 GHz (Fujitsu/Siemens) Windows XPCygwin 1.5.5c
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
13
Computation Time MinMax (optimized C code)
Size n*n #cells generationsper secondtime per celloperation
clock cyclesper cell op.
128 * 128 16 k 998 60 ns 143256 * 256 65 k 258 60 ns 142512 * 512 256 k 62 61 ns 147
1024 * 1024 1 M 16 62 ns 1482048 * 2048 4 M 4 61 ns 1474096 * 4096 16 M 1 60 ns 1448192 * 8192 64 M 0.3 60 ns 143
compiler GNU gcc 3.3.1 with optimization parameter –O3.Pentium 4 2.4 GHz (Fujitsu/Siemens) Windows XPCygwin 1.5.5c
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
14
3. Hardware Prototype Implementation
module rule(clk, cnew, n, e, c, w, s);input clk;input [7:0] c,n,e,s,w;output [7:0] cnew;parameter s1 = 42, s2 = 213;
wire [2:0] as1 = (ns2);wire [7:0] as1_8bit = as1;wire [7:0] as2_8bit = as2;wire [7:0] as1_as2 = (as1_8bit
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
15
Hardware Array Size
kernel cells
border cells
6 x 6 cells4 x 4 = 16 kernel cell16 rules in parallel
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
16
Rule BZR synthesized (16 rules parallel)
pipeline stages
Area (0) .. Speed
(10)
% used chip area
# used logic cells
max. MHz
time per cell op.
speed up
0 0 48 1833 11.8 5.30 ns 1.90 5 48 1833 11.7 5.34 ns 1.90 10 59 2233 12.1 5.17 ns 1.92 0 49 1869 18.2 3.43 ns 2.92 5 49 1869 17.6 3.55 ns 2.82 10 56 2119 21.4 2.92 ns 3.4
pipeline stages– doesn’t vary the number of used logic cells significantly– increase the speed up
approx. 2119/16 = 132 logic cells per CA cell needed
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
17
Rule MinMax synthesized (16 rules parallel)
pipeline stages
Area (0) .. Speed
(10)
% used chip area
# used logic cells
max. MHz
time per cell op.
speed up
0 0 65 2460 6.6 9.47 ns 60 5 65 2460 6.5 9.62 ns 60 10 80 3016 6.4 9.77 ns 63 0 60 2259 17.7 3.53 ns 173 5 60 2259 19.8 3.16 ns 193 10 68 2554 22.8 2.74 ns 22
pipeline stages have same behaviorapprox. 2554/16 = 160 logic cells per CA cell needed
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
18
4. Comparison Hardware vs. Software
The computation time for one new cell state is in software
– ks = number of clock cycles per rule– TS = time for one computer clock.
The computation time for one new cell state is in FPGA hardware
– kH = number of clock cycles per rule– TH = time for one FPGA clock.– p = degree of parallelism in
hardware.The speed-up
24..142 / 1 16prototype platform:
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
19
2004 Altera FPGA Devices
47 times larger than used prototype
operation speed 300-500 Mhz: 400/25= 16 times faster
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
20
Performance for 2004 Technology FPGAs
Assumptions (to use EP2S180)– Total # of Logic Cells 179,400 – 50% = 89,700 will be used to implement the rule, 50% are free
for additional logic (interface to memories, input/output)– FPGA-clock rate is ≈1/15 (pessimistic assumption) of the actual
PC clock rate (3.5 GHz) 233 MHz
Degree of parallel rules– p(Rule BZR) = 89,700/132 = 680– p(Rule MM) = 89,700/160 = 561
Speed-Up– S(Rule BZR) = 24 * 1/15 * 680 = 1,088– S(Rule MM) = 142 * 1/15 * 561 = 5,311
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
21
5. Conclusion(Altera Flex 10k Stratix II vs. SW)
CA-Architectures can be implemented in parallel – Rule BZR: p = 16 680– Rule MM: p = 16 636
Number of clock cycles to execute the rule– Rule BZR: kS = 24 (in Software), kH = 1 (in Hardware)– Rule MM: kS = 142 (in Software), kH = 1 (in Hardware)
Speed-Up– Rule BZR: S = 3.4 1,088– Rule MM: S = 22 5,311
Advantage of FPGA solution– Much less hardware (1 FPGA ⇔ thousands of PCs)– Less power consumption, less space– Programming: Easier than managing thousands of PCs?
-
Dar
mst
adt U
nive
rsity
of T
echn
olog
yC
ompu
ter A
rchi
tect
ure
Gro
up
22
Thank You For Your Attention
Darmstadt University of TechnologyComputer Architecture GroupProf. R. HoffmannHochschulstrasse 10D-64289 DarmstadtGermany
http://www.ra.informatik.tu-darmstadt.de/
Mathias [email protected]