Download - Implementing Cellular Automata in FPGA Logic...Darmstadt University of Technology Computer Architecture Group 1 Implementing Cellular Automata in FPGA Logic Mathias Halbach, Rolf Hoffmann

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

1

Implementing Cellular Automatain FPGA Logic

Mathias Halbach, Rolf Hoffmann

1. Introduction

2. Implementing Cellular Automata in Software (C)

3. Hardware Prototype Implementation

4. Comparison Hardware vs. Software

5. Conclusion

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

2

1. IntroductionCellular Automata – Pioneers

John von Neumann(1903-1957)

Konrad Zuse (1910-1995) with "Rechnender Raum"(computing space)

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

3

Cellular Automata (CA)

optimal model for applications with inherent local neighborhood

physical fields, lattice-gas models, models of growth, moving particles, fluid flow, logic simulation, numerical algorithms, routing problems, picture processing, genetic algorithms, cellular neural networks.

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

4

Hardware Platform CEPRA and CDL

CEPRA: Cellular Processing Architecture– CEPRA-S, 2001 (see next slide)– CEPRA-3D, 1997 (2 FPGAs, 3 dimensional)– CEPRA-1D, 1996 (1 FPGA)– CEPRA-1X, 1996 (1 FPGA)– CEPRA-8D, 1995 (8 DSPs)– CEPRA-8L, 1994 (8 FPGAs)

CDL: Cellular Description Language

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

5

CEPRA-S

2 FPGAs

8 Data Memories

1 Program Memory

1 Special Memory

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

6

Prototyping Platform

Altera Flex 10k Evaluation Board– FPGA: Flex EPF10K70RC240-4 with 3756 logic cells

MAX+plus II Tools – AHDL, VERILOG FPGA-Logic

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

7

Question

If Cellular Automata are implemented in FPGA-Logic:

How high is the speed-up in comparison to a software implementation on a PC?

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

8

CA Rules used for Comparison

1. Rule: Belousov-Zhabotinsky Reaction– describes an oscillating chemical process– This rule is neither very simple nor very complex

function fBZR(Cell, North, East, South, West);const n=127, g=11;begin

b := (North=n) + (East=n) + (South=n) + (West=n);a := (0

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

9

Rule MinMax

2. Rule: is more complex

function MinMax (Cell, North, East, South, West);

const s1=40, s2=215;A:= max (Cell, North, East, South, West);B:= min (Cell, North, East, South, West);small:= (Cells2);fMinMax := A - B + 8 * small – 8 * big;

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

10

2. Implementing Cellular Automatain Software (C)

Naive ImplementationCompute-Phasefor(x=1; x

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

11

Code Optimizing Methods

Use a one-dimensional field to reduce address calculations.

Use pointers instead of indices.Write the calculation function inlineto the compute phase.

Implement border with additional memory instead of if-statements.

Delete the update-phase by using two pointers, which point to A and B in one computation and interchange the pointers for the next computation.

Change the loop iteration index from 1..n to n-1..0 (faster loop terminate check).

Reuse variables and intermediate results.

Reduce conditional statements, e.g. by using tables.

Use “case” (switch) instead of multiple if-statements, also nest if necessary.

Keep the cache filled.

Copy whole lines by a fast copy procedure (memcopy) into a three line buffer and operate on the lines instead of accessing the whole field directly.

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

12

Computation Time BZR (optimized C code)

Size n*n #cells generationsper secondtime per celloperation

clock cyclesper cell op.

128 * 128 16 k 5794 10 ns 25256 * 256 65 k 1533 10 ns 24512 * 512 256 k 333 11 ns 28

1024 * 1024 1 M 83 11 ns 282048 * 2048 4 M 21 11 ns 284096 * 4096 16 M 5 12 ns 298192 * 8192 64 M 1.2 13 ns 31

compiler GNU gcc 3.3.1 with optimization parameter –O3.Pentium 4 2.4 GHz (Fujitsu/Siemens) Windows XPCygwin 1.5.5c

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

13

Computation Time MinMax (optimized C code)

Size n*n #cells generationsper secondtime per celloperation

clock cyclesper cell op.

128 * 128 16 k 998 60 ns 143256 * 256 65 k 258 60 ns 142512 * 512 256 k 62 61 ns 147

1024 * 1024 1 M 16 62 ns 1482048 * 2048 4 M 4 61 ns 1474096 * 4096 16 M 1 60 ns 1448192 * 8192 64 M 0.3 60 ns 143

compiler GNU gcc 3.3.1 with optimization parameter –O3.Pentium 4 2.4 GHz (Fujitsu/Siemens) Windows XPCygwin 1.5.5c

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

14

3. Hardware Prototype Implementation

module rule(clk, cnew, n, e, c, w, s);input clk;input [7:0] c,n,e,s,w;output [7:0] cnew;parameter s1 = 42, s2 = 213;

wire [2:0] as1 = (ns2);wire [7:0] as1_8bit = as1;wire [7:0] as2_8bit = as2;wire [7:0] as1_as2 = (as1_8bit

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

15

Hardware Array Size

kernel cells

border cells

6 x 6 cells4 x 4 = 16 kernel cell16 rules in parallel

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

16

Rule BZR synthesized (16 rules parallel)

pipeline stages

Area (0) .. Speed

(10)

% used chip area

# used logic cells

max. MHz

time per cell op.

speed up

0 0 48 1833 11.8 5.30 ns 1.90 5 48 1833 11.7 5.34 ns 1.90 10 59 2233 12.1 5.17 ns 1.92 0 49 1869 18.2 3.43 ns 2.92 5 49 1869 17.6 3.55 ns 2.82 10 56 2119 21.4 2.92 ns 3.4

pipeline stages– doesn’t vary the number of used logic cells significantly– increase the speed up

approx. 2119/16 = 132 logic cells per CA cell needed

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

17

Rule MinMax synthesized (16 rules parallel)

pipeline stages

Area (0) .. Speed

(10)

% used chip area

# used logic cells

max. MHz

time per cell op.

speed up

0 0 65 2460 6.6 9.47 ns 60 5 65 2460 6.5 9.62 ns 60 10 80 3016 6.4 9.77 ns 63 0 60 2259 17.7 3.53 ns 173 5 60 2259 19.8 3.16 ns 193 10 68 2554 22.8 2.74 ns 22

pipeline stages have same behaviorapprox. 2554/16 = 160 logic cells per CA cell needed

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

18

4. Comparison Hardware vs. Software

The computation time for one new cell state is in software

– ks = number of clock cycles per rule– TS = time for one computer clock.

The computation time for one new cell state is in FPGA hardware

– kH = number of clock cycles per rule– TH = time for one FPGA clock.– p = degree of parallelism in

hardware.The speed-up

24..142 / 1 16prototype platform:

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

19

2004 Altera FPGA Devices

47 times larger than used prototype

operation speed 300-500 Mhz: 400/25= 16 times faster

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

20

Performance for 2004 Technology FPGAs

Assumptions (to use EP2S180)– Total # of Logic Cells 179,400 – 50% = 89,700 will be used to implement the rule, 50% are free

for additional logic (interface to memories, input/output)– FPGA-clock rate is ≈1/15 (pessimistic assumption) of the actual

PC clock rate (3.5 GHz) 233 MHz

Degree of parallel rules– p(Rule BZR) = 89,700/132 = 680– p(Rule MM) = 89,700/160 = 561

Speed-Up– S(Rule BZR) = 24 * 1/15 * 680 = 1,088– S(Rule MM) = 142 * 1/15 * 561 = 5,311

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

21

5. Conclusion(Altera Flex 10k Stratix II vs. SW)

CA-Architectures can be implemented in parallel – Rule BZR: p = 16 680– Rule MM: p = 16 636

Number of clock cycles to execute the rule– Rule BZR: kS = 24 (in Software), kH = 1 (in Hardware)– Rule MM: kS = 142 (in Software), kH = 1 (in Hardware)

Speed-Up– Rule BZR: S = 3.4 1,088– Rule MM: S = 22 5,311

Advantage of FPGA solution– Much less hardware (1 FPGA ⇔ thousands of PCs)– Less power consumption, less space– Programming: Easier than managing thousands of PCs?

Dar

mst

adt U

nive

rsity

of T

echn

olog

yC

ompu

ter A

rchi

tect

ure

Gro

up

22

Thank You For Your Attention

Darmstadt University of TechnologyComputer Architecture GroupProf. R. HoffmannHochschulstrasse 10D-64289 DarmstadtGermany

http://www.ra.informatik.tu-darmstadt.de/

Mathias [email protected]