Implementing Cellular Automata in FPGA Logic...Darmstadt University of Technology Computer...

22
Darmstadt University of Technology Computer Architecture Group 1 Implementing Cellular Automata in FPGA Logic Mathias Halbach, Rolf Hoffmann 1. Introduction 2. Implementing Cellular Automata in Software (C) 3. Hardware Prototype Implementation 4. Comparison Hardware vs. Software 5. Conclusion

Transcript of Implementing Cellular Automata in FPGA Logic...Darmstadt University of Technology Computer...

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    1

    Implementing Cellular Automatain FPGA Logic

    Mathias Halbach, Rolf Hoffmann

    1. Introduction

    2. Implementing Cellular Automata in Software (C)

    3. Hardware Prototype Implementation

    4. Comparison Hardware vs. Software

    5. Conclusion

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    2

    1. IntroductionCellular Automata – Pioneers

    John von Neumann(1903-1957)

    Konrad Zuse (1910-1995) with "Rechnender Raum"(computing space)

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    3

    Cellular Automata (CA)

    optimal model for applications with inherent local neighborhood

    physical fields, lattice-gas models, models of growth, moving particles, fluid flow, logic simulation, numerical algorithms, routing problems, picture processing, genetic algorithms, cellular neural networks.

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    4

    Hardware Platform CEPRA and CDL

    CEPRA: Cellular Processing Architecture– CEPRA-S, 2001 (see next slide)– CEPRA-3D, 1997 (2 FPGAs, 3 dimensional)– CEPRA-1D, 1996 (1 FPGA)– CEPRA-1X, 1996 (1 FPGA)– CEPRA-8D, 1995 (8 DSPs)– CEPRA-8L, 1994 (8 FPGAs)

    CDL: Cellular Description Language

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    5

    CEPRA-S

    2 FPGAs

    8 Data Memories

    1 Program Memory

    1 Special Memory

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    6

    Prototyping Platform

    Altera Flex 10k Evaluation Board– FPGA: Flex EPF10K70RC240-4 with 3756 logic cells

    MAX+plus II Tools – AHDL, VERILOG FPGA-Logic

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    7

    Question

    If Cellular Automata are implemented in FPGA-Logic:

    How high is the speed-up in comparison to a software implementation on a PC?

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    8

    CA Rules used for Comparison

    1. Rule: Belousov-Zhabotinsky Reaction– describes an oscillating chemical process– This rule is neither very simple nor very complex

    function fBZR(Cell, North, East, South, West);const n=127, g=11;begin

    b := (North=n) + (East=n) + (South=n) + (West=n);a := (0

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    9

    Rule MinMax

    2. Rule: is more complex

    function MinMax (Cell, North, East, South, West);

    const s1=40, s2=215;A:= max (Cell, North, East, South, West);B:= min (Cell, North, East, South, West);small:= (Cells2);fMinMax := A - B + 8 * small – 8 * big;

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    10

    2. Implementing Cellular Automatain Software (C)

    Naive ImplementationCompute-Phasefor(x=1; x

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    11

    Code Optimizing Methods

    Use a one-dimensional field to reduce address calculations.

    Use pointers instead of indices.Write the calculation function inlineto the compute phase.

    Implement border with additional memory instead of if-statements.

    Delete the update-phase by using two pointers, which point to A and B in one computation and interchange the pointers for the next computation.

    Change the loop iteration index from 1..n to n-1..0 (faster loop terminate check).

    Reuse variables and intermediate results.

    Reduce conditional statements, e.g. by using tables.

    Use “case” (switch) instead of multiple if-statements, also nest if necessary.

    Keep the cache filled.

    Copy whole lines by a fast copy procedure (memcopy) into a three line buffer and operate on the lines instead of accessing the whole field directly.

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    12

    Computation Time BZR (optimized C code)

    Size n*n #cells generationsper secondtime per celloperation

    clock cyclesper cell op.

    128 * 128 16 k 5794 10 ns 25256 * 256 65 k 1533 10 ns 24512 * 512 256 k 333 11 ns 28

    1024 * 1024 1 M 83 11 ns 282048 * 2048 4 M 21 11 ns 284096 * 4096 16 M 5 12 ns 298192 * 8192 64 M 1.2 13 ns 31

    compiler GNU gcc 3.3.1 with optimization parameter –O3.Pentium 4 2.4 GHz (Fujitsu/Siemens) Windows XPCygwin 1.5.5c

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    13

    Computation Time MinMax (optimized C code)

    Size n*n #cells generationsper secondtime per celloperation

    clock cyclesper cell op.

    128 * 128 16 k 998 60 ns 143256 * 256 65 k 258 60 ns 142512 * 512 256 k 62 61 ns 147

    1024 * 1024 1 M 16 62 ns 1482048 * 2048 4 M 4 61 ns 1474096 * 4096 16 M 1 60 ns 1448192 * 8192 64 M 0.3 60 ns 143

    compiler GNU gcc 3.3.1 with optimization parameter –O3.Pentium 4 2.4 GHz (Fujitsu/Siemens) Windows XPCygwin 1.5.5c

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    14

    3. Hardware Prototype Implementation

    module rule(clk, cnew, n, e, c, w, s);input clk;input [7:0] c,n,e,s,w;output [7:0] cnew;parameter s1 = 42, s2 = 213;

    wire [2:0] as1 = (ns2);wire [7:0] as1_8bit = as1;wire [7:0] as2_8bit = as2;wire [7:0] as1_as2 = (as1_8bit

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    15

    Hardware Array Size

    kernel cells

    border cells

    6 x 6 cells4 x 4 = 16 kernel cell16 rules in parallel

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    16

    Rule BZR synthesized (16 rules parallel)

    pipeline stages

    Area (0) .. Speed

    (10)

    % used chip area

    # used logic cells

    max. MHz

    time per cell op.

    speed up

    0 0 48 1833 11.8 5.30 ns 1.90 5 48 1833 11.7 5.34 ns 1.90 10 59 2233 12.1 5.17 ns 1.92 0 49 1869 18.2 3.43 ns 2.92 5 49 1869 17.6 3.55 ns 2.82 10 56 2119 21.4 2.92 ns 3.4

    pipeline stages– doesn’t vary the number of used logic cells significantly– increase the speed up

    approx. 2119/16 = 132 logic cells per CA cell needed

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    17

    Rule MinMax synthesized (16 rules parallel)

    pipeline stages

    Area (0) .. Speed

    (10)

    % used chip area

    # used logic cells

    max. MHz

    time per cell op.

    speed up

    0 0 65 2460 6.6 9.47 ns 60 5 65 2460 6.5 9.62 ns 60 10 80 3016 6.4 9.77 ns 63 0 60 2259 17.7 3.53 ns 173 5 60 2259 19.8 3.16 ns 193 10 68 2554 22.8 2.74 ns 22

    pipeline stages have same behaviorapprox. 2554/16 = 160 logic cells per CA cell needed

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    18

    4. Comparison Hardware vs. Software

    The computation time for one new cell state is in software

    – ks = number of clock cycles per rule– TS = time for one computer clock.

    The computation time for one new cell state is in FPGA hardware

    – kH = number of clock cycles per rule– TH = time for one FPGA clock.– p = degree of parallelism in

    hardware.The speed-up

    24..142 / 1 16prototype platform:

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    19

    2004 Altera FPGA Devices

    47 times larger than used prototype

    operation speed 300-500 Mhz: 400/25= 16 times faster

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    20

    Performance for 2004 Technology FPGAs

    Assumptions (to use EP2S180)– Total # of Logic Cells 179,400 – 50% = 89,700 will be used to implement the rule, 50% are free

    for additional logic (interface to memories, input/output)– FPGA-clock rate is ≈1/15 (pessimistic assumption) of the actual

    PC clock rate (3.5 GHz) 233 MHz

    Degree of parallel rules– p(Rule BZR) = 89,700/132 = 680– p(Rule MM) = 89,700/160 = 561

    Speed-Up– S(Rule BZR) = 24 * 1/15 * 680 = 1,088– S(Rule MM) = 142 * 1/15 * 561 = 5,311

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    21

    5. Conclusion(Altera Flex 10k Stratix II vs. SW)

    CA-Architectures can be implemented in parallel – Rule BZR: p = 16 680– Rule MM: p = 16 636

    Number of clock cycles to execute the rule– Rule BZR: kS = 24 (in Software), kH = 1 (in Hardware)– Rule MM: kS = 142 (in Software), kH = 1 (in Hardware)

    Speed-Up– Rule BZR: S = 3.4 1,088– Rule MM: S = 22 5,311

    Advantage of FPGA solution– Much less hardware (1 FPGA ⇔ thousands of PCs)– Less power consumption, less space– Programming: Easier than managing thousands of PCs?

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    22

    Thank You For Your Attention

    Darmstadt University of TechnologyComputer Architecture GroupProf. R. HoffmannHochschulstrasse 10D-64289 DarmstadtGermany

    http://www.ra.informatik.tu-darmstadt.de/

    Mathias [email protected]