Download - Implementing Cellular Automata in FPGA Logic...Darmstadt University of Technology Computer Architecture Group 1 Implementing Cellular Automata in FPGA Logic Mathias Halbach, Rolf Hoffmann

Transcript
  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    1

    Implementing Cellular Automatain FPGA Logic

    Mathias Halbach, Rolf Hoffmann

    1. Introduction

    2. Implementing Cellular Automata in Software (C)

    3. Hardware Prototype Implementation

    4. Comparison Hardware vs. Software

    5. Conclusion

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    2

    1. IntroductionCellular Automata – Pioneers

    John von Neumann(1903-1957)

    Konrad Zuse (1910-1995) with "Rechnender Raum"(computing space)

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    3

    Cellular Automata (CA)

    optimal model for applications with inherent local neighborhood

    physical fields, lattice-gas models, models of growth, moving particles, fluid flow, logic simulation, numerical algorithms, routing problems, picture processing, genetic algorithms, cellular neural networks.

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    4

    Hardware Platform CEPRA and CDL

    CEPRA: Cellular Processing Architecture– CEPRA-S, 2001 (see next slide)– CEPRA-3D, 1997 (2 FPGAs, 3 dimensional)– CEPRA-1D, 1996 (1 FPGA)– CEPRA-1X, 1996 (1 FPGA)– CEPRA-8D, 1995 (8 DSPs)– CEPRA-8L, 1994 (8 FPGAs)

    CDL: Cellular Description Language

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    5

    CEPRA-S

    2 FPGAs

    8 Data Memories

    1 Program Memory

    1 Special Memory

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    6

    Prototyping Platform

    Altera Flex 10k Evaluation Board– FPGA: Flex EPF10K70RC240-4 with 3756 logic cells

    MAX+plus II Tools – AHDL, VERILOG FPGA-Logic

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    7

    Question

    If Cellular Automata are implemented in FPGA-Logic:

    How high is the speed-up in comparison to a software implementation on a PC?

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    8

    CA Rules used for Comparison

    1. Rule: Belousov-Zhabotinsky Reaction– describes an oscillating chemical process– This rule is neither very simple nor very complex

    function fBZR(Cell, North, East, South, West);const n=127, g=11;begin

    b := (North=n) + (East=n) + (South=n) + (West=n);a := (0

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    9

    Rule MinMax

    2. Rule: is more complex

    function MinMax (Cell, North, East, South, West);

    const s1=40, s2=215;A:= max (Cell, North, East, South, West);B:= min (Cell, North, East, South, West);small:= (Cells2);fMinMax := A - B + 8 * small – 8 * big;

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    10

    2. Implementing Cellular Automatain Software (C)

    Naive ImplementationCompute-Phasefor(x=1; x

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    11

    Code Optimizing Methods

    Use a one-dimensional field to reduce address calculations.

    Use pointers instead of indices.Write the calculation function inlineto the compute phase.

    Implement border with additional memory instead of if-statements.

    Delete the update-phase by using two pointers, which point to A and B in one computation and interchange the pointers for the next computation.

    Change the loop iteration index from 1..n to n-1..0 (faster loop terminate check).

    Reuse variables and intermediate results.

    Reduce conditional statements, e.g. by using tables.

    Use “case” (switch) instead of multiple if-statements, also nest if necessary.

    Keep the cache filled.

    Copy whole lines by a fast copy procedure (memcopy) into a three line buffer and operate on the lines instead of accessing the whole field directly.

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    12

    Computation Time BZR (optimized C code)

    Size n*n #cells generationsper secondtime per celloperation

    clock cyclesper cell op.

    128 * 128 16 k 5794 10 ns 25256 * 256 65 k 1533 10 ns 24512 * 512 256 k 333 11 ns 28

    1024 * 1024 1 M 83 11 ns 282048 * 2048 4 M 21 11 ns 284096 * 4096 16 M 5 12 ns 298192 * 8192 64 M 1.2 13 ns 31

    compiler GNU gcc 3.3.1 with optimization parameter –O3.Pentium 4 2.4 GHz (Fujitsu/Siemens) Windows XPCygwin 1.5.5c

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    13

    Computation Time MinMax (optimized C code)

    Size n*n #cells generationsper secondtime per celloperation

    clock cyclesper cell op.

    128 * 128 16 k 998 60 ns 143256 * 256 65 k 258 60 ns 142512 * 512 256 k 62 61 ns 147

    1024 * 1024 1 M 16 62 ns 1482048 * 2048 4 M 4 61 ns 1474096 * 4096 16 M 1 60 ns 1448192 * 8192 64 M 0.3 60 ns 143

    compiler GNU gcc 3.3.1 with optimization parameter –O3.Pentium 4 2.4 GHz (Fujitsu/Siemens) Windows XPCygwin 1.5.5c

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    14

    3. Hardware Prototype Implementation

    module rule(clk, cnew, n, e, c, w, s);input clk;input [7:0] c,n,e,s,w;output [7:0] cnew;parameter s1 = 42, s2 = 213;

    wire [2:0] as1 = (ns2);wire [7:0] as1_8bit = as1;wire [7:0] as2_8bit = as2;wire [7:0] as1_as2 = (as1_8bit

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    15

    Hardware Array Size

    kernel cells

    border cells

    6 x 6 cells4 x 4 = 16 kernel cell16 rules in parallel

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    16

    Rule BZR synthesized (16 rules parallel)

    pipeline stages

    Area (0) .. Speed

    (10)

    % used chip area

    # used logic cells

    max. MHz

    time per cell op.

    speed up

    0 0 48 1833 11.8 5.30 ns 1.90 5 48 1833 11.7 5.34 ns 1.90 10 59 2233 12.1 5.17 ns 1.92 0 49 1869 18.2 3.43 ns 2.92 5 49 1869 17.6 3.55 ns 2.82 10 56 2119 21.4 2.92 ns 3.4

    pipeline stages– doesn’t vary the number of used logic cells significantly– increase the speed up

    approx. 2119/16 = 132 logic cells per CA cell needed

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    17

    Rule MinMax synthesized (16 rules parallel)

    pipeline stages

    Area (0) .. Speed

    (10)

    % used chip area

    # used logic cells

    max. MHz

    time per cell op.

    speed up

    0 0 65 2460 6.6 9.47 ns 60 5 65 2460 6.5 9.62 ns 60 10 80 3016 6.4 9.77 ns 63 0 60 2259 17.7 3.53 ns 173 5 60 2259 19.8 3.16 ns 193 10 68 2554 22.8 2.74 ns 22

    pipeline stages have same behaviorapprox. 2554/16 = 160 logic cells per CA cell needed

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    18

    4. Comparison Hardware vs. Software

    The computation time for one new cell state is in software

    – ks = number of clock cycles per rule– TS = time for one computer clock.

    The computation time for one new cell state is in FPGA hardware

    – kH = number of clock cycles per rule– TH = time for one FPGA clock.– p = degree of parallelism in

    hardware.The speed-up

    24..142 / 1 16prototype platform:

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    19

    2004 Altera FPGA Devices

    47 times larger than used prototype

    operation speed 300-500 Mhz: 400/25= 16 times faster

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    20

    Performance for 2004 Technology FPGAs

    Assumptions (to use EP2S180)– Total # of Logic Cells 179,400 – 50% = 89,700 will be used to implement the rule, 50% are free

    for additional logic (interface to memories, input/output)– FPGA-clock rate is ≈1/15 (pessimistic assumption) of the actual

    PC clock rate (3.5 GHz) 233 MHz

    Degree of parallel rules– p(Rule BZR) = 89,700/132 = 680– p(Rule MM) = 89,700/160 = 561

    Speed-Up– S(Rule BZR) = 24 * 1/15 * 680 = 1,088– S(Rule MM) = 142 * 1/15 * 561 = 5,311

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    21

    5. Conclusion(Altera Flex 10k Stratix II vs. SW)

    CA-Architectures can be implemented in parallel – Rule BZR: p = 16 680– Rule MM: p = 16 636

    Number of clock cycles to execute the rule– Rule BZR: kS = 24 (in Software), kH = 1 (in Hardware)– Rule MM: kS = 142 (in Software), kH = 1 (in Hardware)

    Speed-Up– Rule BZR: S = 3.4 1,088– Rule MM: S = 22 5,311

    Advantage of FPGA solution– Much less hardware (1 FPGA ⇔ thousands of PCs)– Less power consumption, less space– Programming: Easier than managing thousands of PCs?

  • Dar

    mst

    adt U

    nive

    rsity

    of T

    echn

    olog

    yC

    ompu

    ter A

    rchi

    tect

    ure

    Gro

    up

    22

    Thank You For Your Attention

    Darmstadt University of TechnologyComputer Architecture GroupProf. R. HoffmannHochschulstrasse 10D-64289 DarmstadtGermany

    http://www.ra.informatik.tu-darmstadt.de/

    Mathias [email protected]