Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The...

38
Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia © 2011 Guy Lemieux Ameer Abdelhadi Guy Lemieux

Transcript of Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The...

Page 1: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Configuration Bitstream Reductionfor SRAM-based FPGAs

by Enumerating LUT Input Permutations

The University ofBritish Columbia © 2011 Guy Lemieux

Ameer Abdelhadi Guy Lemieux

Page 2: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Modern FPGAs are HUGE!

• Altera Stratix IV “4SGX530” device (45nm)– 200k 6-LUTs in one die

• FPGAs keep getting bigger– Moore’s Law

• 2x every 18-24 months• 28nm devices announced (Stratix V, Virtex 7)

– More than Moore• Xilinx 1,200k 6-LUTs using stacked silicon

2

Page 3: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Configuration Bits

• Growing bitstream storage & load time– Altera Stratix IV (4SGX, largest device)

• 4 seconds to boot up using serial EEPROM• 35 seconds to configure using USB 2.0• 20 MByte bitstream

• Multiple configurations per FPGA– Even more bitstream storage

3

Page 4: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Bitstream Compression

• Obvious solution: bitstream compression– Many methods in research

• Architecture-aware: switch blocks (eg, avoid 5:1 muxes), logic blocks (eg, ULMs, coarse-grained) , readback, frame reordering, wildcard techniques

• Architecture-agnostic: general compression (eg: huffman, LZ, arithmetic), “don’t care”

– Not yet applied in practice

• Can we do better?4

Page 5: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Main Idea

• Bitstream Reduction– Throw away redundant bits

• Do not store them in configuration EEPROM• Do not transmit them

– Regenerate missing bits (in the FPGA)• Done on-the-fly, during bitstream load• Does not reduce # SRAM bits inside FPGA

• This is not bitstream compression– Can still compress after reduction

5

Page 6: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Main Idea

• Where to find redundant bits?– LUT configuration bits (this paper)– Interconnect (future work)

• Eg, 4-input LUT:– 16 config bits (2^k)– 65,536 configurations (2^(2^k))– 222 unique (npn-equivalent) 4-input functions– Theoretically, 8 config bits is sufficient

6

Page 7: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Equivalence Classes

• What is npn-equivalent ? First n = input negation (inversion), p = input permutations,Second n = output negation

• Eg, 4-input LUT–4! = 24 input permutations for the same function–2^k = 16 input inversions–2^1 = 2 output inversions

7

Page 8: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Example: LUT input ordering

8

Page 9: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Example: LUT input ordering

9

Page 10: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Example: LUT input ordering

10

Page 11: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Example: LUT input ordering

11

Page 12: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Example: LUT input ordering

12

Page 13: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Example: LUT input ordering

13

Page 14: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Example: LUT input ordering

14

Page 15: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Example: LUT input ordering

15

Page 16: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Example 2-LUT

16

Page 17: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Example 2-LUT

17

Bit e0

-- removed from bitstream-- regenerated at load-time-- storage bit still exists in LUT

Page 18: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Example 2-LUT

18

Bit e0

-- removed from bitstream-- regenerated at load-time-- storage bit still exists in LUT

Regeneration circuit-- shared by all LUTs-- regenerates missing bits

Page 19: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Example 2-LUT

19

Bit e0

-- removed from bitstream-- regenerated at load-time-- storage bit still exists in LUT

Regeneration circuit-- shared by all LUTs-- regenerates missing bits

Problem: changing order of inputs reorders LUT contents

changes value of bit e0

Page 20: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Configuration Chaining

20

Problem: changing order of inputs reorders LUT contents

changes value of bit e0

Solution:

Page 21: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Configuration Chaining

21

1. Copy value ofbit e0 to be saved

Problem: changing order of inputs reorders LUT contents

changes value of bit e0

Solution:

Page 22: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Configuration Chaining

22

1. Copy value ofbit e0 to be saved

Problem: changing order of inputs reorders LUT contents

changes value of bit e0

Solution:

2. Encode value ofbit e0 in the next LUT

Page 23: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

General Approach

Produce a LUT chain

• Input ordering of next LUT generates removed bits for current LUT

Approach

1. Enumerate # of input permutations of LUTk=2 2! = 2 k=3 3! = 6 k=4 4! = 24

2. Define number of bits (pk) to be replaced

k=2 log2(2)=1 k=3 log2(6)=2 k=4 log2(24)=4

• For each LUT– Copy P = pk bits from the current LUT

– Re-order next LUT inputs according to value of P

– Adjust next LUT configuration bits to match new input ordering

– Remove pk configuration bits from current LUT

23

Page 24: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

LUT Savingsk # LUT bits # bits Saved

= Log2(k!)% saved

2 4 1 25%

3 8 2 25%

4 16 4 25%

5 32 6 19%

6 64 9 14%

7 128 12 9.4%

8 256 15 5.9%

24

How complex is bit regeneration?

Page 25: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

General Structure

25

Page 26: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Regeneration Circuits (k=3,4,5)

26

k=3

e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4

Σ6 Σ5 Σ4 Σ3 Σ2Σ7

π0 π1 π2 π3 π4 π5 π6 π7

e6e5e4e3 e11e10

Σ6 Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5 π6

e3e2e1e0

Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5

e2e0 e1e1e0

Σ4 Σ3 Σ2

π0 π1 π2 π3 π4

Σ3 Σ2

π0 π1 π2 π3

Σ2

π0 π1 π2

k=4

k=5

k=6

k=7

k=8

e7 e8 e9e6e5e4 e7 e8e3e2 e5e4

e1e0 e3e2e1e0

• 3 x 6b comparators• 2b (half) adder• Saves 2 bits (25%)

Page 27: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Regeneration Circuits (k=3,4,5)

27

k=3

e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4

Σ6 Σ5 Σ4 Σ3 Σ2Σ7

π0 π1 π2 π3 π4 π5 π6 π7

e6e5e4e3 e11e10

Σ6 Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5 π6

e3e2e1e0

Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5

e2e0 e1e1e0

Σ4 Σ3 Σ2

π0 π1 π2 π3 π4

Σ3 Σ2

π0 π1 π2 π3

Σ2

π0 π1 π2

k=4

k=5

k=6

k=7

k=8

e7 e8 e9e6e5e4 e7 e8e3e2 e5e4

e1e0 e3e2e1e0

• 3 x 6b comparators• 2b (half) adder• Saves 2 bits (25%)

• 6 x 6b comparators• 3b (full) adder• 2b (half) adder• Saves 4 bits (25%)

Page 28: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Regeneration Circuits (k=3,4,5)

28

k=3

e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4

Σ6 Σ5 Σ4 Σ3 Σ2Σ7

π0 π1 π2 π3 π4 π5 π6 π7

e6e5e4e3 e11e10

Σ6 Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5 π6

e3e2e1e0

Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5

e2e0 e1e1e0

Σ4 Σ3 Σ2

π0 π1 π2 π3 π4

Σ3 Σ2

π0 π1 π2 π3

Σ2

π0 π1 π2

k=4

k=5

k=6

k=7

k=8

e7 e8 e9e6e5e4 e7 e8e3e2 e5e4

e1e0 e3e2e1e0

• 3 x 6b comparators• 2b (half) adder• Saves 2 bits (25%)

• 6 x 6b comparators• 3b (full) adder• 2b (half) adder• Saves 4 bits (25%)

• 10 x 6b comparators• 2 x 3b (full) adder• 3 x 2b (half) adder• Saves 6 bits (25%)

Page 29: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Regeneration Circuits (k=3,4,5)

29

k=3

e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4

Σ6 Σ5 Σ4 Σ3 Σ2Σ7

π0 π1 π2 π3 π4 π5 π6 π7

e6e5e4e3 e11e10

Σ6 Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5 π6

e3e2e1e0

Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5

e2e0 e1e1e0

Σ4 Σ3 Σ2

π0 π1 π2 π3 π4

Σ3 Σ2

π0 π1 π2 π3

Σ2

π0 π1 π2

k=4

k=5

k=6

k=7

k=8

e7 e8 e9e6e5e4 e7 e8e3e2 e5e4

e1e0 e3e2e1e0

• 3 x 6b comparators• 2b (half) adder• Saves 2 bits (25%)

• 6 x 6b comparators• 3b (full) adder• 2b (half) adder• Saves 4 bits (25%)

• 10 x 6b comparators• 2 x 3b (full) adder• 3 x 2b (half) adder• Saves 6 bits (25%)

Note: area cost is only once per device

(shared across all LUTs in entire FPGA)

Page 30: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Regeneration Circuits (k=6)

30

k=3

e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4

Σ6 Σ5 Σ4 Σ3 Σ2Σ7

π0 π1 π2 π3 π4 π5 π6 π7

e6e5e4e3 e11e10

Σ6 Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5 π6

e3e2e1e0

Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5

e2e0 e1e1e0

Σ4 Σ3 Σ2

π0 π1 π2 π3 π4

Σ3 Σ2

π0 π1 π2 π3

Σ2

π0 π1 π2

k=4

k=5

k=6

k=7

k=8

e7 e8 e9e6e5e4 e7 e8e3e2 e5e4

e1e0 e3e2e1e0

Page 31: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Regeneration Circuits (k=7)

31

k=3

e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4

Σ6 Σ5 Σ4 Σ3 Σ2Σ7

π0 π1 π2 π3 π4 π5 π6 π7

e6e5e4e3 e11e10

Σ6 Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5 π6

e3e2e1e0

Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5

e2e0 e1e1e0

Σ4 Σ3 Σ2

π0 π1 π2 π3 π4

Σ3 Σ2

π0 π1 π2 π3

Σ2

π0 π1 π2

k=4

k=5

k=6

k=7

k=8

e7 e8 e9e6e5e4 e7 e8e3e2 e5e4

e1e0 e3e2e1e0

Page 32: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Regeneration Circuits (k=8)

32

k=3

e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4

Σ6 Σ5 Σ4 Σ3 Σ2Σ7

π0 π1 π2 π3 π4 π5 π6 π7

e6e5e4e3 e11e10

Σ6 Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5 π6

e3e2e1e0

Σ5 Σ4 Σ3 Σ2

π0 π1 π2 π3 π4 π5

e2e0 e1e1e0

Σ4 Σ3 Σ2

π0 π1 π2 π3 π4

Σ3 Σ2

π0 π1 π2 π3

Σ2

π0 π1 π2

k=4

k=5

k=6

k=7

k=8

e7 e8 e9e6e5e4 e7 e8e3e2 e5e4

e1e0 e3e2e1e0

Page 33: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Regeneration Circuits

All designs are freelyavailable for download.

33

Page 34: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Cost of Regeneration – Summaryk # LUT

bits# bits Saved

= Log2(k!)

# Compare (6 bit

compare)

# Full Adders

# Half Adders

# 2-input Gates

2 4 1 1 0 1 0

3 8 2 3 0 2 0

4 16 4 6 1 1 0

5 32 6 10 2 3 0

6 64 9 15 4 4 4

7 128 12 21 7 5 11

8 256 15 28 11 5 11

34

Note: area cost is only once per device

(shared across all LUTs in entire FPGA)

Page 35: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Implementation

• 65nm standard cell implementation– Use minimum-size drive strength (not timing critical)

• Regeneration circuits are very small– Shared across entire device

35

k=3 4 5 6 7 8

Area (μm2) 64 129 220 333 502 702

Transistors 198 404 694 1,058 1,576 2,208

Page 36: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Future Workk # LUT bits # bits Saved

= Log2(k!)% saved

2 4 1 25%

3 8 2 25%

4 16 4 25%

5 32 6 19%

6 64 9 14%

7 128 12 9.4%

8 256 14 5.5%

36

Can we reach 25% savings with 5+ inputs ?

Page 37: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Future Workk # LUT bits # bits Saved

= Log2(k!)% saved

2 4 1 25%

3 8 2 25%

4 16 4 25%

5 32 6 19%

6 64 9 14%

7 128 12 9.4%

8 256 14 5.5%

37

Can we reach 25% savings with 5+ inputs ?

Eg, combine two 4-LUTs to make one 5-LUT

Page 38: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer.

Thank you!

38