Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

49
Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty

Transcript of Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Page 1: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Algorithm Efficiency in Hardwarewith an Emphasis on Skein

By Phil Doughty

Page 2: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Outline

• Purpose of this Presentation• Full Custom (ASIC) Design

– Digital Hardware Implementation Basics– Gates– Arithmetic

• Field Programmable Gate Arrays (FPGAs)– Layout– How FPGA’s are used

• Skein Hashing Algorithm

Page 3: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Purpose

• Touch upon basic hardware elements• Inform future cryptographers and designers of

cryptographic algorithms of the benefits and limitations of hardware

• Present Skein as an algorithm with pretty good hardware compatibility

Page 4: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Full Custom (ASIC) Design

Image contributed from Dr. Shaaban, CE Dept.

Page 5: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Digital Logic Gates

• Basic operation block • 1 or more input voltages, and exactly 1 output

voltage– Voltage is either High or Low (1 or 0)

• TTL (Bipolar Junction Transistors)• CMOS (Complementary Metal Oxide Semiconductor

Field Effect Transistors)

Page 6: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Primary Gates

• INVERT, AND, OR– INVERT isn’t always necessary depending on

underlying technology• NAND and NOR

– NAND is an AND gate with INVERTed Output– NOR is an OR gate with INVERTed Output– Schematic is similar to AND and OR, but with a

bubble on the output (representing inverse)– Either can be solely used to build any logic

Page 7: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Inverter

Schematic Truth TableInput A Output Y

0 1

1 0

Algebraic Notation

Y = A’

Page 8: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

AND Gate

Schematic Truth TableInput A Input B Output Y

0 0 0

0 1 0

1 0 0

1 1 1

Algebraic Notation

Y = AB

Page 9: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

OR Gate

Schematic Truth TableInput A Input B Output Y

0 0 0

0 1 1

1 0 1

1 1 1

Algebraic Notation

Y = A + B

Page 10: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

XOR Gate

Schematic Truth TableInput A Input B Output Y

0 0 0

0 1 1

1 0 1

1 1 0

Algebraic Notation

Y = A B⊕

Page 11: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

XOR Gate (Continued)

• Can be composed of INVERT, AND, & OR– A B ⊕ = A’B + AB’

• But it can be easily implemented in hardware using faster methods

Page 12: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Gate Delay

• Gates are not instantaneous• There is a delay between the time an input

changes to the time an output changes

Page 13: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Arithmetic Operations

• Addition/Subtraction• Multiplication• Division/Modulus

Page 14: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Addition and Subtraction

• Ripple-Carry Adder– Easiest to analyze– Faster adders are used in industry

• Naffziger (Intel Core 2)• Carry Look-ahead Adders, etc.

• Uses two components, Half Adder and Full Adder• Full Adder has a third input for Carry-In

compared to the Half Adder• Subtraction is just addition by a negative number

in 2’s complement notation

Page 15: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Ripple Carry Adder Algorithm

• Similar to manual addition• Least Significant Bits (A0 and B0) are added

together to produce a Sum Bit and a Carry Bit (S0 and C1).

• The next pair of bits (A1 and B1) are added together along with the previous Carry Bit (C1) to produce a Sum Bit and a Carry Bit (S1 and C2).

• The process repeats

Page 16: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Ripple-Carry Adder Components

Half Adder Full Adder

1 Gate Delay for both the Sum bit and the Carry bit

2 Gate Delays for Sum bit3 Gate Delays for Carry bit1 Gate Delay to change the Sum Bit if the incoming Carry bit changes2 Gate Delays to change the Carry bit if the incoming Carry bit changes

Page 17: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Ripple-Carry Adder

Page 18: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Ripple-Carry Adder Worst Case

• Worst Case Scenario is when C0 is 0, A is all 1’s and B is all 0’s, and then C0 changes to 1

• The Carry has to propagate through all of the Full Adder Blocks

• For an n-bit Ripple-Carry Adder– 2(n-1) + 1 gate delays to change the final Sum bit– 2n gate delays to change the final Carry bit

Page 19: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Multiplication

• Generic Multiplier– Any two numbers can be multiplied together

• A * B = Y– n-bit inputs produces 2n-bit output

• Constant Coefficient Multiplier– Multiplication by a constant

• A * 5 = Y– Easier to implement– Used in Finite Impulse Response (FIR) Filters

Page 20: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Generic Multipliers

• O(n2) gate delays for an n-bit Generic Multiplier

• Very slow compared to addition• Uses many resources compared to addition

Page 21: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Optimized 3-bit Generic Multiplier

At most 11 gate delays

Page 22: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Optimized 8-bit Generic Multiplier

At most 53 Gate Delays

Page 23: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Division/Modulus

• More complex than Multiplication• Can be implemented as a series of

subtractions• Sequential logic may be better suited

– Uses Registers and a Clock signal

Page 24: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Shortcuts

• Multiplication– If multiplying by a power of 2, shift left by the

power

• Division– If dividing by a power of 2, shift right by the power

• Modulus– If taking a modulus of a power of 2, AND the bits

with the (modulus – 1)

Page 25: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Full Custom

Benefits• Best Possible Performance

– Can be specially designed for low power consumption (embedded systems) or for high speed (PC expansion card)

• No restrictions on logic• No restrictions on routing

Drawbacks• Expensive to design• Expensive to test

– Fabrication takes months

Page 26: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

FPGA

Image contributed from Dr. Shaaban, CE Dept.

Page 27: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

What is an FPGA?

• Field Programmable Gate Array (FPGA)– It is an array of gates that can be programmed

• A good compromise between General Purpose Processors and Full Custom

Page 28: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Layout of an FPGA• Input and Output (I/O) Blocks

Interface with the outside world• LED display• Switches, buttons, etc.

• Logic Blocks usually take 3-4 input signals and generate the desired output signal• Data can be registered

• Interconnects can be programmed to connect logic blocks and I/O blocks together (Logic -> Logic, I/O -> Logic, Logic -> I/O, I/O -> I/O)• Usually a special Clock network to

avoid Clock skew problems

Image contributed from Dr. Łukowiak, CE Dept.

Page 29: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

How are FPGA’s actually used?

• They use a “programming language”– VHDL -> VHSIC Hardware Description Language

• VHSIC -> Very High Speed Integrated Circuit

– Verilog -> C-like Language

• Programs are NOT Top-Down like C, BASIC, etc.– The programs describe the hardware– Very parallel with some sequential parts running

in parallel

Page 30: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Step 1: Simulation

• The programs run through a simulator which applies the correct input and generates the output

• Once the simulator produces the desired output, THE TASK IS NOT OVER YET!

Page 31: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Step 2: Synthesis

• The Compiler will try to Synthesize the code into the appropriate logic blocks– (Previous Multiplier Schematic was Synthesized

from VHDL)

• Not all VHDL statements are Synthesizable– while loop, wait statements, etc.

• Many times the program has to be adjusted to use only synthesizable commands… back to Simulation

Page 32: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Step 3: Place & Route

• The compiler now figures out where to place each logic block, and how the logic blocks are interconnected

• Sometimes more hardware is needed than is actually on the specific FPGA device– Buy a bigger FPGA– Redesign the program to reuse more hardware, or

to route data differently… back to Simulation

Page 33: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Step 4: Download to FPGA

• Download the program onto the FPGA• Run the program and make sure the correct

results are obtained• If logic is too complex, then the clock

frequency may have to be scaled down– Gate delay exceeds clock period

• If everything works, then done

Page 34: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

FPGA

Benefits• Better performance than

General Purpose Processors– Even though clock frequency

may be 50-200MHz

• Easier to design than Full Custom

• Easier to test than Full Custom

• Good for prototyping Full Custom

Drawbacks• Not a Production-Grade

piece of hardware– No application uses 100% of

everything available on the FPGA

– Some FPGA’s reset on power loss, and need to be reprogrammed

Page 35: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Skein Hashing Algorithm• Different versions depending on the internal state and output size

– Skein-512-1024 has a 512-bit internal state, and 1024 output bits– Skein-512-512 is the default proposal

• Skein-512-512 will be examined in this presentation– Only 256, 512, and 1024 internal states supported

• Any output size may be used– Skein-256 and Skein-512 have 72 rounds; Skein-1024 has 80 rounds

• Based on the Threefish Block Cipher (introduced alongside Skein)– Threefish Block Cipher has 3 components

• MIX• Permute• Add Subkey

– Skein wraps a 512-bit XOR around Threefish to create a UBI block, which is chained together

Page 36: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Threefish Block Cipher

• Encryption starts with 8 64-bit Subkey additions

• Then there are 4 rounds of MIX and Permute followed by the next Subkey addition

• There are a total of 72 rounds

• The Cipher ends with the 18th Subkey addition

Page 37: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

The MIX Function

• One 64-bit addition• One 64-bit rotate• One 64-bit XOR

Page 38: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

MIX Function Hardware Analysis

• 64-bit Addition– Full Custom (ASIC) isn’t too bad– FPGA’s can handle a few of these

• Bit Rotation– Simply a wire-mapping

• 64-bit XOR– Even easier than Addition

• 1 Gate Delay

Page 39: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

The Permute Function

• 64-bit words are swapped between MIX functions

Page 40: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Permute Function Hardware Analysis

• Entirely wire mappings– Not an issue

Page 41: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Subkey Addition

Page 42: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Subkey Hardware Analysis

• 8 XOR’s chained together– 8 Gate Delays

• Subkey Index mod 9 (and 3)– Full Custom (ASIC) can be hard-coded– Creative methods must be done in FPGA

• Two 64-bit Additions chained together• Additions taken mod 264

– Our only good news!

Page 43: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Subkey Hardware Analysis Continued

• Eight 64-bit Additions happen “logically” in parallel– Each of those Eight is really 2 64-bit Additions chained

together, as mentioned previously– To actually do this in parallel is a large hardware

commitment– To save on hardware, each addition should happen

serially using the same Logic Blocks (FPGA)• This may require external memory I/O between additions to

swap out the addends– VERY SLOW

Page 44: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

UBI Blocks

Page 45: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

UBI Block Hardware Analysis

• One 512-bit XOR– OK for Full Custom (ASIC), but a major pain– Trouble for FPGA

• Wire-routing nightmare

• Chaining is no big deal– ~640-bit register (512-bit “key”, 128-bit “tweak”)

Page 46: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

FPGA Stats on a Spartan 2 for Skein  Number of Slices:                 3494  out of   2352   148% (*)

 Number of Slice Flip Flops:          4604  out of   4704    97%   Number of 4 input LUTs:              6262  out of   4704   133% (*)  Number of IOs:                         62 Number of bonded IOBs:                 44  out of    140    31%      IOB Flip Flops:                      4 Number of GCLKs:                        2  out of      4    50% 

Page 47: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Changes Necessary to Fit

• Complete redesign of the underlying components– Specifically Subkey

• Minimize routing– More utilization of external memory module

• Buy a bigger FPGA– Spartan 3?

Page 48: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

Any Questions, Comments or Concerns?

Page 49: Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

References

• Dr. Łukowiak, C.E. Department• Dr. Shaaban, C.E. Department• http://en.wikipedia.org/wiki/Logic_gate

– Images• http://en.wikipedia.org/wiki/Adder_(electronics)

– Images• http://www.skein-hash.info/downloads

– Images, Paper