Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

of 49 /49
Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty

Embed Size (px)

Transcript of Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty.

  • Slide 1

Algorithm Efficiency in Hardware with an Emphasis on Skein By Phil Doughty Slide 2 Outline Purpose of this Presentation Full Custom (ASIC) Design Digital Hardware Implementation Basics Gates Arithmetic Field Programmable Gate Arrays (FPGAs) Layout How FPGAs are used Skein Hashing Algorithm Slide 3 Purpose Touch upon basic hardware elements Inform future cryptographers and designers of cryptographic algorithms of the benefits and limitations of hardware Present Skein as an algorithm with pretty good hardware compatibility Slide 4 Full Custom (ASIC) Design Image contributed from Dr. Shaaban, CE Dept. Slide 5 Digital Logic Gates Basic operation block 1 or more input voltages, and exactly 1 output voltage Voltage is either High or Low (1 or 0) TTL (Bipolar Junction Transistors) CMOS (Complementary Metal Oxide Semiconductor Field Effect Transistors) Slide 6 Primary Gates INVERT, AND, OR INVERT isnt always necessary depending on underlying technology NAND and NOR NAND is an AND gate with INVERTed Output NOR is an OR gate with INVERTed Output Schematic is similar to AND and OR, but with a bubble on the output (representing inverse) Either can be solely used to build any logic Slide 7 Inverter SchematicTruth Table Input AOutput Y 01 10 Algebraic Notation Y = A Slide 8 AND Gate SchematicTruth Table Input AInput BOutput Y 000 010 100 111 Algebraic Notation Y = AB Slide 9 OR Gate SchematicTruth Table Input AInput BOutput Y 000 011 101 111 Algebraic Notation Y = A + B Slide 10 XOR Gate SchematicTruth Table Input AInput BOutput Y 000 011 101 110 Algebraic Notation Y = A B Slide 11 XOR Gate (Continued) Can be composed of INVERT, AND, & OR A B = AB + AB But it can be easily implemented in hardware using faster methods Slide 12 Gate Delay Gates are not instantaneous There is a delay between the time an input changes to the time an output changes Slide 13 Arithmetic Operations Addition/Subtraction Multiplication Division/Modulus Slide 14 Addition and Subtraction Ripple-Carry Adder Easiest to analyze Faster adders are used in industry Naffziger (Intel Core 2) Carry Look-ahead Adders, etc. Uses two components, Half Adder and Full Adder Full Adder has a third input for Carry-In compared to the Half Adder Subtraction is just addition by a negative number in 2s complement notation Slide 15 Ripple Carry Adder Algorithm Similar to manual addition Least Significant Bits (A 0 and B 0 ) are added together to produce a Sum Bit and a Carry Bit (S 0 and C 1 ). The next pair of bits (A 1 and B 1 ) are added together along with the previous Carry Bit (C 1 ) to produce a Sum Bit and a Carry Bit (S 1 and C 2 ). The process repeats Slide 16 Ripple-Carry Adder Components Half AdderFull Adder 1 Gate Delay for both the Sum bit and the Carry bit 2 Gate Delays for Sum bit 3 Gate Delays for Carry bit 1 Gate Delay to change the Sum Bit if the incoming Carry bit changes 2 Gate Delays to change the Carry bit if the incoming Carry bit changes Slide 17 Ripple-Carry Adder Slide 18 Ripple-Carry Adder Worst Case Worst Case Scenario is when C 0 is 0, A is all 1s and B is all 0s, and then C 0 changes to 1 The Carry has to propagate through all of the Full Adder Blocks For an n-bit Ripple-Carry Adder 2(n-1) + 1 gate delays to change the final Sum bit 2n gate delays to change the final Carry bit Slide 19 Multiplication Generic Multiplier Any two numbers can be multiplied together A * B = Y n-bit inputs produces 2n-bit output Constant Coefficient Multiplier Multiplication by a constant A * 5 = Y Easier to implement Used in Finite Impulse Response (FIR) Filters Slide 20 Generic Multipliers O(n 2 ) gate delays for an n-bit Generic Multiplier Very slow compared to addition Uses many resources compared to addition Slide 21 Optimized 3-bit Generic Multiplier At most 11 gate delays Slide 22 Optimized 8-bit Generic Multiplier At most 53 Gate Delays Slide 23 Division/Modulus More complex than Multiplication Can be implemented as a series of subtractions Sequential logic may be better suited Uses Registers and a Clock signal Slide 24 Shortcuts Multiplication If multiplying by a power of 2, shift left by the power Division If dividing by a power of 2, shift right by the power Modulus If taking a modulus of a power of 2, AND the bits with the (modulus 1) Slide 25 Full Custom Benefits Best Possible Performance Can be specially designed for low power consumption (embedded systems) or for high speed (PC expansion card) No restrictions on logic No restrictions on routing Drawbacks Expensive to design Expensive to test Fabrication takes months Slide 26 FPGA Image contributed from Dr. Shaaban, CE Dept. Slide 27 What is an FPGA? Field Programmable Gate Array (FPGA) It is an array of gates that can be programmed A good compromise between General Purpose Processors and Full Custom Slide 28 Layout of an FPGA Input and Output (I/O) Blocks Interface with the outside world LED display Switches, buttons, etc. Logic Blocks usually take 3-4 input signals and generate the desired output signal Data can be registered Interconnects can be programmed to connect logic blocks and I/O blocks together (Logic -> Logic, I/O -> Logic, Logic -> I/O, I/O -> I/O) Usually a special Clock network to avoid Clock skew problems Image contributed from Dr. ukowiak, CE Dept. Slide 29 How are FPGAs actually used? They use a programming language VHDL -> VHSIC Hardware Description Language VHSIC -> Very High Speed Integrated Circuit Verilog -> C-like Language Programs are NOT Top-Down like C, BASIC, etc. The programs describe the hardware Very parallel with some sequential parts running in parallel Slide 30 Step 1: Simulation The programs run through a simulator which applies the correct input and generates the output Once the simulator produces the desired output, THE TASK IS NOT OVER YET! Slide 31 Step 2: Synthesis The Compiler will try to Synthesize the code into the appropriate logic blocks (Previous Multiplier Schematic was Synthesized from VHDL) Not all VHDL statements are Synthesizable while loop, wait statements, etc. Many times the program has to be adjusted to use only synthesizable commands back to Simulation Slide 32 Step 3: Place & Route The compiler now figures out where to place each logic block, and how the logic blocks are interconnected Sometimes more hardware is needed than is actually on the specific FPGA device Buy a bigger FPGA Redesign the program to reuse more hardware, or to route data differently back to Simulation Slide 33 Step 4: Download to FPGA Download the program onto the FPGA Run the program and make sure the correct results are obtained If logic is too complex, then the clock frequency may have to be scaled down Gate delay exceeds clock period If everything works, then done Slide 34 FPGA Benefits Better performance than General Purpose Processors Even though clock frequency may be 50-200MHz Easier to design than Full Custom Easier to test than Full Custom Good for prototyping Full Custom Drawbacks Not a Production-Grade piece of hardware No application uses 100% of everything available on the FPGA Some FPGAs reset on power loss, and need to be reprogrammed Slide 35 Skein Hashing Algorithm Different versions depending on the internal state and output size Skein-512-1024 has a 512-bit internal state, and 1024 output bits Skein-512-512 is the default proposal Skein-512-512 will be examined in this presentation Only 256, 512, and 1024 internal states supported Any output size may be used Skein-256 and Skein-512 have 72 rounds; Skein-1024 has 80 rounds Based on the Threefish Block Cipher (introduced alongside Skein) Threefish Block Cipher has 3 components MIX Permute Add Subkey Skein wraps a 512-bit XOR around Threefish to create a UBI block, which is chained together Slide 36 Threefish Block Cipher Encryption starts with 8 64-bit Subkey additions Then there are 4 rounds of MIX and Permute followed by the next Subkey addition There are a total of 72 rounds The Cipher ends with the 18 th Subkey addition Slide 37 The MIX Function One 64-bit addition One 64-bit rotate One 64-bit XOR Slide 38 MIX Function Hardware Analysis 64-bit Addition Full Custom (ASIC) isnt too bad FPGAs can handle a few of these Bit Rotation Simply a wire-mapping 64-bit XOR Even easier than Addition 1 Gate Delay Slide 39 The Permute Function 64-bit words are swapped between MIX functions Slide 40 Permute Function Hardware Analysis Entirely wire mappings Not an issue Slide 41 Subkey Addition Slide 42 Subkey Hardware Analysis 8 XORs chained together 8 Gate Delays Subkey Index mod 9 (and 3) Full Custom (ASIC) can be hard-coded Creative methods must be done in FPGA Two 64-bit Additions chained together Additions taken mod 2 64 Our only good news! Slide 43 Subkey Hardware Analysis Continued Eight 64-bit Additions happen logically in parallel Each of those Eight is really 2 64-bit Additions chained together, as mentioned previously To actually do this in parallel is a large hardware commitment To save on hardware, each addition should happen serially using the same Logic Blocks (FPGA) This may require external memory I/O between additions to swap out the addends VERY SLOW Slide 44 UBI Blocks Slide 45 UBI Block Hardware Analysis One 512-bit XOR OK for Full Custom (ASIC), but a major pain Trouble for FPGA Wire-routing nightmare Chaining is no big deal ~640-bit register (512-bit key, 128-bit tweak) Slide 46 FPGA Stats on a Spartan 2 for Skein Number of Slices: 3494 out of 2352 148% (*) Number of Slice Flip Flops: 4604 out of 4704 97% Number of 4 input LUTs: 6262 out of 4704 133% (*) Number of IOs: 62 Number of bonded IOBs: 44 out of 140 31% IOB Flip Flops: 4 Number of GCLKs: 2 out of 4 50% Slide 47 Changes Necessary to Fit Complete redesign of the underlying components Specifically Subkey Minimize routing More utilization of external memory module Buy a bigger FPGA Spartan 3? Slide 48 Any Questions, Comments or Concerns? Slide 49 References Dr. ukowiak, C.E. Department Dr. Shaaban, C.E. Department Images Images Images, Paper