Post on 12-Jul-2020
CESEL: Flexible Crypto Acceleration
Kevin KininghamDan Boneh, Mark Horowitz, Philip Levis
Cryptography
• Mathematical operations to secure data
• Fundamental for building secure systems
• Computationally intensive: takes time and energy
Example: Emergency Response Drone
• Secure connections for real-time control
• Authentication
• Real-time control means these operations must be efficient and fast• Want feedback loops on 1-10ms scale• Some algorithms take 10-100ms
• For other use cases, energy efficiency is also important
Hardware Acceleration• Solution today: fixed hardware crypto accelerators• Implement crypto algorithm directly in silicon• Huge energy/latency improvement, but inflexible
• Problem: Security requirements change over time• Requirements change• Vulnerabilities emerge
• Current accelerators can’t adapt to new crypto • New cipher requires physical replacement of device• Can’t replace drones every 2 years
Flexible Acceleration• We need flexible acceleration• Support wide range of crypto• “Good enough” power reduction
Energy
Flexibility
ASIC
Software
CESEL
Outline• Motivation
• Common Operations in Cryptography
• Features for Flexible Acceleration• Bitslicing and Permutation• Flexible SIMD
• CESEL Architecture
• Evaluation
Crypto BackgroundSymmetric
Ciphers AES, DES, …
Hash Functions
SHA, BLAKE, …
Asymmetric Ciphers
RSA, ECC, …
Crypto BackgroundSymmetric
Ciphers AES, DES, …
Hash Functions
SHA, BLAKE, …
Asymmetric Ciphers
RSA, ECC, …
• Based around idea of “diffusion”• Single input bit change should randomize output
• Traditionally done with substitution tables and bit or byte permutation networks• Easy to construct, very efficient in hardware
• Some recent ciphers use arithmetic operations and bitshifting• More efficient in software, easier to parallelize
Crypto BackgroundSymmetric
Ciphers AES, DES, …
Hash Functions
SHA, BLAKE, …
Asymmetric Ciphers
RSA, ECC, …
• Based around a cryptographically hard problem• E.g. discrete log, factoring
• Many “long-word” arithmetic operations• E.g. RSA typically uses words with >1024 bits
• Much more computationally expensive than symmetric
Commonalities Between Crypto Algorithms
• Investigated 38 crypto algorithms• Drawn from major protocols/libraries/competitions
• Symmetric ciphers/Hash functions• Bit/Byte permutations• Internally parallel• Operation width <= 64 bits
• Asymmetric ciphers• Long-word arithmetic operations (>128 bits)
Outline• Motivation
• Common Operations in Cryptography
• Features for Flexible Acceleration • Bitslicing and Permutation• Flexible SIMD
• CESEL Architecture
• Evaluation
Efficient Permutations
• Problem: How to support efficient permutations?• Arbitrary bit-level permutations are expensive
• Insight: Byte permutations are common case• Especially true for software targeted ciphers• Let’s support those first
• Supported using a crossbar network
Efficient Permutations
2 3 1 4
Inputs
Outputs
Efficient Permutations
2 3 1 4
1
Inputs
Outputs
Efficient Permutations
2 3 1 4
2
3
1
4
Inputs
Outputs
N^2 = 16 connections
Efficient Permutations• Can also reduce size of network by splitting permutation over
multiple cycles
• In CESEL, 32-byte permutation in 4 cycles• Additional optimizations allow word (4-byte) permutations in
single cycle• “Best” network depends on area/perf. constraints
2 3 1 4
1
2
2 3 1 4
3
4
Cycle #1 Cycle #2
(N^2) / 2 = 8 Connections
Bitslicing
• Problem: Crossbar does not scale to bit permutations• 256-bit permutation requires 65k connections
• Insight: Vast majority of ciphers don’t require arbitrary permutations, just same permutation applied in parallel
• Use bitslicing to convert parallel bit permutations into word permutations• Commonly used in software crypto implementations• Very fast in hardware
Bitslicing Example1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
1 1 1 1 2 2 2 2 3 3 3 3 44 4 4
• Bitslicing groups bits by their position in each word
• After regrouping, word permutation is performed
• Finally, bits are regrouped into original positions
• Allows cheap parallel application of bit permutations
Bitslicing Example1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
1 1 1 1 2 2 2 2 3 3 3 3 44 4 4
• Bitslicing groups bits by their position in each word
• After regrouping, word permutation is performed
• Finally, bits are regrouped into original positions
• Allows cheap parallel application of bit permutations
1 1 1 13 3 3 3 2 2 2 244 4 4
Bitslicing Example1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
1 1 1 1 2 2 2 2 3 3 3 3 44 4 4
• Bitslicing groups bits by their position in each word
• After regrouping, word permutation is performed
• Finally, bits are regrouped into original positions
• Allows cheap parallel application of bit permutations
1 1 1 13 3 3 3 2 2 2 244 4 4
3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2
Flexible SIMD
• Problem 2: Ciphers have a wide range of bitwidths for their “fundamental” operations• E.g 8-bits: AES, 32-bits: DES, ChaCha, 64-bits:
SHA512• Asymmetric algorithms require very large bitwidths
• Solution flexible SIMD• Allow large number of lane widths in ISA• CESEL supports between 8 and 256-bits
• Long bitwidths supported by mapping to smaller widths and executing over multiple cycles
Outline• Motivation
• Common Operations in Cryptography
• Features for Flexible Acceleration• Bitslicing and Permutation• Flexible SIMD
• CESEL Architecture
• Evaluation
• In order SIMD architecture with 256-bit data path
• Designed to execute as co-processor
• Split into Frontend (fetch/decode) and Backend (execute/writeback)
• Number of lanes and lane width is flexible
CESEL Overview
CESEL Frontend• No data dependent control flow• Not needed in vast majority of crypto code• Simplifies implementation
• Loops use hardware “Loop Stack”• Hardware keeps track of loop iteration +
boundary• Fetch stage can always predict next
instruction
• 16-bit instructions• Minimizes energy cost of instruction
fetches (significant in practice)
CESEL Backend
• SIMD lane width is flexible• 32x8 bits up to 1x256 bits• Set by special instruction during
execution
• Internally, each operation mapped onto 16-bit execution units
• Fast byte permutation + bitslice
Programming Model
• Stream Based Coprocessor• Allows computing over very large inputs• E.g. perform integrity check of all of flash
• Main processor never needs to read keys
• Also allows easy integration with DMA• E.g. encrypt every ADC read
Outline• Motivation
• Common Operations in Cryptography
• Features for Flexible Acceleration• Bitslicing and Permutation• Flexible SIMD
• CESEL Architecture
• Evaluation
• Implemented CESEL in 180nm TSMC
• Baseline was 32-bit RISC-V CPU optimized for similar area/cycle time
• Measured total system energy for fair comparison
Experimental Setup
Results
• 3.8-60x improvement vs RISC-V baseline
• ~20x more energy than dedicated ASIC
Type ASIC CESEL RISC-VBitsliced
AESSymmetric Encryption
7.07 (0.05x)
147.5 (1x)
9,036 (60x)
SHA2 Hash - 3,279 (1x)
12,360 (3.8x)
ChaCha Symmetric Encryption
15.3 (0.05x)
302.4 (1x)
2,021 (6.7x)
Curve25519
Key Exchange - 8,163
(1x)40,454 (5.0x)
RSA Signatures - 16,840 (1x)
87,401 (5.2x)
R-LWE Post-Quantum - 19,822
(1x)109,502 (5.5x)
Total estimated energy in nJ
Results• Does this improvement matter?
• Estimated energy savings in real IoT application• Sensor collecting/transmitting ~1KB over BLE• Curve25519 key exchange once per week
No Crypto
With CESEL
RISC-V Only
Application 3100 3100 3100Crypto 0 460 2300
Total 3100(1x)
3560(1.14x)
5400(1.74x)
Total estimated energy in nJ
Recap
• IoT needs crypto acceleration + flexibility
• Our solution: CESEL• Wide SIMD + long word support• Special instructions (permute, bitslice)• No data-dependent control flow
• Significant energy savings compared to software• ~5x for most ciphers• 1.5x longer deployment time