BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware.
-
Upload
cecil-berry -
Category
Documents
-
view
212 -
download
0
Transcript of BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware.
BCRYPT ECC-Day 2008
Requirements, Algorithms, Architectures
The design space of ECC hardware
2
Contents
Applications of ECC Hardware
Existing Solutions
Design of ECC Hardware
Details of ECC Hardware
3
Motivation
ECC Hardware: What for? Acceleration Power efficiency Implementation security
Side-channel resistance
Competitors of ECC hardware
RSA hardware Software implementation
Very fast on PC But very slow on 8-bit µC
Application: Server High throughput
> 100 signatures / sec
Application: Smartcard Low latency
100 ms per signature
Low die size
Application: RFID Low power consumption Low die size
4
ECC Hardware: Application
Different Requirements for ECC applications Smartcard
Acceptable latency Implementation security One EC curve sufficient
Server acceleration Throughput (not latency) Complete offloading
Costumers, Clientse-Commerce server, e-Government server
ECC
5
ECC Hardware: Server Acceleration
GF(2191) Hardware Accelerator No GF(2m) support in processors (x86, PPC, …) FPGA (programmable HW) as platform Optimized for one curve Complete EC operation in HW
PCI chipset
FPGA
InfineonPITA-2
Register File
Arithmetic UnitInter-face
ECC Control
Unit
GF(2191), fClk = 66 MHz
Multipl.[Radix]
k·P[Takte]
fCLK,max
[MHz]k·P / sec
[Ops]
W = 8-Bit 40.210 74,6 1641
W = 16-Bit 23.820 71,3 2770
W = 32-Bit 15.623 70,4 4224
6
ECC Hardware: SmartcardsInfineon SLE88CFX4000P
SLE 88 32-Bit Platform 1408-Bit RSA
co-processor
RSA coprocessor Local memory (704 bytes) Scalable word width Support for ECC: GF(p), GF(2m)
Photo © Infineon Technologies
SmartCard
7
ECC Hardware: SmartcardsNXP Smart MX P5CC072
Smart MX 8-bit smartcard FameXE
coprocessor
FameXE RSA, ECC:
GF(p), GF(2m) 2.5 kB local RAM Word width < 4096 bits
Photo © NXP
SmartCard
8
ECC Hardware: RFID Authentication
Challenge-response authentication in RFID Minimization of power consumption Trading performance for power
Lower clock speed Reduced word size
Antenna
AnalogFrontend
VddDig.
Front-end
RFIDCont-roller
ECC ProcessorNVRAM
Register FileECC
Cont-roller
Alu
9
Hardware Design: CMOS Circuits
CMOS complementary metal-oxide semiconductor Silicon circuit: up to 2*106 transistors per IC Digital hardware: standard-cell circuits
Flipflops, full adders, muxes, gates: xor, and, …
10
Hardware Design: Top → Down
Top-down design methodology From specification To working silicon
„First time right“
Design process Refinement of models Early estimates of
area, power, performance
Design iterations when constraints are not met
Efficiency
Effort
Algorithm
Circuit level
Architecture
System level
11
Hardware Design: Design Flow
Abstraction level and tools1. System level
Defining functionality and constraints
2. Algorithmic level High-level model
3. Architectural level Paper + pencil
4. Register-transfer level HDL description
5. Circuit level Schematic + layout
1 2
3 4
5
12
Challenges of ECC Hardware
EC Algorithms (ladder, EC point operation, point representation)
Defines number of multiplications Defines storage requirements Defines implementation security
Multiplication Determines performance
Storage Determines circuit size
Control Determines HDL complexity
Do’s Fix EC parameters
Fixed field size
Separate storage and computation
Dont’s Trading increased
storage for lower computation
Optimization of negligible things
Inversion
13
Approaches to ECC Hardware
EC-processor Computing full point multiplication No external interaction necessary
Co-processor Acceleration of finite-field operation (Limited local memory) External interaction needed
For point ladder and point operation
ISE Enhancement of existing instruction set Acceleration of core operations
Multiply-Accumulate instructions Support of polynomial arithmetic
??
14
Algorithms for ECC
Bitserial multiplication a in full precision; b bitwise Faster: digit-serial (w bits of b)
Modular reduction Without division:
NIST reduction For trinomial / pentanomials For Mersenne-like primes
Montgomery Multiplication Combines a*b and mod p For arbitrary moduli
MulSer(a, b) = a*bc = 0for i = n-1 to 0 do
c = 2·c + a·bi
Pre-comp: R = 2n+2 mod p, R2 mod p, p’ = (-p)-1 mod 2 MonMul(a, b) = a·b·R-1 mod p c = 0 for i = 0 to n+1 do q = ((c0 + a0·bi) mod 2)·p’ c = c + p·q + a·bi
ah al
12642192
ah
al
ah
15
Modular Multiplication in HW
GF(2191) Example Digit-serial multiplication
c(x) = a(x)*m(x) mod f(x) a(x): full precision m(x): w-bit digits
– Digit size w = 8, 16, 32
Alignment of intermediate result
Interleaved NIST reduction small intermediate results
Squaring as own operation Simple when irred. poly f(x) fixed
a(x) + b(x) mod f(x)
a(x) b(x)
CM
muxm
<< w
muxb
c(x)
a(x) · mi(x)
a(x)
01
0
b(x)
<< w ^2
dout
din
muxm
i mi(x
)
PCI chipset
FPGA
InfineonPITA-2
Register File
Arithmetic UnitInter-face
ECC Control
Unit
16
Multiplier in HW
Partial product generation a(x) * mi
Simply 191 AND gates Amplification of mi crucial
Aligning intermediate results Simple: Fixed shift
operation
Accumulation of PP Array or Tree adder
Modular reduction 200 bits -> 191 bits
m0
190 189 i 1 0
m1
190 189 i 1 0i-1
i+1 2
188
p0p1p2pipi+1p189p190p191
a0a1a2aiai+1a189a190
... ...
a9 a0a1a10a11 a2a8..a3a190..a12a191a192a193
a9 a0a1a10a11 a2a8..a3a190..a12
17
GF(p) Multiplier
Radix-4 multiplier A in full precision B: 2 bits / cycle
Montgomery multiplic. Orup’s optimization
Redundant number representation Carry-save (CS) More storage Shorter crit. Path Red2bin: CSA reuse
Booth recoding (Benc)
CSA FA
A
C
S
Si
CSA FA
Ci
PPG
PPG'Ci-1 Si-1
M
Con
trol
CSA HA
qi-1
qqi
Qenc
cin
cin
>> 2 >> 2
2
2
b
bi
B
>> 2 0
Benc
Czero
3
bi-1
bi-1,neg
1
~
18
Dual-field Support
Application: e.g. ECDSA ECC over GF(2m) Protocol: GF(p)
Mul, Add, Inv mod n– n … base point order
Architecture ~GF(p) mult. CSA for GF(p) XOR for GF(2m)
Carries blocked
GF(p) versus GF(2m) GF(2m) faster … GF(p) needs reg. C
Carry-save Adder
Carry-save Adder
a, a(x) p, p(x)
s
cb
q
neg
-a a 0 2c c 0 2s s 0
p p/2 0 c c/2 0 s s/2 0
b/2 s
c s
s
Reg C Reg SReg B
p1 c1 s1
c2 s2a2b2
Control
ECDSA
e = SHA-1(Message)
k = random(1, n-1)
R = k*(Px,Py) = (Rx,Ry)
r = Rx mod n
s = k-1·(e + d·r)
19
ECC for RFID
Problem: Very constrained power budget P = E/t = I*U = fclk*CL*Vdd*Vdd Problem analysis: where is power consumed?
Mostly for storage: clocking of registers
New idea Less registers; more comb. logic Smaller datapaths
No computation at full wordsize Adoption of ISE techniques
– MAC-operation Simple HDL implementation
RAM
datapathcontrol
16
20
Control
Task of control logic Generate control signals
For 60.000 – 6 Mio clock cycles
Separation of control and datapath
Registered control signals
For performance and power efficiency
Avoiding critical path
Hierarchical control Complex control
Options Hardwired
State machine Micro-program
Counter + ROM Micro-controller
Software
Elliptic-Curve Processor
Mult.Control ALU
Mod. Red.
RAM
ROM2
PointControl- init- final
- double- add
register
Mux
ROM1
ScalarCon-trol
21
Results
Server Acceleration For GF(2191) Size: 1500 slices
On Xilinx FPGA > 1000 EC ops / sec
@ 66 MHz clock
Smartcard Coprocessor Dual-Field capability 192-bit ECC: 23k GE
400k – 700k cycles 256-bit ECC: 31k GE
600k - 900k cycles
ECC for RFID 163-bit ECC: 12k GE
400k cycles 192-bit ECC: 18k GE
850k cycles
Storage 75% of area
ISE-datapath 75% of power
Realistic on <130 nm CMOS
Power constraint ~15µA
22
Conclusions
Different applications require different ECC hardware
Fixed parameters (EC params, field) allow more efficient implementation
Squaring in GF(2m) NIST reduction
ECC for RFID Seems possible
Costumers, Clientse-Commerce server, e-Government server
ECC