High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption...

20
High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi [email protected] National University of Singapore (NUS) Sept 10 th 2018 – CHES 2018

Transcript of High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption...

Page 1: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

High-Performance FV Somewhat Homomorphic Encryption on GPUs:

An Implementation using CUDA

Ahmad Al Badawi [email protected]

National University of Singapore (NUS)

Sept 10th 2018 – CHES 2018

Page 2: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

FHE – The holy grail of Cryptography

• FHE enables computing on encrypted data without decryption [GB2009]

• Challenge: requires enormous computation

Client Cloud Server

Homomorphic evaluation of 𝑓

𝐸𝑛𝑐(𝑥)

𝐸𝑛𝑐(𝑓(𝑥))

Encryption/Decryption

Ahmad Al Badawi - [email protected] 2

Page 3: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

How the problem is being tackled? • Algorithmic methods:

– New FHE schemes

– Plaintext packing (1D, 2D, …)

– Encoding schemes

– Approximated computing

– Squashing the target function

– DAG optimizations for the target circuit

• Acceleration methods: – Speedup FHE basic primitives (KeyGen, Enc, Dec, Add, Mul)

– Modular Algorithms

– Parallel Implementations

– Hardware implementations: GPUs, FPGAs and probably ASICs

Ahmad Al Badawi - [email protected] 3

Page 4: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

Our Contributions 1. Implementation of FV RNS on GPUs

2. Introducing a set of CUDA optimizations

3. Benchmarking with state-of-the-art implementations

Ahmad Al Badawi - [email protected] 4

Page 5: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

Why GPUs for FHE?

Ahmad Al Badawi - [email protected] 5

• GPU +

– Naturally available

– many computing cores

– Developer friendly (FPGA,

ASIC)

• FHE +

– Huge level of parallelism

“If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?”

Seymour Cray 1925-1996

Page 6: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

Textbook FV

Ahmad Al Badawi - [email protected] 6

• Basic mathematical structure is 𝑅: ℤ 𝑥 /(𝑥𝑛 + 1)

– Plaintext space: 𝑅𝑡: ℤ𝑡 𝑥 /(𝑥𝑛 + 1)

– Ciphertext space: 𝑅𝑞: ℤ𝑞 𝑥 /(𝑥𝑛 + 1)

• Public key: 𝑝𝑘0, 𝑝𝑘1 ∈ 𝑅𝑞

• Secret key: 𝑠𝑘 ∈ 𝑅𝑞

• 𝑐 = 𝐸𝑛𝑐(𝑚): (𝑞

𝑡𝑚+ 𝑝𝑘0𝑢 + 𝑒0

𝑞, 𝑝𝑘1𝑢 + 𝑒1 𝑞)

• 𝑚 = 𝐷𝑒𝑐 𝑐 :𝑡

𝑞𝑐0 + 𝑐1𝑠𝑘 𝑞

𝑡

• 𝑐+ = 𝐴𝑑𝑑 𝑐0, 𝑐1 : ( 𝑐00 + 𝑐10 𝑞 , 𝑐01 + 𝑐11 𝑞)

Page 7: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

Textbook FV (cont.)

Ahmad Al Badawi - [email protected] 7

• 𝑐× = 𝑀𝑢𝑙 𝑐0, 𝑐1, 𝑒𝑣𝑘 :

1. Tensor product:

𝑣0 = ⌊𝑡

𝑞 𝑐00𝑐10⌉

𝑞 , 𝑣2 = ⌊

𝑡

𝑞 𝑐01𝑐11⌉

𝑞

𝑣1 = ⌊𝑡

𝑞(𝑐00𝑐11 + 𝑐01𝑐10)⌉

𝑞

2. Base decomposition:

𝑣2 = 𝑣2(𝑖)𝑤𝑖

𝑙

𝑖=0

, 𝑙 =

2. Relinearization:

𝑐× = 𝑣𝑗 + 𝑒𝑣𝑘𝑖𝑗 ⋅ 𝑣2(𝑖)

𝑙

𝑖=0𝑞

, 𝑗 ∈ {0,1}

Page 8: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

Implementation Requirements

Ahmad Al Badawi - [email protected] 8

• Polynomial arithmetic in cyclotomic rings

• Large polynomial degree (a few thousands)

– Power-of-2 cyclotomic

– Addition/Subtraction: 𝒪(𝑛)

– Multiplication: 𝒪(n log 𝑛)

• Large coefficients ∈ ℤ𝑞 (a few hundreds of bits)

– Modular algorithms (RNS)

• Extra non-trivial operations:

– Scaling-and-round ⌊𝑡

𝑞𝑥⌉

– Base decomposition

Page 9: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

Polynomial Arithmetic

Ahmad Al Badawi - [email protected] 9

𝑘 =log2 𝑞

log2 𝑝

𝑛

CRT:

𝑞 = 𝑝𝑖𝑘−1𝑖=0 , where 𝑝𝑖 is a prime

RNS / CRT

log2 𝑝-bit number

. . .

. . .

. . .

Addition/Subtraction: component-wise add/sub modulo 𝑝𝑖

𝑝0 𝑝1

𝑝𝑘−1

Page 10: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

Polynomial Arithmetic (cont.)

Ahmad Al Badawi - [email protected] 10

Addition/Subtraction/Multiplication: component-wise add/sub/mul modulo 𝑝𝑖

𝑝0

𝑝1

𝑝𝑘−1

𝑛

RNS / CRT

𝑛

32-bit number NTT (DGT)

. . .

. . .

. . . . . .

. . .

NTT

NTT−1

. . . 𝑝0

𝑝1

𝑝𝑘−1

𝑘 =log2 𝑞

log2 𝑝

Page 11: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

DFT, NTT, DWT, DGT…?

Ahmad Al Badawi - [email protected] 11

Pros Cons

DFT - Well-established - Several efficient libraries to use

- Floating point errors increase as (𝑛 & 𝑝𝑖′s) increase

- Reduce precision (smaller 𝑝𝑖′s) => longer RNS matrix

=> more DFTs

NTT - Exact - Transform length (2𝑛)

DWT - Exact - Only power-of-2 cyclotomics - Transform length (𝑛)

DGT - Exact

- Transform length (𝑛

2)

- 50% Less interaction with memory

- Only power-of-2 cyclotomics - Gaussian Arithmetic (larger number of multiplications ~(30% - 40%)

• We use DGT in our implementation

Page 12: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

Efficient DGT/NTT/DWT on GPU?

Ahmad Al Badawi - [email protected] 12

𝐴 = NTT(𝑎) s.t.

𝐴𝑗 = 𝑎𝑖𝑤𝑗𝑖

𝑛−1

𝑖=0

mod 𝑞

𝑎 = NTT−1(𝐴) s.t.

𝑎𝑖 = 𝑛−1 𝐴𝑗𝑤−𝑖𝑗

𝑛−1

𝑗=0

mod 𝑞

• Better to store 𝑤𝑗𝑖 in lookup table.

– LUT can be stored in GPU texture

memory (which is limited on GPU)

– DWT LUT are 𝒪(𝑛)

– DGT LUT are 𝒪(𝑛

2)

• Compute in 𝐺𝐹(𝑝 ) or in 𝐺𝐹(𝑝𝑖)?

– We found it is better to do it 𝐺𝐹(𝑝𝑖).

– Why? (see next)

Page 13: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

Compute in 𝐺𝐹(𝑝 ) or in 𝐺𝐹(𝑝𝑖)?

Ahmad Al Badawi - [email protected] 13

𝑝0

𝑝1

𝑝𝑘−1

𝑛 . . .

. . . . . . 𝑘 =

log2 𝑞

log2 𝑝

𝑮𝑭(𝒑 )

𝑝 : 64-bit prime (should fit in one word)

𝑝 ≤𝑝

2𝑛 (one multiplication)

𝑛 212 213 214 215 216

log2 𝑝 26 25 25 24 24

- Longer RNS matrix => more NTTs - Size double (32-bit => 64-bit) - Supports limited number of operations

in NTT domain

𝑮𝑭(𝒑𝒊)

𝑝: word-size prime (can be 64-bit)

- Shorter RNS matrix => Less NTTs - No size doubling - Supports unlimited number of

operations in NTT domain

Page 14: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

But, is NTT/DWT/DGT performance-critical?

Ahmad Al Badawi - [email protected] 14

Toy Settings

NTT

RNS Base Extension

RNS Scaling

others

Medium Settings

NTT

RNS Base Extension

RNS Scaling

others

Large Settings

NTT

RNS Base Extension

RNS Scaling

others

Breakdown of homomorphic multiplication (AND) in the BFV FHE scheme

Halevi, Shai, Yuriy Polyakov, and Victor Shoup. "An Improved RNS Variant of the BFV Homomorphic Encryption Scheme." (2018).

Page 15: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

Computing CRT on GPU?

Ahmad Al Badawi - [email protected] 15

- CRT 𝑎, {𝑝𝑖} :

(𝑎0, … , 𝑎𝑘−1) = 𝑎 mod 𝑝𝑖

- CRT−1(𝑎0, … , 𝑎𝑘−1) = 𝑎 s.t.

𝑎 = 𝑞

𝑝𝑖

𝑘−1

𝑖=0

𝑞

𝑝𝑖

−1

𝑎𝑖 (mod 𝑝𝑖) (mod 𝑞)

where 𝑞 = 𝑝𝑖𝑘−1𝑖=0

• At least two methods:

– Classic algorithm

– Garner’s algorithm

Classic Garners

LUT 𝑘2 𝑘 𝑘 − 1

2

Thread Divergence Non tractable Nil

• Is CRT critical to performance?

– No!

Page 16: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

RNS tools

Ahmad Al Badawi - [email protected] 16

• Useful to:

– Remain in RNS representation

– No costly multi-precision arithmetic

• Two basic operations:

– Scale-and-round

– Base decomposition

• Adopted from (BEHZ2016*) scheme

• Are RNS tools critical to performance?

– Extremely critical

* Bajard, Jean-Claude, et al. "A full RNS variant of FV like somewhat homomorphic encryption schemes." International Conference on Selected Areas in Cryptography. Springer, Cham, 2016.

Page 17: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

FV_RNS Homomorphic Multiplication

Ahmad Al Badawi - [email protected] 17

Page 18: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

Benchmarking Results

Ahmad Al Badawi - [email protected] 18

0.000

200.000

400.000

600.000

800.000

1000.000

1200.000

1400.000

1600.000

(11,62) (12,186) (13,372) (14,744)

Tim

e (

ms)

Key Generation

GPU-FV

SEAL

NFLlib-FV

0.000

5.000

10.000

15.000

20.000

25.000

30.000

35.000

(11,62) (12,186) (13,372) (14,744)

Tim

e (

ms)

Enc

GPU-FV

SEAL

NFLlib-FV

0.000

2.000

4.000

6.000

8.000

10.000

12.000

14.000

16.000

18.000

(11,62) (12,186) (13,372) (14,744)

Tim

e (

ms)

Dec

GPU-FV

SEAL

NFLlib-FV

0.000

50.000

100.000

150.000

200.000

250.000

300.000

350.000

400.000

450.000

500.000

(11,62) (12,186) (13,372) (14,744)

Tim

e (

ms)

HomoMul + Relinearization

GPU-FV

SEAL

NFLlib-FV

Page 19: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

Which FV RNS variant to Implement?

Ahmad Al Badawi - [email protected] 20

• Two RNS variants of FV

– BEHZ

– HPS

• Answer can be found in:

– Al Badawi, Ahmad, et al. "Implementation and Performance Evaluation of RNS

Variants of the BFV Homomorphic Encryption Scheme." IACR Cryptology ePrint

Archive 2018 (2018): 589.

Page 20: High-Performance FV Somewhat Homomorphic …...High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University

21

Thank You

Questions? Ahmad Al Badawi

[email protected]