Post on 16-Jul-2015
Author(s)
Politehnica University of
Bucharest
Automatic Control and Computers
Faculty
Computer Science
Department
Scientific Advisor
AES encryption using GPU architectures
Grigore Lupescu Emil Slusanschi
Scientific Student Projects Session - May 2014
AES Encrytion (1)
17.05.2014 Scientific Student Projects Session - May 2014 2
Algorithm to repeatedly apply a block cipher (e.g. AES) to the input plaintext
Most operation modes require an initialization vector
Most used cipher modes: Cipher-block chaining (CBC), Counter (CTR)
Other cipher modes: Electronic codebook (ECB), Output feedback (OFB)
Why use ECB ?
Simple, fast, very well parallelizable, max throughput
Provides a good estimate of how CTR would perform
AES Encrytion (2)
17.05.2014 Scientific Student Projects Session - May 2014 3
KeyExpansion: round keys are derived from the cipher key.
InitialRound: (AddRoundKey)
Rounds:
SubBytes— substitution step where each byte is replaced with another according to SBOX table.
ShiftRows— transposition step where the last three rows of the state are shifted.
MixColumns—a mixing operation which operates on the columns of the state. Operations (+,*) are redefined in the Galois Finite Field.
AddRoundKey - bitwise xor of each byte of the state with the round key.
Final Round:(SubBytes, ShiftRows, AddRoundKey).
Target System (1)
15.05.2014 Scientific Student Projects Session - May 2012 4
SoC CPU – AMD A4 4000K (2 cores @3.0ghz, Richland architecture, AES-NI), cores denoted by BLUE
SoC Integrated GPU HD7480 (iGPU), 2 SIMD units of 64 cores each (VLIW4 architecture), SIMD units denoted by RED
Discrete GPU AMD R7 250 (dGPU), 6 SIMD units of 64 cores each (GCN architecture), PCIe 16x 2.0 bus, SIMD units denoted by RED
Data to be encrypted denoted by GREEN
Software – C/C++/OpenCL, Linux Ubuntu 14.04 x64
Target System (2)
15.05.2014 Scientific Student Projects Session - May 2012 5
Algorithm Opt_1
• Array “indata” will reside in global device memory (__global)
• Variable “state” which holds transformations will be in GPU cache (__local)
• Simple operation “ShiftRows” is designed with vector addressing (state.s05AF49E38.. )
• Simple operation “AddRoundKey” is a simple XOR (state ^ key).
• Complex operation “SubBytes” will use precomputed tables of Sbox, stored in constant memory
• Complex operation “MixColumns” will use precomputed tables of Galois_FiniteField, stored in constant memory
• Host sample code bellow (simple blocking enqueues)
while(!done()) { writeData(32MB, &offset);
execKernel(32MB, &offset); readData(32MB, &offset); }
15.05.2014 Scientific Student Projects Session - May 2012 6
Results Opt_1
15.05.2014 Scientific Student Projects Session - May 2012 7
• AMD CodeXL profiling, initial results – iGPU A4 4000, ~100MB/sec AES ECB128
Algorithm Opt_2
• Array “indata” will reside in global device memory (__global)
• Variable “state” which holds transformations will be in GPU cache (__local)
• Simple operation “ShiftRows” - unchanged
• Simple operation “AddRoundKey” – unchanged
• Complex operation “SubBytes” will use precomputed tables of Sbox, stored in cache memory (__local)
• Complex operation “MixColumns” compute values instead of using precomputed (used optimized version of MixColumns)
• Host sample code – unchanged
15.05.2014 Scientific Student Projects Session - May 2012 8
Results Opt_2
15.05.2014 Scientific Student Projects Session - May 2012 9
• Profiling, Opt_1 – iGPU A4 4000, ~100MB/sec AES ECB128
• Profiling, Opt_2 – iGPU A4 4000, ~210MB/sec AES ECB128
Algorithm Opt_3
• Array “indata” will reside in global device memory (__global)
• Variable “state” which holds transformations will be in GPU cache (__local)
• Simple operation “ShiftRows” - unchanged
• Simple operation “AddRoundKey” – unchanged
• Complex operation “SubBytes” – unchanged
• Complex operation “MixColumns” - unchanged
• Host sample code – overlap execution with I/O by creating multiple queues (R, W, E)
15.05.2014 Scientific Student Projects Session - May 2012 10
Algorithm Opt_3 (2)
15.05.2014 Scientific Student Projects Session - May 2012 11
Results Opt_3
15.05.2014 Scientific Student Projects Session - May 2012 12
• Right figure - Results AES ECB128 in MB/sec, of serial (Opt_2) vs overlap (Opt_3)
• Bellow figure – 3 OpenCL queues (R, W, E) for asyncenqueues hence to achieve overlap execution with I/O
Conclusions
15.05.2014 Scientific Student Projects Session - May 2012 13
iGPU AES performance is good (faster than CPU but CPU AESNI is fastest)
Prefer cache over constant memory
Where possible analyze using precomputed tables vs computation on the fly
Overlaping execution with I/O could improve iGPU performance by 10-20%
Space of the iGPU occupied in the x86 SoC die increases with each generation and its contribution in AES throughput will increase as well
Memory transfers are expected to improve with each new generation and with them CPU/iGPU performance