Final Presentation Encryption on Embedded System Supervisor: Ina Rivkin students: Chen Ponchek Liel...
-
Upload
franklin-quinn -
Category
Documents
-
view
213 -
download
0
Transcript of Final Presentation Encryption on Embedded System Supervisor: Ina Rivkin students: Chen Ponchek Liel...
Final PresentationEncryption on
Embedded System
Supervisor: Ina Rivkinstudents: Chen Ponchek
Liel Shoshan
Spring 2014Part B
Motivation• Now days, there are many portable storage systems with
large memories which contains valuable data (such as disk on key, tablets, etc.)
• Therefore there is a concrete need for portable cryptography systems which are suitable for such devices.
• In our project, we will aspire to provide a suitable system which will answer this need.
Project Goalmain goal:
Implementation of efficient data cryptography
embedded system using AES algorithm
and finding the suitable architecture
for portable system.
Project Specifications• Implementing on a Zync SOC by Xilinx.
• Suitable for portable systems (Disk-on-Key, tablets, etc.) - low power system.
• Transparent system (while storing/loading files) - The cryptography system won’t create
traffic bottle necks.
• Finding the best architecture - according to the requirements above:
• Profiling AES algorithm.
• Finding the balance between using the ARM processor and using the FPGA
(the hardware accelerator needs more power).
AES Algorithm• Advanced Encryption Standard, also known as “Rijndael”, is a block cipher.
• The cipher is iterative, quick and comfortable to implement both by software
and hardware, and it doesn’t have high memory requirements.
• Most of the AES calculations are made through 10 rounds.
• The Key Expansion Schedule creates 10 Round Keys from the initial cipher key.
• In each round the state block is described as a 2D, 4X4 array of bytes.
• Each round consists of 4 steps:
1. SubBytes
2. ShiftRows
3. MixColumns
4. AddRoundKey
KeyExpansion
Key
System Top View
zedb
oard
DDRARM PSsoftware
ProgrammableLogichardware
UART
AXI4-bus
BRAM
Software Implementation• Each step is implemented as a separate function.• Each function is independent of the other functions.• Code optimizations improved performance significantly.• The encryption rate we achieved was 323 KB/s. • 1.5 times slower than the typical maximum data rate in USB (The typical rates
are around 0.5 MB/s.)
• Conclusion: A hardware accelerator is needed.
Software ProfilingDistribution of software’s running time by functions
Software Implementation ProfilingEncryption Time-Split
KeyExpansion
Key
Hardware/Software Balancing• The most time consuming function is Mix Columns.• Concurrency can be achieved by running Key Expansion and
the encryption process simultaneously.• To minimize data traffic between PS and PL, Add Round Key
should be implemented in hardware.
Integrated System Block diagram
zedb
oard
DDRARM PSsoftware
ProgrammableLogichardware
AXI4-bus
Add Round KeyShift Rows Key Expansion
Mix ColumnsSub BytesUART
IntegratedSystem Flow Diagram
SubBytes
ShiftRows
AddRoundKey
KeyExpansion
ARM PSsoftware
Programmable Logichardware
x 9
Key
MixColumns
AddRoundKey
SubBytes
ShiftRows
AddRoundKey
State
zedb
oard
Integrated System Block Diagram
DDR
BRAMAXI4-bus
BRAMAXI4-busKey
ExpansionBRAM
Mixor
MixColumn
Add Round
Key
ARM Processing System
Programmable Logic
UART
Handshake• Synchronizing between ARM processor and hardware modules. • Communication protocol via BRAM.• Processor side:• Processor writes data to BRAM.• Processor rising the flag – designated address on BRAM.
• PL side:• Waiting for flag – continuously reading from designated address.• Executing.• Initiating the flag.
• There is no need for synchronization in the opposite direction – hardware always completes its run before the processor needs the data.
ARM PL
Key Expansion BRAM
MixorMix
Column
Add Round
Key
BRAMAXI4-bus
BRAMAXI4-bus
BRAMAXI4-bus
BRAMAXI4-bus
Hardware Implementation Key Expansion
• The key expansion schedule gets the initial cipher key as its only argument, and outputs the extended key.• It reads the cipher key from the BRAM, written there by the PS.• The output is written to a different BRAM.
• The procedure is independent of the other functions, therefore it can operate as a background task, simultaneously to the rest of the code.• Concurrency of ARM and FPGA was achieved by hardware implementation.
ARM PL
Key Expansion BRAM
MixorMix
Column
Add Round
Key
BRAMAXI4-bus
BRAMAXI4-bus Key Expansion BRAMBRAM
FINISHaddress_sig 0x0
BRAM_WE_B1111data_out_sig 0x0
Expandena_key 1
SaveCol4address_sig 0x1C
InitFlagaddress_sig 0x0
BRAM_WE_B1111data_out_sig 0x0
flag = 0
flag = 1
RdCol4address_sig 0x1C
RdCol3address_sig 0x18
RdCol2address_sig 0x14
RdCol1address_sig 0x10
idleaddress_sig 0x0
valid = 0
valid = 1 i < 43
Write2BRAMaddress_sig 0x20 + 4i
data_out_sig key_out [1407-32i downto 1407-32(i+1)+1]
BRAM_WE_B1111i := i +1
i = 43
Key Expansion state machine flow
Key Expansion ChipScope waveform
• Reading the cipher key from BRAM
• Expanding the key and writing to BRAM
DATA_IN
ADDRESS
DATA_OUT
DATA_INADDRESSDATA_OUT
DATA_INADDRESSDATA_OUT
Hardware Implementation Mix Columns and Add Round Key
• Mixor is a combined module implements both Mix Columns and Add round Key.• Both round key and state block are the module’s inputs.• Reads the state block from a BRAM, shared with the PS.• Reads the round key from a BRAM, written there by the Key Expansion
module.• The output is written to the shared BRAM, from which the PS reads the current block state.
ARM PL
Key Expansion BRAM
MixorMix
Column
Add Round
Key
BRAMAXI4-bus
BRAMAXI4-bus
MixorMix
Column
Add Round
Key
BRAM
BRAM
InitFlagADDRESS_DATA 0x0
DATA_OUT_DATA 0x0BRAM_WE_B_{num_col} 1111
MixADDRESS_DATA 0x8
DATA_OUT_DATA ( col_mixed ) xor ( col_in_key )BRAM_WE_B_{num_col} 1111
SaveCol
flag = 1
RdColADDRESS_DATA 0x4
ADDRESS_KEY 0x20 + 4x[ num_col + 4x( round + 1 ) ]
flag= 0
idleADDRESS_DATA 0x0
Mixor state machine flow
Mixor ChipScope waveform
• Mixor’s module execution over the 1st column
data_in_data1
bram_we_1
data_out_data
data_in_key
address_key
col_mixed
address_data
Hardware Blocks ImplementationPerformance
•Mixor• HW implementation - 24 cycles = 0.24 µsec• SW implementation - 2.545 µsec
• ~10 times faster• Key Expansion• HW implementation - 93 cycles = 0.93 µsec• SW implementation - 15 µsec
• ~15 times faster
Conclusions• The hardware modules are much faster than the software functions.• The data transmission’s overhead between PS and PL significantly
decreases the system’s speed and causes to a sever slowdown in performance - 68% of running time.• Main conclusion• The integrated system is best suitable for executing intensive
calculations, and low data traffic algorithms.• The AES algorithm has high data traffic and therefore the hardware
accelerator did not cause significant performance improvements.
Demonstration