Final Presentation Encryption on Embedded System Supervisor: Ina Rivkin students: Chen Ponchek Liel...

Final PresentationEncryption on

Embedded System

Supervisor: Ina Rivkinstudents: Chen Ponchek

Liel Shoshan

Spring 2014Part B

Motivation• Now days, there are many portable storage systems with

large memories which contains valuable data (such as disk on key, tablets, etc.)

• Therefore there is a concrete need for portable cryptography systems which are suitable for such devices.

• In our project, we will aspire to provide a suitable system which will answer this need.

Project Goalmain goal:

Implementation of efficient data cryptography

embedded system using AES algorithm

and finding the suitable architecture

for portable system.

Project Specifications• Implementing on a Zync SOC by Xilinx.

• Suitable for portable systems (Disk-on-Key, tablets, etc.) - low power system.

• Transparent system (while storing/loading files) - The cryptography system won’t create

traffic bottle necks.

• Finding the best architecture - according to the requirements above:

• Profiling AES algorithm.

• Finding the balance between using the ARM processor and using the FPGA

(the hardware accelerator needs more power).

AES Algorithm• Advanced Encryption Standard, also known as “Rijndael”, is a block cipher.

• The cipher is iterative, quick and comfortable to implement both by software

and hardware, and it doesn’t have high memory requirements.

• Most of the AES calculations are made through 10 rounds.

• The Key Expansion Schedule creates 10 Round Keys from the initial cipher key.

• In each round the state block is described as a 2D, 4X4 array of bytes.

• Each round consists of 4 steps:

1. SubBytes

2. ShiftRows

3. MixColumns

4. AddRoundKey

KeyExpansion

Key

System Top View

zedb

oard

DDRARM PSsoftware

ProgrammableLogichardware

UART

AXI4-bus

BRAM

Software Implementation• Each step is implemented as a separate function.• Each function is independent of the other functions.• Code optimizations improved performance significantly.• The encryption rate we achieved was 323 KB/s. • 1.5 times slower than the typical maximum data rate in USB (The typical rates

are around 0.5 MB/s.)

• Conclusion: A hardware accelerator is needed.

Software ProfilingDistribution of software’s running time by functions

Software Implementation ProfilingEncryption Time-Split

KeyExpansion

Key

Hardware/Software Balancing• The most time consuming function is Mix Columns.• Concurrency can be achieved by running Key Expansion and

the encryption process simultaneously.• To minimize data traffic between PS and PL, Add Round Key

should be implemented in hardware.

Integrated System Block diagram

zedb

oard

DDRARM PSsoftware

ProgrammableLogichardware

AXI4-bus

Add Round KeyShift Rows Key Expansion

Mix ColumnsSub BytesUART

IntegratedSystem Flow Diagram

SubBytes

ShiftRows

AddRoundKey

KeyExpansion

ARM PSsoftware

Programmable Logichardware

x 9

Key

MixColumns

AddRoundKey

SubBytes

ShiftRows

AddRoundKey

State

zedb

oard

Integrated System Block Diagram

DDR

BRAMAXI4-bus

BRAMAXI4-busKey

ExpansionBRAM

Mixor

MixColumn

Add Round

Key

ARM Processing System

Programmable Logic

UART

Handshake• Synchronizing between ARM processor and hardware modules. • Communication protocol via BRAM.• Processor side:• Processor writes data to BRAM.• Processor rising the flag – designated address on BRAM.

• PL side:• Waiting for flag – continuously reading from designated address.• Executing.• Initiating the flag.

• There is no need for synchronization in the opposite direction – hardware always completes its run before the processor needs the data.

ARM PL

Key Expansion BRAM

MixorMix

Column

Add Round

Key

BRAMAXI4-bus

BRAMAXI4-bus

BRAMAXI4-bus

BRAMAXI4-bus

Hardware Implementation Key Expansion

• The key expansion schedule gets the initial cipher key as its only argument, and outputs the extended key.• It reads the cipher key from the BRAM, written there by the PS.• The output is written to a different BRAM.

• The procedure is independent of the other functions, therefore it can operate as a background task, simultaneously to the rest of the code.• Concurrency of ARM and FPGA was achieved by hardware implementation.

ARM PL

Key Expansion BRAM

MixorMix

Column

Add Round

Key

BRAMAXI4-bus

BRAMAXI4-bus Key Expansion BRAMBRAM

FINISHaddress_sig 0x0

BRAM_WE_B1111data_out_sig 0x0

Expandena_key 1

SaveCol4address_sig 0x1C

InitFlagaddress_sig 0x0

BRAM_WE_B1111data_out_sig 0x0

flag = 0

flag = 1

RdCol4address_sig 0x1C

RdCol3address_sig 0x18



idleaddress_sig 0x0

valid = 0

valid = 1 i < 43

Write2BRAMaddress_sig 0x20 + 4i

data_out_sig key_out [1407-32i downto 1407-32(i+1)+1]

BRAM_WE_B1111i := i +1

i = 43

Key Expansion state machine flow

Key Expansion ChipScope waveform

• Reading the cipher key from BRAM

• Expanding the key and writing to BRAM

DATA_IN

ADDRESS

DATA_OUT

DATA_INADDRESSDATA_OUT

DATA_INADDRESSDATA_OUT

Hardware Implementation Mix Columns and Add Round Key

• Mixor is a combined module implements both Mix Columns and Add round Key.• Both round key and state block are the module’s inputs.• Reads the state block from a BRAM, shared with the PS.• Reads the round key from a BRAM, written there by the Key Expansion

module.• The output is written to the shared BRAM, from which the PS reads the current block state.

ARM PL

Key Expansion BRAM

MixorMix

Column

Add Round

Key

BRAMAXI4-bus

BRAMAXI4-bus

MixorMix

Column

Add Round

Key

BRAM

BRAM

InitFlagADDRESS_DATA 0x0

DATA_OUT_DATA 0x0BRAM_WE_B_{num_col} 1111

MixADDRESS_DATA 0x8

DATA_OUT_DATA ( col_mixed ) xor ( col_in_key )BRAM_WE_B_{num_col} 1111

SaveCol

flag = 1

RdColADDRESS_DATA 0x4

ADDRESS_KEY 0x20 + 4x[ num_col + 4x( round + 1 ) ]

flag= 0

idleADDRESS_DATA 0x0

Mixor state machine flow

Mixor ChipScope waveform

• Mixor’s module execution over the 1st column

data_in_data1

bram_we_1

data_out_data

data_in_key

address_key

col_mixed

address_data

Hardware Blocks ImplementationPerformance

•Mixor• HW implementation - 24 cycles = 0.24 µsec• SW implementation - 2.545 µsec

• ~10 times faster• Key Expansion• HW implementation - 93 cycles = 0.93 µsec• SW implementation - 15 µsec

• ~15 times faster

Conclusions• The hardware modules are much faster than the software functions.• The data transmission’s overhead between PS and PL significantly

decreases the system’s speed and causes to a sever slowdown in performance - 68% of running time.• Main conclusion• The integrated system is best suitable for executing intensive

calculations, and low data traffic algorithms.• The AES algorithm has high data traffic and therefore the hardware

accelerator did not cause significant performance improvements.

Demonstration

Final Presentation Encryption on Embedded System Supervisor: Ina Rivkin students: Chen Ponchek Liel...

Documents

Transcript of Final Presentation Encryption on Embedded System Supervisor: Ina Rivkin students: Chen Ponchek Liel...