CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583Reconfigurable Computing

Prof. Joseph ZambrenoDepartment of Electrical and Computer EngineeringIowa State University

Lecture #21 – HW/SW Codesign

CprE 583 – Reconfigurable ComputingNovember 2, 2006 Lect-21.2

Quick Points

• Midterm graded and returned• Average – 84.5• Median – 85.0• Maximum – 95.0• Minimum – 72.0• Standard Deviation – 6.65


HW #4 Discussion

• Problem 1 – did just a simple adder work?• Problem 2 – how did you implement the

permutation table?• Problem 3 – did you use a counter?


AES-128E Algorithm12

8-b

itp

lain

text

128-

bit

key

Round Transformation

round++

round = 10?

SubBytes

ShiftRows

AddRoundKey

128-bitciphertext

Yes

No

Key

Exp

ansi

onMixColumns


S-box

0 1 2 3 4 5 6 7 8 9 a b c d e f0123456789abcdef

63 7c 77 7b f2 6b 6f c5 30 01 67 2b fe d7 ab 76ca 82 c9 7d fa 09 21 25 c5 c3 ee 1e a6 4f 8c 20b7 fd 93 26 36 fa 25 3a 46 b5 aa 0c a6 3a 77 5d04 c7 23 c3 18 42 dd dc b5 95 8c 57 69 84 00 dc09 83 2c 1a 1b c7 20 8c fa e1 cb 6d 0d fa 43 7353 d1 00 ed 20 43 41 99 6d 0d 44 5a 8e 2a 60 4ad0 ef aa fb 43 1b 12 4a e1 f2 ff 4a bc fd 0f a651 a3 40 8f 92 cc 3a a6 fc 09 c7 ef 00 25 89 aacd 0c 13 ec 5f 4a 35 af 4f 12 dc ed 51 3b 1b dc60 81 4f dc 2a 4a 46 81 55 43 88 bc 67 13 15 bae0 32 3a 0a 06 6f 25 60 23 00 c3 60 60 48 b5 49e7 c8 37 6d d5 12 ed 45 43 a6 25 4f 19 e1 ed 4fba 25 25 2e a6 2c 34 a6 97 ee 36 33 33 1f 6f 5170 b5 b5 66 40 3f 1b bb b5 63 1b b5 6f 25 ba 4ae1 98 11 11 56 af b5 b5 20 aa b5 2a 4a 00 60 228c 89 0d 0d 82 00 ba 8c 4f 43 17 0d 04 b5 09 09

y

x

Overview of AES (cont.)

• 128-bit input is copied into a two-dimensional (4x4) byte array referred to as the state• Round transformations operate on the state array• Final state copied back into 128-bit output

• AES makes use of a non-linear substitution function that operates on a single byte• Can be simplified as a look-up table (S-box)


AES-128E Modules: SubBytes

Key

Exp

ansi

on

128-

bit

pla

inte

xt12

8-b

itk

ey


round++

round = 10?

SubBytes

ShiftRows

AddRoundKey

128-bitciphertext

Yes

No

MixColumns

SubBytes

• S-box transformation performed independently on each byte of the state

S-boxS0,0 S0,1 S0,2 S0,3

S1,0 S1,2 S1,3

S2,0 S2,1 S2,2 S2,3

S3,0 S3,1 S3,2 S3,3

Sr,c

S'0,0 S'0,1 S'0,2 S'0,3

S'1,0 S'1,2 S'1,3

S'2,0 S'2,1 S'2,2 S'2,3

S'3,0 S'3,1 S'3,2 S'3,3

S'r,cstate[i] state'[i]


AES-128E Modules: ShiftRows

Key

Exp

ansi

on

128-

bit

pla

inte

xt12

8-b

itk

ey


round++

round = 10?

SubBytes

ShiftRows

AddRoundKey

128-bitciphertext

Yes

No

MixColumns

ShiftRows

• Bytes in the last three rows of the state are shifted cyclically over variable offsets

S0,0 S0,1 S0,2 S0,3

S1,0 S1,1 S1,2 S1,3

S2,0 S2,1 S2,2 S2,3

S3,0 S3,1 S3,2 S3,3

S'0,0 S'0,1 S'0,2 S'0,3

S'1,1 S'1,2 S'1,3 S'1,0

S'2,2 S'2,3 S'2,0 S'2,1

S'3,3 S'3,0 S'3,1 S'3,2

state[i] state'[i]


AES-128E Modules: MixColumns

Key

Exp

ansi

on

128-

bit

pla

inte

xt12

8-b

itk

ey


round++

round = 10?

SubBytes

ShiftRows

AddRoundKey

128-bitciphertext

Yes

No

MixColumns

MixColumns

• Modulo polynomial-basis multiplication performed on each column of the state

• Can be simplified as series of AND and XOR operations

state[i] state'[i]

{03h}

{02h}

S0,0 S0,2 S0,3

S1,0 S1,2 S1,3

S2,0 S2,2 S2,3

S3,0 S3,2 S3,3

S0,1

S1,1

S2,1

S3,1

S'0,0 S'0,2 S'0,3

S'1,0 S'1,2 S'1,3

S'2,0 S'2,2 S'2,3

S'3,0 S'3,2 S'3,3

S'0,1

S'1,1

S'2,1

S'3,1


MixColumns Implementation

entity MixColumns is

port (STATE_IN : in STATEtype; RNUM_IN : in RNUMtype; STATE_OUT : out STATEtype);

end MixColumns;

architecture behavior of MixColumns is signal tSTATE : STATEtype;

begin

process(STATE_IN) variable t1, t2 : std_logic_vector(7

downto 0); begin for i in 0 to 3 loop for j in 0 to Nb-1 loop

-- Multiply by 2 t1 := STATE_IN(i mod 4)(j)(6 downto 0) & '0'; if (STATE_IN(i mod 4)(j)(7) = '1') then t1 := t1 xor x"1b"; end if;

-- Multiply by 3 t2 := STATE_IN((i+1) mod 4)(j)(6 downto 0) & '0'; if (STATE_IN((i+1) mod 4)(j)(7) = '1') then t2 := t2 xor x"1b"; end if; t2 := t2 xor STATE_IN((i+1) mod 4)(j);

tSTATE(i)(j) <= t1 xor t2 xor STATE_IN((i+2) mod 4)(j) xor STATE_IN((i+3) mod 4)(j); end loop;end loop;

end process;


AES-128E Modules: AddRoundKey

Key

Exp

ansi

on

128-

bit

pla

inte

xt12

8-b

itk

ey


round++

round = 10?

SubBytes

ShiftRows

AddRoundKey

128-bitciphertext

Yes

No

MixColumns

AddRoundKey

• Words from the round-specific key are XORed into columns of the state

S0,0 S0,2 S0,3

S1,0 S1,2 S1,3

S2,0 S2,2 S2,3

S3,0 S3,2 S3,3

S'0,0 S'0,2 S'0,3

S'1,0 S'1,2 S'1,3

S'2,0 S'2,2 S'2,3

S'3,0 S'3,2 S'3,3

S0,1

S1,1

S2,1

S3,1

S'0,1

S'1,1

S'2,1

S'3,1

Rkey[i]

w[0] w[2] w[3]w[1]

state[i] state'[i]


AddRoundKey Implementation

entity AddRoundKey is port(STATE_IN : in STATEtype; KEY_IN : in KEYtype; STATE_OUT : out STATEtype);end AddRoundKey;

architecture behavior of AddRoundKey isbegin

process(STATE_IN, KEY_IN) begin

for j in 0 to (Nb-1) loop STATE_OUT(0)(j) <= STATE_IN(0)(j) xor KEY_IN(j)(31 downto 24); STATE_OUT(1)(j) <= STATE_IN(1)(j) xor KEY_IN(j)(23 downto 16); STATE_OUT(2)(j) <= STATE_IN(2)(j) xor KEY_IN(j)(15 downto 8); STATE_OUT(3)(j) <= STATE_IN(3)(j) xor KEY_IN(j)(7 downto 0); end loop;

end process;

end behavior;


AES-128E Modules: KeyExpansion

Key

Exp

ansi

on

128-

bit

pla

inte

xt12

8-b

itk

ey


round++

round = 10?

SubBytes

ShiftRows

AddRoundKey

128-bitciphertext

Yes

No

MixColumns

KeyExpansion

• Initial 128-bit key is converted into separate keys for each of the 10 required rounds

• Consists of Sbox transformations and some XORs

128-

bit

key

Rkey[1]

Rkey[2]

Rkey[3]

Rkey[4]

Rkey[5]

Rkey[6]

Rkey[7]

Rkey[8]

Rkey[9]

Rkey[10]

S

S

S

S

rcon

w[0]

w[1]

w[2]

w[3]

w[4]

w[5]

w[6]

w[7]


Design Decisions

• Online/offline key generation• Inter-round layout decisions

• Round unrolling• Round pipelining

• Intra-round layout decisions• Transformation pipelining• Transformation partitioning

• Technology mapping decisions• S-box synthesis as Block SelectRAM,

distributed ROM primitives, or logic gates


Round Unrolling / Pipelining

• Unrolling replaces a loop body (round) with N copies of that loop body

• AES-128E algorithm is a loop that iterates 10 times – N є [1, 10]• N = 1 corresponds to original looping case• N = 10 is a fully unrolled implementation

• Pipelining is a technique that increases the number of blocks of data that can be processed concurrently• Pipelining in hardware can be implemented by

inserting registers• Unrolled rounds can be split into a certain number of

pipeline stages• These transformations will increase throughput but

increase area and latency


Unrolling factor = 10Unrolling factor = 2Unrolling factor = 1Unrolling factor = 5

Round Unrolling / Pipelining (cont.)In

pu

tp

lain

text

R1 Ou

tpu

tC

iph

erte

xtR2 R3 R4 R5

R6R7R8R9R10

Round pipelining = ON


Transformation Partitioning/Pipelining

• FPGA maximum clock frequency depends on critical logic path• Inter-round transformations can’t improve

critical path• Individual transformations can be pipelined with

registers similar to the rounds• Transformations that are part of the maximum

delay path can be partitioned and pipelined as well

• Can result in large gains in throughput with only minimal area increases


Transformation pipelining = ON

Partitioning / Pipelining (cont.)

Transformation partitioning = ON

SubBytes ShiftRows MixColumns

KeyExpansion

AddRoundKey

KeyExpansionB KeyExpansionCKeyExpansionA


S-box Technology Mapping

• With synthesis primitives, can map the S-box lookup tables to different hardware components

• Two S-boxes can fit on a single Block SelectRAM

constant SSYNROMSTYLE: string := “select_rom”; -- {logic, select_rom}entity Sbox is

port(BYTE_IN : in std_logic_vector(7 downto 0); BYTE_OUT : out std_logic_vector(7 downto 0));

attribute syn_romstyle : string; attribute syn_romstyle of BYTE_OUT : signal is SSYNROMSTYLE;

end Sbox;...

Sample VHDL code


Recap – Retiming


weight(e) = weight(e) + lag(head(e)) - lag(tail(e))

Recap – Retiming (cont.)


Retiming and Pipelining

• Can use this retiming to pipeline• Assume have enough (infinite supply) of

registers at edge of circuit• Retime them into circuit• See [WeaMar03A] for details


Recap – Retiming and Covering


Outline

• HW #4 Discussion• Recap• HW/SW Codesign

• Motivation• Specification• Partitioning• Automation


Hardware/Software Codesign

• Definition 1 – the concurrent and co-operative design of hardware and software components of an embedded system

• Definition 2 – A design methodology supporting the cooperative and concurrent development of hardware and software (co-specification, co-development, and co-verification) in order to achieve shared functionality and performance goals for a combined system [MicGup97A]


Motivation

• Not possible to put everything in hardware due to limited resources

• Some code more appropriate for sequential implementation

• Desirable to allow for parallelization, serialization

• Possible to modify existing compilers to perform the task


Why put CPUs on FPGAs?

• Shrink a board to a chip• What CPUs do best:

• Irregular code• Code that takes advantage of a highly optimized

datapath

• What FPGAs do best:• Data-oriented computations• Computations with local control


• Most recent work addressing this problem assumes relatively slow bus interface

• FPGA has direct interface to memory in this model

General-PurposeProcessor

Memory

FPGA

Memory bus

Computational Model


Hardware/Software Partitioning

CPUHW

Accelerator

if (foo < 8) {for (i=0; i<N; i++)

x[i] = y[i]*z[i];}


Methodology

• Separation between function, and communication

• Unified refinable formal specification model

• Facilitates system specification

• Implementation independent

• Eases HW/SW trade-off evaluation and partitioning

• From a more practical perspective:• Measure the application• Identify what to put onto the accelerator• Build interfaces


Informal Specification,Constraints

System model

Architecture design

HW/SW implementation

PrototypeTest

Implementation

Fail

Success

Componentprofiling

Performanceevaluation

System-Level Methodology


Concurrency

• Concurrent applications provide the most speedup

CPU accelerator

if (a > b) ... x[i] = y[i] * z[i]

No data dependencies


Process 2

Process 3

Process 1

Partitioning

• Can divide the application into several processes that run concurrently

• Process partitioning exposes opportunities for parallelismif (i>b) …

for (i=0; i<N; i++) …

for (j=0; j<N; j++) ...


process (a, b, c) in port a, b; out port c;{ read(a); … write(c);}

Specification

Line (){ a = … … detach}

Processor

Capture

Model FPGA

Partition

Synthesize

Interface

Automating System Partitioning

• Good partitioning mechanism:1)Minimize communication across bus

2)Allows parallelism both hardware (FPGA) and processor operating concurrently

3)Near peak processor utilization at all times (performing useful work)


task

Software Hardware

List of tasks List of tasks

Partitioning Algorithms

• Assume everything initially in software• Select task for swapping• Migrate to hardware and evaluate cost

• Timing, hardware resources, program and data storage, synchronization overhead

• Cost evaluation and move evaluation similar to what we’ve seen regarding mincut and simulated annealing


Multi-threaded Systems

• Single thread:

• Multi-thread:


Performance Analysis

• Single threaded:• Find longest possible

execution path

• Multi-threaded with no synchronization:• Find the longest of

several execution paths

• Multi-threaded with synchronization:• Find the worst-case

synchronization conditions


Multi-threaded Performance Analysis

• Synchronization causes the delay along one path to affect the delay along another

synchronization point

ta tb

tc td

Delay = max(ta, tb) + td


Control

• Need to signal between CPU and accelerator• Data ready• Complete

• Implementations:• Shared memory• Handshake

• If computation time is very predictable, a simpler communication scheme may be possible


Application Program

Operating System

I/O driver

I/O bus

Application hardware (custom)

I/O driver

I/O bus

Send, Receive, Wait

Register reads/writes

Interrupt service

Bus transactionsInterrupts

Communication Levels

• Easier to program at application level• (send, receive, wait) but difficult to predict

• More difficult to specify at low level• Difficult to extract from program but timing and

resources easier to predict


d1

d2d3

p1 p2 p3

r2

r3

FPGAControl/Data FIFO

Other Interface Models

• Synchronization through a FIFO• FIFO can be implemented either in hardware or in

software• Effectively reconfigure hardware (FPGA) to allocate

buffer space as needed• Interrupts used for software version of FIFO


Debugging

• Hard to test a CPU/accelerator system:• Hard to control and observe the accelerator

without the CPU• Software on CPU may have bugs

• Build separate test benches for CPU code, accelerator

• Test integrated system after components have been tested


Graphical EFSMGraphical EFSM ESTERELESTEREL ................................

CompilersCompilers

PartitioningPartitioning

Sw SynthesisSw Synthesis

FormalFormal

VerificationVerification

Sw Code + Sw Code +

RTOSRTOSLogic NetlistLogic NetlistSimulationSimulation

Hw SynthesisHw SynthesisIntfc + RTOSIntfc + RTOS

SynthesisSynthesis

CFSMsCFSMs

Rapid prototypingRapid prototyping

POLIS Codesign Methodology


Codesign Finite State Machines

• POLIS uses an FSM model for• Uncommitted• Synthesizable• VerifiableControl-dominated HW/SW specification

• Translators from • State diagrams, • Esterel, ECL, ReactiveJava• HDLsInto a single FSM-based language


CFSM behavior

• Four-phase cycle: Idle Detect input events Execute one transition Emit output events

• Software response could take a long time:• Unbounded delay assumption

• Need efficient hw/sw communication primitive:• Event-based point-to-point communication


CFSM2CFSM2

CFSM3CFSM3

C=>GC=>G

CFSM1CFSM1

C=>FC=>FB=>CB=>C

F^(G==1)F^(G==1)

(A==0)=>B(A==0)=>B

C=>AC=>ACFSM1CFSM1 CFSM2CFSM2

C=>BC=>B

FF

GG

CCCC

BBAA

C=>GC=>G

C=>BC=>B

• Globally Asynchronous, Locally Synchronous (GALS) model

Network of CFSMs


Summary

• Hardware/software codesign complicated and limited by performance estimates

• Algorithms not generally as good as human partitioning

• Other interesting issues include dual processors, special memory interfaces

• Will likely evolve at faster rate as compilers evolve

CprE / ComS 583 Reconfigurable Computing

Documents

Transcript of CprE / ComS 583 Reconfigurable Computing