Domain specific processors Lecture 9iverbauw/Courses/...•I. Verbauwhede, “Low Power DSPs”,...
Transcript of Domain specific processors Lecture 9iverbauw/Courses/...•I. Verbauwhede, “Low Power DSPs”,...
1
1HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Domain specific processors Lecture 9
Ingrid Verbauwhede
Departement Elektrotechniek, afdeling ESAT/COSIC
2HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Overview
• Lecture 1: what is a system-on-chip• Lecture 2: terminology for the different steps• Lecture 3: models of computation• Lecture 4: two MOC’s: SDFG & control flow• Lecture 5: control flow & FIR example• Lecture 6: fixed point refinement• Lecture 7: architecture exploration• Lecture 8: DSP Processors• Lecture 9 – DSP : Domain specific processors
2
3HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
HJ94 goal: Skiing down a mountain
Specification
ASIC SpecialPurpose
Retargetablecoprocessor
DSPprocessor
DSP-RISC RISC
Algorithm Transformations
Memory Transformations and Optimizations
Floating-point to Fixed-point
SPW, Matlab, C++
pipelining, unrolling
loop merging, compaction
40 bit accumulator
• DSP = one class of domain specific processors
4HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
References
• The origins:• E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP
magazine, October 1988, pg. 4-19.• Part II, IEEE ASSP magazine, January 1989, pg. 4-14
• Good overview:• P. Lapsley, J. Bier, A. Shoham, E.A.Lee, “DSP Processor Fundamentals:
Architectures and Features,” IEEE Press, 1998.
• Domain specific processor:•I. Verbauwhede, “Low Power DSPs”, Chapter 19 in Low Power Electronics
and Design, Edited by Christian Piguet, CRC Press, 2005.
•Other domains: security and cryptography, wireless communications
3
5HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
DSP processors -
• Last lecture: DSP = domain specific processor– Highly optimized for wireless communication– EVERY component of the processor:
• Datapath = MAC• Memory = Harvard or Modified Harvard• Address arithmetic: indirect – modulo – bit reverse (FFT)• Control: CISC with specialized instruction set
– Example of FIR calculation
• Today:– More domain specific processors– Type of co-processors
6HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Application domain: wireless communications
Receiver
Tran
smit
Syn
thes
ize
PA
TCXO
Receiver
Tran
smit
Syn
thes
ize
PA
TCXO
Ext
erna
lM
emor
ies
DigitalASIC
MicroProcessor
DSP
BatteryPack
AnalogASIC
PowerSupply Audio
Codec
No network
* 0 #7 8 94 5 61 2 3
clr
RF Board
Baseband board
DSP is example of “domain specific” processor
4
7HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Performance requirements: digital cellular phone
RFReceive
RFSend
Demodulation Channeldecoder
Speechdecoder
Modulation Channelencoder
Speechencoder
Communication Application
Goal: Minimum “MIPS” to get the job done.
8HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Application Domain: compute intensive functions
Source encoder/decoder = speech codersAdvanced vocoders for improved speech quality & higher capacity:Example: ACELP derivatives for GSM and IS136A
• Digital filtering (FIR, IIR)
• Vector quantization, code book search (square distance computation)
Channel encoder/decoder = error correctingComplex wireless modems:
• Galois field arithmetic
• Convolution coders based on Viterbi trellis search
• Turbo coders
5
9HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Compute intensive functions: evolution of DSP’s
Simple FIR example
Square distance for speech processing
Speed-up of FIR example
Viterbi acceleration for communication algorithms
Evolution of DSPs follows these examples
10HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
The Viterbi Decoding (Introduction)
• Error Correcting Decoding Algorithm for Convolutional Code• Trellis Representation• Maximum Likelihood Decoding Algorithm• GSM System
6
11HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Convolutional Code (ex. Wyner-Ash Code)
• Generator matrix G(D) = [ 1 1+D ]• Input sequence u(D) = 1, 1, 0, 1, 0, …• Output Sequence c(D) = u(D)G(D)
=11, 10, 01, 11, 01, …
D
12HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Constraint length K and Rate
• v = 1, K = 2, 2states
• Rate = 1/2, one input bit generates twocoded output bits.
D
100,00 1,101,11
0,01
7
13HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Trellis Representation
• Example G(D)=[ 1+D2 1+D+D2 ]v = 2, K = 3, 4 states
• Instead of writing a State Diagram,
D D
t0 1 2 3 4
S00
S10
S01
S00
S10
S01
S00
S10
S01
S00
S10
S01
S00
S10
S01
S11 S11 S11 S11 S11
00
11
00
11
00
11
00
11
1001
1001
1001
11
00
11
00
01
10
01
10
14HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Efficiency of Viterbi decoding
• Identifies the path through the Trellis--- Selecting survivor paths for each states by calculating Hamming Distance
• The total number of paths grows exponentially with the number of states--- K increasing, H/W Complexity increases exponentially
but the Error Rate decreases
8
15HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Viterbi Decoding Algorithm (1)
• Assume N = 7 blocks
t
S00
S10
S01
S11
00
11
00
11
00
11
00
11
1001
1001
1001
11
00
11
00
01
10
01
10
0 1 2 3 4 5 6 7
000000
11
1001 01
11 11
10
11
00
01
10
Information Data
Convolution Codes
Error Sequence
Received Data
0
00
00
00
1
11
01
10
1
10
10
00
0
10
00
10
1
00
00
00
0
01
10
11
0
11
00
11
Tail Bit
16HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
S00
S10
S01
S11
00
11
00
11
00
11
00
11
1001
1001
1001
11
00
11
00
01
10
01
10
000000
11
1001 01
11 11
10
11
00
01
10
0 1 10
12
4
2
Viterbi Decoding Algorithm (2)
• Calculate Hamming Distance (Choose smaller one)
t0 1 2 3 4 5 6 7
Information Data
Convolution Codes
Error Sequence
Received Data
0
00
00
00
1
11
01
10
1
10
10
00
0
10
00
10
1
00
00
00
0
01
10
11
0
11
00
11
9
17HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Viterbi Decoding Algorithm (3)
• Selecting the Optimal Path
t0 1 2 3 4 5 6 7
Information Data
Convolution Codes
Error Sequence
Received Data
0
00
00
00
1
11
01
10
1
10
10
00
0
10
00
10
1
00
00
00
0
01
10
11
0
11
00
11
S00
S10
S01
S11
00
11
00
11
00
11
00
11
1001
1001
1001
11
00
11
00
01
10
01
10
000000
11
1001 01
11 11
10
11
00
01
10
0 1 1 20 2 33
1 3 22 2
2 2 34
2 32 3
3
18HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Traceback
• We cannot wait for the end of sequence for some applications
• The amount of “delay” is called tracebackdepth LD.
--- Larger LD , better performancebut need more memory and complexity
10
19HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Viterbi in GSM
• Full-rate speech channel 22.8kbps: Rate = 1/2, K = 5
• Half-rate speech channel :11.4kbps: Rate = 1/3, K = 7
20HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Required Performance
11
21HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Compute Intensive function 2: Viterbi
i
i+ s/2
2i
2i+1
+a
-a
-a
+a
. . .
. . .
Viterbi butterfly
i = state indexs = # of states = 2w = decoding window
Basic equations:
d(2n) = min { d(i) + a, d(i + s/2) - a }d(2i + 1) = min { d(i) - a, d(i + s/2) + a }
IS-95: k = 8, w = 192, corresponds to 2 x 192 x (cycles for one ACS)
k-1
7
Basic algorithm in Viterbi channel decoders,modified version in turbo decoders.
Key operation: Add-Compare-Select (ACS)
22HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Viterbi on Atmel’s Lode
Two MAC units & ALU: Add-Compare-Select
• DMAC operates as dual add/subtract unit
• ALU finds minimum
• Shortest distance saved
• Path indicator saved
• 4 cycles / butterfly
+
A1
MAC0
DB1(16)DB0(16)
µ2
+
µ1
A0
MAC1
Γ1 Γ2
Min()ALU
A3Γ
A2
decision bit
to memory
Γ = min [(Γ1 + µ1), (Γ2 + µ2)]
12
23HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
MSW/LSWSelect
Viterbi on TIC54x
ALU and CSSU: Add-Compare-Select
• ALU splits in 16 bit halves
• ACC splits in half
• Shortest distance saved
• CSSU compares halves
• Path indicator saved
• 4 cycles / butterfly
+
TREG
ALU
DB1(16)DB0(16)
µ2
+
µ1
AccumulatorΓ1 Γ2
CompALU
TRN reg
Γ
decision bit
Data bus EB, to memory
Γ = min [(Γ1 + µ1), (Γ2 + µ2)]
24HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
BUT: DSP Software Development
• Complex DSP architecture not amenable to compiler technology
• Algorithms are modeled in high level language (e.g. C++)
• Solutions are implemented and debugged in hand-optimized assembler - large development effort with minimal tool support
HLL
algorithmic
model
prototype
code
production
code
hand coded assembler
optimize & debug
Long, frustrating time to market
Fragile legacy code
Widely used in handhelds, but change in basestations VLIW
13
25HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
2G Basestation Baseband Processing
• Multiple DSPs used for baseband processing.• RISC Microcontroller for timing, framing, I/O control• Software upgradable over the network• DSPs dominate cost and power consumption
DSP RISCMicro
Controller
I/O
T1/E1
DSP
DSP
DSP
DSP
DSP
DSP
DSP
I/O
I/O I/O ASIC
DSP
DSP
AFE
AFE
ChannelEqualization
ChannelDe/coding Encryption
RAM
RAM
Tx
TxRx
Rx
Tx/Rx baseband processing board for 2-carrier GSM basestation
Future trend - integrate baseband processing -low cost Pico BTS
26HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Compiler Driven VLIW
Large orthogonal register set, regular interconnect
Data memory
RegisterArray
Interconnect
ex1(alu)
ex2(alu)
ex3(mpy)
ex4(ld/st)
exn(ld/st)
cond/branch ex1 ex2 ex3 ….. exnInstruction format:
Atomic RISC-like operations => heavily pipelined, high freq. clock
14
27HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Explicitly Parallel Instruction Computing
Execution ClustersData memory
RegisterArray
Interconnect
ex1(alu)
ex4(alu)
ex5(mpy)
ex3(ld/st)
ex6(ld/st)
RegisterArray
Interconnect
ex2(alu)
Execution Sets
1 1 1 0 1 0 1 0
fetch set
exec. set
28HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Texas Instruments ‘C6201
ALU shift mpy add ALU shift mpy add
Register Bank A(16 x 32)
Register Bank B(16 x 32)
Instruction Dispatch & Decode
Program Memory(16K x 32)
256
Data Memory(32K x 16)
8-way VLIW with two execution clusters256 bit (8x32) instruction fetch with variable length execute setEach 32 bit instruction individually predicated11 stage pipeline1600 MIPS, 400 MMACs @ 200 MHz
15
29HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
FIR Filter on TI ‘C6x
loop:
ldw .d1t1 *a4++,a5
|| ldw .d2t2 *b4++,b5
||[b0] sub .s2 b0,1,b0
||[b0] b .s1 loop
|| mpy .m1x a5,b5,a6
|| mpyh .m2x a5,b5,b6
|| add .l1 a7,a6,a7
|| add .l2 b7,b6,b7
• Outer Loop: 23 cycles, 180 bytes– 1 cycle in inner loop
• All 8 exec units used in inner loop - maximum efficiency– 2 MACs per cycle
Hand-coded assembly: 32-tap FIR filter
Assembly syntax more difficult to learn.Hard to get full use of all 8 execution units at once.Software pipelining difficult to implement, and requires longer prolog/epilog (larger
code size).
Courtesy: Gareth Hughes: Bell Labs Australia
30HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Viterbi on TI ‘C6x
Cycle 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
.D1 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH sd1 STH m[2] STH m[3]
.D2 ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj SUB m LDH sd0 STH m[5] STH m[4]
.M1 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0
.M2 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8
.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 ADD m0 SUB -m0
.L2 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 SUB old SUB -m1 SUB m1 SUB I
.S1 B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k
.S2 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 ADD tr B JLOOP MVK j
Cycle 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
.D1 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH m[0] STH m[1] LDH old1
.D2 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 STH trans STH m[1] STH m[6] LDH old0
.M1 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0
.M2 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 MPY mj
.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 SUB new ADD old ADD SP
.L2 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8
.S1 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 MVK k
.S2 *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr B JLOOP
Utilization of execution units in Viterbi decoder
• 16-state Viterbi decoder for GSM from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm
– 3 cycles per butterfly– 32 cycles per GSM timeslot (8 butterflies)– MPY instructions used to move data
x 8
16
31HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Viterbi on TI ‘C6x
3-cycle 2-ACS Inner-Loop
x 8
LOOP:
[b1]
b.s1LOOP
||[b1]
sub.s2b1,1,b1
||[!a2]sth.d1
b12,*+a6[8]
||[!a2]add.d2
b0,b14,b14
||
cmpgt
.l1
a11,a10,a1
||
cmpgt
.l2
b11,b10,b0
||
mpy.m1x1,b5,a4
[a2]
sub.s1a2,1,a2
||[!a2]
sth.d1a12,*a6++
||[a1]
add.s22,b0,b0
||[b0]
mpy
.m21,b11,b12
||
mpy.m11,a10,a12
||
sub
.l2xa7,b5,b10
||
ldh.d2*++b9,b5
shl.s2b14,2,b14
||[a1]
mpy.m11,a11,a12
||
add.s1
a7,a4,a10
||
sub.l1xb13,a4,a11
||
add.l2b13,b5,b11
||
mpy.m21,b10,b12
||
ldh.d2*b4++[2],a7
||
ldh.d1*a5++[2],b13
; end of LOOP
32HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Lucent / Motorola Star*Core SC140
6-way VLIW with 128 bit (8x16) instruction fetchPrefix instructions for high performance without sacrificing code densityEach execution set (parallel instructions + prefix) predicated5 stage pipeline1800 MIPS, 1200 MMACs @ 300 MHz
Program / Data Memory
ProgramSequencerInstructionDispatcher
AddressRegisters
(27)
AAU
Data Registers(16)
MAC
ALU
BFUAAU
MAC
ALU
BFU
MAC
ALU
BFU
MAC
ALU
BFU
17
33HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Viterbi on Star*Core
• Hardware support for Viterbialgorithm:– max2vit instruction.– vsl instruction
• 1 cycle per butterfly through software-pipelining
• Decision bits are manually stored using the Viterbi Shift Left (VSL) instruction:
GSM (K=5, 16 states)[ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2 ][ add2 d0,d4 sub2 d6,d2
sub2 d4,d0 add2 d2,d6 ][ max2vit d4,d2 max2vit d0,d6 ][ vsl.4w d2:d6:d1:d3,(r2)+n0
vsl.4f d2:d6:d1:d3,(r3)+n0 ]
max2vit d4,d2 max2vit d0,d6
SR
D1
D3
D2
D6
vsl.4w d2:d6:d1:d3,(r2)+n0
Results writtento memory
x 4
decisions
decisions
path metrics
path metrics
Courtesy: Gareth Hughes: Bell Labs Australia
34HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
SOC
18
35HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Energy-Efficient SoC are distributed
[‘Under the Hood’, EET, D. Carey, 9/5/02]
TIBaseband
DSP
HTCInterface
ASIC
TIPower
Management
Intel32Mb Flash
Intel128Mb Flash
Winbond128Mb
SDRAM
TIRF Synth
TIRF TX/RX
ConexantPower Amp
IntelStrongArm
SonyLCD
Interface
Sony240x320
color LCD
PhilipsAudio Codec
TouchscreenSIM
MMICExpansion
T-MobilePocketPC Phone
36HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
DisplayAD7873Digitizer
MotorolaDragonBall
8M SDRAM
4M FLASH
FPGA
PhilipsUSB
MaximTransceivers
Agere POMBaseband
MotorolaTransceiver
RF MicroPoweramp
MaximControl
Driver
MemoryCardSlot
architecture tuned to applicationPalmPilot i705
19
37HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
OMAP 2420 platform (TI)
38HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
OMAP 2420 features
• Intended for high-volume wireless handset manufacturers
• Application processor “all-in-one entertainment”• Supports all wireless standards• Dual core ARM11 (330MHz) + DSP C55x (220MHz)• 2D/3D graphics accelerator at 2 Mega-Polygon/s for
gaming applications• Image & Video accelerator for 4 Megapixel cameras
and 30 frame/s VGA video support• 5Mbits SRAM to boost streaming media performance
20
39HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
40HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Power Cost
???
GeneralPurpose
Fixed
Platform
Application
ASIC
Energy-flexibility trade-off
21
41HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Also general purpose architectures become heterogeneous.
IBM PowerPC ®
RISC CPU
Synchronous Dual-Port RAM
SelectIO-ltra™ SystemIO™ & XCITE ™
Conexant3.125Gb Serial
XtremeDSP™
Source: Xilinx webpage
42HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Question
• Energy - flexibility are opposite demands!• How to navigate in this jungle?• 3D design space:
• Next question: how to map (or compile) an application onto such an architecture?
Computational Abstraction Level
Reconfigurable featureBinding rate
22
43HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Flexibility (1) - Abstraction level
Computational Abstraction Level
• Instruction set level = “programmable”
• CLB level = “reconfigurable”
44HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Flexibility (2) - Reconfigurable feature
• Basic components:
CLB RAM details
Switches, Muxes
Implementation
Execution unit type
Register file
Cross-bar Busses
Micro-architecture
Custom instructions
Register set
Size address/ data bus
Instruction set Architecture
Number & type of processes
Memory hierarchy
Interconnect network
Systems
ComputationStorageCommunication
Reconfigurable feature
Computational Abstraction Level
23
45HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Flexibility (3) - Binding rate
Binding rate
Compare processing to binding• Configurable (“compile-time”)• Re-configurable• Dynamic reconfigurable (“adaptive”)
46HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
SOC architecture: RINGS
Networking Video
StandardAlgorithm
ArchitectureµArchitecture
Circuit
MEMORY
Reconfigurable Interconnect
CPU
RF
BasebandProcessing
VideoEngine
Domain-Specific
Hardware
SoftwareNetworking
Medium accessBaseband ProcµArchitecture
Circuit
Signal Proc
DSP
AlgorithmArchitectureµArchitecture
Assembly
24
47HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Instruction set extension
• Instruction set extension• Register mapped• Tightly coupled• Experiment: DFT
12.5 times5.76 mJ67.6 mJEnergy
Improve-ment
SW with HW datapath
SW onEmbedded proc.
1000iterations
48HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Co-processor
• Memory mapped• Loosely coupled• Experiment: AES
LocalMemory
25 times13.5 mJ89.2 mJEnergy
Improve-ment
SW with HW
datapath
SW on emb. Proc.
175iterations
25
49HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Independent IP
• Loosely coupled• Network on chip
connected• Flexible interconnect• Experiment: TCP/IP
checksum
router
router
84 times0.20 mJ17.0 mJEnergy
Improve-ment
HW datapath
SW on emb. Proc.
100packets
50HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Example: The Security Pyramid
DQ
Vcc
CPUCrypto
MEM
JCA
Java
JVM
CLK
Protocol
Algorithm
Architecture
Circuit
Micro-Architecture
Identification
ConfidentialityIntegrity
Kasumi, Rijndael,RC4, MD5, …
26
51HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Example: AES Coprocessor
InputFSM
ProcFSM
OutputFSM
>>
Encrypt
KeySchedule
>>
instruction
roundkey16 16256256
handshakeCORE
[DAC 2002]
52HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator[2] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet[3] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS[4] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS
648 Mbits/secAsmPentium III [2] 41.4 W 0.015 (1/1900)
Java [4]Emb. Sparc 450 bits/sec 120 mW 0.0000037
(1/9.600.000)
C Emb. Sparc[3] 133 Kbits/sec 0.0011 (1/33000)
56 mW
Power
1.32 Gbit/secFPGA [1]
35.7 (1/1)2 Gbits/sec0.18µm CMOS
Figure of Merit(Gb/s/W)
ThroughputAES 128bit key128bit data
490 mW 2.7 (1/11)
120 mW
Design options: AES acceleration: Gbits/Joule
27
53HJ94, Spring 2006, Ingrid Verbauwhede, lecture 9
Applications
Mapped
onto
Architectures
Conclusion
Design Methods
= Low Power!