Efficient Virtualization on Embedded Power Architecture Platforms
Hardware platforms for Embedded computing
description
Transcript of Hardware platforms for Embedded computing
![Page 1: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/1.jpg)
Hardware platforms for Embedded computing
![Page 2: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/2.jpg)
The energy/flexibility conflict- Intrinsic Power Efficiency -
Technology
[H. de Man, Keynote, DATE‘02;T. Claasen, ISSCC99]
Operations/Watt[MOPS/mW]
ProcessorsReconfigurable Computing
hardwired muxed ASIC1
0.1
0.01
0.13µ
Necessary to optimize HW/SW; otherwise the prize for software flexibility cannot be paid!
Ambient Intelligence
0.07µ
DSP-ASIPsµPs
10
0.25µ0.5µ1.0µ
poor design techniques
![Page 3: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/3.jpg)
Architectural Choices
P
Prog Mem
MACUnit
AddrGenP
Prog Mem
P
Prog Mem
SatelliteProcessorDedicated
Logic
Satellite
Processor
SatelliteProcessor
GeneralPurpose
P
Software
DirectMapped
Hardware
HardwareReconfigurable
Processor
ProgrammableDSP
Flex
ibili
tyFl
exib
ility
1/Efficiency (power, speed)1/Efficiency (power, speed)
![Page 4: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/4.jpg)
The Processor Design Space
Cost
Perf
orm
ance
Microprocessors
Performance iseverything& Software rules
Embeddedprocessors
Microcontrollers
Cost is everything
Application specific architecturesfor performance
![Page 5: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/5.jpg)
Area of processor cores = Cost
Nintendo processor Cellular phones
![Page 6: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/6.jpg)
Another figure of meritComputation per unit area
Nintendo processor Cellular phones???
![Page 7: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/7.jpg)
Embedded vs. general-purpose processors
Embedded processors may be optimized for a category of applications. Customization may be narrow or broad.
We may judge embedded processors using different metrics: Code size. Memory system performance. Preditability.
![Page 8: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/8.jpg)
Microcontrollers
CPU ROM RAM
I/O
A single chip
Subsystems:Timers, Counters, AnalogInterfaces, I/O interfaces
Memory
![Page 9: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/9.jpg)
Microcontroller Architectures
CPUProgram + Data
Address Bus
Data Bus
Memory
Von NeumannArchitecture
CPUProgram
Address Bus
Data Bus
HarvardArchitecture
Memory
Data
Address Bus
Fetch Bus
0
0
0
2n
![Page 10: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/10.jpg)
MCS-51 “Family” of Microcontollers
8051 introduced by Intel in late 1970s Now produced by many companies in
many variations The most pupular microcontroller – about
40% of market share 8-bit microcontroller
![Page 11: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/11.jpg)
“Original” 8051 Microcontroller
Oscillator and timing
4096 Bytes Program Memory
128 Bytes Data
Memory
Two 16 Bit Timer/Event
Counters
8051 CPU
64 K Byte Bus Expansion
Control
Programmable I/O
Programmable Serial Port Full Duplex UART
Synchronous Shifter
Internal data bus
External interrupts
subsystem interrupts
Control Parallel portsAddress Data BusI/O pins
Serial InputSerial Output
![Page 12: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/12.jpg)
Microcontrollers- MHS 80C51 as an example -• 8-bit CPU optimised for control applications• Extensive Boolean processing capabilities• 64 k Program Memory address space• 64 k Data Memory address space• 4 k bytes of on chip Program Memory• 128 bytes of on chip data RAM• 32 bi-directional and individually addressable I/O lines• Two 16-bit timers/counters• Full duplex UART• 6 sources/5-vector interrupt structure with 2 priority levels• On chip clock oscillators• Very popular CPU with many different variations
Features for Embedded System
s
![Page 13: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/13.jpg)
RISC processors RISC generally means
highly-pipelinable, one instruction per cycle.
Pipelines of embedded RISC processors have grown over time: ARM7 has 3-stage
pipeline. ARM9 has 5-stage
pipeline. ARM11 has eight-stage
pipeline.
ARM11 pipeline [ARM05].
![Page 14: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/14.jpg)
RISC processor families ARM: ARM7 is relatively simple, no memory
management; ARM11 has memory management, other features.
MIPS: MIPS32 4K has 5-stage pipeline; 4KE family has DSP extension; 4KS is designed for security.
PowerPC: 400 series includes several embedded processors; MPD7410 is two-issue machine; 970FX has 16-stage pipeline.
![Page 15: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/15.jpg)
DSP Applications
Audio applications MPEG Audio Portable audio Digital cameras Wireless Cellular
telephones Base station
Networking Cable modems ADSL VDSL
![Page 16: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/16.jpg)
Another Look at DSP Applications
High-end Wireless Base Station - TMS320C6000 Cable modem gateways
Mid-end Cellular phone - TMS320C540 Fax/ voice server
Low end Storage products - TMS320C27 Digital camera - TMS320C5000 Portable phones Wireless headsets Consumer audio Automobiles, toasters, thermostats, ...
Incr
easi
ngC
ost
Increasingvolum
e
![Page 17: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/17.jpg)
DSP vs. General Purpose MPU
The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC).
DSP are judged by whether they can keep the multipliers busy 100% of the time.
The "SPEC" of DSPs is 4 algorithms: Inifinite Impule Response (IIR) filters Finite Impule Response (FIR) filters FFT, and convolvers
In DSPs, algorithms are king! Binary compatability not an issue
Software is not (yet) king in DSPs. People still write in assembly language for a product to
minimize the die area for ROM in the DSP chip.
![Page 18: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/18.jpg)
Architectural Features of DSPs
Data path configured for DSP Fixed-point arithmetic MAC- Multiply-accumulate
Multiple memory banks and buses - Harvard Architecture Multiple data memories
Specialized addressing modes Bit-reversed addressing Circular buffers
Specialized instruction set and execution control Zero-overhead loops Support for MAC
Specialized peripherals for DSP THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE
DESIGN!!!
![Page 19: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/19.jpg)
Application: y[j] = i=0
x[j-i]*a[i]
i: 0i n-1: yi[j] = yi-1[j] + x[j-i]*a[i]
Domain-oriented architectures
Architecture: Example: Data path ADSP210x
n-1
- Parallelism - Dedicated registers
MR
MFMX MY
*+,-
AR
AFAX AY
+,-,..
DP
yi-1[j]
x[j-i]
x[j-i]*a[i]
a[i]
Address generation unit (AGU)
Address- registersA0, A1, A2 ..i+1, j-i+1
ax
MR:=0; A1:=1; A2:=n-2; MX:=x[n-1]; MY:=a[0];for ( j:=1 to n) {MR:=MR+MX*MY; MY:=a[A1]; MX:=x[A2]; A1++; A2--}
![Page 20: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/20.jpg)
DSP - Features (1) • Multiply/accumulate (MAC) and zero-overhead loop
(ZOL) instructions (as shown)• Heterogeneous registers (as shown)• Separate address generation units (AGUs)
(as in ADSP 210x)
![Page 21: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/21.jpg)
DSP - Features (2) • Modulo
addressing: Am++ Am:=(Am+1) mod n(implements ring or circular buffer in memory)
..x[n-2]x[n-1]x[0]x[1]..
Memory, t=t1
..x[n-3]x[n-2]x[n-1]x[n]x[1]
Memory, t2=t1+1
sliding windowt2x
t1
t
![Page 22: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/22.jpg)
Multiple memory banks or memories
MR
MFMX MY
*+,-
AR
AFAX AY
+,-,..
DP
Address generation unit (AGU)
Address- registersA0, A1, A2 ..
Simplifies parallel fetches
![Page 23: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/23.jpg)
Very long instruction word (VLIW) processorsKey idea: detection of possible parallelism to be done by compiler, not by hardware at run-time (inefficient).
VLIW: parallel operations (instructions) encoded in one long word (instruction packet), each instruction controlling one functional unit. E.g.:
![Page 24: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/24.jpg)
The Texas InstrumentsTMS 320C6xx as an example
31 0
0Instr. A
31 0
0Instr. D
31 0
1Instr. F
31 0
0Instr. G
31 0
1Instr. E
31 0
1Instr. C
31 0
1Instr. B
Cycle Instruction
1 A2 B C D3 E F G
Instructions B, C and D use disjoint functional units, cross paths and other data path resources. The same is also true for E, F and G.
Bit in each instruction encodes end of parallel execution
Parallel execution cannot span several packets.
![Page 25: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/25.jpg)
Partitioned register files
register file A register file B
L1 S1 M1 D1 D2 M2 S2 L2
Data bus
Address bus
Data path A Data path B
• Many memory ports are required to supply enough operands per cycle.
• Memories with many ports are expensive. Registers are partitioned into (typically 2) sets, e.g. for TI
C60x:
![Page 26: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/26.jpg)
Instruction types are mapped tofunctional unit types
There are 4 functional unit (FU) types: M: Memory Unit I: Integer Unit F: Floating-Point Unit B: Branch Unit
Instruction types corresponding FU type,except type A (mapping to either I or M-functional units).
![Page 27: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/27.jpg)
Large # of delay slots,a problem of VLIW processors
The execution of many instructions has been started before it is realized that a branch was required.Nullifying those instructions would waste compute power Executing those instructions is declared a feature, not a bug. How to fill all „delay slots“ with useful instructions? Avoid branches wherever possible.
add sub and or
sub mult xor div
ld st mv beq
![Page 28: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/28.jpg)
Predicated execution:Implementing IF-statements „branch-free“
Conditional Instruction „[c] I“ consists of:• condition c• instruction I
c = true => I executedc = false => NOP
![Page 29: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/29.jpg)
Predicated execution:Implementing IF-statements „branch-free“: TI C6x
if (c){ a = x + y; b = x + z;}else{ a = x - y; b = x - z;}
Conditional branch
[c] B L1 NOP 5 B L2 NOP 4 SUB x,y,a || SUB x,z,bL1: ADD x,y,a || ADD x,z,bL2:
Predicated execution
[c] ADD x,y,a|| [c] ADD x,z,b|| [!c] SUB x,y,a|| [!c] SUB x,z,b
max. 12 cycles 1 cycle
![Page 30: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/30.jpg)
Roadmap continues: 906545 nm “Traditional” Bus-based SoCs fit in one tile !!
Architecture Evolution
Communication demand is staggering, but unevenly distributed, because of architectural heterogeneity
I/0
I/0
PE
PE PE PE
SRAM SRAM
DRAM
I/O
I/OPERIPHERALS
3D stacked m
ain mem
ory
PE
LocalMemory
hierarchy
CPU
i/o
![Page 31: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/31.jpg)
Multicores Are Here!
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005 20??
# of
cor
es
1
2
4
8
16
32
64
128
256
512
Athlon
Raw
Power4 Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Boardcom 1480 Opteron 4PXeon MP
AmbricAM2045
[Amarasinghe06]
![Page 32: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/32.jpg)
MPSoC – 2005 ITRS roadmap
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200
200
400
600
800
1000
60
50
40
30
20
10
0
1200
Num
ber
of P
roce
ssin
g En
gine
s
Logi
c, M
emor
y Si
ze (N
orm
aliz
ed to
200
5)
Number of Processing Engines(Right Axis)
Total Logic Size(Normalized to 2005, Left Axis)
Total Memory Size(Normalized to 2005, Left Axis)
16 23 32 46 63 79101
133 161212
268
348
424
526
669
878
[Martin06]
![Page 33: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/33.jpg)
Power is the Challenge!
0
200
400
600
800
1000
1200
1400
90nm 65nm 45nm 32nm 22nm 16nm
Pow
er (W
), Po
wer
Den
sity
(W/c
m2 )SiO2 LkgSD LkgActive
10 mm Die
Technology, Circuits, and Architecture to constrain the power
![Page 34: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/34.jpg)
Near Term Solutions Move away from Frequency alone to
deliver performance More on-die memory Multi-everywhere
Multi-threading Chip level multi-processing
Throughput oriented designs Performance by higher level of
integration
![Page 35: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/35.jpg)
Architecture Techniques
0%
25%
50%
75%
100%
1u 0.5u 0.25u 0.13u 65nm
Cach
e %
of T
otal
Are
a
486 Pentium®
Pentium® III
Pentium® 4
Pentium® M
Increase on-die Memory
ST Wait for Mem
MT1 Wait for MemMT2 Wait
MT3
Single ThreadSingle Thread
Multi-ThreadingMulti-Threading
Full HW Utilization
Multi-threading
Improved performance, no impact on thermals & power delivery
C1 C2
C3 C4
Cache
Chip Multi-processing
LargeCore 1
1.5
2
2.5
3
3.5
1 2 3 4Die Area, Power
Rel
ativ
e Pe
rfor
man
ce
Multi Core
Single Core
![Page 36: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/36.jpg)
Multi-Core
C1 C2
C3 C4
Cache
Large Core
Cache
1
2
3
4
1
2 SmallCore 1 1
1
2
3
4
1
2
3
4
Power
PerformancePower = 1/4
Performance = 1/2
Multi-Core:Power efficient
Better power and thermal management
![Page 37: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/37.jpg)
Embedded vs. General Purpose
Embedded Applications Asymmetric Multi-Processing
Differentiated Processors Specific tasks known early
Mapped to dedicated processors Configurable and extensible
processors: performance, power efficiency
Communication Coherent memory Shared local memories HW FIFOS, other direct connections
Dataflow programming models Classical example – Smart mobile –
RISC + DSP + Media processors
Server Applications Symmetric Multi-Processing
Homogeneous cores General tasks known late
Tasks run on any core High-performance, high-speed
microprocessors Communication
large coherent memory space on multi-core die or bus
SMT programming models (Simultaneous Multi-Threading)
Examples: large server chips (eg Sun Niagara 8x4 threads), scientific multi-processors
![Page 38: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/38.jpg)
MPSoC architectures
![Page 39: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/39.jpg)
Example system platforms
Generic Automotive Wireless Multimedia
![Page 40: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/40.jpg)
PC-based platform
Basic hardware components: CPU; memory; timers; DMA; minimal I/O devices.
Basic software: BIOS.
![Page 41: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/41.jpg)
PC-style hardware architecture
CPU
system bus
memory
DMAcontroller
timers
businterface
brid
ge
high-speed bus
low-speed bus
I/O
I/O
![Page 42: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/42.jpg)
Strong ARM StrongARM system includes:
CPU chip (3.686 MHz clock) system control module (32.768 kHz
clock). Real-time clock; operating system timer general-purpose I/O; interrupt controller; power manager controller; reset controller.
![Page 43: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/43.jpg)
Pros and cons
Plentiful hardware options. Simple programming semantics. Good software development
environments. Performance-limited.
![Page 44: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/44.jpg)
TI Open Wireless Multimedia Applications Platform Dual-processor shared memory system:
GPPOS
DSPmanager
General-purposeprocessor
DSP
DSPOS
DSPtask
& I/Octrl
bridge
Memctrl
external memory
http://www.ti.com/sc/docs/apps/wireless/omap/overview.htm
![Page 45: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/45.jpg)
TI OMAP™ Hardware platform
I-MMU D-MMU
I-Cache
RISC Core
MMU
I-Cache Internal RAM/ROM
DSP Core+
Appl Coprocessors
DMA
Memory & Traffic Controller
ProgramMemory SDRAM
PeripheralsLCD Controller, Interrupt Handlers, Timers, GPIO, UARTs, ...
ARM9 core 16KB I-
cache 8KB D-cache 2-way set
associative 150 MHz
C55x DSP core
16KB I-cache 8KB RAM set 2-way set
associative 200 MHz
D-Cache
![Page 46: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/46.jpg)
OMAPI Standard (ST/TI)
Goal: standardize the interfaces between application processor and peripheral devices in a mobile product
Provide standard services (APIs) in the OS that can be used by application developers
![Page 47: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/47.jpg)
STMicro Nomadik platformMain Core
Memory System HW Accelerators I/Os
![Page 48: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/48.jpg)
Nomadik SW platform
Compliant with OMAPI standard
![Page 49: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/49.jpg)
Scalable VLIW Media Processor:• 100 to 300+ MHz• 32-bit or 64-bit
Nexperia™
System Buses• 32-128 bit
General-purpose Scalable RISC Processor• 50 to 300+ MHz• 32-bit or 64-bit
Library of DeviceIP Blocks• Image coprocessors• DSPs• UART• 1394• USB…and more
TM-xxxxTM-xxxxD$D$
I$I$
TriMedia CPUTriMedia CPU
DEVICE IP BLOCKDEVICE IP BLOCK
DEVICE IP BLOCKDEVICE IP BLOCK
DEVICE IP BLOCKDEVICE IP BLOCK
.. .. ..
DVP SYSTEM SILICON
PI B
US
SDRAM
MMI
DVP
MEM
ORY
BU
S
DEVICE IP BLOCK
PRxxxxD$
I$
MIPS CPU
DEVICE IP BLOCK. . .
DEVICE IP BLOCK
PI B
US
TriMedia™MIPS™
Philips Digital Video Nexperia Platform
![Page 50: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/50.jpg)
MiddlewareJavaTV, TVPAK, OpenTV, MHP/Java, proprietary ...
Applications
Nexperia HardwareNexperia Hardware
Streaming andStreaming and Platform SoftwarePlatform Software K
erne
l: pS
OS
, Win
-CE
, Ja
vaO
S
Nexperia-DVP SoftwareNexperia™ -DVP Software Architecture
Supports multiple OSs and middleware software
Abstracts platform functionality via consistent APIs
Nexperia™-DVP Streaming Software
Encapsulates implementation of streaming media components (hardware and software)
Nexperia™ Platform Software OS independent device
drivers for on-chip and off-chip devices
![Page 51: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/51.jpg)
Infineon Automotive Platform
TC1166
Applications High Performance drives / servo drives, Industrial control RoboticsFeatures 32-bit super-scalar TriCoreTM V1.3
CPU, 4 stage pipelineFully integrated DSP capabilitiesSingle precision floating point unit (FPU)80 MHz at full industrial temperature range
32-bit peripheral control processor with single cycle instruction (PCP2)
Memories1.5 MByte embedded progr. flash with ECC32 KByte data flash - EEPROM emulation56 KBSRAM, 8 KB I$, 16 KB Imem
8-channel DMA controller Interrupt system with 2 x 255 hardware
priority arbitration levels serviced by CPU and PCP2 Coprocessor
Triple bus structure: 64-bit local memory buses to internal flash and data memory, 32-bit system peripheral bus, 32-bit remote peripheral bus
![Page 52: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/52.jpg)
HW layer
SW Platform layer(> 60% of total SW)
Application Platform layer(10% of total SW)
Controllers Library
OSEKRTOS
OSEKCOM
I/O drivers & handlers(> 20 configurable modules)
Application Programming Interface
Boot Loader
Sys. Config.
Transport
KWP 2000
CCP
ApplicationSpecificSoftware
Speedom
eterTachom
eterW
ater temp.
Speedom
eterTachom
eterO
dometer
---------------
ApplicationLibraries
Nec78k HC12HC08 H8S26 MB90
SW Platform Reuse> 70%
of total SW
CustomerLibraries
MOSAIC SW Architecture & Components for Automotive Dashboard and Body Control
![Page 53: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/53.jpg)
Special Purpose processor
Stream processorGraphic processorNetwork processor
Dynamically Reconfigurable Processors
FPGA 、 Reconfigurable systems
Dedicated hardware
ProgrammableHardware
DSP
General purposeCPU
ConfigurableProcessor
Tile Processor
HomogeneousChip-multiprocessor
Specialinstructions
MultipleCores
HeterogeneousMultiprocessor
Multiple Cores
High performance forwide application field
High performance for narrow application fieldArchitecture trends
![Page 54: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/54.jpg)
Task Specific (configurable) Processors
HDL GENERATOR
Silicon
RTL synthesis
Silicon
µcode
Processor modelD
D
Applications
SysC specs
ISADP
Courtesy:Target
Compilers T
RWTH AACHEN Lisatek(CoWare);IMEC Target Compiler T, ARM OptimoDEPHILIPS Siliconhive; TENSILICA, PicoChip…
INSTRUCTION SET SIMULATOR
HDLModel Break
Step
RETARGETABLE
COMPILER
Machinecode
MACDAPACSACH Y,1NEGLAR AR3,#X…
![Page 55: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/55.jpg)
Multi-issue instruction
L operations packed in one long instruction
M copies of storage and function
SIMD operation
Parallelism at Three Levels in Extensible Instructions
Parallelism: L x M x NExample: 3 x 4 x 3 = 36 ops/cycle
op
op
N dependent operations
implemented as single
fused operation
const
register and constant inputs
reg
Fused operation
reg reg reg
op
Three forms of instruction-set parallelism:• Very Long Instruction Word (VLIW)• Single Instruction Multiple Data (SIMD) aka “vectors”• Fused operations aka “complex operations”
![Page 56: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/56.jpg)
addi addi
l 8 u i
sub
abs
add
l 8 u i
Example:SAD (sum of absolute differences)
short total = 0;char *p1, *p2;for i =1,m for j =1,n total + = abs(*p1++ - *p2++)
Original C Code
SLOT 2
SLOT 1
SLOT 0
Sample Software Pipelined ScheduleVector + Fusion + FL I X Configuration
loop j =1, n / 8 by 2: liu9x8[j]; liu9x8[j]; fusion[j-2] liu9x8[j+1]; liu9x8[j+1]; fusion[j-1]
N O YES
Vectorize?2
abs9 x 8
cvt9_16
add16 x 8
sub9 x 8
l iu 9 x 8l iu 9 x 8
48
fusion
fusion
![Page 57: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/57.jpg)
Dynamically Reconfigurable Processors
Reconfigurable systems → Previous lesson Flexible but It takes 10’s milliseconds for dynamic
reconfiguration. Dynamically Reconfigurable Processors
Improves area efficiency by changing hardware structure. IPs used in various SoCs. History
Reconfigurable Co-processor Garp(1997), CHIMAERA(2000) Multicontext reconfigurable devices WASMII(1992),Time-multiplexing
FPGA(1997), PipeRench(1998), DRL(1998) Functional-level synthesis
Various commercial products are available since 2000 IPFlex DAPDNA-2, NEC electronics DRP-1, PACT Xpp, Elixent DFabrix
SONY’s VME(Virtual Mobile Engine) is embedded in Network Workman and PSP
Recently, many Japanese vendors start to develop commercial products
Fujitsu Hitachi Lucent Sanyo Toshiba ( Mep+D-Fabrix)
![Page 58: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/58.jpg)
What is Configurable Computing?Spatially-programmed connection of Spatially-programmed connection of processing elementsprocessing elements
“Hardware” customized to specifics of problem.
Direct map of problem specific dataflow, control.
Circuits “adapted” as problem requirements change.
![Page 59: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/59.jpg)
Spatial vs. Temporal Computing
Spatial Temporal
![Page 60: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/60.jpg)
Processor vs. FPGA Area
![Page 61: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/61.jpg)
Processing Element Specialized for media/stream processing Coarse grain ⇔ Fine grain: LUT of FPGAs Components
ALU Shifter + Mask unit Multiplexers Registers
Operations and interconnection between components are changeable
No instruction fetch mechanism : A part of large datapath
![Page 62: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/62.jpg)
Reconfigurable HW (DSP fabric)
Target signal processing and arithmetic intensive applications
Reconfigurable array of simple DSP core (CNode)
Low power architecture Hierarchical clock gating Distributed leakage control (fine grain power gating)
Programmable DMA engine
Reconfigurable at run time, multi task
![Page 63: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/63.jpg)
Mapping Flow
• Alus execute a cyclic micro-sequence
• Data exchanges through hierarchical clustered interconnect
• Configuration step is sequence loading and interconnect programming
Data in Data out
ILP + software pipelining
Procedure(In,Out,inout)
Constant A,b,c,…;
Begin
X=a-in[0];
……..
End;
Behavioral code
Data in Data out
Data in Data out
Data in
Data out
Partitioning/static scheduling
DFG
Coarse grained configuration
MUX
Clusters Level0
Mux level 2
N0_i
N0_o
N2_o N2_i
N1_i N1_o
Level 1
![Page 64: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/64.jpg)
Mapping Flow 3D optimization problem
(place/route/schedule)
Traditional scheduling techniques for VLIW or clustered VLIW don’t apply The solution don’t take into account the spatial
dimension of the problem
Traditional P&R used in FPGA don't apply neither because they don't consider the time dimension
![Page 65: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/65.jpg)
Putting it all together
Constant SoC Die Size Slow evolution of peripherals (area decrease) GP CPU sub-system complexity 2x each node (constant
area), Embedded Memory capacity 2x at each node (constant area) Loosely coupled DSP sub-system complexity increase by
30% at each node (30% area decrease)
2004 2006 2008 2010 2012 Technology Node (nm) 90 65 45 32 22 Loosely coupled Sub-Systems 2 4 6 8 12 General Purpose CPU Single Multiple Hardware Accelerator Hardwired Reconfigurable
![Page 66: Hardware platforms for Embedded computing](https://reader035.fdocuments.us/reader035/viewer/2022062310/5681633e550346895dd3cda3/html5/thumbnails/66.jpg)
Interconnect
4MB Multi-port Embedded
Memory HostCore 2
L1L2
Peripherals& analog
What can fit in 45mm² in 45nmL1
DSP
HW
DMA
L1
DSP
HW
DMA
L1
DSP
HW
DMA
L1
DSP
HW
DMA
L1
DSP
HW
DMA
L1
DSP
HW
DMA
Programmable Multimedia Accelerator
ImagingH/W192 CNode
(40 GOPS)
HostCore 1
L1
VideoH/W