Post on 25-Jan-2022
Communication Protocols
Communication Protocols
• Layering– Lower levels provide services to higher level
– Easier to design
– Physical layer
• Lowest level in hierarchy
• Medium to carry data from one actor (device or node) to another
• Protocols: real-time or best effort– Parallel
– Serial
– Wireless
Parallel communication
• Multiple data, control, and power wires
– One bit per wire
• High data throughput with short distances
• Typically used when connecting devices on same IC or same circuit board
– Bus must be kept short• long parallel wires result in high capacitance values which requires
more time to charge/discharge
• Data misalignment between wires increases as length increases
• Higher cost, bulky
Parallel Protocols: PCI Bus
• PCI Bus (Peripheral Component Interconnect)– High performance bus designed by Intel
in the 1990’s
– Interconnects CPUs, expansion boards, memory
– Data transfer rates up to 1GBs for 64 bit addresses
– Synchronous bus architecture
– Multiplexed data/address lines
• PCI express
– Serial, point-to-point protocol
Source: http://computer.howstuffworks.com
Parallel Protocols: ARM Bus
• ARM Bus– Designed and used internally by ARM Corporation
– Interfaces with ARM line of processors
– Many IC design companies have own bus protocol
– Data transfer rate is a function of clock speed
– 32-bit addressing
Serial Communication
• Single data wire – transmit one bit at a time
• Higher data throughput with long distances
– Less average capacitance, so more bits per unit of time
• Complex protocol and interfacing logic
– Sender needs to decompose word into bits
– Receiver needs to recompose bits into word
– Control signals often on the same wire -> increasing protocol complexity
Serial Communication
time
bit 0 bit 1 bit n-1
no
char
start stop...
• Parameters:
– Baud (bit) rate.
– Number of bits per character.
– Parity/no parity.
– Even/odd parity.
– Length of stop bit (1, 1.5, 2 bits).
Serial Protocol: 8251 UART
• Universal asynchronous receiver transmitter• Takes parallel data and transmits serially at up to
max 450 Kbps• 8251 chip functions are integrated into standard PC
interface chip.
CPU 8251
status
(8 bit)
data
(8 bit)
serial
port
xmit/
rcv
Serial Protocols: I2C
• I2C (Inter-IC)– Two-wire serial bus protocol developed by Philips
Semiconductors ~20 years ago
– Enables peripheral ICs to communicate using simple communication hardware
• appropriate for peripherals where simplicity and low manufacturing cost are more important than speed
– Normal mode: 100 Kbps with 7-bit address
– Fast mode: 3.4 Mbpbs with10-bit address
– Common devices capable of interfacing to I2C bus:• EPROMS, Flash, and some RAM memory, real-time clocks, watchdog
timers, and microcontrollers
• Raspberry PI
Serial Protocols: USB
• USB (Universal Serial Bus)– Easier connection between PC and peripherals
– USB 1.1 has 2 data rates:• 12 Mbps for increased bandwidth devices
• 1.5 Mbps for lower-speed devices (joysticks, game pads)
– USB 2.0 runs at 480 Mbps; USB 3.1 up to 10 Gbps
– Tiered star topology can be used• One USB device (hub) connected to PC
• Up to 127 USB devices can be connected to hub
– USB host controller • Manages and controls bandwidth and driver software required by each
peripheral
• Dynamically allocates power downstream according to devices connected/disconnected
PCI Express (PCIe)• Serial, point-to-point protocol
• Bandwidth is very scalable: 1x-16x links
• Max 6.4GBps in either direction on x16
• Switches for connecting different devices
Source: http://computer.howstuffworks.com
Real-Time Communication & Protocol Examples
Class Overview
• What’ve covered until now in SW:
– Real-time scheduling, RTOS, RTIO started
• Where we are going today:
– RTIO, HW/SW codesign
• Due today:
– Article on RTIO
• Upcoming:
– HW2 assigned
– Individual project part 2 deadline extended to end of the
day Sunday, 2/19
Real-time Comm. Requirements– Real-time behavior
– Efficient, economical(e.g. centralized power supply)
– Appropriate bandwidth and communication delay
– Robustness
– Fault tolerance
– Maintainability
– Diagnosability
– Security
– Safety
Real-time IO•Field bus:
–A family of industrial computer network protocols used for real-time distributed control
• Carrier-sense multiple-access/collision-detection (CSMA/CD); used in Ethernet & CAN
• Alternatives:–Token rings, token busses
–Carrier-sense multiple-access/collision-avoidance: CSMA/CA• Each partner gets an ID (priority). After each bus transfer, all partners try setting their
ID on the bus; partners detecting higher ID disconnect themselves from the bus. Highest priority partner gets guaranteed response time; others communicate only if they are given a chance.
Event vs. time triggered• Event Triggered (ET):
– Computation/communication triggered by an external event
– Events are primarily generated by changes in the environment
– Efficient — only do things when they need to be done; rest and save energy/cpu time/bandwidth
– High peak-load if multiple events happen at once
– Hard to analyze due to asynchronous nature of events
• Time Triggered (TT):
– Computation/communication triggered by the system clock
– Events happen according to a fixed schedule:
• Inefficient — does things periodically, whether needed or not
– Enhanced analizability due to easily characterizable load, predictable interaction sequences, bus use, etc.
Time division multiple access
• Each assigned a fixed time slot:
http://www.ece.cmu.edu/~koopman/jtdma/jtdma.html#classical
Master sends sync
Some waiting time
Each slave transmits in its time slot
Variations (truncating unused slots, several slots per slave) exist
Advantages of TDMA-bussesover priority-driven schemes
– Can provide QoS guarantees
– TDMA resources support temporal composability, by separating resource access of different subsystems
– TDMA resources have a very deterministic timing behavior
– Can be made fault tolerant
– Support for error detection
– Support for error contention
• a faulty subsystem does not affect the correct behavior of the remaining system
[Ernesto Wandeler Lothar Thiele: Optimal TDMA Time Slot and Cycle
Length Allocation for Hard Real-Time Systems, ASP-DAC, 2006]
Field busses: Profibus• Process Field Bus (Profibus):
• PROFIBUS DP (Decentralized Peripherals) is used to operate sensors and actuators via a centralized controller in factory automation apps; runs at 9.6kbps – 12 Mbps; RS485 allows max 126 devices, but expansion is possible
• ROFIBUS PA (Process Automation) is used to monitor measuring equipment via a process control system in process automation apps; runs at 31.2 kbps; same message format as DP
– Focus on safety; 20% market share for field busses.– Integration with Ethernet via Profinet.
[http://www.profibus.com/]
Profibus: Application & Data Layers
• Application layer:– DP-V0 for cyclic exchange of data and diagnosis
– DP-V1 for acyclic data exchange and alarm handling
– DP-V2 for slave2slave comm and data exchange broadcast
• Data link: – FDL (Field bus Data Link) combines token passing with a master-slave method for Profibus-DP
• Each byte uses even parity and is transferred asynchronously with a start and stop bit
• Master signals the start of a new telegram with a SYN pause of at least 33 bits
• Various messages possible:
– Token
– Variable data length
– Fixed data length
– No data
– Brief ack
OSI-Layer PROFIBUS
7 Application DPV0 DPV1 DPV2
Management
6 Presentation
--5 Session
4 Transport
3 Network
2 Data Link FDL
1 Physical EIA-485 Optical MBP
Controller area network (CAN)– Designed by Bosch and Intel in 1981;
– Key concept: • every device can be connected by a single set of wires, and every device that is connected
can freely exchange data with any other device
– Originally designed for cars; now used also for:
• elevator controllers, copiers, telescopes, production-line control systems, and medical instruments
– Binary countdown arbitration (CSMA/CD)• Start from MSB, transmit each bit of priority
• Highest priority wins
– Throughput:10kbit/s - 1 Mbit/s
– Low and high-priority signals• maximum latency of 134 µs for high priority
www.can.bosch.com
Aircraft communication systems– Information exchange
• information many bytes of data: e.g. digital map, flight plan, etc.
• exchange : a response is expected, at min acknowledgment
• higher speed data link needed
– Control platform: sampling and data transmission• data : digital value of an analog parameter: e.g. speed; height etc.
• No response is expected, but:– Time, integrity and availability are the key drivers.
– The stability of the flight relies on this transmission
• Aeronautical response : ARINC 429 protocol
ARINC 429 overview• Developed by Aeronautical Radio, Incorporated (ARINC)
• Commonly used standard for the aircraft
• Electrical and data format standard for a 2-wire serial bus with one sender and many listeners.
• Each data is individually identified (by a label) and sent
Physical connection
DataLink/MAC
Network
Transport
Application
label data
A429
label data parity
32 bit
Information system requirements
• Ensure that the information is transmitted without any error.
– Data needs to be acknowledged
– Messages can be sent again in case of error
• Past aircraft uses A429 but added acknowledgement.
Physical connection
DataLink/MAC
Network
Transport
Application
A429 williamsburg
A429
ARINC 629
• Multi-transmitter protocol where many units share the same bus; originally designed for Boeing 777.
• Based on "waiting room" protocol:
– Each node is assigned a unique number of mini slots that must elapse with silence on the channel before the data transmission begins
• Three (groups of) time-out parameters:– SG — synchronization gap controlling access to the waiting room
– TGi — terminal gap, the personal time-out of node I
– TI — transmit interval preventing monopolization of channel
– TI > SG > max{TGi }
TTP (Time-Triggered Protocol)
Sources: Dr. Insup Lee & H. Kopetz
TTP – more than just a protocol– Network protocol
– Operating system scheduling philosophy
– Fault tolerance approach
Time-Triggered approach – Simple to implement
– Stable time base
– Cyclic schedules
TTP versions
• TTP/A (Automotive Class A = soft real time)
– A scaled-down version of TTP
– A cheaper master/slave variant
• Distributed master slave is expensive
• TTP/C (Automotive Class C = hard real time)
– A full version of TTP
– A fault-tolerant distributed variant
Protocol Layer in TTP/A
TTP/A: Polling
• Operation
– Master polls the other nodes (slaves)
– Non-master nodes transmit messages when they are polled
– Inter-slave communication through the master
Polling Tradeoffs
• Advantages– Simple protocol to implement
– Historically very popular
– Bounded latency for real-time applications
• Disadvantages– Single point of failure from centralized master
– Polling consumes bandwidth
– Network size is fixed during installation• Master can also discover nodes during reconfiguration
TTP/C
• TTP/C
– A time-triggered communication protocol for safety-critical (fault-tolerant) distributed real-time control systems
– Based on a TDMA media access strategy
• Clock synchronization: Each node measures the difference between the expected and the observed arrival time of a message to calculate the difference between the sender’s & receiver’s clocks
– Fail Silence• A subsystem is fail-silent if it either produces correct results or no
results at all, i.e., it is quiet in case it cannot deliver the correct service
Application software in host
FTU Membership
Redundancy
Management (RM)
SRU Membership
Clock Synchronization
Media Access: TDMA
Host Layer
FTU CNIFault tolerance unit
Communication
Network Interface (CNI)
FTU Layer
RM Layer
SRU LayerSmallest
Replaceable Unit
Data
Link/Physical
Layer
Basic CNI
TTP/C Protocol Layer
FTU Layer
Group two or more nodes into FTUs
RM Layer
Provide the mechanisms for the cold start of a TTP/C cluster
SRU Layer
Store the data fields of the received frames
Data Link/Physical Layer
Provide the means to exchange frames between the nodes
(a) Two active nodes, two shadow nodes
(b) Triple modular redundancy: three active nodes with one shadow
(c) Two active nodes without a shadow node
FTU Configuration Examples in TTP/C
Controller to run protocol
DPRAM (dual ported RAM)
Used for memory-mapped network interface
BG (Bus Guard)
Hardware watchdog to ensure “fail silent”
HW must use highly accurate time sources
Dual redundant crystal oscillators are used for Boeing 777
TTP/C Frame
• I-Frames used for initialization
• N-Frames used for normal messages
Cycle in TTP/C
• TDMA Cycle– One FTU sends results twice
– Then next FTU sends results
– And so on, until back to the next message from the first FTU
• Cluster Cycle
– Cluster cycle involves scheduling all messages and tasks
TTP/A vs. TTP/C
Service TTP/A TTP/C
Clock Synchronization Central
Multimaster
Distributed,
Fault-Tolerant
Mode Switches yes yes
Communication Error Detection Parity 16/24 bit CRC
Membership Service simple full
External Clock Synchronization yes yes
Time-Redundant Transmission yes yes
Duplex Nodes no yes
Duplex Channels no yes
Redundancy Management no yes
Shadow Node no yes
Pros and Cons of TTP
• Advantages– Simple protocol to implement
– Deterministic response time
– No wasted time for master polling messages
• Disadvantages– Wasted bandwidth when some nodes are idle
– Stable clocks
– Fixed network size during installation
FlexRay• Robust, scalable, deterministic, and fault-tolerant digital serial
bus system designed for use in automotive applications
• Developed by consortium: BMW, Ford, Bosch,Daimler-Chrysler… – Specified in SDL; finalized in 2009
• Built as extension to TTP and Byteflight protocols.• Improved error tolerance and time-determinism
• Meets requirements with transfer rates >> CAN
– initially targeted for ~ 10Mbit/sec;
– design allows much higher data rates
• TDMA (Time Division Multiple Access) protocol:Fixed time slot with exclusive access to the bus
• Cycle subdivided into a static and a dynamic segments
TDMA in FlexRay• Exclusive bus access enabled for short time in each case.
Dynamic segment for transmission of variable length information.Fixed priorities in dynamic segment: minislots for each potential sender.Bandwidth used only when it is actually needed.
htt
p://w
ww
.tzm
.de
/Fle
xR
ay/F
lexR
ay_
Intr
od
uctio
n.h
tml
Structure of Flexray networksBus Guardian (BG) protects the system against failing processors by gating access to Bus Driver (BD)
Comparison of real-time protocols
FIP = Flexible time triggered protocol; statically scheduled with centralized arbitration
LON = for building automation, uses TDMA with CSMA/CA and dynamically varies the
number of slots per device for each schedule
Wireless communication
• Infrared (IR)– Frequencies just below visible light spectrum
– Diode emits infrared light to generate signal
– Infrared transistor detects signal
– Cheap to build but need line of sight, limited range
– Data transfer rate of 9.6 kbps and 4 Mbps
• Radio frequency (RF)– Electromagnetic wave frequencies in radio spectrum
– Analog circuitry and antenna needed on both sides
– Line of sight not needed, transmitter power determines range
RFID• Use of EM field to transfer data, for identifying and tracking tags
attached to objects; no need for line of sight
• Active vs. passive tags– Active transmits ID, they are low power (~10-100uA) but higher cost ($10-
$200/unit retail)
– Passive can be read by RF - no intrinsic power consumption (powered by EM induction) and cheaper ($0.20-0.40)
• Readers– $100+ to $1000s, range from read and report to smart tracking, etc.
• Using RFID for real-time location systems (RTLS)– Only active tags work with range 100m+ in line of sight, or 1-20m
obstructed
– Battery - up to years on a single charge @ <1Hz transmission rate
– Location accuracy as close as 30cm with reader presence
Bluetooth, BLE, ZigbeeB
luet
oo
th • IEEE 802.15.1
• Developed and licensed by the Bluetooth Special Interest Group (SIG)
BLE • Adopted into
Bluetooth specification
• Bluetooth Low Energy Technology
ZigB
ee • IEEE 802.15.4
• Maintained and published by the ZigBeeAlliance
Side By Side ComparisonBluetooth BLE ZigBee
Band 2.4GHz 2.4GHz 2.4GHz, 868MHz, 915MHz
Antenna/HW Shared Independent
Power 100 mW ~10 mW 30 mW
Battery Life Days – months 1-2 years 6 months – 2 yrs
Range 10-30 m 10 m 10-75 m
Data Rate 1-3 Mbps 1 Mbps 25-250 Kbps
Network Topologies
Ad hoc, point to point, star
Ad hoc, point to point, star
Mesh, ad hoc, star
Time to Wake and Transmit
3s 3ms 15ms
Security 128-bit encryption 128-bit encryption 128-bit encryption
Wireless Protocols: 802.11
• IEEE 802.11
– Standard for wireless LANs
– Specifies parameters for PHY and MAC layers of network• PHY layer
– handles transmission of data between nodes
– data transfer rates up to 600 Mbit/s for 802.11n
– operates in 2.4 / 5 GHz frequency band (RF)
• MAC layer
– medium access control layer
– protocol responsible for maintaining order in shared medium
– collision avoidance/detection
Summary
• Interfacing: on & off chip
• Real-time IO
– Profibus
– CAN
– ARINC
– TTP/A & TTP/C
– FlexRey
• Wireless
– IR, BLE, ZigBee, RFID, 802.11
Hardware/Software Codesign
Tajana Simunic Rosing
Department of Computer Science and Engineering
University of California, San Diego.
ES Design
Verification and Validation
HardwareHardware components
System Architecture: YesterdayPCB design
3MHIGH DENSITY
GraphicsExternal
BusI/OLAN
SCSI/
IDE
DRAMVRAM
Processor
Cache/DRAM
Controller
Audio Motion
VideoVRAM
DRAM
Cache
VRAMDRAM
PCI Bus
ISA/EISA
Add-in board
A System Architecture: TodayHW/SW Codesign of a SoC
MEMORY
Cache/SRAM
Processor
Core
DSP
Processor
Core
Graphics Video
VRAM
Glue Glue
En
cry
ptio
n/
De
cry
ptio
n
PCI Interface
EISA InterfaceI/
O I
nte
rfac
e
Mo
tio
n
LA
N In
terf
ace
SCSI
System Design Problem Areas
Interface
Processor ASIC
Memory
Inte
rface
Analog I/O
DM
A
2. HDL Modeling
Architectural synthesis
Logic synthesis
Physical synthesis
3. Software synthesis,
Optimization,
Retargetable code gen.,
Debugging &
Programming environ.
1. Design environment, co-simulation
constraint analysis.
4. Test Issues
HW-centric view of a Platform
ApplicationSpace
HW-SW Kernel
MEM
FPGACPU Processor(s), RTOS(es)
and SW architecture
IP can be:
• HW or SW
• hard, soft or ‘firm’ (HW)
• source or object (SW)
Scaleable
bus, test, power, IO,
clock, timing architectures
+ Reference Design
Programmable
SW IP
Hardware IP
Pre-Qualified/Verified
Foundation-IP*
Foundry-Specific
HW Qualification
Reconfigurable Hardware Region
(FPGA, LPGA, …)
SW architecture
characterisation
Source: Grant Martin and Henry Chang, “Platform-Based Design:
A Tutorial,” ISQED 2002, 18 March 2002, San Jose, CA.
SW-Centric View of Platforms
Output DevicesInput devices
Hardware Platform
I O
Hardware
Software
network
Software Platform
Application Software
Platform API
API
RT
OS
BIOS
Device DriversN
etw
ork
Co
mm
un
icat
ion
Source: Grant Martin and Henry Chang, “Platform-Based Design:
A Tutorial,” ISQED 2002, 18 March 2002, San Jose, CA.
HW/SW Codesign: Motivations
• Benefit from both HW and SW
–HW:
• Parallelism -> better performance, lower power
• Higher implementation cost
–SW
• Sequential implementation -> great for some problems
• Lower implementation cost, but often slower and higher power
Software or hardware?
Decision based on hardware/ software partitioning
Hardware/software codesign
Processor P1
Processor P2 Hardware
Specification
Mapping
System Partitioning
– Good partitioning mechanism:
1) Minimize communication across bus
2) Allows parallelism -> both HW & CPU operating concurrently
3) Near peak processor utilization at all times
process (a, b, c)
in port a, b;
out port c;
{
read(a);
…
write(c);
}
Specification
Line ()
{
a = …
…
detach
}
Processor
Capture
Model HW
Partition
Synthesize
Interface
Determining Communication Level
–Easier to program at application level• (send, receive, wait) but difficult to predict
–More difficult to specify at low level• Difficult to extract from program but timing and
resources easier to predict
Application
Program
Operating
System
I/O driver
I/O bus
Application
hardware
(custom)
I/O driver
I/O bus
Send, Receive, Wait
Register reads/writes
Interrupt service
Bus transactions
Interrupts
Partitioning Costs
• Software Resources–Performance and power consumption
–Lines of code – development and testing cost
–Cost of components
• Hardware Resources–Fixed number of gates, limited memory & I/O
–Difficult to estimate timing for custom hardware
–Recent design shift towards IP• Well-defined resource and timing characteristics
Functional
Blocks
Feature
Points
Source Lines of
Code (SLOC)
Software
Development and
Testing Cost
Calibration
Language
Conversion
Equivalent SLOC
including reuse
Software
development effort
Software
maintenance effort
Software schedule
Software
Cost
Analysis
Process
I/O Count
Die Area
Core Area
Gate Count
Wafer
Characteristics
Design Cost
Tooling Cost
Wafer Fabrication
and Sawing Cost
Single-Chip-
Package Cost
Feature Size
Interconnect
Length
Die Yield
Number Up
Die Cost
Chip Hardware
Cost
I/O Format
Rent’s Rule
Test Development Cost
Productivity, reuse
S/G Ratio
I/O Count
Die Area
Core Area
Gate Count
Wafer
Characteristics
Design Cost
Tooling Cost
Wafer Fabrication
and Sawing Cost
Single-Chip-
Package Cost
Feature Size
Interconnect
Length
Die Yield
Number Up
Die Cost
Chip Hardware
Cost
I/O Format
Rent’s Rule
Test Development Cost
Productivity, reuse
S/G RatioHardware
Cost
Analysis
Process
Hardware/Software Partitioning
memory
ASIC
ASIC
Processor
Simple architectural model: CPU + 1 or more ASICs on a bus
• Properties of classic partitioning algorithms
– Single rate; Single-thread: CPU waits for ASIC
– Type of CPU is known; ASIC is synthesized
HW/SW Partitioning Styles
• HW first approach
– start with all-ASIC solution which satisfies constraints
– migrate functions to software to reduce cost
• SW first approach
– start with all-software solution which does not satisfy constraints
– migrate functions to hardware to meet constraints
Codesign Verification
• Run SW on the CPU
• Simulate HW (Verilog)
Verilog Simulator
Application-specific
hardware
Hardware
Process 1
Hardware
Process 1
Bus interface
Verilog PLI
Software
process 1
Software
process 2
Unix sockets
SpecC model
Gate Count Lines of Code
Derived from
Foresight
I/O Count Number Up
Fab. Cost
Test Cost
Die Size
SCP Cost
HW SW
Dev. Cost Dev. Schedule
Maintenance Cost
Cost Analysis
(Ghost)
System Performance
Metrics
System
Cost
Outputs
Co-Design Process
System
Requirements
Capture
Functional
Behavior Block
Diagram
State
Machines
Mini-
specs
Library
Elements
User-
defined
Reusables
Resource
Specification
Architecture
Block Diagram
Data Flow
Monitors
System
Characteristics
Foresight Co-Design
Integrated Toolset
Industry Initiatives • Seamless Co-Verification Environment-CVE
• Proridium (Foresight)
– Customers: Boeing, Microsoft, Raytheon, Oracle etc.
• CoWare (now in Synopsys)
– Cosimulation and IP integration
– One of founding members of SystemC (language)
• New FPGA synthesis tools incorporate CPUs
• Platform-based design
– Platform: predesigned architecture that designers can use to build systems for a given range of applications
ILP for HW/SW Partitioning
Ingredients:
• Cost function
• Constraints
Involving linear expressions of integer variables from a set X
Def.: The problem of minimizing (1) subject to the constraints (2) is called an integer programming (IP) problem.
If all xi are constrained to be either 0 or 1, the IP problem said to be a 0/1
integer programming problem.
Cost function )1(,with NxRaxaC i
Xx
iii
i
Constraints: )2(,with: ,, RcbcxbJjXx
jjijiji
i
ℕ
ℝ
FAQ on integer programming
Maximizing the cost done by setting C‘=-C
Integer programming is NP-complete.
Running times increase exponentially with problem size
Commercial solvers can solve for thousands of variables
IP models are a good starting point for modelling even if in the end heuristics have to be used to solve them.
IP model for HW/SW partitioningNotation:Index set I denotes task graph nodes. Index set L denotes task graph node types
e.g. square root, DCT or FFTIndex set KH denotes hardware component types.
e.g. hardware components for the DCT or the FFT. Index set J of hardware component instancesIndex set KP denotes processors.
All processors are assumed to be of the same typeT is a mapping from task graph nodes to their types
T: I L
Therefore:
Xi,k: =1 if node vi is mapped to HW component type k KH Yi,k: =1 if node vi is mapped to processor k KP NY ℓ,k =1 if at least one node of type ℓ is mapped to processor k KP
ConstraintsOperation assignment constraints
KHk KPk
kiki YXIi 1: ,,
All task graph nodes have to be mapped either in software or in hardware.
Variables are assumed to be integers.
Additional constraints to guarantee they are either 0 or 1:
1:: , kiXKHkIi
1:: , kiYKPkIi
Operation assignment constraints
ℓ L, i:T(vi)=cℓ, k KP: NY ℓ,k Yi,k
•For all types ℓ of operations and for all nodes i of this type:
– if i is mapped to some processor k, then that processor must implement the functionality of ℓ.
•Decision variables must also be 0/1 variables:
ℓ L, k KP: NY ℓ,k 1.
Resource & design constraints
• k KH, the cost for components of that type should not exceed its maximum.
• k KP, the cost for associated data storage area should not exceed its maximum.
• k KP the cost for storing instructions should not exceed its maximum.
• The total cost (k KH) of HW components should not exceed its maximum
• The total cost of data memories (k KP) should not exceed its maximum
• The total cost instruction memories (k KP) should not exceed its maximum
Scheduling
Processor
p1 ASIC h1
FIR1 FIR2
v1 v2 v3 v4
v9 v10
v11
v5 v6 v7 v8
e3 e4
t
p1
v8 v7
v7 v8
or
...
... ...
...
t
c1
or
...
... ...
...e3
e3
e4
e4t
FIR2 on h1
v4 v3
v3 v4
or
...
... ...
...
Communication channel c1
Scheduling / precedence constraints
• For all nodes vi1 and vi2 that are potentially mapped to the same processor or hardware component instance, introduce a binary decision variable bi1,i2 withbi1,i2=1 if vi1 is executed before vi2 and
= 0 otherwise.Define constraints of the type(end-time of vi1) (start time of vi2) if bi1,i2=1 and(end-time of vi2) (start time of vi1) if bi1,i2=0
• Ensure that the schedule for executing operations is consistent with the precedence constraints in the task graph.
• Timing constraints need to be met
Example• HW types H1, H2 and H3 with
costs of 20, 25, and 30.
• Processors of type P.
• Tasks T1 to T5.
• Execution times:
T H1 H2 H3 P
1 20 100
2 20 100
3 12 10
4 12 10
5 20 100
Operation assignment constraint
T H1 H2 H3 P
1 20 100
2 20 100
3 12 10
4 12 10
5 20 100
X1,1+Y1,1=1 (task 1 mapped to H1 or to P)
X2,2+Y2,1=1
X3,3+Y3,1=1
X4,3+Y4,1=1
X5,1+Y5,1=1
KHk KPk
kiki YXIi 1: ,,
Operation assignment constraint
•Assume types of tasks are ℓ =1, 2, 3, 3, and 1.
ℓ L, i:T(vi)=c ℓ, k KP: NY ℓ,k Yi,k
Functionality 3 to be implemented on
processor if node 4 is mapped to it.
Other equations•Time constraint: Application specific hardware required for time constraints under 100 time units.
T H1 H2 H3 P
1 20 100
2 20 100
3 12 10
4 12 10
5 20 100
Cost function:
C=20 #(H1) + 25 #(H2) + 30 # (H3) + cost(processor) + cost(memory)
Result•For a time constraint of 100 time units and cost(P)<cost(H3):
T H1 H2 H3 P
1 20 100
2 20 100
3 12 10
4 12 10
5 20 100
Solution:T1 H1
T2 H2
T3 P
T4 P
T5 H1
Separation of scheduling and partitioning
•Combined scheduling/partitioning very complex; Heuristic: Compute estimated schedule
•Perform partitioning for estimated schedule
•Perform final scheduling
•If final schedule does not meet time constraint, go to 1 using a reduced overall timing constraint.
2nd Iteration
t
specification
Actual execution time
1st Iteration
approx. execution time
t
Actual execution time
approx. execution time
New specification
Summary
• HW/SW codesign is complicated and limited by performance estimates
• Algorithms are in research and development,
– much of the work is still done by expert designers
Sources and References
• Peter Marwedel, “Embedded Systems Design,” 2004.
• Giovanni De Micheli @ EPFL
• Vincent Mooney @ Gatech
• Nikil Dutt @ UCI
CMOS VLSI Trends
Yesterday
(1980s)
Today Tomorrow
memory
gate arrays
ASICs
processors
memory
struc. ASIC
ASICs
processors
reconfigurable
SoC
memory
ASICs
processors
reconfigurable(no processor)
platform SoC
custom SoC
struc. ASIC(no processor)
struc. SoC
Increasing Customization Cost
Example: Design with
80 M transistors in
100 nm technology
Estimated Cost -
$85 M -$90 M
12 – 18 months
Top cost drivers
Verification (40%)
Architecture Design (26%)
Embedded Design 1400 man months (SW)
1150 man months (HW)
HW/SW integration
*Handel H. Jones, ”How to Slow the Design Cost Spiral,” Electronics Design Chain, September 2002, www.designchain.com
Responses to Increasing Cost
• General purpose ISA
– Universality high volumes and reuse
– Abstraction compilation technologies and high application/development productivity
• Custom silicon for embedded platforms in sufficiently high volumes
– Domain specific ISAs, e.g., DSPs
– Application Specific Standard Products
– Reconfigurable hardware
• HW/SW Codesign
HW/SW Codesign Issues• Task level concurrency management
Which tasks in the final system?
• High level transformationsTransformation outside the scope of traditional compilers
• Hardware/software partitioningWhich operation mapped to hardware, which to software?
• CompilationHardware-aware compilation
• SchedulingPerformed several times, with varying precision
• Design space explorationSet of possible designs, not just one.
Partitioning Analysis
• Result of compilation is synthesizable HDL and assembly code for the processor
• Compiler & profiler determine dependence and rough performance estimates
HW & SW Foundries
• HW1– LSI Logic ASIC Wafer Foundry
Data• 0.18 mm feature size• 8 inch wafers• 6 layers
– TSMC 018 Wafer Processing
• HW2– Samsung Semiconductor ASIC
Wafer Foundry Data• 0.35 mm feature size• 6 inch wafers• 4 layers
– TSMC 035 Wafer Processing
• SW1– Nominal to High development
effort
• SW2– Low to Nominal development
effort
Packaging
Fabrication
Tooling
Design
Testing
0%
20%
40%
60%
80%
100%1
00
0, N
o
10
00
, 2
0%
10
00
, 4
0%
10
00
0, N
o
10
00
0, 2
0%
10
00
0, 4
0%
10
00
00, N
o
10
00
00, 2
0%
10
00
00, 4
0%
Recu
rrin
gProduction Quantity and Level of Reuse
Pe
rce
nt
of
To
tal
Co
st
Software development
Packaging
Fabrication
Tooling
Design
Testing
MIXED Implementation Using HW1 and SW1
Reuse of:
• Gate-level IP
• Code
0
5
10
15
20
25
30
35
40
45
0 10 20 30 40 50 60 70 80 90 100
Percent Custom Hardware
To
tal
Co
st
($/c
hip
)
HW1/SW1 HW1/SW2
HW2/SW1 HW2/SW2
Total Cost Per Chip
10,000 Units
Co-simulation for HW & SW• Transistor-level accurate
– post layout SPICE model
• Gate-level accurate– precise HDL gate delay model
• Cycle accurate– correct transitions at clock edges
– timing information between edges is thrown away
• Bus accurate– cycle accurate bus model
– behavioral model of processor, hardware
• Instruction set accurate– instruction set simulator used for processors
– used for early design space exploration