Advanced Bus

STATESTATE--OFOF--THETHE--ARTART

INTERCONNECT FABRICS ANDINTERCONNECT FABRICS AND

COMMUNICATION PROTOCOLSCOMMUNICATION PROTOCOLS

AHB: AHB: criticalcritical overviewoverview� Protocol

� Lacks parallelism

� In order completion

� Address of next transaction just anticipated on the bus ��

� No multiple outstanding transactions: cannot hide slave wait

states effectively

� High arbitration overhead (min. 2 cycles on single-transfers)� High arbitration overhead (min. 2 cycles on single-transfers)

� Bus-centric vs. transaction-centric

� Initiators and targets are exposed to bus architecture (e.g.

arbiter)

� No decoupling, instance-specific bus components

� Topology

� Scalability limitation of shared bus solution!

Bus evolutionBus evolution

ProtocolT

ow

ard

im

pro

ved

uti

lizati

on

of

the t

op

olo

gy (

thro

ug

htp

ut,

late

ncy)

Topology

Toward enhanced parallelism

To

ward

im

pro

ved

uti

lizati

on

of

the t

op

olo

gy (

thro

ug

htp

ut,

late

ncy)

Topology evolutionTopology evolution

Shared bus with unidirectional

Request and response lanes

Crossbar with unidirectional

Request and response lanes


Partial Crossbar

with unidirectional

request and

response lanes

xbar

S4

S3

S2

S1

S0

M6

0 M0 M1

Shared bus

P2 P3 T1 M2 M3

Shared bus

P4 P5 T2 M4 M5

Shared bus

P6 P7 T3 M7

Shared bus

P8 P9 T4 M8 M9

Shared bus

Multi-layer bus architecture

The The communicationcommunication bottleneckbottleneck

off-chip

memory

controller

LX

IPTG

IP 1

IP 2

IPTG

IPTG

IPTG

IPTG

IPTG

IPTG

IPTG

System interconnect� Today: multi-layer topology

controller

IP 3

IP 5

IP 3

IPTG

IPTG IPTG

IPTG

IPTG

IPTG

IPTG

IPTG

IPTG

IPTG

� Jeopardizing design predictability, feasibility and cost!


4-ary 2-

mesh

Switches 16

Bis. Band. 4

Tiles x

Switch

1

Switch Arity 6Switch Arity 6

Max. Hops 6

4-ary 2-mesh

Tile

Switch

4-ary 2-

mesh

2-ary 4-

mesh

Switches 16 16

Bis. Band. 4 8

Tiles x

Switch

1 1

Switch Arity 6 6


Switch Arity 6 6

Max. Hops 6 4

4-ary 2-mesh 2-ary 4-mesh

Tile

Switch

Tile

Switch

4-ary 2-

mesh

2-ary 4-

mesh

2-ary 2-

mesh

Switches 16 16 4

Bis. Band. 4 8 2

Tiles x

Switch

1 1 4

Switch Arity 6 6 10


4-ary 2-mesh 2-ary 2-mesh

Low latency

Switch Arity 6 6 10

Max. Hops 6 4 2

Tile

Switch

Tile

Switch

Split transactionsSplit transactionsA splitsplit--transaction bus transaction bus is a bus where the request and response phases

are split and independent to improve bus utilization

-Master must arbitrate for the request phase

-Slave must arbitrate for the response phase

Master Slave

Request Response

Bus

Busy

Bus released Bus

busy

Bus released

Multiple outstanding transactionsMultiple outstanding transactionsMaster

Queue of pending

requests

Slave

Queue of pending

responses

Requests Responses

�The master needs to associate each response to one of its pending requests

�The initiator should support multiple outstanding transactions too

OutOut--ofof--order order completioncompletion

Master

To S2

To S1

S1-slow

Queue of

pending

requests

S2 -fast

Queue of

pending

requests

time

Requests

� Association between requests and responses is more challenging

� The typical case for out-of-order completion is when a fast slave is

addressed after a slow slave. The fast slave will return its response earlier.

From S2 From S1

OutOut--ofof--order order completioncompletion

Master

S12

S11

S1

S11

S12

Queue of

pending

requeststime

anticipated

Requests

� Out-of-order completion even in case multiple outstanding transactions are

addressed to the same complex slave

� A complex slave may use local optimizations and change the processing

order of incoming requests (e.g., serve accesses to an open row first in an

SDRAM device)

Resp of S12 Resp of S11

BusBus--centric architecturecentric architecture

Master

interface

Slave

interface

Bus

architecture

� Internal bus components are directly exposed to the connected

master and slave interfaces

� The bus architecture is instance-specific and lacks modularity

architecture

TransactionTransaction--centric architecturecentric architectureMaster interface

Bus

architecture

Slave interface

Point-to-point

Communication

Protocol

Hidden components

Slave interface

Master interface

� Internal bus components are hidden behind bus interfaces

� Modular architecture

� Orthogonalization of concerns

� Internal bus architecture can freely evolve without impacting the interfaces

� The only objective of interfaces: specifying communication transactions!

(communication abstraction)

architecture

But what is there on the market?But what is there on the market?But what is there on the market?But what is there on the market?

AMBA MultiAMBA Multi--layer AHBlayer AHB� Enables parallel access paths between multiple masters and

slaves

� Fully compatible with AHB wrappers

� It is a topology (not protocol) evolution

� Pure combinational matrix (scales poorly with no of I/Os)

Master1

Master2

Slave1Interconnect

Matrix

Slave1

Slave1

AHB

AHB

MultiMulti--Layer AHB implementationLayer AHB implementation

� The matrix is completely flexible and can be adapted � MUXes are point arbitration stages� AHB layer can be AHB-lite: single master, no req/grant, no split/retry

MultiMulti--layerlayer AHB AHB implementationimplementation� A layer loosing arbitration is waited by means of HREADY

� When a layer is waited, input stage samples pipelined address and control signals

HierarchicalHierarchical systemssystems

• Slaves accessed only by masters on a given layer can

be made local to the layer

Multiple Multiple slavesslavesMultiple slaves appear as

single

slave to the matrix

• combine low bandwidth

slaves

• group slaves accessed

only

by one master (e.g. DMAby one master (e.g. DMA

controller)

Alternatively, a slave can be

an AHB-to-APB bridge, thus

allowing connection to

multiple low-bandwidth

slaves

Multiple Multiple mastersmasters per per layerlayer

Combine masters that have

low bandwidth requirements

PuttingPutting itit alltogether…alltogether…

Interconnect matrix and Slave4

are used for across-layer

communication

HW

semaphores

DualDual portport slavesslaves

Common for off-chip SDRAM controllers

• Layer1: bandwidth limited high priority traffic with

low latency requirements (e.g., processor cores)

• Layer2: Bandwidth-critical traffic

(e.g., hardware accelerators)

The dual-port slave may even be connected to the matrix

AMBA 3.0 (AMBA AXI)AMBA 3.0 (AMBA AXI)

• High bandwidth – low latency designs

• High frequency operation

• Flexibility in the implementation

• Backward compatible with AHB and APB

This is an evolution of the communication protocol

Novel features with respect to AHB

• Burst-based transactions with only first address issued

• Address information can be issued before/after actual

write data transfer

• Multiple outstanding addresses

• Out-of-order transaction completion

• easy addition of register stages for timing closure

Design Design paradigmparadigm changechange

Maste

r

Sla

ve

Maste

r

Sla

ve

Target

Communication

architecture

AXI AXIInitiator Target

� Point-to-point interface specification

� Independent of the implementation

of the communication architecture

� Communication architecture can (be) freely evolve (customized)

� Transaction-based specification of the interface

� Open Core Protocol (OCP) is another example of this paradigm

AXI AXI

TransactionTransaction--centriccentric busbus

AXI can be used to interconnect:

-an initiator to the bus

-a target to the bus

-an initiator with a target

The interface definition

allows a variety of different

interconnect

implementations

Maste

r

Sla

ve

InitiatorAXI

Target

InterconnectInterconnect approachesapproaches

Maste

r

Sla

ve

AXIcrossbar

Maste

r

Sla

ve

AXI

shared

busM

aste

r

Sla

ve

Maste

r

Sla

ve

Most systems use one of three interconnect approaches:-shared address and data buses-Shared address buses and multiple data buses-Multilayer, with multiple address and data buses

Most common

ChannelChannel--based Architecturebased Architecture� Five groups of signals

� Read Address “AR” signal name prefix

� Read Data “R” signal name prefix

� Write Address “AW” signal name prefix

� Write Data “W” signal name prefix

� Write Response “B” signal name prefix� Write Response “B” signal name prefix

R. ADDRESS

READ DATA

WRITE DATA

RESPONSE

W. ADDRESS

Channels are independent and asynchronous wrt each other

Read transactionRead transaction

Single address for burst transfers

Write transactionWrite transaction

Single response for an entire burst

Channels Channels -- One way flowOne way flow

AWVALID

AWDDR

AWLEN

AWSIZE

AWBURST

AWLOCK

RVALID

RLAST

RDATA

RRESP

RID

RREADY

WVALID

WLAST

WDATA

WSTRB

WID

WREADY

BVALID

BRESP

BID

BREADY

AWPROT

AWCACHE

AWID

AWREADY

AWPROT

� Channel: a set of unidirectional information

signals

� Valid/Ready handshake mechanism

� READY is the only return signal

� Valid: source IF has valid data/control signals

� Ready: destination IF is ready to accept data

� Last: indicates last word of a burst transaction

ValidValid –– readyready handshakehandshake

AMBA 2.0 AHB BurstAMBA 2.0 AHB Burst

A21 A22 A23A11 A12 A13 A14

D21 D22 D23D11 D12 D13 D14

D31

D31

ADDRESS

DATA

� AHB Burst

� Address and Data are locked together

� Two pipeline stages

� HREADY controls pipeline operation

AXI AXI -- One Address for BurstOne Address for Burst

A21A11

D21 D22 D23D11 D12 D13 D14

D31

D31

ADDRESS

DATA D21 D22 D23D11 D12 D13 D14 D31DATA

� AXI Burst

� One Address for entire burst

AXI AXI -- Outstanding TransactionsOutstanding Transactions

A21A11

D21 D22 D23D11 D12 D13 D14

D31

D31

ADDRESS

DATA

� AXI Burst

� One Address for entire burst

� Allows multiple outstanding addresses

Problem: Slow Problem: Slow slaveslave

A21A11 A31ADDRESS

D11 D12DATA

� If one slave is very slow, all data is held

up.

OutOut--ofof--Order CompletionOrder Completion

A21A11

D21 D22 D23 D11 D12 D13 D14

D31

D31

ADDRESS

DATA

� Out of order completion allowed

Fast slaves may return data ahead of slow slaves� Fast slaves may return data ahead of slow slaves

� Complex slaves may serve requests out-of-order

� Each transaction has an ID attached (given by the master IF)

� Channels have ID signals - AID, RID, etc.

� Transactions with the same ID must be ordered

� The interconnect in a multi-master system must append

another tag to ID to make each master’s ID unique

OrderingOrdering restrictionsrestrictions

Simple rulesSimple rules

A simple master can issue transactions with the same ID

(implicitely forcing in-order delivery)

A simple slave can serve requests in the order they arrive,A simple slave can serve requests in the order they arrive,

regardless of the ID tag

AXI AXI -- Data InterleavingData Interleaving

A21A11

D21 D22 D23D11 D12 D13 D14

D31

D31

ADDRESS

DATA D21 D22 D23D11 D12 D13 D14D31DATA

� Returned data can even be interleaved

� Gives maximum use of data bus

� Note - Data within a burst is always in

order

BurstBurst readread

Valid high until ready high

The valid-ready handshake regulates data transfer

This is clearly a split transaction bus!

Overlapping burst readOverlapping burst read

Address of second burst issued:

True outstanding transactions

BurstBurst writewrite

� Channels are

asynchronous

� Register slices can

be applied across

any channel

Register slices for max frequencyRegister slices for max frequency

WREADY

WID

WDATA

WSTRB

WLAST

WVALID

any channel

� Allows maximum

frequency of operation

by changing delay into latency

WREADY

Other AXI featuresOther AXI features� No early burst termination, but fine granularity specification of burst beats

(1-16)

� Burst types:

� Fixed (FIFO-like))

� Incremental

� Wrapping

� Support for system caches

� Bufferable vs. Cacheable transactions� Bufferable vs. Cacheable transactions

� Support for

� Priviledged transactions vs. Normal ones

� Secure vs. non-secure transactions

� Support exclusive accesses

� Read exclusive, followed by write exclusive

� Support for locked accesses

� Terminated by an unlocked access

� Write data interleaving ( of transactions with different IDs)

ComparisonComparison

AHB

2 wait states memories

�It is impossible to

hide slave response

latency

Init1

Init2

Init3

Mem1

Mem2

Mem3

Bus

STBUS low buf

STBUS high buf

AXI

latency

Interleaving support in

interfaces and

interconnect allow

better interconnect

exploitation

While the previous

response phase is in

progress, a new request

can be processed by the

next addressed slave

More data pre-accessed

while previous response

phase is in progress

ScalabilityScalability

� Highly parallel benchmark (no slave bottlenecks)

� 1 memory wait state

70%

80%

90%

100%

110%

Relative execution time

110%

120%

130%

140%

150%

160%

170%

180%


AHB AXI STBus STBus (B)

0%

10%

20%

30%

40%

50%

60%

70%

2 Cores

4 Cores

6 Cores

8 Cores



0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

2 Cores

4 Cores

6 Cores

8 Cores


� 1 kB cache (low bus traffic)

� 256 B cache (high bus traffic)

ScalabilityScalability

30%

40%

50%

60%

70%

80%

90%

100%

2 Cores

4 Cores

6 Cores

8 Cores

Interconnect usage efficiency

30%

40%

50%

60%

70%

80%

90%

100%

2 Cores

4 Cores

6 Cores

8 Cores

Interconnect busy


0%

10%

20%Interconnect usage efficiency


0%

10%

20%

� Increasing contention: AXI, STBus show 80%+ efficiency, AHB < 50%

� Saturation of shared bus architectures

NetworksNetworks--onon--Chip (NoCs)Chip (NoCs)

Same paradigm of Wide Area Networks and

of large scale multi-processors

IP coremaster

NI

switch

IP coremaster

NI

switch

switchNoC

IP coremaster

NI

switch

IP coremaster

NI

switch

switchNoC

IP coremaster

NIIP coremaster

NI

switch

IP coremaster

NIIP coremaster

NI

switch

switchNoC

PAYLOAD HEADERTAIL

Packet

NIIP coreslave

IP coremaster

NI

NIIP coreslave

NIIP coreslave

switch

switch

NoC

NIIP coreslave

IP coremaster

NI

NIIP coreslave

NIIP coreslave

switch

switch

NoC

NIIP coreslave

NIIP coreslave

IP coremaster

NIIP coremaster

NI

NIIP coreslave

NIIP coreslave

NIIP coreslave

NIIP coreslave

switch

switch

NoC

Clean separation

at session layerCore issues end-to-end

transactions

(through AXI, OCP,..),

Network deals with

lower level issues

Modularity at HW level

Only 2 building blocks:

network interface,

switch

Physical design aware

Path segmentation

Regular routing

FLITFLITFLITLFLIT

Shared buses vs NoCsShared buses vs NoCs

- Each integrated IP core adds bus load capacitance

+ Only point-to-point one-way links are used

- Bus timing problems in deep sub-micron designs

NoCs Pros….

- Bus timing problems in deep sub-micron designs

+ Better suited for GALS paradigm

- Arbiter delay grows with no of masters. Instance-specific arbiter

+ Distributed routing decisions. Reinstantiable switches

- Bus bandwidth is shared among all masters

+ Bus bandwidth scales with network dimension

Shared buses vs NoCsShared buses vs NoCs

+ After bus is granted, bus access latency is null

- Unpredictable latency due to network congestion problems

+ Very low silicon cost

NoCs Cons….

+ Very low silicon cost

- High area cost

+ Simple bus-IP core interface

- Network-IP core interface can be very complex (e.g. packetization,..)

+ Design guidelines are well known

- Design guidelines start to consolidate

Advanced Bus

Documents

Transcript of Advanced Bus