Mp So C 18 Apr

45
NoC: MPSoC Communication Fabric Interconnection Networks (ELE 580) Shougata Ghosh 18 th Apr, 2006

description

Fashion, apparel, textile, merchandising, garments

Transcript of Mp So C 18 Apr

Page 1: Mp So C 18 Apr

NoC: MPSoC Communication

FabricInterconnection Networks (ELE 580)

Shougata Ghosh18th Apr, 2006

Page 2: Mp So C 18 Apr

Outline

MPSoC

Network-On-Chip

Cases: IBM CoreConnect CrossBow IPs Sonic Silicon Backplane

Page 3: Mp So C 18 Apr

What are MPSoCs?

MPSoC – Multiprocessor System-On-Chip Most SoCs today use multiple processing

cores MPSoCs are characterised by heterogeneous

multiprocessors CPUs, IPs (Intellectual Properties), DSP

cores, Memory, Communication Handler (USB, UART, etc)

Page 4: Mp So C 18 Apr

Where are MPSoCs used?

Cell phones Network Processors

(Used by Telecomm. and networking to handle high data rates)

Digital Television and set-top boxes High Definition Television Video games (PS emotion engine)

Page 5: Mp So C 18 Apr

Challenges

All MPSoC designs have the following requirements: Speed Power Area Application Performance Time to market

Page 6: Mp So C 18 Apr

Why Reinvent the wheel?

Why not use uniprocessor (3.4 GHz!!)? PDAs are usually uniprocessor

Cannot keep up with real-time processing requirements Slow for real-time data

Real-time processing requires “real” concurrency Uniprocessors provide “apparent” concurrency

through multitasking (OS) Multiprocessors can provide concurrency required to

handle real-time events

Page 7: Mp So C 18 Apr

Need multiple Processors

Why not SMPs? +SMPs are cheaper (reuse) +Easier to program -Unpredictable delays (ex: Snoopy cache) -Need buffering to handle unpredictability

Page 8: Mp So C 18 Apr

Area concerns

Configured SMPs would have unused resources

Special purpose PEs: Don’t need to support unwanted processes

Faster Area efficient Power efficient

Can exploit known memory access patterns Smaller Caches (Area savings)

Page 9: Mp So C 18 Apr

MPSoC Architecture

Page 10: Mp So C 18 Apr

Components

Hardware Multiple processors Non-programmable IPs Memory Communication Interface

Interface heterogeneous components to Comm. Network Communication Network

Hierarchical (Busses) NoC

Page 11: Mp So C 18 Apr

Design Flow System-level-synthesis

Top-down approach Synthesis algo. ->SoC Arch. + SW Model from

system-level specs. Platform-based Design

Starts with Functional System Spec. + Predesigned Platform

Mapping & Scheduling of functions to HW/SW Component-based Design

Bottom-up approach

Page 12: Mp So C 18 Apr

Platform Based Design Start with functional Spec :

Task Graphs

Task graph Nodes: Tasks to complete Edges: Communication

and Dependence between tasks

Execution time on the nodes Data communicated on the edges

Page 13: Mp So C 18 Apr

Map tasks on pre designed HW

Use Extended Task Graph for SW and Communication

Page 14: Mp So C 18 Apr

Mapping on to HW Gantt chart: Scheduling

task execution & Timing analysis

Extended Task Graph Comm. Nodes

(Reads and Writes)

ILP and Heuristic Algo. to schedule Task and Comm. to HW and SW

Page 15: Mp So C 18 Apr

Component Based Design Conceptual MPSoC Platform SW, Processor, IP, Comm.

Fabric

Parallel Development Use APIs

Quicker time to market

Page 16: Mp So C 18 Apr

Design Flow Schematic

Page 17: Mp So C 18 Apr

Communication Fabric

Has been mostly Bus based IBM CoreConnect, Sonic Silicon Backplane, etc.

Busses not scalable!! Usually 5 Processors – rarely more than 10!

Number of cores has been increasing Push towards NoC

Page 18: Mp So C 18 Apr

NoC NoC NoC-ing on Heaven’s Door!! Typical Network-On-Chip (Regular)

Page 19: Mp So C 18 Apr

Regular NoC Bunch of tiles Each tile has input (inject into network) and

output (recv. From network) ports Input port => 256-bit Data 38-bit Control Network handles both static and dynamic traffic

Static: Flow of data from camera to MPEG encoder Dynamic: Memory request from PE (or CPU)

Uses dedicated VC for static traffic Dynamic traffic goes through arbitration

Page 20: Mp So C 18 Apr

Control Bits Control bit fields

Type (2 bits): Head, Body, Tail, Idle Size (4 bits): Data size 0 (1-bit) to 8 (256-bit) VC Mask (8 bits): Mask to determine VC (out of 8)

Can be used to prioritise Route (16 bits): Source routing Ready (8 bits): Signal from network

indicating it’s ready to accept the next flit (??why 8?)

Page 21: Mp So C 18 Apr

Flow Control

Virtual Channel flow control Router with input and output controller Input controller has buffer and state for each VC Inp. controller strips routing info from head flit Flit arbitrates for output VC Output VC has buffer for single flit

Used to store flit trying to get inp. buffer in next hop

Page 22: Mp So C 18 Apr

Input and Output Controllers

Page 23: Mp So C 18 Apr

NoC Issues Basic difference between NoC and Inter-chip or

Inter-board networks: Wires and pins are ABUNDANT in NoC Buffer space is limited in NoC

On-Chip pins for each tile could be 24,000 compared to 1000 for inter-chip designs

Designers can trade wiring resources for network performance!

Channels: On-Chip => 300 bits Inter-Chip => 8-16 bits

Page 24: Mp So C 18 Apr

Topology

The previous design used folded torus Folded torus has twice the wire demand and

twice the bisection BW compared to mesh Converts plentiful wires to bandwidth

(performance) Not hard to implement On-Chip However, could be more power hungry

Page 25: Mp So C 18 Apr

Flow Control Decision

Area scarce in On-Chip designs Buffers use up a LOT of area Flow control with less buffers are favourable However, need to balance with performance

Dropping pkt. FC requires least buffer but at the expense of performance

Misrouting when enough path diversity

Page 26: Mp So C 18 Apr

High Performance Circuits

Wiring regular and known at design time Can be accurately modeled (R, L, C) This enables:

Low swing circuit – 100mV compared to 1V HUGE power saving

Overdrive produces 3 times signal velocity compared to full-swing drivers

Overdrive increases repeater spacing Again significant power savings

Page 27: Mp So C 18 Apr

Heterogeneous NoC Regular topologies facilitate modular design

and easily scaled up by replication However, for heterogeneous systems, regular

topologies lead to overdesigns!! Heterogeneous NoCs can optimise local

bottlenecks Solution?

Complete Application Specific NoC synthesis flow Customised topology and NoC building blocks

Page 28: Mp So C 18 Apr

xPipe Lite

Application Specific NoC library Creates application specific NoC

Uses library of NI, switch and link Parameterised library modules optimised for

frequency and low latency Packet switched communication Source routing Wormhole flow control Topology: Torus, Mesh, B-Tree, Butterfly

Page 29: Mp So C 18 Apr

NoC Architecture Block Diagram

Page 30: Mp So C 18 Apr

xPipes Lite

Uses OCP to communicate with cores OCP advantages:

Industry wide standard for comm. protocol between cores and NoC

Allows parallel development of cores and NoC Smoother development of modules Faster time to market

Page 31: Mp So C 18 Apr

xPipes Lite – Network Interface Bridges OCP interface and NoC switching

fabric Functions:

Synch. Between OCP and xPipes timing Packeting OCP transaction to flits Route calculation Flit buffering to improve performance

Page 32: Mp So C 18 Apr

NI

Uses 2 registers to interface with OCP Header reg. to store address (sent once) Payload reg. to store data (sent multiple times for

burst transfers) Flits generated from the registers

Header flit from Header reg. Body/payload flits from Payload reg.

Routing info. in header flit Route determined from LUT using the dest.

address

Page 33: Mp So C 18 Apr

Network Interface

Bidirectional NI Output stage identical to

xPipes switches Input stage uses dual-

flit buffers Uses the same flow

control as the switches

Page 34: Mp So C 18 Apr

Switch Architecture xPipes switch is the basic building block of the

switching fabric 2-cycle latency Output queued router Fixed and round robin priority arbitration on input

lines Flow control

ACK/nACK Go-Back-N semantics

CRC

Page 35: Mp So C 18 Apr

Switch

Allocator module does the arbitration for head flit

Holds path until tail flit Routing info requests the

output port

The switch is parameterisable in: Number of input/output, arbitration policy, output

buffer sizes

Page 36: Mp So C 18 Apr

Switch flow control

Input flit dropped if: Requested output port held by previous packet Output buffer full Lost the arbitration

NACK sent back All subsequent flits of that packet dropped

until header flit reappears(Go-Back-N flow control)

Updates routing info for next switch

Page 37: Mp So C 18 Apr

xPipes Lite - Links

The links are pipelined to overcome interconnect delay problem

xPipes Lite uses shallow pipelines for all modules (NI, Switch) Low latency Less buffer requirement Area savings Higher frequency

Page 38: Mp So C 18 Apr

xPipes Lite Design Flow

Page 39: Mp So C 18 Apr

IBM CoreConnect

Page 40: Mp So C 18 Apr

CoreConnect Bus Architecture An open 32-, 64-, 128-bit core on-chip bus

standard Communication fabric for IBM Blue Logic and

other non-IBM devices Provides high bandwidth with hierarchical

bus structure Processor Local Bus (PLB) On-Chip Peripheral Bus (OPB) Device Control Register bus (DCR)

Page 41: Mp So C 18 Apr

Performance Features

Page 42: Mp So C 18 Apr

CoreConnect Components

PLB OPB DCR PLB Arbiter OPB Arbiter PLB to OPB Bridge OPB to PLB Bridge

Page 43: Mp So C 18 Apr

PLB

Page 44: Mp So C 18 Apr

Processor Local Bus Fully synchronous, supports up to 8 masters

32-, 64-, and 128-bit architecture versions; extendable to 256-bit

Separate read/write data buses, enables overlapped transfers and higher data rates

High Bandwidth Capabilities Burst transfers, variable and fixed-length supported Pipelining Split transactions DMA transfers No on-chip tri-states required Cache Line transfers Overlapped arbitration, programmable priority fairness

Page 45: Mp So C 18 Apr

Processor Local Bus (cont’d.)