Download - SOC architecture and design - Imperial College Londonwl/teachlocal/cuscomp/notes/cc... · 2020-02-07 · SOC architecture and design • system-on-chip (SOC) –processors: become

wl 2020 11.1

SOC architecture and design

• system-on-chip (SOC)– processors: become components in a system

• SOC covers many topics– processor: pipelined, superscalar, VLIW, array, vector

– storage: cache, embedded and external memory

– interconnect: buses, network-on-chip

– impact: time, area, power, reliability, configurability

– customisability: specialized processors, reconfiguration

– productivity/tools: model, explore, re-use, synthesise, verify

– examples: crypto, graphics, media, network, comm, security

– future: autonomous SOC, self-optimising/verifying design

• our focus– overview, processor, memory

wl 2020 11.2

iPhone SOC

1 GHz ARM Cortex A8

I/O

I/O

I/O

Processor

MemorySource: UC Berkeley

wl 2020 11.3

Basic system-on-chip model

wl 2020 11.4

AMD’s Barcelona Multicore

Processor

Core 1 Core 2

Core 3 Core 4

Northbridge

512K

B L

2

512K

B L

2

512K

B L

2

512K

B L

2

2M

B s

hare

d L

3 C

ac

he

4 out-of-order cores

1.9 GHz clock rate

65nm technology

3 levels of caches

integrated Northbridge

http://www.techwarelabs.com/reviews/processors/barcelona/

wl 2020 11.5

SOC vs processors on chip

• with lots of transistors, designs move in 2 ways:

– complete system on a chip

– multi-core processors with lots of cache

System on chip Processors on chip

processor multiple, simple,

heterogeneous

few, complex,

homogeneous

cache one level, small 2-3 levels, extensive

memory embedded, on chip very large, off chip

functionality special purpose general purpose

interconnect wide, high bandwidth often through cache

power, cost both low both high

operation largely stand-alone need other chips

wl 2020 11.6

Processor types: overview

Processor type Architecture / Implementation approach

SIMD Single instruction applied to multiple

functional units

Vector Single instruction applied to multiple

pipelined registers

VLIW Multiple instructions issued each cycle

under compiler control

Superscalar Multiple instructions issued each cycle

under hardware control

wl 2020 11.7

Sequential and parallel machines

• basic single stream processors

– pipelined: overlap operations in basic sequential

– superscalar: transparent concurrency

– VLIW: compiler-generated concurrency

• multiple streams, multiple functional units

– array processors

– vector processors

• multiprocessors

wl 2020 11.8

Pipelined processor

IF DFAGID WBEX

Instruction #1

IF DFAGID WBEX

Instruction #2

IF DFAGID WBEX

Instruction #3

IF DFAGID WBEX

Instruction #4

Time

wl 2020 11.9

Superscalar and VLIW processors

IF DFAGID WBEX

Instruction #2

IF DFAGID WBEX

Instruction #3

IF DFAGID WBEX

Instruction #5

IF DFAGID WBEX

Instruction #6

Time

IF DFAGID WBEX

IF DFAGID WBEX

Instruction #4

Instruction #1

wl 2020 11.10

Superscalar

VLIW

hardware for parallelism control

wl 2020 11.11

Array processors

• perform op if condition = mask

• operand can come from neighbour

mask op dest sr1 sr2

one instruction

issued to all PEs

n PEs, each with

memory; neighbour

communications

wl 2020 11.12

Vector processors

• vector registers, eg 8 sets x 64 elements x 64 bits

• vector instructions: VR3 = VR2 VOP VR1

wl 2020 11.13

Memory addressing:

three levels

(each segment contains pages

for a program/process)

wl 2020 11.14

User view of memory: addressing

• a program: process address (offset + base + index)

– virtual address: from page address and process/user id

• segment table: process base and bound (for each process)

– system address: process base + page address

• pages: active localities in main/real memory

– virtual address: page table lookup to physical address

– page miss: virtual pages not in page table

• TLB (translation look-aside buffer): recent translations

– TLB entry: corresponding real and (virtual, id) address

• a few hashed virtual address bits address TLB entries

– if virtual, id = TLB (virtual, id) then use translation

wl 2020 11.15

TLB and Paging:

Address

translation

process base

(find process)

(find page)

System Address

Physical Address

Virtual Address

(recent translations)

wl 2020 11.16

SOC interconnect

• interconnecting multiple active agents requires

– bandwidth: capacity to transmit information (bps)

– protocol: logic for non-interfering message transmission

• bus

– AMBA (Adv. Microcontroller Bus Architecture) from ARM,

widely used for SOC

– bus performance: can determine system performance

• network on chip

– array of switches

– statically switched: eg mesh

– dynamically switched: eg crossbar

– adopted in the latest FPGAs to support AI and 5G

wl 2020 11.17

Adaptive Compute Acceleration

Diverse Workloads in

Milliseconds

Future-Proof for

New Algorithms

ADAPTIVE

AdaptableEngines

ScalarEngines

AIEngines

Compute Acceleration Engines, connected by Network-on-Chip

Enabling Data Scientists, Software Developers, Hardware Developers

PLATFORM

Development Tools

Hardware/Software Libraries

Run-time Stack

Software Programmable

Silicon Infrastructure

Source: Xilinx

wl 2020 11.18

Adaptable Engines2X compute density

Programmable I/O•Any interface or sensor

•Includes 4.2Gb/s MIPI

AI Engines•AI Compute

•Diverse DSP workloads

DDR Memory•3200-DDR4, 4200-LPDDR4

•2X bandwidth/pin

PCIe & CCIX•2X PCIe & DMA bandwidth

•Cache-coherent interface

to accelerators

Transceivers•Broad range, 25G →112G

•58G in mainstream devices

Scalar Engines•Platform Control

•Edge Compute

Versal Architecture (7nm technology)

Network-on-Chip•Guaranteed Bandwidth

•Enables Software Programmability

Source: Xilinx

wl 2020 11.19

Platform Management ControllerBringing the Platform to Life & Keeping it Safe & Secure

Boot & Configuration

˃ Boots the platform in milliseconds (any engine

first)

˃ 8 times faster dynamic reconfiguration

˃ Advanced power & thermal management

Security, Safety & Reliability Enclave

˃ Hardware Root of Trust

˃ Cryptographic acceleration, confidentiality

˃ Enhanced diagnostics, system monitoring, anti-tamper

˃ Error mitigation, detection, management for safety

Integrated Platform Interfaces & High Speed Debug

˃ Integrated flash, system & debug interfaces

˃ High-speed non-invasive, chip-wide debug

BOOT & CONFIG

Boot

10s of Milliseconds

DONE

DEBUG SAFETY

SECURITY

Source: Xilinx

wl 2020 11.20

AI

CORE

ME

MO

RY

AI

CORE

ME

MO

RY

AI

CORE

ME

MO

RY

AI

CORE

ME

MO

RY

AI Engine: overview

Signal ProcessingArtificial

Intelligence

CNN

LSTM / MLP

Computer Vision

• 1GHz+ Multi-precision Vector

Processor

• High bandwidth extensible memory

• Up to 400 AI Engines per device

• 8 times Compute Density

• 40% Lower Power Consumption

Software Programmable

Adaptable. Intelligent.

Deterministic

Efficient

Source: Xilinx

wl 2020 11.21

AI Engine: tile-based architecture

Interconnect

ISA-based

Vector Processor

Local

Memory

AI Vector

Extensions

5G Vector

Extensions

ISA-based

Vector ProcessorSoftware Programmable

(e.g., C/C++)

Data

Mover

Data MoverNon-neighbor data communication

Integrated synchronization primitives

Non-Blocking InterconnectUp to 200+ GB/s bandwidth per tile

Local MemoryMulti-bank implementation

Shared across neighbor cores

Cascade InterfacePartial results to next core

PL

PS I/O

Source: Xilinx

wl 2020 11.22

AI Engine: processor core

Local, Shareable Memory• 32KB Local, 128KB Addressable

32-bit Scalar RISC Processor

Up to 128 MACs / Clock Cycle per Core (INT 8)

Highly

Parallel

Memory Interface

Scalar Unit

Scalar

Register

File

Scalar

ALU

Non-

linear

Functions

Vector

Regist

er File

Fixed-Point

Vector UnitFloating-

Point Vector

Unit

Vector Unit Vector Processor

512-bit SIMD DatapathInstruction Fetch

& Decode UnitAGU AGU AGU

Load Unit A Load Unit B Store Unit

7+ operations / clock cycle

• 2 Vector Loads / 1 Mult / 1 Store

• 2 Scalar Ops / Stream Access

Instruction Parallelism: VLIW Data Parallelism: SIMD

Multiple vector lanes

• Vector Datapath

• 8 / 16 / 32-bit & SPFP operands

Stream Interface

Source: Xilinx

wl 2020 11.23

AI Engine: processor core

• Program Memory per Tile

– 16KB, 128-bit wide, 1K word deep, single port

– Instruction Compression, ECC protection + Reporting

• 32KB data memory per Tile

– 8 single port banks, 256-bit wide, 128b deep.

– 5 cycle access latency

– Error detection (parity) + reporting

• Independent DMA per tile, 2D strided access to north, south, east, west

• 3 AGUs, 2 load, 1 store

• 32-bit scalar RISC

– w/ 32x32 scalar multiplier

– sin/cos, square root, inv square-root

• 512-bit vector fixed point unit

• Single Precision floating point vector unit

Source: Xilinx

wl 2020 11.24

Multi-precision support

AI Data Types Signal Processing Data Types

* *

Source: Xilinx

wl 2020 11.25

Design cost: product economics

• increasingly product cost determined by

– design costs, including verification

– not marginal cost to produce

• manage complexity in die technology by

– engineering effort

– engineering cleverness

• design effort

– often dictated by

product volume

Basic

physical

tradeoffs

Design time

and effort

Balance point depends on

n, number of units

wl 2020 11.26

Design complexity

processors

wl 2020 11.27

Cost: product program vs engineering

Product cost

Manufacturing

costs

Engineering

Marketing,

sales,

administration

Fixed

costsVariable costs

Chip design

CAD

support

Software

Verify & test

Mask costs

Capital

equipment

CAD

programs

Labor costs

Fixed

project costs

Engineering

costs

wl 2020 11.28

Example: two scenarios

• fixed costs Kf, support costs 0.1 x function(n), and variable costs Kv x n, so

• design gets more complex, while production costs decrease– Kf increases while Kv decreases

– if same price, requires higher volumes to break even

• when compared with 1995, in 2015– Kf increased by 10 times

– Kv decreased by the same amount

wl 2020 11.29

More recent: higher NRE

2015

1995

wl 2020 11.30

IP: Intellectual Property

wl 2020 11.31

Summary

• physical customisation: FPGA vs ASIC

• customisation techniques– parametric description: pointwise, pointfree descriptions

– patterns of composition: series, parallel, chain, row, grid

– transformations: retiming, slowdown, state machines

• system-on-chip (SOC)– processors, memory, interconnect, design costs, IP

• why exciting?– foundation of everything else in computing: theory + practice

– Microsoft adopt FPGA in data centres; Intel bought Altera

– you can be part of it: projects, internship, research, start-up…

wl 2020 11.32

Answers to Unassessed Coursework 6

1. rdl1 R = snd [-]-1 ; R

rdln+1 R = snd aprn-1 ; rsh ; fst (rdln R) ; R

2. P0 = rdln Pcell; 1

<<s,x>, a> Pcell <sx+a, x>

3. rdln R = rown (Ri ; 2-1) ; 2

P1 = loop (rown Pcell1 ; fst mapn D) ; 1

<<s,x>, a> Pcell1 <a,<sx+a, x>>

4. loop (rown R) = (loop R)n

Proof: induction on n

(see www.doc.ic.ac.uk/~wl/papers/scp90.pdf)

P1 = P2 ; [D,D]-n

P2 = (loop (Pcell1 ; [D,[D,D]]))n

http://www.doc.ic.ac.uk/~wl/papers/scp90.pdf