wl 2020 11.1
SOC architecture and design
• system-on-chip (SOC)– processors: become components in a system
• SOC covers many topics– processor: pipelined, superscalar, VLIW, array, vector
– storage: cache, embedded and external memory
– interconnect: buses, network-on-chip
– impact: time, area, power, reliability, configurability
– customisability: specialized processors, reconfiguration
– productivity/tools: model, explore, re-use, synthesise, verify
– examples: crypto, graphics, media, network, comm, security
– future: autonomous SOC, self-optimising/verifying design
• our focus– overview, processor, memory
wl 2020 11.2
iPhone SOC
1 GHz ARM Cortex A8
I/O
I/O
I/O
Processor
MemorySource: UC Berkeley
wl 2020 11.3
Basic system-on-chip model
wl 2020 11.4
AMD’s Barcelona Multicore
Processor
Core 1 Core 2
Core 3 Core 4
Northbridge
512K
B L
2
512K
B L
2
512K
B L
2
512K
B L
2
2M
B s
hare
d L
3 C
ac
he
4 out-of-order cores
1.9 GHz clock rate
65nm technology
3 levels of caches
integrated Northbridge
http://www.techwarelabs.com/reviews/processors/barcelona/
wl 2020 11.5
SOC vs processors on chip
• with lots of transistors, designs move in 2 ways:
– complete system on a chip
– multi-core processors with lots of cache
System on chip Processors on chip
processor multiple, simple,
heterogeneous
few, complex,
homogeneous
cache one level, small 2-3 levels, extensive
memory embedded, on chip very large, off chip
functionality special purpose general purpose
interconnect wide, high bandwidth often through cache
power, cost both low both high
operation largely stand-alone need other chips
wl 2020 11.6
Processor types: overview
Processor type Architecture / Implementation approach
SIMD Single instruction applied to multiple
functional units
Vector Single instruction applied to multiple
pipelined registers
VLIW Multiple instructions issued each cycle
under compiler control
Superscalar Multiple instructions issued each cycle
under hardware control
wl 2020 11.7
Sequential and parallel machines
• basic single stream processors
– pipelined: overlap operations in basic sequential
– superscalar: transparent concurrency
– VLIW: compiler-generated concurrency
• multiple streams, multiple functional units
– array processors
– vector processors
• multiprocessors
wl 2020 11.8
Pipelined processor
IF DFAGID WBEX
Instruction #1
IF DFAGID WBEX
Instruction #2
IF DFAGID WBEX
Instruction #3
IF DFAGID WBEX
Instruction #4
Time
wl 2020 11.9
Superscalar and VLIW processors
IF DFAGID WBEX
Instruction #2
IF DFAGID WBEX
Instruction #3
IF DFAGID WBEX
Instruction #5
IF DFAGID WBEX
Instruction #6
Time
IF DFAGID WBEX
IF DFAGID WBEX
Instruction #4
Instruction #1
wl 2020 11.10
Superscalar
VLIW
hardware for parallelism control
wl 2020 11.11
Array processors
• perform op if condition = mask
• operand can come from neighbour
mask op dest sr1 sr2
one instruction
issued to all PEs
n PEs, each with
memory; neighbour
communications
wl 2020 11.12
Vector processors
• vector registers, eg 8 sets x 64 elements x 64 bits
• vector instructions: VR3 = VR2 VOP VR1
wl 2020 11.13
Memory addressing:
three levels
(each segment contains pages
for a program/process)
wl 2020 11.14
User view of memory: addressing
• a program: process address (offset + base + index)
– virtual address: from page address and process/user id
• segment table: process base and bound (for each process)
– system address: process base + page address
• pages: active localities in main/real memory
– virtual address: page table lookup to physical address
– page miss: virtual pages not in page table
• TLB (translation look-aside buffer): recent translations
– TLB entry: corresponding real and (virtual, id) address
• a few hashed virtual address bits address TLB entries
– if virtual, id = TLB (virtual, id) then use translation
wl 2020 11.15
TLB and Paging:
Address
translation
process base
(find process)
(find page)
System Address
Physical Address
Virtual Address
(recent translations)
wl 2020 11.16
SOC interconnect
• interconnecting multiple active agents requires
– bandwidth: capacity to transmit information (bps)
– protocol: logic for non-interfering message transmission
• bus
– AMBA (Adv. Microcontroller Bus Architecture) from ARM,
widely used for SOC
– bus performance: can determine system performance
• network on chip
– array of switches
– statically switched: eg mesh
– dynamically switched: eg crossbar
– adopted in the latest FPGAs to support AI and 5G
wl 2020 11.17
Adaptive Compute Acceleration
Diverse Workloads in
Milliseconds
Future-Proof for
New Algorithms
ADAPTIVE
AdaptableEngines
ScalarEngines
AIEngines
Compute Acceleration Engines, connected by Network-on-Chip
Enabling Data Scientists, Software Developers, Hardware Developers
PLATFORM
Development Tools
Hardware/Software Libraries
Run-time Stack
Software Programmable
Silicon Infrastructure
Source: Xilinx
wl 2020 11.18
Adaptable Engines2X compute density
Programmable I/O•Any interface or sensor
•Includes 4.2Gb/s MIPI
AI Engines•AI Compute
•Diverse DSP workloads
DDR Memory•3200-DDR4, 4200-LPDDR4
•2X bandwidth/pin
PCIe & CCIX•2X PCIe & DMA bandwidth
•Cache-coherent interface
to accelerators
Transceivers•Broad range, 25G →112G
•58G in mainstream devices
Scalar Engines•Platform Control
•Edge Compute
Versal Architecture (7nm technology)
Network-on-Chip•Guaranteed Bandwidth
•Enables Software Programmability
Source: Xilinx
wl 2020 11.19
Platform Management ControllerBringing the Platform to Life & Keeping it Safe & Secure
Boot & Configuration
˃ Boots the platform in milliseconds (any engine
first)
˃ 8 times faster dynamic reconfiguration
˃ Advanced power & thermal management
Security, Safety & Reliability Enclave
˃ Hardware Root of Trust
˃ Cryptographic acceleration, confidentiality
˃ Enhanced diagnostics, system monitoring, anti-tamper
˃ Error mitigation, detection, management for safety
Integrated Platform Interfaces & High Speed Debug
˃ Integrated flash, system & debug interfaces
˃ High-speed non-invasive, chip-wide debug
BOOT & CONFIG
Boot
10s of Milliseconds
DONE
DEBUG SAFETY
SECURITY
Source: Xilinx
wl 2020 11.20
AI
CORE
ME
MO
RY
AI
CORE
ME
MO
RY
AI
CORE
ME
MO
RY
AI
CORE
ME
MO
RY
AI Engine: overview
Signal ProcessingArtificial
Intelligence
CNN
LSTM / MLP
Computer Vision
• 1GHz+ Multi-precision Vector
Processor
• High bandwidth extensible memory
• Up to 400 AI Engines per device
• 8 times Compute Density
• 40% Lower Power Consumption
Software Programmable
Adaptable. Intelligent.
Deterministic
Efficient
Source: Xilinx
wl 2020 11.21
AI Engine: tile-based architecture
Interconnect
ISA-based
Vector Processor
Local
Memory
AI Vector
Extensions
5G Vector
Extensions
ISA-based
Vector ProcessorSoftware Programmable
(e.g., C/C++)
Data
Mover
Data MoverNon-neighbor data communication
Integrated synchronization primitives
Non-Blocking InterconnectUp to 200+ GB/s bandwidth per tile
Local MemoryMulti-bank implementation
Shared across neighbor cores
Cascade InterfacePartial results to next core
PL
PS I/O
Source: Xilinx
wl 2020 11.22
AI Engine: processor core
Local, Shareable Memory• 32KB Local, 128KB Addressable
32-bit Scalar RISC Processor
Up to 128 MACs / Clock Cycle per Core (INT 8)
Highly
Parallel
Memory Interface
Scalar Unit
Scalar
Register
File
Scalar
ALU
Non-
linear
Functions
Vector
Regist
er File
Fixed-Point
Vector UnitFloating-
Point Vector
Unit
Vector Unit Vector Processor
512-bit SIMD DatapathInstruction Fetch
& Decode UnitAGU AGU AGU
Load Unit A Load Unit B Store Unit
7+ operations / clock cycle
• 2 Vector Loads / 1 Mult / 1 Store
• 2 Scalar Ops / Stream Access
Instruction Parallelism: VLIW Data Parallelism: SIMD
Multiple vector lanes
• Vector Datapath
• 8 / 16 / 32-bit & SPFP operands
Stream Interface
Source: Xilinx
wl 2020 11.23
AI Engine: processor core
• Program Memory per Tile
– 16KB, 128-bit wide, 1K word deep, single port
– Instruction Compression, ECC protection + Reporting
• 32KB data memory per Tile
– 8 single port banks, 256-bit wide, 128b deep.
– 5 cycle access latency
– Error detection (parity) + reporting
• Independent DMA per tile, 2D strided access to north, south, east, west
• 3 AGUs, 2 load, 1 store
• 32-bit scalar RISC
– w/ 32x32 scalar multiplier
– sin/cos, square root, inv square-root
• 512-bit vector fixed point unit
• Single Precision floating point vector unit
Source: Xilinx
wl 2020 11.24
Multi-precision support
AI Data Types Signal Processing Data Types
* *
Source: Xilinx
wl 2020 11.25
Design cost: product economics
• increasingly product cost determined by
– design costs, including verification
– not marginal cost to produce
• manage complexity in die technology by
– engineering effort
– engineering cleverness
• design effort
– often dictated by
product volume
Basic
physical
tradeoffs
Design time
and effort
Balance point depends on
n, number of units
wl 2020 11.26
Design complexity
processors
wl 2020 11.27
Cost: product program vs engineering
Product cost
Manufacturing
costs
Engineering
Marketing,
sales,
administration
Fixed
costsVariable costs
Chip design
CAD
support
Software
Verify & test
Mask costs
Capital
equipment
CAD
programs
Labor costs
Fixed
project costs
Engineering
costs
wl 2020 11.28
Example: two scenarios
• fixed costs Kf, support costs 0.1 x function(n), and variable costs Kv x n, so
• design gets more complex, while production costs decrease– Kf increases while Kv decreases
– if same price, requires higher volumes to break even
• when compared with 1995, in 2015– Kf increased by 10 times
– Kv decreased by the same amount
wl 2020 11.29
More recent: higher NRE
2015
1995
wl 2020 11.30
IP: Intellectual Property
wl 2020 11.31
Summary
• physical customisation: FPGA vs ASIC
• customisation techniques– parametric description: pointwise, pointfree descriptions
– patterns of composition: series, parallel, chain, row, grid
– transformations: retiming, slowdown, state machines
• system-on-chip (SOC)– processors, memory, interconnect, design costs, IP
• why exciting?– foundation of everything else in computing: theory + practice
– Microsoft adopt FPGA in data centres; Intel bought Altera
– you can be part of it: projects, internship, research, start-up…
wl 2020 11.32
Answers to Unassessed Coursework 6
1. rdl1 R = snd [-]-1 ; R
rdln+1 R = snd aprn-1 ; rsh ; fst (rdln R) ; R
2. P0 = rdln Pcell; 1
<<s,x>, a> Pcell <sx+a, x>
3. rdln R = rown (Ri ; 2-1) ; 2
P1 = loop (rown Pcell1 ; fst mapn D) ; 1
<<s,x>, a> Pcell1 <a,<sx+a, x>>
4. loop (rown R) = (loop R)n
Proof: induction on n
(see www.doc.ic.ac.uk/~wl/papers/scp90.pdf)
P1 = P2 ; [D,D]-n
P2 = (loop (Pcell1 ; [D,[D,D]]))n
Top Related