SoC Design Lecture 12: MPSoC Multi-Processor...

Shaahin Hessabi

Department of Computer Engineering

Sharif University of Technology

SoC DesignLecture 12: MPSoC

Multi-Processor System-on-Chip

The Premises

Hessabi©Sharif University of TechnologySoC: MPSoC2

The System-on-Chip (SoC) todayHeterogeneous ~10 IP’sHomogeneous (Multi-processor) ~ 10 µP On-Chip BUS (AMBA, Core Connect, Wishbone, …) IP and µP are sold with proprietary bus interface

Near and long-term forecast 100 IP/ µP Buses are non scalable!

Physical design issues: signal integrity, power consumption, timing closureClock issues Is time for the Globally Asynchronous paradigm? (Still locally synchronous)

Need for more regular design

Heterogeneous Today’s SoC


CPU DSP MEM

EmbeddedFPGA

DedicatedIP

Interconnection network (BUS)

I/O

Maya (Rabaey’00)


The Cell Processor


Started in mid 2000 by Sony, Toshiba and IBM Sony has PS2 architecture, needs chip for PS3

Toshiba has memory experience, needs chips for HDTV

IBM has technical knowledge in processor manufacturing

Billions of dollars have been invested high throughput multi purpose processor

One of the earliest NoC processors developed to address high-performance distributed computing. Natural human interactions including photorealistic, predictable real-time response, virtualized resource for

concurrent activities

Heterogeneous Multiprocessing: 9-Core Processor

First prototype: 90nm SOI, 8 copper layers

241 million transistors, 235 mm2 (Rev. DD2)

60-80 W (prototype)

only 6-7 SPEs enabled (manufacturing errors)

1.1V, >4 GHz

Cell Processor Architecture


One Power Processor Element (PPE): 64-bit dual-threaded processor based on Power ArchitectureContains PXU (power execution unit), L1 and

L2 caches.

8 Synergistic Processor Elements (SPEs)SPE contains: independent processor SXU

(synergistic execution unit), 256-KB local store (LS)

21M transistors (14M SRAM, 7M logic)

Cell processor is capable of handling 10 simultaneous threads .

One Element Interconnection Bus (EIB): coherent bus, organized as four 16-byte-wide rings.

One Memory Interface Controller (MIC)

One Bus Interface Controller (BIC)

One Pervasive Unit (PU)

One Power Management Unit (PMU)

One Thermal Management Unit (TMU)

Cell Processor Architecture Components


PU: Pervasive Unit (not shown in figure) contains all of the global logic needed for:

Basic chip functions Serial peripheral interface (SPI): communicate with an external controller during normal operation

Phase-locked loop (PLL): clock generation and distribution logic

Power-on-reset (POR): systematically initializes all the units of the processor.

Lab debug Fault isolation registers: allow the OS to quickly determine which unit generated an error condition

Performance monitor (PFM)

Trace logic analyzer (TLA): captures/stores internal signals while chip is running to assist debug

Manufacturing test: 11 different test modes, including Array BIST, Memory BIST

Logic BIST

Cell Processor Architecture Components (Cont’d)


PMU and TMU Manage chip power to avoid permanent damage to the chip because of overheating

PMU: Power Management Unit allows software controls to reduce chip power when full processing capabilities are not needed.

TMU: Thermal Management Unit (not shown) monitors each of the 10 digital thermal sensors (diodes), distributed on the chip, to monitor

temperatures in hot spots.

controls the chip temperature dynamically and interrupts the PPE when a temperature specified for each element is observed.

Software controls the TMU by setting 4 temperature values and the amount of throttling for each sensor in the TMU: 1st value specifies when the throttling of an element stops

2nd value specifies when throttling starts

3rd value specifies when the element is completely stopped

4th value specifies when the chip’s clocks are shut down.

Cell’s Element Interconnect Bus


From the trenches: D. Krolak, IBM “Well, in the beginning, early in the development process, several people were pushing

for a crossbar switch, and the way the bus is architected, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just wasn't enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.”

Cell’s Element Interconnect Bus


4 rings (2 ckwise + 2 counter-ckwise)

No token rings, still request/grant arbitrations

Homogeneous SoC (Multiprocessor)


CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

Interconnection network (BUS, XBAR)

Multiprocessor SoC: Cisco CRS-1 Router


CRS-1 Router uses 188 extensible network processors per “Silicon Packet Processor” chip

16 PPE Clusters of 12 PPEs each

Multi-Processor Architectures


1. Tightly-coupled multiprocessor systems:Contain multiple CPUs that are connected at the bus level.

CPUs may have access to a central shared memory: SMP: symmetric multiprocessor Systems that treat all CPUs equally

ASMP: asymmetric multiprocessor

or may participate in a memory hierarchy with both local and shared memory NUMA: non-uniform memory access CC-NUMA: cache-coherent NUMA

2. Loosely-coupled multiprocessor systems: Often referred as clusters Based on multiple standalone single or dual processor commodity

computers interconnected via a high speed communication system, such as Gigabit ethernet.

SMP

memory

NUMA

Multiprocessor Communication Architectures


Message Passing Separate address space for each processor

Processors communicate explicitly via message passing using communication APIs, such as send() or receive(). Create extra communication overhead.

Processors have private memories

. . .interconnection network

. . .

processor1

cache

processor2

cache

processorN

cache

memory1

memoryM

memory2

Shared Memory Processors communicate with shared address space Processors communicate implicitly by memory read/write Lower latency widely used in many of today’s high performance MPSoCs. SMP or NUMA SMP: Shared Memory Processor or Uniform Memory Access Access to all memory occurred at the same speed for all processors.

NUMA: Non-Uniform Memory Access or Distributed Shared Memory Typically interconnection is grid or hypercube. Access to some parts of memory is faster for some processors than other parts of memory. Harder to program, but scales to more processors

Shared Memory MultiProcessor

Bus Based UMA


(a) Simplest MP: More than one processor on a single bus connect to memory bus bandwidth becomes a bottleneck.

(b) Each processor has a cache to reduce the need to access to memory.

(c) To further scale the number of processors, each processor is given private local memory.

NUMA


All memories can be addressed by all processors, but access to a processor’s own local memory is faster than access to another processor’s remote memory.

Looks like a distributed machine, but the interconnection network is usually custom-designed switches and/or buses.

What is MPSoC?


Multiprocessor SoC: Heterogeneous processors.

Buses used currently to interconnect modules (processors, memories, etc.) but NoCs are projected to replace buses in future systems.

MPSoCs are not chip multiprocessors. Chip multiprocessors are components that take advantage of increased transistor densities to put

more processors on a single chip, but they don’t try to leverage application needs MPSoCs are custom architectures that balance the constraints of VLSI technology with an

application’s needs.

MPSoC vs. Competitors


Uniprocessor Need task-level parallelism for performance Real concurrency, not the apparent concurrency of a multitasking OS running on a uniprocessor.

Symmetric mutliprocessor (SMP) SMP has the following advantages: Could manufacture the chips in even larger volumes lower price,

Uniform platforms and richer tool sets will make software development easier,

Symmetry makes it easier to map an application onto the architecture.

However, cannot directly apply the scientific computing model to SoCs.

SoCs must obey constraints that do not apply to scientific computation:

1. They must perform real-time computations.

2. They must be area-efficient.

3. They must be energy-efficient.

4. They must provide the proper I/O connections.

1. Real-Time Performance


More than high-performance computing: results be available at a predictable rate. Rate variations can often be solved by adding buffer memory, But memory incurs both area and energy consumption costs.

Producing results at predictable times requires careful design of hardware: Instruction set, memory system, and system bus.

Also careful design of software: to take advantage of features of the hardware,

to avoid common problems like excessive reliance on buffering.

Many mechanisms provide performance at the expense of making performance less predictable. Snooping caching dynamically manages cache coherency at the cost of less predictable delays since the

time required for a memory access depends on the state of several caches.

One way to provide predictable and high performance: use a mechanism specialized to the needs of the application: Specialized memory systems or application-specific instructions.

Different tasks in an application have different characteristics different parts of the architecture need different hardware structures.

2. Area Efficiency


Heterogeneous multiprocessors are more area-efficient than SMPs.

Task-level parallelism is inherently heterogeneous. Each block does something different and has different computational requirements. A special-purpose PE or a specialized CPU: faster and smaller than a programmable processor.

Matching CPU datapath width to the native data sizes of the application saves area. Choosing a cache size and organization to match the application can greatly improve performance. Memory specialization is an important technique for designing efficient architectures.

o If some aspects of memory behavior of the application can be predicted, system architect can reflect those characteristics in the architecture.

o Example: smaller cache can be used when the application has regular memory access patterns.

3. Energy Efficiency


Most SoC designs are power-sensitive, due to: Environmental considerations (heat dissipation), or

System requirements (battery power).

Specialization saves power, by stripping away unnecessary features. Particularly true for leakage power consumption.

SoCs are mass-market devices due to the economics of VLSI manufacturing. Cost of designing power-saving features for a particular architecture can be compensated due to

many times replication during manufacturing.

4. Proper I/O Connections SoC must provide a complete system.

Can we implement I/O devices in a generic fashion given enough transistors? To some extent, done for FPGA I/O pads.

Due to variety of physical interfaces, difficult to create customizable I/O devices effectively.

Example: MPSoC from Philips Research


For communication needs of consumer electronics SoC with real-time requirements (e.g. set-top boxes)

Mi: MemoriesPi: Programmable dedicated

processorsMIi: External memory interfacesRi: RoutersNi: Network interfaces

Ref: B. Vermeulen et al., IEEE Communications Magazine, Sept. 2003

Design and Manufacturing Challenges


Software Development Software shipped as part of a chip must be extremely reliable.

Must meet design constraints typically reserved for hardware, e.g., hard timing constraints (e.g., real-time operation) and energy consumption.

MPSoCs are heterogeneous: harder to program than traditional symmetric multiprocessors.

Need customized development environment, including compilers, debuggers, simulators, etc.

NoCs resemble external networks, but differ from them in crucial ways Extensive wiring resources: What topologies can best exploit them?

Buffers a scarce resource because of area overhead: What flow control method reduce buffer count and router overhead?

What circuits (e.g. transceivers) can best exploit the structured wiring of on-chip networks?

Challenges (Cont’d.)


Determining FPGAs vs. software programmability tradeoff FPGA fabrics can be used as cores to provide alternative means of programmability

Tools for using FPGAs in the design environment are not yet well developed.

Security issues, particularly when MPSoC devices connect to the Internet Security breaches can cause malfunctions and must be considered during HW/SW codesign

MPSoCs connected into a network of chips, e.g. in automotive/avionics applications Lack of control on external network state, e.g. node failures, reconfiguration. Current MPSoC

design is essentially carried out in a closed environment.

Silicon debug: Design validation and testing are increasingly insufficient to remove all bugs before first silicon. Design cycle may require expensive respins.

Why Multiprocessors?


Microprocessors as the fastest CPUs Collecting several CPUs much easier than redesigning one

Multiple users Multiple applications Multi-tasking within an application Responsiveness and/or throughput Share hardware between CPUs Complexity of current microprocessors Do we have enough ideas to sustain 1.5X/yr?

Can we deliver such complexity on schedule?

Slow (but steady) improvement in parallel software (scientific apps, databases, OS)

Emergence of embedded market driving microprocessors in addition to desktops Embedded functional parallelism

What Level Parallelism?


Bit level parallelism: 1970 to ~1985 4 bits, 8 bit, 16 bit, 32 bit microprocessors

Instruction level parallelism (ILP): ~1985 through today Pipelining

Superscalar

VLIW

Out-of-Order execution

Limits to benefits of ILP?

Process Level or thread level parallelism: mainstream for general purpose computing? Servers are parallel

High-end desktop dual processor PC

Program Level parallelism, or even distributed computing

Popular Categories


SISD (Single Instruction Single Data) Uniprocessors

MISD (Multiple Instruction Single Data) Multiple processors on a single data stream

SIMD (Single Instruction Multiple Data) Examples: Illiac-IV, CM-2 Simple programming model

Low overhead

Flexibility

All custom integrated circuits

(Phrase reused by Intel marketing for media instructions ~ vector)

MIMD (Multiple Instruction Multiple Data) Flexible

MIMD current winner for MPSoC

Major MIMD Styles


Centralized shared memoryUniform Memory Access (UMA) time or Shared Memory Processor (SMP)

Major MIMD Styles (Cont’d)


Distributed memory (memory module with CPU) Get more memory bandwidth, lower memory latency

Drawback: Longer communication latency

Drawback: Software model more complex

OS Option 1


Each CPU has its own OS Statically allocate physical memory to each CPU Each CPU runs its own independent OS Share peripherals Each CPU handles its processes system calls Used in early multiprocessor systems Simple to implement Avoids concurrency issues by not sharing Issues:1. Each processor has its own scheduling queue, and its own memory partition.2. Consistency is an issue with independent disk buffer caches and potentially shared files.

OS Option 2


Master-Slave Multiprocessors OS mostly runs on a single fixed CPU.

User-level applications run on the other CPUs.

All system calls are passed to the Master CPU for processing

Very little synchronization required

Simple to implement

Single centralized scheduler to keep all processors busy

Memory can be allocated as needed to all CPUs.

Issues: Master CPU becomes the bottleneck.

OS Option 3


Symmetric Multiprocessors (SMP) OS kernel runs on all processors, while load and resources are balanced between all processors.

One alternative: A single mutex (mutual exclusion object) that makes the entire kernel a large critical section; Only one CPU is in the kernel at a time; Only slightly better than master-slave

Better alternative: Identify independent parts of the kernel and make each of them their own critical section, which allows parallelism in the kernel

Issues: A difficult task; Code is mostly similar to uniprocessor code; hard part is identifying independent parts that don’t interfere with each other

Example: Quad-Processor Pentium Pro


SMP, bus interconnection.

4 x 200 MHz Intel Pentium Pro processors.

8 + 8 Kb L1 cache per processor.

512 Kb L2 cache per processor.

Snoopy cache coherence.

Employed in Compaq, HP, IBM, NetPower.

OS: Windows NT, Solaris, Linux, etc.

MPSoC Design Goals


1. Fast design time Very important in typical applications for MPSoC architectures: game/network processors, high-definition video encoding, multimedia hubs, base-band telecom

circuits, … have particularly tight time-to-market and time window constraints.

2. Higher level abstractions: system-level modeling. Hardware side: RTL models too time consuming to design and verify MPSoCs (cores and associated peripherals) RTL abstraction: designers produce the equivalent of 4 to 10 gates per line of RTL code. A 100 million-gate MPSoC circuit using only RTL code, with 90% code reuse, requires > 1 million lines

of code for the remaining 10 million gates. Unrealistic for most MPSoC target markets.

A higher abstraction level is needed on the hardware side.

Software side: MPSoCs use hundreds of thousands of lines of dedicated software and complex software

development environments cannot use mostly low-level programming languages anymore. Higher level abstractions are needed on the software side too.

MPSoC Design Goals (cont’d)


3. Predictability of results

High-level abstractions hiding precise circuit behavior (timing information).

MPSoCs are mostly targeted for real-time applications accurate performance information must be available at design time.

4. Meeting design metrics

High-level design metrics and performance estimation are essential parts in MPSoC design methodologies.

System’s design metrics are not easy to compose from design metrics of its components.

MPSoC Design Methodologies


Design steps:

1. Design space exploration hardware/software partitioning, selection of architectural platform and components

2. Architecture design Design of components, hardware/software interface design.

Design process must consider TTM, system performance, power, and cost.

Reuse of predesigned components is necessary for reducing design time, but their integration into a system is challenging.

A complete design flow requires multiple capabilities and tools because of the complexity and diversity of applications.

MPSoC Design Methodologies (cont’d)


Competing EDA approaches to improve productivity:

1. Top-Down approaches start with an architectural solution, target architecture, or architectural platform:

Synthesis from system level models: COSYMA environment for hardware/software co-synthesis,

POLIS for Hardware-Software Co-design of Embedded Systems,

SpecC, SystemC

ODESSEY

Platform-based design

2. Bottom-up approach (component-based) starts with a set of components and provides a set of primitives to build application-specific architectures and communication APIs.

Goal : allow the integration of heterogeneous processors and communication protocols by using abstract interconnections.

Behavior and communication must be separated in the system specification. System communication can be described at a higher level and refined independently of the behavioral system.

2 approaches described previously: standard bus protocol, standard component protocol.

Synthesis from System Level Models


1. Starts with informal model of application.

2. - Build a more formal (capable to be validated) SoC specification. - System architecture is fixed and HW/SW partitioning is decided.- Produces a golden architecture model:spec of HW components fixedglobal structure of on-chip network.

3. Design SW.

4. Design HW components.

5. Interconnect HW and SW components while respecting constraints described in golden architecture model.

A full design flow from a system-level specification to the RTL architecture

SoC Design Lecture 12: MPSoC Multi-Processor...

Documents

Transcript of SoC Design Lecture 12: MPSoC Multi-Processor...