SoC Design Lecture 12: MPSoC Multi-Processor...
Transcript of SoC Design Lecture 12: MPSoC Multi-Processor...
Shaahin Hessabi
Department of Computer Engineering
Sharif University of Technology
SoC DesignLecture 12: MPSoC
Multi-Processor System-on-Chip
The Premises
Hessabi©Sharif University of TechnologySoC: MPSoC2
The System-on-Chip (SoC) todayHeterogeneous ~10 IP’sHomogeneous (Multi-processor) ~ 10 µP On-Chip BUS (AMBA, Core Connect, Wishbone, …) IP and µP are sold with proprietary bus interface
Near and long-term forecast 100 IP/ µP Buses are non scalable!
Physical design issues: signal integrity, power consumption, timing closureClock issues Is time for the Globally Asynchronous paradigm? (Still locally synchronous)
Need for more regular design
Heterogeneous Today’s SoC
Hessabi©Sharif University of TechnologySoC: MPSoC3
CPU DSP MEM
EmbeddedFPGA
DedicatedIP
Interconnection network (BUS)
I/O
Maya (Rabaey’00)
Hessabi©Sharif University of TechnologySoC: MPSoC4
Maya (Rabaey’00)
Hessabi©Sharif University of TechnologySoC: MPSoC5
Maya (Rabaey’00)
Hessabi©Sharif University of TechnologySoC: MPSoC6
The Cell Processor
Hessabi©Sharif University of TechnologySoC: MPSoC7
Started in mid 2000 by Sony, Toshiba and IBM Sony has PS2 architecture, needs chip for PS3
Toshiba has memory experience, needs chips for HDTV
IBM has technical knowledge in processor manufacturing
Billions of dollars have been invested high throughput multi purpose processor
One of the earliest NoC processors developed to address high-performance distributed computing. Natural human interactions including photorealistic, predictable real-time response, virtualized resource for
concurrent activities
Heterogeneous Multiprocessing: 9-Core Processor
First prototype: 90nm SOI, 8 copper layers
241 million transistors, 235 mm2 (Rev. DD2)
60-80 W (prototype)
only 6-7 SPEs enabled (manufacturing errors)
1.1V, >4 GHz
Cell Processor Architecture
Hessabi©Sharif University of TechnologySoC: MPSoC8
One Power Processor Element (PPE): 64-bit dual-threaded processor based on Power ArchitectureContains PXU (power execution unit), L1 and
L2 caches.
8 Synergistic Processor Elements (SPEs)SPE contains: independent processor SXU
(synergistic execution unit), 256-KB local store (LS)
21M transistors (14M SRAM, 7M logic)
Cell processor is capable of handling 10 simultaneous threads .
One Element Interconnection Bus (EIB): coherent bus, organized as four 16-byte-wide rings.
One Memory Interface Controller (MIC)
One Bus Interface Controller (BIC)
One Pervasive Unit (PU)
One Power Management Unit (PMU)
One Thermal Management Unit (TMU)
Cell Processor Architecture Components
Hessabi©Sharif University of TechnologySoC: MPSoC9
PU: Pervasive Unit (not shown in figure) contains all of the global logic needed for:
Basic chip functions Serial peripheral interface (SPI): communicate with an external controller during normal operation
Phase-locked loop (PLL): clock generation and distribution logic
Power-on-reset (POR): systematically initializes all the units of the processor.
Lab debug Fault isolation registers: allow the OS to quickly determine which unit generated an error condition
Performance monitor (PFM)
Trace logic analyzer (TLA): captures/stores internal signals while chip is running to assist debug
Manufacturing test: 11 different test modes, including Array BIST, Memory BIST
Logic BIST
Cell Processor Architecture Components (Cont’d)
Hessabi©Sharif University of TechnologySoC: MPSoC10
PMU and TMU Manage chip power to avoid permanent damage to the chip because of overheating
PMU: Power Management Unit allows software controls to reduce chip power when full processing capabilities are not needed.
TMU: Thermal Management Unit (not shown) monitors each of the 10 digital thermal sensors (diodes), distributed on the chip, to monitor
temperatures in hot spots.
controls the chip temperature dynamically and interrupts the PPE when a temperature specified for each element is observed.
Software controls the TMU by setting 4 temperature values and the amount of throttling for each sensor in the TMU: 1st value specifies when the throttling of an element stops
2nd value specifies when throttling starts
3rd value specifies when the element is completely stopped
4th value specifies when the chip’s clocks are shut down.
Cell’s Element Interconnect Bus
Hessabi©Sharif University of TechnologySoC: MPSoC11
From the trenches: D. Krolak, IBM “Well, in the beginning, early in the development process, several people were pushing
for a crossbar switch, and the way the bus is architected, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just wasn't enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.”
Cell’s Element Interconnect Bus
Hessabi©Sharif University of TechnologySoC: MPSoC12
4 rings (2 ckwise + 2 counter-ckwise)
No token rings, still request/grant arbitrations
Homogeneous SoC (Multiprocessor)
Hessabi©Sharif University of TechnologySoC: MPSoC13
CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM
CPU
MEM
Interconnection network (BUS, XBAR)
Multiprocessor SoC: Cisco CRS-1 Router
Hessabi©Sharif University of TechnologySoC: MPSoC14
CRS-1 Router uses 188 extensible network processors per “Silicon Packet Processor” chip
16 PPE Clusters of 12 PPEs each
Multi-Processor Architectures
Hessabi©Sharif University of TechnologySoC: MPSoC15
1. Tightly-coupled multiprocessor systems:Contain multiple CPUs that are connected at the bus level.
CPUs may have access to a central shared memory: SMP: symmetric multiprocessor Systems that treat all CPUs equally
ASMP: asymmetric multiprocessor
or may participate in a memory hierarchy with both local and shared memory NUMA: non-uniform memory access CC-NUMA: cache-coherent NUMA
2. Loosely-coupled multiprocessor systems: Often referred as clusters Based on multiple standalone single or dual processor commodity
computers interconnected via a high speed communication system, such as Gigabit ethernet.
SMP
memory
NUMA
Multiprocessor Communication Architectures
Hessabi©Sharif University of TechnologySoC: MPSoC16
Message Passing Separate address space for each processor
Processors communicate explicitly via message passing using communication APIs, such as send() or receive(). Create extra communication overhead.
Processors have private memories
. . .interconnection network
. . .
processor1
cache
processor2
cache
processorN
cache
memory1
memoryM
memory2
Shared Memory Processors communicate with shared address space Processors communicate implicitly by memory read/write Lower latency widely used in many of today’s high performance MPSoCs. SMP or NUMA SMP: Shared Memory Processor or Uniform Memory Access Access to all memory occurred at the same speed for all processors.
NUMA: Non-Uniform Memory Access or Distributed Shared Memory Typically interconnection is grid or hypercube. Access to some parts of memory is faster for some processors than other parts of memory. Harder to program, but scales to more processors
Shared Memory MultiProcessor
Bus Based UMA
Hessabi©Sharif University of TechnologySoC: MPSoC17
(a) Simplest MP: More than one processor on a single bus connect to memory bus bandwidth becomes a bottleneck.
(b) Each processor has a cache to reduce the need to access to memory.
(c) To further scale the number of processors, each processor is given private local memory.
NUMA
Hessabi©Sharif University of TechnologySoC: MPSoC18
All memories can be addressed by all processors, but access to a processor’s own local memory is faster than access to another processor’s remote memory.
Looks like a distributed machine, but the interconnection network is usually custom-designed switches and/or buses.
What is MPSoC?
Hessabi©Sharif University of TechnologySoC: MPSoC19
Multiprocessor SoC: Heterogeneous processors.
Buses used currently to interconnect modules (processors, memories, etc.) but NoCs are projected to replace buses in future systems.
MPSoCs are not chip multiprocessors. Chip multiprocessors are components that take advantage of increased transistor densities to put
more processors on a single chip, but they don’t try to leverage application needs MPSoCs are custom architectures that balance the constraints of VLSI technology with an
application’s needs.
MPSoC vs. Competitors
Hessabi©Sharif University of TechnologySoC: MPSoC20
Uniprocessor Need task-level parallelism for performance Real concurrency, not the apparent concurrency of a multitasking OS running on a uniprocessor.
Symmetric mutliprocessor (SMP) SMP has the following advantages: Could manufacture the chips in even larger volumes lower price,
Uniform platforms and richer tool sets will make software development easier,
Symmetry makes it easier to map an application onto the architecture.
However, cannot directly apply the scientific computing model to SoCs.
SoCs must obey constraints that do not apply to scientific computation:
1. They must perform real-time computations.
2. They must be area-efficient.
3. They must be energy-efficient.
4. They must provide the proper I/O connections.
1. Real-Time Performance
Hessabi©Sharif University of TechnologySoC: MPSoC21
More than high-performance computing: results be available at a predictable rate. Rate variations can often be solved by adding buffer memory, But memory incurs both area and energy consumption costs.
Producing results at predictable times requires careful design of hardware: Instruction set, memory system, and system bus.
Also careful design of software: to take advantage of features of the hardware,
to avoid common problems like excessive reliance on buffering.
Many mechanisms provide performance at the expense of making performance less predictable. Snooping caching dynamically manages cache coherency at the cost of less predictable delays since the
time required for a memory access depends on the state of several caches.
One way to provide predictable and high performance: use a mechanism specialized to the needs of the application: Specialized memory systems or application-specific instructions.
Different tasks in an application have different characteristics different parts of the architecture need different hardware structures.
2. Area Efficiency
Hessabi©Sharif University of TechnologySoC: MPSoC22
Heterogeneous multiprocessors are more area-efficient than SMPs.
Task-level parallelism is inherently heterogeneous. Each block does something different and has different computational requirements. A special-purpose PE or a specialized CPU: faster and smaller than a programmable processor.
Matching CPU datapath width to the native data sizes of the application saves area. Choosing a cache size and organization to match the application can greatly improve performance. Memory specialization is an important technique for designing efficient architectures.
o If some aspects of memory behavior of the application can be predicted, system architect can reflect those characteristics in the architecture.
o Example: smaller cache can be used when the application has regular memory access patterns.
3. Energy Efficiency
Hessabi©Sharif University of TechnologySoC: MPSoC23
Most SoC designs are power-sensitive, due to: Environmental considerations (heat dissipation), or
System requirements (battery power).
Specialization saves power, by stripping away unnecessary features. Particularly true for leakage power consumption.
SoCs are mass-market devices due to the economics of VLSI manufacturing. Cost of designing power-saving features for a particular architecture can be compensated due to
many times replication during manufacturing.
4. Proper I/O Connections SoC must provide a complete system.
Can we implement I/O devices in a generic fashion given enough transistors? To some extent, done for FPGA I/O pads.
Due to variety of physical interfaces, difficult to create customizable I/O devices effectively.
Example: MPSoC from Philips Research
Hessabi©Sharif University of TechnologySoC: MPSoC24
For communication needs of consumer electronics SoC with real-time requirements (e.g. set-top boxes)
Mi: MemoriesPi: Programmable dedicated
processorsMIi: External memory interfacesRi: RoutersNi: Network interfaces
Ref: B. Vermeulen et al., IEEE Communications Magazine, Sept. 2003
Design and Manufacturing Challenges
Hessabi©Sharif University of TechnologySoC: MPSoC25
Software Development Software shipped as part of a chip must be extremely reliable.
Must meet design constraints typically reserved for hardware, e.g., hard timing constraints (e.g., real-time operation) and energy consumption.
MPSoCs are heterogeneous: harder to program than traditional symmetric multiprocessors.
Need customized development environment, including compilers, debuggers, simulators, etc.
NoCs resemble external networks, but differ from them in crucial ways Extensive wiring resources: What topologies can best exploit them?
Buffers a scarce resource because of area overhead: What flow control method reduce buffer count and router overhead?
What circuits (e.g. transceivers) can best exploit the structured wiring of on-chip networks?
Challenges (Cont’d.)
Hessabi©Sharif University of TechnologySoC: MPSoC26
Determining FPGAs vs. software programmability tradeoff FPGA fabrics can be used as cores to provide alternative means of programmability
Tools for using FPGAs in the design environment are not yet well developed.
Security issues, particularly when MPSoC devices connect to the Internet Security breaches can cause malfunctions and must be considered during HW/SW codesign
MPSoCs connected into a network of chips, e.g. in automotive/avionics applications Lack of control on external network state, e.g. node failures, reconfiguration. Current MPSoC
design is essentially carried out in a closed environment.
Silicon debug: Design validation and testing are increasingly insufficient to remove all bugs before first silicon. Design cycle may require expensive respins.
Why Multiprocessors?
Hessabi©Sharif University of TechnologySoC: MPSoC27
Microprocessors as the fastest CPUs Collecting several CPUs much easier than redesigning one
Multiple users Multiple applications Multi-tasking within an application Responsiveness and/or throughput Share hardware between CPUs Complexity of current microprocessors Do we have enough ideas to sustain 1.5X/yr?
Can we deliver such complexity on schedule?
Slow (but steady) improvement in parallel software (scientific apps, databases, OS)
Emergence of embedded market driving microprocessors in addition to desktops Embedded functional parallelism
What Level Parallelism?
Hessabi©Sharif University of TechnologySoC: MPSoC28
Bit level parallelism: 1970 to ~1985 4 bits, 8 bit, 16 bit, 32 bit microprocessors
Instruction level parallelism (ILP): ~1985 through today Pipelining
Superscalar
VLIW
Out-of-Order execution
Limits to benefits of ILP?
Process Level or thread level parallelism: mainstream for general purpose computing? Servers are parallel
High-end desktop dual processor PC
Program Level parallelism, or even distributed computing
Popular Categories
Hessabi©Sharif University of TechnologySoC: MPSoC29
SISD (Single Instruction Single Data) Uniprocessors
MISD (Multiple Instruction Single Data) Multiple processors on a single data stream
SIMD (Single Instruction Multiple Data) Examples: Illiac-IV, CM-2 Simple programming model
Low overhead
Flexibility
All custom integrated circuits
(Phrase reused by Intel marketing for media instructions ~ vector)
MIMD (Multiple Instruction Multiple Data) Flexible
MIMD current winner for MPSoC
Major MIMD Styles
Hessabi©Sharif University of TechnologySoC: MPSoC30
Centralized shared memoryUniform Memory Access (UMA) time or Shared Memory Processor (SMP)
Major MIMD Styles (Cont’d)
Hessabi©Sharif University of TechnologySoC: MPSoC31
Distributed memory (memory module with CPU) Get more memory bandwidth, lower memory latency
Drawback: Longer communication latency
Drawback: Software model more complex
OS Option 1
Hessabi©Sharif University of TechnologySoC: MPSoC32
Each CPU has its own OS Statically allocate physical memory to each CPU Each CPU runs its own independent OS Share peripherals Each CPU handles its processes system calls Used in early multiprocessor systems Simple to implement Avoids concurrency issues by not sharing Issues:1. Each processor has its own scheduling queue, and its own memory partition.2. Consistency is an issue with independent disk buffer caches and potentially shared files.
OS Option 2
Hessabi©Sharif University of TechnologySoC: MPSoC33
Master-Slave Multiprocessors OS mostly runs on a single fixed CPU.
User-level applications run on the other CPUs.
All system calls are passed to the Master CPU for processing
Very little synchronization required
Simple to implement
Single centralized scheduler to keep all processors busy
Memory can be allocated as needed to all CPUs.
Issues: Master CPU becomes the bottleneck.
OS Option 3
Hessabi©Sharif University of TechnologySoC: MPSoC34
Symmetric Multiprocessors (SMP) OS kernel runs on all processors, while load and resources are balanced between all processors.
One alternative: A single mutex (mutual exclusion object) that makes the entire kernel a large critical section; Only one CPU is in the kernel at a time; Only slightly better than master-slave
Better alternative: Identify independent parts of the kernel and make each of them their own critical section, which allows parallelism in the kernel
Issues: A difficult task; Code is mostly similar to uniprocessor code; hard part is identifying independent parts that don’t interfere with each other
Example: Quad-Processor Pentium Pro
Hessabi©Sharif University of TechnologySoC: MPSoC35
SMP, bus interconnection.
4 x 200 MHz Intel Pentium Pro processors.
8 + 8 Kb L1 cache per processor.
512 Kb L2 cache per processor.
Snoopy cache coherence.
Employed in Compaq, HP, IBM, NetPower.
OS: Windows NT, Solaris, Linux, etc.
MPSoC Design Goals
Hessabi©Sharif University of TechnologySoC: MPSoC36
1. Fast design time Very important in typical applications for MPSoC architectures: game/network processors, high-definition video encoding, multimedia hubs, base-band telecom
circuits, … have particularly tight time-to-market and time window constraints.
2. Higher level abstractions: system-level modeling. Hardware side: RTL models too time consuming to design and verify MPSoCs (cores and associated peripherals) RTL abstraction: designers produce the equivalent of 4 to 10 gates per line of RTL code. A 100 million-gate MPSoC circuit using only RTL code, with 90% code reuse, requires > 1 million lines
of code for the remaining 10 million gates. Unrealistic for most MPSoC target markets.
A higher abstraction level is needed on the hardware side.
Software side: MPSoCs use hundreds of thousands of lines of dedicated software and complex software
development environments cannot use mostly low-level programming languages anymore. Higher level abstractions are needed on the software side too.
MPSoC Design Goals (cont’d)
Hessabi©Sharif University of TechnologySoC: MPSoC37
3. Predictability of results
High-level abstractions hiding precise circuit behavior (timing information).
MPSoCs are mostly targeted for real-time applications accurate performance information must be available at design time.
4. Meeting design metrics
High-level design metrics and performance estimation are essential parts in MPSoC design methodologies.
System’s design metrics are not easy to compose from design metrics of its components.
MPSoC Design Methodologies
Hessabi©Sharif University of TechnologySoC: MPSoC38
Design steps:
1. Design space exploration hardware/software partitioning, selection of architectural platform and components
2. Architecture design Design of components, hardware/software interface design.
Design process must consider TTM, system performance, power, and cost.
Reuse of predesigned components is necessary for reducing design time, but their integration into a system is challenging.
A complete design flow requires multiple capabilities and tools because of the complexity and diversity of applications.
MPSoC Design Methodologies (cont’d)
Hessabi©Sharif University of TechnologySoC: MPSoC39
Competing EDA approaches to improve productivity:
1. Top-Down approaches start with an architectural solution, target architecture, or architectural platform:
Synthesis from system level models: COSYMA environment for hardware/software co-synthesis,
POLIS for Hardware-Software Co-design of Embedded Systems,
SpecC, SystemC
ODESSEY
Platform-based design
2. Bottom-up approach (component-based) starts with a set of components and provides a set of primitives to build application-specific architectures and communication APIs.
Goal : allow the integration of heterogeneous processors and communication protocols by using abstract interconnections.
Behavior and communication must be separated in the system specification. System communication can be described at a higher level and refined independently of the behavioral system.
2 approaches described previously: standard bus protocol, standard component protocol.
Synthesis from System Level Models
Hessabi©Sharif University of TechnologySoC: MPSoC40
1. Starts with informal model of application.
2. - Build a more formal (capable to be validated) SoC specification. - System architecture is fixed and HW/SW partitioning is decided.- Produces a golden architecture model:spec of HW components fixedglobal structure of on-chip network.
3. Design SW.
4. Design HW components.
5. Interconnect HW and SW components while respecting constraints described in golden architecture model.
A full design flow from a system-level specification to the RTL architecture