High Performance Embedded Systems MPSoCs

Post on 23-Dec-2021

5 views 0 download

Transcript of High Performance Embedded Systems MPSoCs

High Performance Embedded Systems

July 2020

Electronics Engineering Department

Electronics Master Program

MPSoCs

Outline

2

• Multiprocessors Architecture and Taxonomy

• Parallel Execution Mechanism

• Multiprocessors Design Techniques

• Memory Systems

• Processors Symmetry

• Co-processing

3

Multiprocessors Architecture and Taxonomy

Taken from: https://arstechnica.com/gadgets/2020/05/intels-comet-lake-desktop-cpus-are-here/

Intel 4004 Core i9??

4

Multiprocessors Architecture and Taxonomy

Taken from: https://arstechnica.com/gadgets/2020/05/intels-comet-lake-desktop-cpus-are-here/

Intel 4004 Core i9

5

Multiprocessors Architecture and Taxonomy

Taken from: https://arstechnica.com/gadgets/2020/05/intels-comet-lake-desktop-cpus-are-here/

Exynos 7420 finFET transistors

6

Multiprocessors Architecture and Taxonomy

Taken from: https://arstechnica.com/gadgets/2020/05/intels-comet-lake-desktop-cpus-are-here/

Exynos 7420 finFET transistors

7

Multiprocessors Architecture and Taxonomy

Taken from: https://www.researchgate.net/publication/257711815_Where_Photovoltaics_Meets_Microelectronics/figures?lo=1

8

Multiprocessors Architecture and Taxonomy

Taken from: https://www.semiconductor-digest.com/2020/03/10/transistor-count-trends-continue-to-track-with-moores-law/

9

Multiprocessors Architecture and Taxonomy

Taken from: https://www.elprocus.com/difference-between-soc-system-on-chip-single-board-computer/

SoC

10

Multiprocessors Architecture and Taxonomy

Taken from: http://soc.inha.ac.kr/index.php/Project

2-Parallel Radix-

2^4 FFT/IFFT

Processor Chip for

MB-OFDM UWB

communications

11

Multiprocessors Architecture and Taxonomy

Taken from: PrSoC: Programmable System-on-chip (SoC) for silicon prototyping IEEE 2008

12

Multiprocessors Architecture and Taxonomy

Taken from: https://www.elprocus.com/difference-between-soc-system-on-chip-single-board-computer/

SoC

MPSoC

13

Multiprocessors Architecture and Taxonomy

Taken from: https://commons.wikimedia.org/wiki/File:ARM-Cortex-A9.gif

¿MPSoCs?

14

Multiprocessors Architecture and Taxonomy

SoC

Taken from: W. Wolf Multiprocessor Systems-On-Chip

• Is an integrated circuit that implements

most or all of the functions of a

complete electronic system.

• The most fundamental characteristic of

an SoC is complexity.

15

Multiprocessors Architecture and Taxonomy

SoC

Taken from: W. Wolf Multiprocessor Systems-On-Chip

Many product categories:

• Cell phones.

• Telecommunications and networking.

• Digital television.

• Videos games.

• …..

16

Multiprocessors Architecture and Taxonomy

SoC Example

Taken from: W. Wolf Multiprocessor Systems-On-Chip

Processing Elements

17

Multiprocessors Architecture and Taxonomy

SoC Example

Taken from: W. Wolf Multiprocessor Systems-On-Chip

Memory

18

Multiprocessors Architecture and Taxonomy

SoC Example

Taken from: W. Wolf Multiprocessor Systems-On-Chip

Communications

19

Multiprocessors Architecture and Taxonomy

SoC Example

Taken from: W. Wolf Multiprocessor Systems-On-Chip

MPSoCs?

20

Multiprocessors Architecture and Taxonomy

MPSoCs?

Wait!

What is a Parallel Architecture?

21

Multiprocessors Architecture and Taxonomy

Parallel Architecture

“A large collection of processing elements that communicate and cooperate to

solve large problems fast”. - Almasi.

Taken from: M. Aguilar MPSoCs

22

Multiprocessors Architecture and Taxonomy

Parallel Architecture

“A large collection of processing elements that communicate and cooperate to

solve large problems fast”. - Almasi.

Taken from: M. Aguilar MPSoCs

23

Multiprocessors Architecture and Taxonomy

Parallel Architecture

“A large collection of processing elements that communicate and cooperate to

solve large problems fast”. - Almasi.

Taken from: M. Aguilar MPSoCs

SoC

HW+SW

24

Multiprocessors Architecture and Taxonomy

Parallel Architecture

“A large collection of processing elements that communicate and cooperate to

solve large problems fast”. - Almasi.

Taken from: M. Aguilar MPSoCs

SoC

HW+SW

Technology was increased

25

Multiprocessors Architecture and Taxonomy

Parallel Architecture

“A large collection of processing elements that communicate and cooperate to

solve large problems fast”. - Almasi.

Taken from: M. Aguilar MPSoCs

SoC

HW+SW

Technology was increased

26

Multiprocessors Architecture and Taxonomy

Parallel Architecture

“A large collection of processing elements that communicate and cooperate to

solve large problems fast”. - Almasi.

Taken from: M. Aguilar MPSoCs

SoC

HW+SW

MPSoCs Technology was increased

27

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

Serial Communication

Parallel Communication

28

Multiprocessors Architecture and Taxonomy

Here we go

What are MPSoCs?

Taken from: W. Wolf Multiprocessor Systems-On-Chip

29

Multiprocessors Architecture and Taxonomy

What are MPSoCs?

“Are the latest incarnation of very largescale integration (VLSI)

technology”

Taken from: W. Wolf Multiprocessor Systems-On-Chip

???

30

Multiprocessors Architecture and Taxonomy

What are MPSoCs?

“Are the latest incarnation of very largescale integration (VLSI)

technology”

Taken from: W. Wolf Multiprocessor Systems-On-Chip

???• Silicon

• Power

• Area

• …

31

Multiprocessors Architecture and Taxonomy

What are MPSoCs?

“Are the latest incarnation of very largescale integration (VLSI)

technology”

“A single integrated circuit can contain over

100 million transistors, and the International Technology Roadmap

for Semiconductors predicts that chips with a billion transistors are

within reach”

Taken from: W. Wolf Multiprocessor Systems-On-Chip

32

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

MPSoCs

“The multiprocessor System-on-Chip (MPSoC) is a system-on-a-chip

(SoC) which uses multiple processors (see multi-core), usually

targeted for embedded applications”.

SoC

HW+SW

MPSoCs Understood!!

33

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

MPSoCs

“The multiprocessor system-on-chip (MPSoC) uses multiple CPUs

along with other hardware subsystems to implement a system”. -

Wayne Wolf.

Multiprocessor = Multicore?

34

Multiprocessors Architecture and Taxonomy

General Structure MPSoCs

Processing Elements (PE)

• Relation with application context and requirements.

• MPSoCs Homogenous.

• MPSoCs Heterogenous

• Interconnection Element

• Buses.

• NoCs (Networks on Chip). More information here.

Taken from: M. Agular MPSoCs

35

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

Advantage in MPSoCs

• Performance

• Powerful platform (Cores).

• Users.

• Applications.

• Tasks into same application.

Power Consumption

• Low power from parallel approach.

36

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

37

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

MPSoCs Beneficts

• Wireless.

• Multimedia: video and audio.

• Health.

• Military.

• Avionics.

• Aerospacial

38

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

Multiprocessor = Multicore?

39

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

Multiprocessor

• Platform with several CPUs.

• Parallel approach was used.

Multicore

• Platform with only one CPU.

• Multiple cores into CPU.

40

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

MPSoCs Software

41

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

Parallel Approaches

Parallel

Approaches

42

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

Parallel Approaches

Parallel

Approaches

Bits

Threads

TasksInstructions

Data

43

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

MPSoCs Architecture?

44

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

MPSoCs

PEs

45

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

MPSoCs

Homogeneous Heterogenous

PEs

46

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

MPSoCs Heterogeneous

• Different PEs, for example

• GPU (General Purpose Unit).

• DSPs.

• HW Acceleration

• NoC infrastructure.

• Better performance and power consumption

• Use in embedded system.

• Portable system.

• Power consumption.

47

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

MPSoCs Homogenous

• PEs to conform a SoC.

• PE is instanced several times.

• Instance is connected by communication

infrastructure.

• Flexibility and Scalability.

• Worst power consumption.

48

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

MPSoCs Taxonomy?

49

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

Processor Organization

Serial

SISD

Uniprocessor

Multi ALUOverlapped

operations

Parallel

SIMD MISD MIMD

Vector

processor

Array

processor

Tightly

coupled

Loosely

coupled

Shared

memory

Symmetric

multiprocessor

(SMP)Nonuniform

memory access

(NUMA)

Distributed

memory

Clusters

50

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

Where are located MPSoCs?

51

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

Processor Organization

Serial

SISD

Uniprocessor

Multi ALUOverlapped

operations

Parallel

SIMD MISD MIMD

Vector

processor

Array

processor

Tightly

coupled

Loosely

coupled

Shared

memory

Symmetric

multiprocessor

(SMP)Nonuniform

memory access

(NUMA)

Distributed

memory

Clusters

MPSoCs

52

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs and Parallel Computing Lectures Notes

MISD

• This architecture executing

different operations over

different data bundle.

• Multiprocessing approach and

MPSoCs were located in this

category.

53

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

MPSoCs

Homogeneous Heterogenous

PEs

Memory Access

Uniform Access (UMA)

Non-Uniform Access (NUMA)

Processors Symmetry

SMP (Symmetric Multi-processing)

AMP (Asymmetric Multi-processing)

Memory Architecture

Share Memory

Distributed memory

MPSoCs Architecture

54

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

ARM Cortex A9

55

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

Analog Devices - Blackfin

56

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

TI Davinci DM355

57

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

TI OMAP5

58

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

ST Microelectronic Nomadik

59

Multiprocessors Architecture and Taxonomy

Taken from: M. Aguilar MPSoCs

Nexperia

60

Multiprocessors Architecture and Taxonomy

Taken from: http://linuxgizmos.com/new-arm-cortex-a72-nearly-twice-as-fast-as-cortex-a57/

Cortex-A72

Outline

61

• Multiprocessors Architecture and Taxonomy

• Parallel Execution Mechanism

• Multiprocessors Design Techniques

• Memory Systems

• Processors Symmetry

• Co-processing

62

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

63

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Consider following approaches

• Shared memory.

• Threads.

• Message Passing.

• Data Parallel.

• Hybrid.

• Others

All these can be implemented on any architecture.

64

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Consider following approaches

• Shared memory.

• Threads.

• Message Passing.

• Data Parallel.

• Hybrid.

• Others

All these can be implemented on any architecture.

65

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Shared Memory

• Tasks share a common address space, which they read and write

asynchronously.

• Various mechanisms such as locks/semaphores may be used control access to

the shared memory.

• Advantage

• No need to explicitly communicate of data tasks simplified programming.

• Disadvantages

• Need to take care when managing memory, avoid synchronization conflicts.

• Harder to control data locality.

66

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

In Hardware

• Shared memory systems use:

• UMA (Uniform Memory Access)

• NUMA (Non- Uniform Memory

Access)

• COMA (Cache-only memory

architecture)

In Software

• Inter-process communication (IPC).

• Virtual memory mapping.

67

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Consider following approaches

• Shared memory.

• Threads.

• Message Passing.

• Data Parallel.

• Hybrid.

• Others

All these can be implemented on any architecture.

68

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Threads

• A thread can be considered as a

subroutine in the main program.

• Threads communicate with each other

through the global memory.

• Commonly associated with shared

memory architectures and operating

systems.

• Posix Threads or pthreads.

• OpenMP.

69

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Threads

Advantages

• Responsiveness.

• Faster execution.

• Lower resource consumption.

• Better system utilization.

• Simplified share and communication

• Parallelization.

• Drawbacks

• Synchronization.

• Thread crashes a process.

70

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Consider following approaches

• Shared memory.

• Threads.

• Message Passing.

• Data Parallel.

• Hybrid.

• Others.

All these can be implemented on any architecture.

71

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Message Passing

• A set of tasks that use their own local memory

during computation.

• Data exchange through sending and receiving

messages.

• Data transfer usually requires cooperative

operations to be performed by each process.

• For example, a send operation must have a

matching receive operation.

• MPI

• Example here

72

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Consider following approaches

• Shared memory.

• Threads.

• Message Passing.

• Data Parallel.

• Hybrid.

• Others.

All these can be implemented on any architecture.

73

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Data Parallel

• Consider the following characteristics:

• Parallel work performs operations on a data set,

organized into a common structure.

• Tasks works collectively on the same data structure,

with each task working on a different partition.

• Tasks perform the same operation on their partition.

• Shared memory architectures, all tasks may have

access to the data structure through global memory.

• Distributed memory architectures the data structure is

split up and resides as “chunks” in the local memory

of each task.

• More information here.

74

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Consider following approaches

• Shared memory.

• Threads.

• Message Passing.

• Data Parallel.

• Hybrid.

• Others

All these can be implemented on any architecture.

75

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Hybrid

• Using various models (for example OpenMP/MPI).

• Single Program Multiple Data (SPMD)

• Single program is executed by all tasks simultaneously.

• Multiple Program Multiple Data (MPMD)

• Has multiple executables. Task can execute the same of different programs

as other task

76

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Consider following approaches

• Shared memory.

• Threads.

• Message Passing.

• Data Parallel.

• Hybrid.

• Others. (Depends on the architecture)

All these can be implemented on any architecture.

77

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Others

• MCAPI (Multicore Association)

• Poly-Platform

• CUDA

78

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Others

• MCAPI (Multicore Association)

• Poly-Platform

• CUDA

79

Parallel Execution Mechanism

Taken from: https://en.wikipedia.org/wiki/Multicore_Association

MCAPI (Multicore Association)

• Founded in 2005

• First specification and referred to as MCAPI

• Based on message-passing

• Target is addressed to system, toolchain and programming language

heterogeneous.

• Active working

• MCAPI

• Virtualization.

• Open Asymmetric Multiprocessing (OpenAMP)

80

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Others

• MCAPI (Multicore Association)

• Poly-Platform

• CUDA

81

Parallel Execution Mechanism

Taken from: http://polycoresoftware.com/poly-platform

Poly-Platform

• Collection productivity tools

• Migrating process

• Main approach multicore platforms.

• Driven supports for several SoC, OS and Transport Information.

82

Parallel Execution Mechanism

Taken from: Parallel Computing Lectures Notes

Others

• MCAPI (Multicore Association)

• Poly-Platform

• CUDA

83

Parallel Execution Mechanism

Taken from: https://en.wikipedia.org/wiki/CUDA

CUDA

• Initial release 2007.

• Parallel computing platform and

application programming interface.

• Created by NVIDIA.

• GPU approach.

• Supports in Windows, Linux and

macOS.

Outline

84

• Multiprocessors Architecture and Taxonomy

• Parallel Execution Mechanism

• Multiprocessors Design Techniques

• Memory Systems

• Processors Symmetry

• Co-processing

85

Multiprocessors Design Techniques

Taken from: W.Wolf High-Performance Embedded Computing

Embedded Systems Design Flows

• Co-design flows.

• Platform-based design.

• Two-stage process.

• Programming platforms.

• Standards-Based design.

MPSoCs?

86

Multiprocessors Design Techniques

Challenges

• Software development is a major challenge for MPSoC designers.

• Software that runs on the multiprocessor must be high performance, real time,

and low power.

• Each MPSoC requires its own software development environment: compiler,

debugger, simulator, and other tools.

• Better understanding of how to abstract tasks properly to capture the essential

characteristics of their low-level behavior for system-level analysis.

Taken from: W.Wolf Multprocessor Systems on Chip

87

Multiprocessors Design Techniques

Taken from: W. Wolf Multiprocessor Systems on Chip

Challenges

• Networks-on-chips have emerged over the past few years as an architectural

approach to the design of single-chip multiprocessors.

• FPGAs have emerged as a viable alternative to application-specific integrated

circuits (ASICs) in many markets. FPGA fabrics are also starting to be

integrated into SoCs.

88

Multiprocessors Design Techniques

Taken from: SoC Lectures Notes

Challenges

• C code sequence is not easy to replace.

• Algorithm specification contains parallel specifications (Model of computation

KPN, SDF, etc).

• Not new programming languages.

• Automatically and parallel programming.

• Platform-based design (SW synthesis) or SW and HW synthesis.

89

Multiprocessors Design Techniques

Taken from: MPSoCs https://slideplayer.com/slide/8773117/

Challenges

All MPSOC design have the following requirements:

• Speed.

• Power.

• Area.

• Application Performance.

• Time to market.

90

Multiprocessors Design Techniques

Taken from: SoC Lectures Notes

MPSoCs Programming

• Task mapping to multiprocessor or cores.

• Communication inter-processor management.

• Data transfer engine management.

• Shared resource management.

• Memory management

• Debugging.

91

Multiprocessors Design Techniques

Taken from: SoC Lectures Notes

MPSoCs Exploration

• Divide computational and communications.

92

Multiprocessors Design Techniques

Taken from: SoC Lectures Notes

Virtual Processing Unit VPU

• Load simulator: It is a high-level simulation of

the core behavior.

• Functional simulator: Native execution of

tasks, scheduling is given by the VPU OS.

93

Multiprocessors Design Techniques

Taken from: SoC Lectures Notes

Virtual Processing Unit VPU

Allows spatial and temporal modeling of task mapping to PE

94

Multiprocessors Design Techniques

Taken from: SoC Lectures Notes

Virtual Platform

• It is a software model that allows the exploration of hardware and software.

• It allows hardware platform exploration and optimization.

• Software development, debugging and optimization.

• Concurrent hardware and software design.

95

Multiprocessors Design Techniques

Taken from: SoC Lectures Notes

Virtual Platform

• Requirements:

• High speed in terms of simulation process.

• Compromise between simulation speed and precision.

• Flexibility.

• Usability by developers not experts in hardware.

96

Multiprocessors Design Techniques

Design Techniques

• Core-based Strategy.

• Wrappers.

• System-level design flow.

• Platform-based design.

• Component-based design.

Taken from: W.Wolf High-Performance Embedded Computing

97

Multiprocessors Design Techniques

Design Techniques

• Core-based Strategy.

• Wrappers.

• System-level design flow.

• Platform-based design.

• Component-based design.

Taken from: W.Wolf High-Performance Embedded Computing

98

Multiprocessors Design Techniques

Core-based Strategy

• Core-based synthesis strategy for the IBM CoreConnect bus.

• Coral tool automates many of the tasks required to stitch together multiple

cores using virtual components.

• Each virtual component describes the interfaces for a class of real

components.

• Coral can synthesize some combinational logic.

• Coral also checks the connections between cores using Boolean decision

diagrams.

Taken from: W.Wolf High-Performance Embedded Computing

99

Multiprocessors Design Techniques

Core-based Strategy

Core Connect provides three types of busses:

• A high-speed processor local bus (PLB).

• An on-chip peripheral bus (OPB).

• A device control register (DCR) bus for configuration and status information.

Taken from: W.Wolf High-Performance Embedded Computing

100

Multiprocessors Design Techniques

Taken from: SoC Lectures Notes

Core-based Strategy

101

Multiprocessors Design Techniques

Design Techniques

• Core-based Strategy.

• Wrappers.

• System-level design flow.

• Platform-based design.

• Component-based design.

Taken from: W.Wolf High-Performance Embedded Computing

102

Multiprocessors Design Techniques

Wrappers

• Treats both hardware and software as

components.

• A wrapper is a design unit that interfaces a

module to another module.

• A wrapper can be hardware or software

and may include both.

• The wrapper performs only low-level

adaptations, such as protocol

transformationTaken from: W.Wolf High-Performance Embedded Computing

103

Multiprocessors Design Techniques

Wrappers

Heterogeneous multiprocessor introduce several types of problems:

• Many chips have multiple communication networks to match the network to

the processing needs. Synchronizing communication across network

boundaries is more difficult than communicating within a network.

• Specialized hardware is often needed to accelerate interprocess

communication and free the CPU for more interesting computations.

• The communication primitives should be at a higher level of abstraction than

shared memory.

Taken from: W.Wolf High-Performance Embedded Computing

104

Multiprocessors Design Techniques

Wrappers

A dedicated CPU is added to the system, its software must be adapted

in several ways:

1. The software must be updated to support the platform’s communication

primitives.

2. Optimized implementations of the host processor’s communication

functions must be provided for interprocessor communication.

3. Synchronization functions must be provided.

Taken from: W.Wolf High-Performance Embedded Computing

105

Multiprocessors Design Techniques

Design Techniques

• Core-based Strategy.

• Wrappers.

• System-level design flow.

• Platform-based design.

• Component-based design.

Taken from: W.Wolf High-Performance Embedded Computing

106

Multiprocessors Design Techniques

System-Level Design

• An abstract platform is created from a combination of system requirements,

models of the software, and models of the hardware components.

• Abstract platform is analyzed to determine the application’s performance

and power/energy consumption.

• Based on the results of this analysis, software is allocated and scheduled

onto the platform.

• Golden abstract architecture that can be used to build the implementation.

Taken from: W.Wolf High-Performance Embedded Computing

107

Multiprocessors Design Techniques

System-Level Design

Taken from: W.Wolf High-Performance Embedded Computing

108

Multiprocessors Design Techniques

System-Level Design

Major elements of an abstract architecture:

1. Software tasks are described by their data and

scheduling dependencies; they

interface to an API.

2. Hardware components consist of a core and an

interface.

3. The hardware/software integration is modeled by

the communication network that connects the CPUs

that run the software and the hardware IP

cores.

Taken from: W.Wolf High-Performance Embedded Computing

109

Multiprocessors Design Techniques

Design Techniques

• Core-based Strategy.

• Wrappers.

• System-level design flow.

• Platform-based design.

• Component-based design.

Taken from: W.Wolf High-Performance Embedded Computing

110

Multiprocessors Design Techniques

Platform-based Design

• Design space: platform selection

• Platform programming

• Multi-CPUs

• Concurrency

• Real-Time

• Platform developer must be

provided tools (compiler, editors,

debuggers, simulators, etc)

Taken from: Introduction to Embedded Systems

111

Multiprocessors Design Techniques

Platform-based Design

• Start with functional specifications

• Task graphs.

• Nodes: Task to complete

• Edges: Communication and

dependence between tasks

• Execution time on the nodes.

• Data communicated on the edges.

Taken from: MPSoCs https://slideplayer.com/slide/8773117/

112

Multiprocessors Design Techniques

Platform-based Design

• Map task on pre-designed HW.

• Use extended task graph for SW and

Communication

Taken from: MPSoCs https://slideplayer.com/slide/8773117/

113

Multiprocessors Design Techniques

Platform-based Design

• Map task on pre-designed HW.

• Use extended task graph for SW and

Communication

Taken from: MPSoCs https://slideplayer.com/slide/8773117/

114

Multiprocessors Design Techniques

Design Techniques

• Core-based Strategy.

• Wrappers.

• System-level design flow.

• Platform-based design.

• Component-based design.

Taken from: W.Wolf High-Performance Embedded Computing

115

Multiprocessors Design Techniques

Component Based Design

• Conceptual MPSOCs platform.

• SW, Processor, IP, Communication

Fabric.

• Parallel Development

• Use APIs.

• Quicker time to market.

Taken from: MPSoCs https://slideplayer.com/slide/8773117/

116

Multiprocessors Design Techniques

Component Based Design

Taken from: MPSoCs https://slideplayer.com/slide/8773117/

117

Multiprocessors Design Techniques

Multicore Application Programming Studio (MAPS)

• Developed at RWTH Aachen University in Germany.

• It is a platform that offers tools and technologies for MPSoC programming.

• Main features are:

• Sequential C code partition.

• Parallel programming model.

• Mapping and scheduling.

• Different types of applications.

• Functional Verification (Virtual Platform).

• Multiple applications environment.

• IDE easy to use.

Taken from: M. Aguilar SoC Lectures Notes

118

Multiprocessors Design Techniques

MAPS Flow

Taken from: M. Aguilar SoC Lectures Notes

119

Multiprocessors Design Techniques

MAPS Flow

Taken from: M. Aguilar SoC Lectures Notes

120

Multiprocessors Design Techniques

MAPS Programming Model: C for Paralell Network (CPN)

• Embedded Systems programming was used C language.

• CPN is a language developed as an extension of ANSI C in order to

describe process networks (KPN and SDF).

• A compiler called cpn-cc performs a transformation source-to-source to

convert code in CPN to code C standard with the APIs of the target

architecture.

Taken from: M. Aguilar SoC Lectures Notes

121

Multiprocessors Design Techniques

MAPS Programming Model: C for Paralell Network (CPN)

Taken from: M. Aguilar SoC Lectures Notes

122

Multiprocessors Design Techniques

MAPS Virtual Platform (MVP)

• MAPS Virtual Platform (MVP)

• High level: abstract PEs based on SystemC.

• Low level: (Instruction Set Simulators) ISS-based virtual platform.

• “mPhone” smartphone virtual.

Taken from: M. Aguilar SoC Lectures Notes

123

Multiprocessors Design Techniques

Virtual Processing Element

• It is a parameterizable processing element.

• Clock frequency.

• Type (RISC, VLIW, DSP, etc).

• Scheduling algorithm (Round robin, EDF, based on priorities, etc).

Taken from: M. Aguilar SoC Lectures Notes

Outline

124

• Multiprocessors Architecture and Taxonomy

• Parallel Execution Mechanism

• Multiprocessors Design Techniques

• Memory Systems

• Processors Symmetry

• Co-processing

125

Memory Systems

Memory Systems

Taken from: W. Wolf High-Performance Embedded Computing

Memory Systems

126

Memory Systems

Memory Systems

• The memory system is a traditional bottleneck in computing.

• Not only are memories slower than processors, but processor clock rates

are increasing much faster than memory cycle times.

Taken from: W. Wolf High-Performance Embedded Computing and

https://www.taringa.net/+serviciotecnico/consulta-cuello-de-botella-cpu-debil-en-gpu-potente_15casq

127

Memory Systems

Memory Systems

Taken from: Multi-core architectures

128

Memory Systems

Memory Systems

Taken from: MPSoCs Hardware platforms Lectures Notes

129

Memory Systems

Memory Systems

• Start with a look at parallel memory systems in scientific multiprocessors.

• Consider models for memory and motivations for heterogeneous memory

systems.

• Look at what sorts of consistency mechanisms are needed in embedded

multiprocessors.

Taken from: W. Wolf Hugh-Performance Embedded Computing

130

Memory Systems

Memory Systems

Taken from: W. Wolf High-Performance Embedded Computing

Memory Systems

Homogeneous Heterogenous

131

Memory Systems

Memory Systems

Taken from: W. Wolf High-Performance Embedded Computing

Memory Systems

Homogeneous Heterogenous

132

Memory Systems

Memory Systems

In terms of understanding memory systems considers following case study:

• Scientific processors traditionally use parallel, homogeneous memory

systems to increase system performance.

• Multiple memory banks allow several memory accesses to occur

simultaneously.

Taken from: W. Wolf High-Performance Embedded Computing

133

Memory Systems

Memory Systems

• Each bank is separately addressable.

Taken from: W. Wolf High-Performance Embedded Computing

134

Memory Systems

Memory Systems

• If the memory system has n banks,

then n accesses can be performed in

parallel.

• This is known as the peak access

rate.

Taken from: W. Wolf High-Performance Embedded Computing

135

Memory Systems

Memory Systems

• Cannot keep the memory busy all of

the time.

• A simple statistical model lets us

estimate performance of a random-

access program.

Taken from: W. Wolf High-Performance Embedded Computing

136

Memory Systems

Memory Systems

• Assume that the program accesses a

certain number of sequential

locations, then moves to some other

location.

• Where:

• λ describes probability of a

nonsequential memory access (a

branch in code to be a nonconsecutive

data location).

• k describes sequential accesses.Taken from: W. Wolf High-Performance Embedded Computing

137

Memory Systems

Memory Systems

• Where:

• 𝑝 𝑘 = 𝜆 1 − 𝜆 𝑘−1

• And the mean length of a sequential

access sequence is:

• 𝐿𝑏 =1− 1−𝜆 𝑚

𝜆

Taken from: W. Wolf High-Performance Embedded Computing

138

Memory Systems

Memory Systems

• Use program statistics to estimate

the average probability of

nonsequential accesses, design the

memory system accordingly.

• Use software techniques to

maximize the length of access

sequences wherever possible.

Taken from: W. Wolf High-Performance Embedded Computing

139

Memory Systems

Memory Systems

Taken from: W. Wolf High-Performance Embedded Computing

Memory Systems

Homogeneous Heterogenous

140

Memory Systems

Memory Systems

• Embedded systems can make use of multiple-bank memory systems, but they

also make use of more heterogeneous memory architectures.

• They do so to improve the real-time performance and lower the power

consumption of the memory system.

Taken from: W. Wolf High-Performance Embedded Computing

141

Memory Systems

Memory Systems

Why do heterogeneous memory systems

improve real-time performance?

Taken from: W. Wolf High-Performance Embedded Computing

142

Memory Systems

Memory Systems

• The energy required to perform a memory access depends in part on the size of

the memory block being accessed.

• A heterogeneous memory may be able to use smaller memory blocks, reducing

the access time.

• Energy per access also depends on the number of ports on the memory block.

• By reducing the number of units that can access a given part of memory, the

heterogeneous memory system can reduce the energy required to access that

part of the memory space.

Taken from: W. Wolf High-Performance Embedded Computing

143

Memory Systems

Memory Systems

Taken from: W. Wolf High-Performance Embedded Computing

Memory Systems

Homogeneous Heterogenous

Consistent Memory Systems

144

Memory Systems

Memory Systems

Taken from: W. Wolf High-Performance Embedded Computing

Shared

variables

Consistent

Memory Systems

Snooping

cachesCache

consistency

145

Memory Systems

Memory Systems

• Shared variables

• To worry about whether two processors see the same state of a shared variable.

• If reads and writes of two processors are interleaved, then one processor may write

the variable after another one has written it, causing that processor to erroneously

assume the value of the variable.

• Critical sections, guarded by semaphores, to ensure that critical operations occur in

the right order.

• Use atomic test-and-set operations (often called spin locks) to guard small pieces of

memory.

Taken from: W. Wolf High-Performance Embedded Computing

146

Memory Systems

Memory Systems

• Cache consistency

• If two processors access the same

memory location, then each may have

a copy of the location in its own cache.

• If one processing element writes that

location, then the other will not

immediately see the change and will

make an incorrect computation.

Taken from: W. Wolf High-Performance Embedded Computing

147

Memory Systems

Memory Systems

• Snooping Cache

• This type of cache contains extra

logic that watches the

multiprocessor interconnect for

memory transactions.

• When it sees a write to a location

that it currently contains, it

invalidates that location.

Taken from: W. Wolf High-Performance Embedded Computing

148

Memory Systems

Memory Systems

Taken from: W. Wolf High-Performance Embedded Computing

Shared

memory

Memory Systems

Architecture

Hybrid

memoryDistributed

memory

149

Memory Systems

Memory Systems

Taken from: W. Wolf High-Performance Embedded Computing

Shared

memory

Memory Systems

Architecture

Hybrid

memoryDistributed

memory

150

Memory Systems

Memory Systems

• Shared Memory

• Shared memory parallel computers vary

widely, but generally have in common the

ability for all processors to access all

memory as global address space.

• Multiple processors can operate

independently but share the same memory

resources.

Taken from: W. Wolf High-Performance Embedded Computing,

https://en.wikipedia.org/wiki/Shared_memory#/media/File:Shared_memory.svg,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

151

Memory Systems

Memory Systems

• Shared Memory

• Changes in a memory location effected by

one processor are visible to all other

processors.

• Historically, shared memory machines

have been classified as UMA and NUMA,

based upon memory access times.

Taken from: W. Wolf High-Performance Embedded Computing,

https://en.wikipedia.org/wiki/Shared_memory#/media/File:Shared_memory.svg,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

152

Memory Systems

Memory Systems

• Shared Memory (Uniform Memory

Access UMA)

• Most commonly represented today by

Symmetric Multiprocessor (SMP)

machines.

• Identical processors.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

153

Memory Systems

Memory Systems

• Shared Memory (Uniform Memory

Access UMA)

• Equal access and access times to

memory.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

154

Memory Systems

Memory Systems

• Shared Memory (Uniform Memory Access

UMA)

• Sometimes called CC-UMA - Cache

Coherent UMA. Cache coherent means if one

processor updates a location in shared

memory, all the other processors know about

the update. Cache coherency is accomplished

at the hardware level.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

155

Memory Systems

Memory Systems

• Shared Memory (Non-Uniform Memory

Access NUMA)

• Often made by physically linking two or

more SMPs.

• One SMP can directly access memory of

another SMP.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

156

Memory Systems

Memory Systems

• Shared Memory (Non-Uniform Memory

Access NUMA)

• Not all processors have equal access time to

all memories.

• Memory access across link is slower

• If cache coherency is maintained, then may

also be called CC-NUMA - Cache Coherent

NUMA.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

157

Memory Systems

Memory Systems

• Shared Memory

• Advantages

• Global address space provides a user-

friendly programming perspective to

memory.

• Data sharing between tasks is both fast

and uniform due to the proximity of

memory to CPUs.

Taken from: W. Wolf High-Performance Embedded Computing,,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

158

Memory Systems

Memory Systems

• Shared Memory

• Disadvantages

• Primary disadvantage is the lack of

scalability between memory and CPUs.

Adding more CPUs can geometrically

increases traffic on the shared memory-CPU

path, and for cache coherent systems,

geometrically increase traffic associated with

cache/memory management.

Taken from: W. Wolf High-Performance Embedded Computing,,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

159

Memory Systems

Memory Systems

• Shared Memory

• Disadvantages

• Programmer responsibility for

synchronization constructs that ensure

"correct" access of global memory.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

160

Memory Systems

Memory Systems

Taken from: W. Wolf High-Performance Embedded Computing

Shared

memory

Memory Systems

Architecture

Hybrid

memoryDistributed

memory

161

Memory Systems

Memory Systems

• Distributed Memory

• Like shared memory systems, distributed

memory systems vary widely but share a

common characteristic.

• Distributed memory systems require a

communication network to connect inter-

processor memory.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

162

Memory Systems

Memory Systems

• Distributed Memory

• Processors have their own local memory.

Memory addresses in one processor do not

map to another processor, so there is no

concept of global address space across all

processors.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

163

Memory Systems

Memory Systems

• Distributed Memory

• Because each processor has its own local

memory, it operates independently.

Changes it makes to its local memory have

no effect on the memory of other

processors. Hence, the concept of cache

coherency does not apply.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

164

Memory Systems

Memory Systems

• Distributed Memory

• When a processor needs access to data in

another processor, it is usually the task of

the programmer to explicitly define how

and when data is communicated.

Synchronization between tasks is likewise

the programmer's responsibility.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

165

Memory Systems

Memory Systems

• Distributed Memory

• The network "fabric" used for data transfer

varies widely, though it can be as simple as

Ethernet.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

166

Memory Systems

Memory Systems

• Distributed Memory

• Advantages

• Memory is scalable with the number

of processors. Increase the number of

processors and the size of memory

increases proportionately.

Taken from: W. Wolf High-Performance Embedded Computing,

https://en.wikipedia.org/wiki/Shared_memory#/media/File:Shared_memory.svg,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

167

Memory Systems

Memory Systems

• Distributed Memory

• Advantages

• Each processor can rapidly access its

own memory without interference and

without the overhead incurred with

trying to maintain global cache

coherency.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

168

Memory Systems

Memory Systems

• Distributed Memory

• Advantages

• Cost effectiveness: can use

commodity, off-the-shelf processors

and networking.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

169

Memory Systems

Memory Systems

• Distributed Memory

• Disadvantages

• The programmer is responsible for

many of the details associated with data

communication between processors.

• It may be difficult to map existing data

structures, based on global memory, to

this memory organization.

• .Taken from: W. Wolf High-Performance Embedded Computing,

https://en.wikipedia.org/wiki/Shared_memory#/media/File:Shared_memory.svg,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

170

Memory Systems

Memory Systems

• Distributed Memory

• Disadvantages

• Non-uniform memory access times -

data residing on a remote node takes

longer to access than node local data.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

171

Memory Systems

Memory Systems

Taken from: W. Wolf High-Performance Embedded Computing

Shared

memory

Memory Systems

Architecture

Hybrid

memoryDistributed

memory

172

Memory Systems

Memory Systems

• Hybrid Memory

• The largest and fastest computers in the

world today employ both shared and

distributed memory architectures.

• The shared memory component can be a

shared memory machine and/or graphics

processing units (GPU).

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

173

Memory Systems

Memory Systems

• Hybrid Memory

• The distributed memory component is

the networking of multiple shared

memory/GPU machines, which know

only about their own memory - not the

memory on another machine. Therefore,

network communications are required to

move data from one machine to another.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

174

Memory Systems

Memory Systems

• Hybrid Memory

• Current trends seem to indicate that this

type of memory architecture will

continue to prevail and increase at the

high end of computing for the

foreseeable future.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

175

Memory Systems

Memory Systems

• Hybrid Memory

• Advantages and Disadvantages

• Whatever is common to both shared and

distributed memory architectures.

• Increased scalability is an important

advantage.

• Increased programmer complexity is an

important disadvantage.

Taken from: W. Wolf High-Performance Embedded Computing,

https://computing.llnl.gov/tutorials/parallel_comp/#MemoryArch

176

Memory Systems

Design Memory Systems?

Taken from: W. Wolf High-Performance Embedded Computing,

177

Memory Systems

Design Memory Systems

A simple model of memory components for parallel memory design would include

three major parameters of a memory component of a given size.

• Area: The physical size of the logical component. This is most important in chip design, but it also

relates to cost in board design.

• Performance: The access time of the component. There may be more than one parameter, with

variations for read and write times, page mode accesses, and so on.

• Energy: The energy required per access. If performance is characterized by multiple modes, energy

consumption will exhibit similar modes.

Taken from: W. Wolf High-Performance Embedded Computing,

178

Memory Systems

Design Memory Systems

Taken from: W. Wolf High-Performance Embedded Computing,

179

Memory Systems

Memory Systems

Taken from: https://www.xataka.com/ordenadores/el-cuello-de-botella-de-la-ley-de-moore-no-esta-en-los-procesadores-sino-en-las-memorias

180

Memory Systems

Memory Systems

Taken from: https://www.xataka.com/ordenadores/el-cuello-de-botella-de-la-ley-de-moore-no-esta-en-los-procesadores-sino-en-las-memorias

Outline

181

• Multiprocessors Architecture and Taxonomy

• Parallel Execution Mechanism

• Multiprocessors Design Techniques

• Memory Systems

• Processors Symmetry

• Co-processing

182

Processors Symmetry

Taken from: W. Wolf High-Performance Embedded Computing

Symmetric

SMP

Multi-processing

Asymmetric

AMP

183

Processors Symmetry

Taken from: W. Wolf High-Performance Embedded Computing

Symmetric

SMP

Multi-processing

Asymmetric

AMP

184

Processors Symmetry

Taken from: M. Aguilar SoCs

Symmetric Multi-processing (SMP)

• System with multiple processors or cores that are communicated by a single

shared memory and are controlled by a single operating system

185

Processors Symmetry

Taken from: https://www.geeksforgeeks.org/what-is-smp-symmetric-multi-processing/

Symmetric Multi-processing (SMP)

• Identical: All the processors are treated equally i.e. all are identical.

• Communication: Shared memory is the mode of communication among

processors.

• Complexity: Are complex in design, as all units share same memory and data

bus.

• Expensive: They are costlier in nature.

• Unlike asymmetric where a task is done only by Master processor, here tasks of

the operating system are handled individually by processors.

186

Processors Symmetry

Taken from: https://www.geeksforgeeks.org/what-is-smp-symmetric-multi-processing/

Symmetric Multi-processing (SMP)

• Applications

• This concept finds its application in parallel processing, where time-sharing

systems(TSS) have assigned tasks to different processors running in parallel

to each other, also in TSS that uses multithreading i.e. multiple threads

running simultaneously.

187

Processors Symmetry

Taken from: https://www.geeksforgeeks.org/what-is-smp-symmetric-multi-processing/

Symmetric Multi-processing (SMP)

• Advantages

• Throughput: Since tasks can be run by all the processors unlike in

asymmetric, hence increased degree of throughput(processes executed in unit

time).

• Reliability: Failing a processor doesn’t fail whole system, as all are equally

capable processors, though throughput do fail a little.

188

Processors Symmetry

Taken from: https://www.geeksforgeeks.org/what-is-smp-symmetric-multi-processing/

Symmetric Multi-processing (SMP)

• Disadvantages

• Complex design: Since all the processors are treated equally by OS, so

designing and management of such OS become difficult.

• Costlier: As all the processors share the common main memory, on account

of which size of memory required is larger implying more expensive.

189

Processors Symmetry

Taken from: https://www.enea.com/globalassets/downloads/operating-systems/enea-oseck/enea-smp-platform-for-xilinx-zynq-datasheet.pdf

Symmetric Multi-processing (SMP)

190

Processors Symmetry

Taken from: https://www.enea.com/globalassets/downloads/operating-systems/enea-oseck/enea-smp-platform-for-xilinx-zynq-datasheet.pdf

Symmetric Multi-processing (SMP)

More information here

191

Processors Symmetry

Taken from: W. Wolf High-Performance Embedded Computing

Symmetric

SMP

Multi-processing

Asymmetric

AMP

192

Processors Symmetry

Taken from: M. Aguilar SoC Lectures Notes

Asymmetric Multi-processing (AMP)

• Is a system with multiple processors or cores that are communicated by a single

shared memory and each processor or cores is controlled by an independent

operating system (different or equal).

193

Processors Symmetry

Asymmetric Multi-processing (AMP)

• Characteristics

• Processors are not treated equally.

• Tasks of the operating system are done by master processor.

• No Communication between Processors as they are controlled by the

master processor.

• Process are master-slave.

• Systems are cheaper.

• Systems are easier to design.

Taken from: https://www.geeksforgeeks.org/what-is-smp-symmetric-multi-processing/

194

Processors Symmetry

Taken from: https://www.openampproject.org/old_website/docs/mca/BKK19%20OpenAMP%20Introduction.pdf

Asymmetric Multi-processing (AMP)

195

Processors Symmetry

Taken from: https://www.openampproject.org/old_website/docs/mca/BKK19%20OpenAMP%20Introduction.pdf

Asymmetric Multi-processing (AMP)

196

Processors Symmetry

Taken from: https://www.openampproject.org/old_website/docs/mca/BKK19%20OpenAMP%20Introduction.pdf

Asymmetric Multi-processing (AMP)

197

Processors Symmetry

Asymmetric Multi-processing (AMP)

Taken from: https://github.com/OpenAMP/open-amp

198

Processors Symmetry

Asymmetric Multi-processing (AMP)

Taken from: https://github.com/OpenAMP/open-amp

199

Processors Symmetry

Taken from: https://www.openampproject.org/old_website/docs/mca/BKK19%20OpenAMP%20Introduction.pdf

Asymmetric Multi-processing (AMP)

Outline

200

• Multiprocessors Architecture and Taxonomy

• Parallel Execution Mechanism

• Multiprocessors Design Techniques

• Memory Systems

• Processors Symmetry

• Co-processing

201

Co-processing

Taken from: https://www.researchgate.net/publication/250840737_Automatic_Generation_of_Application-

Specific_Architectures_for_Heterogeneous_MPSoC_through_Combination_of_Processors/figures

202

Co-processing

Taken from: https://www.researchgate.net/publication/250840737_Automatic_Generation_of_Application-

Specific_Architectures_for_Heterogeneous_MPSoC_through_Combination_of_Processors/figures

203

Co-processing

Taken from: https://www.researchgate.net/publication/250840737_Automatic_Generation_of_Application-

Specific_Architectures_for_Heterogeneous_MPSoC_through_Combination_of_Processors/figures

204

Co-processing

Taken from: http://www.cecs.uci.edu/~papers/esweek06/codes/p288.pdf

205

Co-processing

Taken from: https://www.researchgate.net/publication/221656884_A_Generic_Wrapper_Architecture_for_Multi-

Processor_SoC_Cosimulation_and_Design/figures?lo=1

206

Co-processing

Taken from: https://link.springer.com/chapter/10.1007/978-3-319-01113-4_1

207

Co-processing

What is a coprocessor?

208

Co-processing

A coprocessor is:

• A computer processor used to supplement functions of the primary processor.

• Several operations performed by the coprocessor such as:

• Floating Point (FPU).

• Graphics Processing.

• Signal Processing.

• Cryptography.

• Etc, ……

Taken from: https://youtu.be/xrMUv9ZVKY0

209

Co-processing

A coprocessor is:

• By offloading processor intensive tasks from the main processor, coprocessor can

accelerate system performance.

• Coprocessors allow a line of computers to be customized, so that customers who

do not need extra performance need not pay for it.

Taken from: https://youtu.be/xrMUv9ZVKY0

210

Co-processing

Functions

• A coprocessor may not be a general-purpose processor.

• Coprocessors cannot fetch instructions from memory, execute program flow

control instructions, do input/output operations manage memory and so on.

• The coprocessor requires the host (main) processor to fetch the coprocessor

instructions and handle all other operations aside from the coprocessor functions.

• In some architectures the coprocessor is a more general-purpose computer but

carries out only a limited range of functions under the close control of a

supervisory processor.

Taken from: https://youtu.be/xrMUv9ZVKY0

211

Co-processing

Taken from: https://www.doulos.com/knowhow/arm/using_your_c_compiler_to_exploit_neon/Resources/using_your_c_compiler_to_exploit_neon.pdf

Coprocessor

212

Co-processing

NEON Arm

• v7-A architecture, ARM has introduced a powerful SIMD implementation called

NEON™.

• NEON is a coprocessor which comes with its own instruction set for vector

operations.

• Most vector operations carry out the same operation on all elements of their

operand vector(s) in parallel.

• Using your C compiler to exploit NEON™ Advanced SIMD.

Taken from: https://youtu.be/xrMUv9ZVKY0

213

Co-processing

NEON Arm

• The goal of NEON is to provide a powerful, yet comparatively easy to program

SIMD instruction set that covers integer data types of up to 64-bit width as well

as single precision floating point (32 bit).

• Instead it shares its sixteen 128-bit registers with the vector floating point unit.

• Executed on the same processor core, NEON performance is influenced by

context switching overhead, non-deterministic memory access latency

(cache/MMU access) and interrupt handling.

Taken from: https://youtu.be/xrMUv9ZVKY0

214

Co-processing

NEON Arm

Taken from: https://youtu.be/xrMUv9ZVKY0

215

Co-processing

NEON Arm

Taken from: https://youtu.be/xrMUv9ZVKY0

216

Co-processing

NEON Arm

Taken from: https://youtu.be/xrMUv9ZVKY0

217

Co-processing

NEON Arm

Taken from: https://youtu.be/xrMUv9ZVKY0

218

Co-processing

NEON Arm

Taken from: https://youtu.be/xrMUv9ZVKY0

219

Co-processing

NEON Arm

Taken from: https://youtu.be/xrMUv9ZVKY0

220

Co-processing

DSP’s

Taken from: Introduccion a los Sistemas Empotrados Lectures Notes

221

Co-processing

DSP’s

Taken from: M. Aguilar SoC Lectures Notes

222

Co-processing

DSP’s

Taken from: M. Aguilar SoC Lectures Notes

223

Co-processing

GPU

Taken from: https://www.anandtech.com/show/14101/nvidia-announces-jetson-nano

224

Co-processing

GPU

Taken from: https://www.anandtech.com/show/14101/nvidia-announces-jetson-nano

225

Co-processing

Flight controller UAV

Taken from: https://cdn.sparkfun.com/assets/d/d/9/9/3/Pixhawk4-DataSheet.pdf

226

Co-processing

Flight controller UAV

Taken from: https://cdn.sparkfun.com/assets/d/d/9/9/3/Pixhawk4-DataSheet.pdf

227

References

[1] Lectures Notes, Tecnologico de Costa Rica, Course SoC.

[2] W. Wolf. High-Performance Embedded Computing: Architectures, Applications

and Methodologies. Elsevier, United States of America, 2007.

[3] E. Ashford and S. Arunkumar Introduction to Embedded Systems, 2017

Lectures notes and materials are available in TEC-Digital and web portal

www.ie.tec.ac.cr/sarriola/HPEC

www.ie.tec.ac.cr/joaraya

228