Computer Architecture 1DT016: Multiprocessing and...

[email protected] 2017 1

Multiprocessingand

Operating systems from a Computer architecture perspective

Computer Architecture

1DT016 distanceFall 2017

http://xyx.se/1DT016/index.php

Per FoyerMail: [email protected]

1

In this session


Challenges i parallel computing


”Nine women can not give birth to one child in one month no matter

how hard they try”

The fundamental challenges of parallel computing:

• Not all problems can be parallized. Some tasks must be executed in sequence.

• Tasks that have parallelizable algorithms are not infinitely scalable

• There is little compiler support for parallel programming

• Some parallel algorithms are plagued with massive load inbalance due to non-uniform data distribution

• Parallel distributed algorithms are not always easy to synchronize and debug

Embedded systems


• Can perform independent or distributed tasks

• Networking over CAN-bus, I2C and even TCP/IP

• May operate under real-time constraints

• If powerful enough, can be used as very low cost computing nodes in distributed systems as grids or clusters

System on a Chip (SoC)


BCM 2835 Raspberry Pi SoC

The Raspberry Pi has an intricate boot sequence:Stage one to four is executed by the GPU (!)

Stage 1: Boot is in the GPU on-chip ROM. Loads Stage 2 in the L2 cache

Stage 2: bootcode.bin from SD-cardEnables SDRAM and loads Stage 3

Stage 3: loader.bin. Knows about the .elf format and loads start.elf

Stage 4: start.elf loads kernel.img firmware into ARM CPU.

Stage 5: kernel.img is run on the ARM that loads OS

GPU: Graphics Processing UnitELF: Executable and Linkable Format

FPGAs and Soft Cores


Field Programmable Gate Array• LUTs - LookUp Tables

(~Truth tables)

ARM Cortex-M0 processor now availablefree of charge from ARM Holding Inc.…several ARM clones available (OpenCores.org)

FPGA development workflow• HDL (Verilog / VHDL)• Compile• Synthesize / Verify• Bitstream

Unique: The Propeller CPU


Round-Robin Scheduler between active Cogs

Boot sequence: x86 / x86_32 / x86_64


[1] 1MB max. 640 kB DOS – 16-bit instructions now probably microcoded[2] 4 GB max. Supervisor/User modes, memory protection Virtual x86 (16-bit) support[3] 2 ^ 64 = 1.833 x 10^19 B

[1]

[2]

[3]

AMD/Intel protection features


In order to safeguard in a multiprocessor environment, both AMD andIntel have some essential features in hardware:

Function Intel AMDVirtual Technology Extensions VT-x AMD-vPhysical Address Extension [1] PAE PAEExecution Protection (data) [2] XD NXStreaming SIMD Extension SSE SSE

Acronyms:NX: No eXecute, XD: eXecution Disable

[1] Makes it possible to address more than 4GB in 32-bit mode.Needs NX/XD to be active

[2] Prevents exploits like executing malicious code in the data area(buffer overflow attacks, malware,…)Note: x86 is a vN architecture. A Harvard machine doesn’tneed this kind of protection.

Multicore processor boot sequence


U3A2

A1C0

Memory

Booting an operating system from cold upto fully running applications:

Intel model for x86_32 and x86_64:

• C0 performs initial loading from low levelhardware interface in 16-bit x86 real mode

• C0 switches to protected supervisor modex86_64 and loads the operating system

• C0 (the OS) allocates resources for theapplication cores and starts them

• One or more cores may be allocatedfor utility processing (U0)

Note: C0 is always the boot processor

If it’s a Harvard or von Neumann configuration doesn’t matter. The principles are the same.

Windows task manager


BIOS / UEFI / U-boot


Frankly a very scary technology (when looking at the potential security ramifications) included in all modern Intel CPU:s

Intel AMT / ME / IE:

• Is independent of main CPU• Based on the MINIX operating system [1]• Executes in Ring -3• Can access host memory via DMA (with restrictions)• Dedicated link to NIC, and its filtering capabilities• Can force host OS to reboot at any time (and boot the

system from the emulated CDROM)• Active even in S3 (suspended mode) sleep!• Exploited at Black Hat Europe conference on december 6th, 2017

Some Virtual Hypervisors (Xen) uses Intel VT-d in order to protect itself and consequently, for example malicious software is not able to accessthis memory of such hypervisors. Or so it’s believed…

[1] Professor Andrew S. Tanenbaum, the MINIX OS creator, is very angry about this

Intel AMT / ME / IE


AMT = Active Management TechnologyME = Management EngineIE = Innovation Engine (what ever that is… - undocumented)

Tightly coupled distributed system


MultiprocessorLatency: nS

C

C

C

C

CCC

C

C

C CC

C = CPU entity

SharedMemory

Multicore or SMP

Closely coupled distributed system


C

C

C

C

CCC

C

C

C CC

Inter-connect

M M

M M

M

M

M M

M M

M

M

C = CPU entityM = Local memory

MulticomputerLatency: µS

Loosely coupled distributed system


M

C+

M

C+

M

C+

M

C+

M

C+

M

C+

MultisystemLatency: mS

M

C+

C+ = Complete systemM = Memory configuration

Wide Area Network

Grid computing


C

C

CC

CC

C

C

C

CC

C+• Node availability and capacity is not known

or guaranteed beforehand• Nodes “phone home” to grid controller• Nodes may be homogenous or heterogenous

Grid controller

Nodes

Local or wide area network

Good for tasks that are easyto parallelize or split

Famous grid example: seti@home


Search for Extra Terrestial IntelligenceActive since 1999. Driven by UC Berkeley (https://setiathome.berkeley.edu)

Computer clusters


M

C+

M

C+

M

C+

M

C+

M

C+

Load balancer

Cluster controller• Uses cluster aware OS

Computing nodes

Load balancer:Passive: Round-robin task distributionActive: Measures load on nodes

before task distribution

The load balancer may be transparent to the cluster controller

A cluster can be homogenous (same architecture) or heterogeneous (mixed architecture)

Other connection schemes (1)


Traffic routing between independent nodes in parallel computing is normally not trivial. It may impose a burden on the operating system(s) causing overhead in scheduling due to routing calculations.

Some configurations for sending data from one (independent) node to another:

Ring Hopsmax = n/2

Complete mesh Hopsmax = 1

CubeHopsmax = log2 nWhat happens if one node fails?

F = Frontend processor (FEP)

F

F

F

Other connection schemes (2)


4

2

1 3

6

5 7

Balanced binary treeHopsmax = 2 * | log2 n |

What happens if one node fails?

HypercubeHopsmax = | log2 n |

F

F

F = Frontend processor (FEP)

Super computing by architecture


Multicomputing redundancy


M M M

DB

Intercommunicationprotocol between nodes

OL: On-lineHS: Hot standby

The system consists of one computing system and a database.There are two hot-standby systems ready to take over if the on-line system fails. How is failure determined?

• If OL fails, HS1 immediately takes over control and becomes OL

• In mission critical systems where a node doesn’t produce the same resultsas the others, the faulty node will be disconnected and another takes over.

OL HS2HS1

Redundancy design misstake (1)


M M

DB

Simple heartbeat protocol between application nodes

OL: On-line nodeHS: Hot-standby nodeDB: Database server

OL HS

Communication betweenapplication nodes and databaseserver based on TCP/IP

Theory: If one of the application nodes fails, the heartbeat will cease and the other one take over.

WRONG: There is no MUTEX guarantee here. If the heartbeat line fails butboth application nodes are ok, BOTH think their neighbor has failed. Theresult is a “split brain” disaster where both application nodes accessesthe database and almost certainly destroys data and cause inconsistencies.

MUTEX =MUTual EXclusion

Redundancy design misstake (2)


M M

DB

Simple heartbeat protocol between application nodes

OL: On-line nodeHS: Hot-standby nodeDB: Database server

OL HS

Communication betweenapplication nodes and databaseserver based on TCP/IP

How resolve the “split brain” problem on the previous slide?

Use the database disk control hardware AND heartbeat tests between OL, HS and DB to guarantee MUTEX at any one time.

Virtual machines (1)


• VM Technology allows virtual machines to run on a single physical machine• VM is not about simulation. The guest OS must follow the underlying

hardware architecture (e.g. Intel x86_64, SPARC, etc)• The guest OS has no knowledge about that it is executing in a VM

Hardware

Virtual Machine Monitor (VMM) / Hypervisor

VM VM VM

Guest OS Guest OS Guest OS

App App App App App

Virtual machines (2)


Hardware

Virtual Machine Monitor (VMM) / Hypervisor

VM VM VM

Guest OS VM supplies guestwith completevirtual hardware

VMM optimizes theutilization of theunderlying physicalhardware

Guest OS uses devicedrivers that match thevirtual hardware

With paravirtualization a VM can execute very close tophysical hardware speed.The VMM distributes load over physical hardware CPUs and/or CPU cores

VMM: XenServer


Uses Paravirtualizationvery close to the physical Hardware

Can pre-allocateresources as memoryand CPUs/cores

Completely free at:xenserver.org

Executes directlyabove thehardware levelXenServer VMM isan OS in itself

VMM: VirtualBox


Completely free at virtualbox.orgExecutes within a host OS (Windows, macOS, Linux) with good performance

Operating systems


If there is no support in software for hardware with multiprocessing capabilities, that hardware will be useless!

Programs, Processes and Threads


Program: Binary containing executable code and data segments Needs an OS to load and run.

Process: Executing entity having it’s own context (code and resources) Have been scheduled by OS

Thread: (Software): Lightweight process executing in a “host process context” sharing the host resources

Thread: (Hardware – Hyper-threading): Presents a number of logical CPU:s to the OS. E.g., A hyper-threaded single core appears as two virtual CPU:s to the OS.

If one virtual CPU is waiting, the other can borrow it’s resources. The OS doesn’t now about this. It sees two cores (or more)

Operating system layers


Device drivers

Hardware

Kernel

OS Core services

APIs

System libraries

Low level SW to HW Interface

Process scheduler, low levelresource management and protection

File systems, timed events,High level resource mgmt

Application to operatingsystem SW interface

Common application highlevel routines

Prog Prog Prog ProgA program may use severalinterconnected processes

Operating system execution rings


There’s more to this…

Ring -1 (minus one): • (HW) Hypervisor mode• Can pre-empt ring 0

Ring -2 (minus two):• (HW) System Management

Mode (SMM)• Can pre-empt ring -1

Ring -3 (x86) (minus three):• Separate processing unit

inside Intel CPUs• BIG controversy (MINIX)• Very little is known about

this mode• Intel ME/IE• THIS IS SCARY !!!…depending on hardware

OS: Scheduler


The Scheduler becomes more complex for each computing elementadded CPU-cores, Multi-CPU, distributed nodes

OS: The context switch


The Context switch is the single most time critical part of an operating system

It switches execution context between processes

It has to protect CPU-registers etc that are used by processes on a low level

The context switch is very often written in assembler for maximum speed

When switching1. Freeze execution of current process2. Save state for current process (save registers, private stack pointer, …)3. Load (frozen) state for next process (restore registers, …)4. Resume execution of next process. jnhtfrewdsr56§qw

In a overly loaded system a situation called thrashing may occur:

The number of context switches per time unit is so many that the operatingsystems spends more time on switching context than executing processes.

OS process states


Waiting

Ready Running

TerminatedNew

Interrupt

SchedulerDispatch

Admitted Exit

I/O or eventcompletion

I/O or eventwait

Scheduling


Assume processes P1, P2 and P3 and one time-frame

P2P1 P1P3 P2 P3

t1

Execution time for each process always 1/3 * tSimple but wasteful if some processes are in wait and don’t need to be scheduled

P1 P2 P3 P1 P2

Priority scheduling

Round Robin (with pre-emption)

P3 enters wait for resource, or exitst2

t1 t2

If P2 has higher prioritythan P1, P2 can be givenmore execution timein next time frame

OS: pipes


A pipe is a mechanism thatallows for bi-directionalasynchronous communicationbetween two processes

Pipe operation is controlled bythe OS scheduler. Data can onlyflow when a process is in itsrunning state.

Pipes are mainly used where latency is low, e.g. in tightly coupled systems

OS: Semaphores


Semaphores is an operating system mechanism that is used to protect a sharedresource.

The resource can be sharedby two or more processes.

The OS guarantees that one,and only one, process canaccess the shared resourceat any one time (MUTEX).

MUTEX stands for MUTualEXclusion

Single core processor systemsSometimes use spin-locksWaiting for MUTEX.

OS: Deadlocks


A deadlock can occur if two processes are waiting for each other orif several processes are in a circular wait.

It may also happen if one process holding a shared resource stops or dies

P1 P2

R2

R1

P: ProcessR: Resource

Has

Has

Waits for

Waits for

Ways for a kernel to break a deadlock:• Forced process pre-emption and rescheduling• Process termination• Force resource release

OS: Message queues (Mailboxes)


Message queues are used for interprocesscommunication.

Processes can set message priorities whichare handled by the OS

The OS guarantees MUTEX on queues

Client Server


Häpp! Finito la musica!;-)

Computer Architecture 1DT016: Multiprocessing and...

Documents

Transcript of Computer Architecture 1DT016: Multiprocessing and...