Computer Architecture 1DT016: Multiprocessing and...
Transcript of Computer Architecture 1DT016: Multiprocessing and...
[email protected] 2017 1
Multiprocessingand
Operating systems from a Computer architecture perspective
Computer Architecture
1DT016 distanceFall 2017
http://xyx.se/1DT016/index.php
Per FoyerMail: [email protected]
1
In this session
[email protected] 2017 2
Challenges i parallel computing
[email protected] 2017 3
”Nine women can not give birth to one child in one month no matter
how hard they try”
The fundamental challenges of parallel computing:
• Not all problems can be parallized. Some tasks must be executed in sequence.
• Tasks that have parallelizable algorithms are not infinitely scalable
• There is little compiler support for parallel programming
• Some parallel algorithms are plagued with massive load inbalance due to non-uniform data distribution
• Parallel distributed algorithms are not always easy to synchronize and debug
Embedded systems
[email protected] 2017 4
• Can perform independent or distributed tasks
• Networking over CAN-bus, I2C and even TCP/IP
• May operate under real-time constraints
• If powerful enough, can be used as very low cost computing nodes in distributed systems as grids or clusters
System on a Chip (SoC)
[email protected] 2017 5
BCM 2835 Raspberry Pi SoC
The Raspberry Pi has an intricate boot sequence:Stage one to four is executed by the GPU (!)
Stage 1: Boot is in the GPU on-chip ROM. Loads Stage 2 in the L2 cache
Stage 2: bootcode.bin from SD-cardEnables SDRAM and loads Stage 3
Stage 3: loader.bin. Knows about the .elf format and loads start.elf
Stage 4: start.elf loads kernel.img firmware into ARM CPU.
Stage 5: kernel.img is run on the ARM that loads OS
GPU: Graphics Processing UnitELF: Executable and Linkable Format
FPGAs and Soft Cores
[email protected] 2017 6
Field Programmable Gate Array• LUTs - LookUp Tables
(~Truth tables)
ARM Cortex-M0 processor now availablefree of charge from ARM Holding Inc.…several ARM clones available (OpenCores.org)
FPGA development workflow• HDL (Verilog / VHDL)• Compile• Synthesize / Verify• Bitstream
Boot sequence: x86 / x86_32 / x86_64
[email protected] 2017 8
[1] 1MB max. 640 kB DOS – 16-bit instructions now probably microcoded[2] 4 GB max. Supervisor/User modes, memory protection Virtual x86 (16-bit) support[3] 2 ^ 64 = 1.833 x 10^19 B
[1]
[2]
[3]
AMD/Intel protection features
[email protected] 2017 9
In order to safeguard in a multiprocessor environment, both AMD andIntel have some essential features in hardware:
Function Intel AMDVirtual Technology Extensions VT-x AMD-vPhysical Address Extension [1] PAE PAEExecution Protection (data) [2] XD NXStreaming SIMD Extension SSE SSE
Acronyms:NX: No eXecute, XD: eXecution Disable
[1] Makes it possible to address more than 4GB in 32-bit mode.Needs NX/XD to be active
[2] Prevents exploits like executing malicious code in the data area(buffer overflow attacks, malware,…)Note: x86 is a vN architecture. A Harvard machine doesn’tneed this kind of protection.
Multicore processor boot sequence
[email protected] 2017 10
U3A2
A1C0
Memory
Booting an operating system from cold upto fully running applications:
Intel model for x86_32 and x86_64:
• C0 performs initial loading from low levelhardware interface in 16-bit x86 real mode
• C0 switches to protected supervisor modex86_64 and loads the operating system
• C0 (the OS) allocates resources for theapplication cores and starts them
• One or more cores may be allocatedfor utility processing (U0)
Note: C0 is always the boot processor
If it’s a Harvard or von Neumann configuration doesn’t matter. The principles are the same.
Windows task manager
[email protected] 2017 11
BIOS / UEFI / U-boot
[email protected] 2017 12
Frankly a very scary technology (when looking at the potential security ramifications) included in all modern Intel CPU:s
Intel AMT / ME / IE:
• Is independent of main CPU• Based on the MINIX operating system [1]• Executes in Ring -3• Can access host memory via DMA (with restrictions)• Dedicated link to NIC, and its filtering capabilities• Can force host OS to reboot at any time (and boot the
system from the emulated CDROM)• Active even in S3 (suspended mode) sleep!• Exploited at Black Hat Europe conference on december 6th, 2017
Some Virtual Hypervisors (Xen) uses Intel VT-d in order to protect itself and consequently, for example malicious software is not able to accessthis memory of such hypervisors. Or so it’s believed…
[1] Professor Andrew S. Tanenbaum, the MINIX OS creator, is very angry about this
Intel AMT / ME / IE
[email protected] 2017 13
AMT = Active Management TechnologyME = Management EngineIE = Innovation Engine (what ever that is… - undocumented)
Tightly coupled distributed system
[email protected] 2017 14
MultiprocessorLatency: nS
C
C
C
C
CCC
C
C
C CC
C = CPU entity
SharedMemory
Multicore or SMP
Closely coupled distributed system
[email protected] 2017 15
C
C
C
C
CCC
C
C
C CC
Inter-connect
M M
M M
M
M
M M
M M
M
M
C = CPU entityM = Local memory
MulticomputerLatency: µS
Loosely coupled distributed system
[email protected] 2017 16
M
C+
M
C+
M
C+
M
C+
M
C+
M
C+
MultisystemLatency: mS
M
C+
C+ = Complete systemM = Memory configuration
Wide Area Network
Grid computing
[email protected] 2017 17
C
C
CC
CC
C
C
C
CC
C+• Node availability and capacity is not known
or guaranteed beforehand• Nodes “phone home” to grid controller• Nodes may be homogenous or heterogenous
Grid controller
Nodes
Local or wide area network
Good for tasks that are easyto parallelize or split
Famous grid example: seti@home
[email protected] 2017 18
Search for Extra Terrestial IntelligenceActive since 1999. Driven by UC Berkeley (https://setiathome.berkeley.edu)
Computer clusters
[email protected] 2017 19
M
C+
M
C+
M
C+
M
C+
M
C+
Load balancer
Cluster controller• Uses cluster aware OS
Computing nodes
Load balancer:Passive: Round-robin task distributionActive: Measures load on nodes
before task distribution
The load balancer may be transparent to the cluster controller
A cluster can be homogenous (same architecture) or heterogeneous (mixed architecture)
Other connection schemes (1)
[email protected] 2017 20
Traffic routing between independent nodes in parallel computing is normally not trivial. It may impose a burden on the operating system(s) causing overhead in scheduling due to routing calculations.
Some configurations for sending data from one (independent) node to another:
Ring Hopsmax = n/2
Complete mesh Hopsmax = 1
CubeHopsmax = log2 nWhat happens if one node fails?
F = Frontend processor (FEP)
F
F
F
Other connection schemes (2)
[email protected] 2017 21
4
2
1 3
6
5 7
Balanced binary treeHopsmax = 2 * | log2 n |
What happens if one node fails?
HypercubeHopsmax = | log2 n |
F
F
F = Frontend processor (FEP)
Super computing by architecture
[email protected] 2017 22
Multicomputing redundancy
[email protected] 2017 23
M M M
DB
Intercommunicationprotocol between nodes
OL: On-lineHS: Hot standby
The system consists of one computing system and a database.There are two hot-standby systems ready to take over if the on-line system fails. How is failure determined?
• If OL fails, HS1 immediately takes over control and becomes OL
• In mission critical systems where a node doesn’t produce the same resultsas the others, the faulty node will be disconnected and another takes over.
OL HS2HS1
Redundancy design misstake (1)
[email protected] 2017 24
M M
DB
Simple heartbeat protocol between application nodes
OL: On-line nodeHS: Hot-standby nodeDB: Database server
OL HS
Communication betweenapplication nodes and databaseserver based on TCP/IP
Theory: If one of the application nodes fails, the heartbeat will cease and the other one take over.
WRONG: There is no MUTEX guarantee here. If the heartbeat line fails butboth application nodes are ok, BOTH think their neighbor has failed. Theresult is a “split brain” disaster where both application nodes accessesthe database and almost certainly destroys data and cause inconsistencies.
MUTEX =MUTual EXclusion
Redundancy design misstake (2)
[email protected] 2017 25
M M
DB
Simple heartbeat protocol between application nodes
OL: On-line nodeHS: Hot-standby nodeDB: Database server
OL HS
Communication betweenapplication nodes and databaseserver based on TCP/IP
How resolve the “split brain” problem on the previous slide?
Use the database disk control hardware AND heartbeat tests between OL, HS and DB to guarantee MUTEX at any one time.
Virtual machines (1)
[email protected] 2017 26
• VM Technology allows virtual machines to run on a single physical machine• VM is not about simulation. The guest OS must follow the underlying
hardware architecture (e.g. Intel x86_64, SPARC, etc)• The guest OS has no knowledge about that it is executing in a VM
Hardware
Virtual Machine Monitor (VMM) / Hypervisor
VM VM VM
Guest OS Guest OS Guest OS
App App App App App
Virtual machines (2)
[email protected] 2017 27
Hardware
Virtual Machine Monitor (VMM) / Hypervisor
VM VM VM
Guest OS VM supplies guestwith completevirtual hardware
VMM optimizes theutilization of theunderlying physicalhardware
Guest OS uses devicedrivers that match thevirtual hardware
With paravirtualization a VM can execute very close tophysical hardware speed.The VMM distributes load over physical hardware CPUs and/or CPU cores
VMM: XenServer
[email protected] 2017 28
Uses Paravirtualizationvery close to the physical Hardware
Can pre-allocateresources as memoryand CPUs/cores
Completely free at:xenserver.org
Executes directlyabove thehardware levelXenServer VMM isan OS in itself
VMM: VirtualBox
[email protected] 2017 29
Completely free at virtualbox.orgExecutes within a host OS (Windows, macOS, Linux) with good performance
Operating systems
[email protected] 2017 30
If there is no support in software for hardware with multiprocessing capabilities, that hardware will be useless!
Programs, Processes and Threads
[email protected] 2017 31
Program: Binary containing executable code and data segments Needs an OS to load and run.
Process: Executing entity having it’s own context (code and resources) Have been scheduled by OS
Thread: (Software): Lightweight process executing in a “host process context” sharing the host resources
Thread: (Hardware – Hyper-threading): Presents a number of logical CPU:s to the OS. E.g., A hyper-threaded single core appears as two virtual CPU:s to the OS.
If one virtual CPU is waiting, the other can borrow it’s resources. The OS doesn’t now about this. It sees two cores (or more)
Operating system layers
[email protected] 2017 32
Device drivers
Hardware
Kernel
OS Core services
APIs
System libraries
Low level SW to HW Interface
Process scheduler, low levelresource management and protection
File systems, timed events,High level resource mgmt
Application to operatingsystem SW interface
Common application highlevel routines
Prog Prog Prog ProgA program may use severalinterconnected processes
Operating system execution rings
[email protected] 2017 33
There’s more to this…
Ring -1 (minus one): • (HW) Hypervisor mode• Can pre-empt ring 0
Ring -2 (minus two):• (HW) System Management
Mode (SMM)• Can pre-empt ring -1
Ring -3 (x86) (minus three):• Separate processing unit
inside Intel CPUs• BIG controversy (MINIX)• Very little is known about
this mode• Intel ME/IE• THIS IS SCARY !!!…depending on hardware
OS: Scheduler
[email protected] 2017 34
The Scheduler becomes more complex for each computing elementadded CPU-cores, Multi-CPU, distributed nodes
OS: The context switch
[email protected] 2017 35
The Context switch is the single most time critical part of an operating system
It switches execution context between processes
It has to protect CPU-registers etc that are used by processes on a low level
The context switch is very often written in assembler for maximum speed
When switching1. Freeze execution of current process2. Save state for current process (save registers, private stack pointer, …)3. Load (frozen) state for next process (restore registers, …)4. Resume execution of next process. jnhtfrewdsr56§qw
In a overly loaded system a situation called thrashing may occur:
The number of context switches per time unit is so many that the operatingsystems spends more time on switching context than executing processes.
OS process states
[email protected] 2017 36
Waiting
Ready Running
TerminatedNew
Interrupt
SchedulerDispatch
Admitted Exit
I/O or eventcompletion
I/O or eventwait
Scheduling
[email protected] 2017 37
Assume processes P1, P2 and P3 and one time-frame
P2P1 P1P3 P2 P3
t1
Execution time for each process always 1/3 * tSimple but wasteful if some processes are in wait and don’t need to be scheduled
P1 P2 P3 P1 P2
Priority scheduling
Round Robin (with pre-emption)
P3 enters wait for resource, or exitst2
t1 t2
If P2 has higher prioritythan P1, P2 can be givenmore execution timein next time frame
OS: pipes
[email protected] 2017 38
A pipe is a mechanism thatallows for bi-directionalasynchronous communicationbetween two processes
Pipe operation is controlled bythe OS scheduler. Data can onlyflow when a process is in itsrunning state.
Pipes are mainly used where latency is low, e.g. in tightly coupled systems
OS: Semaphores
[email protected] 2017 39
Semaphores is an operating system mechanism that is used to protect a sharedresource.
The resource can be sharedby two or more processes.
The OS guarantees that one,and only one, process canaccess the shared resourceat any one time (MUTEX).
MUTEX stands for MUTualEXclusion
Single core processor systemsSometimes use spin-locksWaiting for MUTEX.
OS: Deadlocks
[email protected] 2017 40
A deadlock can occur if two processes are waiting for each other orif several processes are in a circular wait.
It may also happen if one process holding a shared resource stops or dies
P1 P2
R2
R1
P: ProcessR: Resource
Has
Has
Waits for
Waits for
Ways for a kernel to break a deadlock:• Forced process pre-emption and rescheduling• Process termination• Force resource release
OS: Message queues (Mailboxes)
[email protected] 2017 41
Message queues are used for interprocesscommunication.
Processes can set message priorities whichare handled by the OS
The OS guarantees MUTEX on queues
Client Server
[email protected] 2017 42
Häpp! Finito la musica!;-)