Multi-core systems System Architecture COMP25212

25
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group

description

Multi-core systems System Architecture COMP25212. Daniel Goodman Advanced Processor Technologies Group. Applications on Multi-cores. Processes – operating system level processes e.g. separate applications – in many cases do not share any data – separate virtual memory spaces - PowerPoint PPT Presentation

Transcript of Multi-core systems System Architecture COMP25212

Page 1: Multi-core systems System Architecture COMP25212

Multi-core systems

System Architecture COMP25212

Daniel GoodmanAdvanced Processor Technologies Group

Page 2: Multi-core systems System Architecture COMP25212

Applications on Multi-cores

Processes – operating system level processes e.g. separate applications – in many cases do not share any data – separate virtual memory spaces

Threads – parallel parts of the same application sharing the same memory – this is where the problems lie – assume we are talking about threads

Page 3: Multi-core systems System Architecture COMP25212

An Example

Core 0 Core 4// a = 0 & b = 0

while (a==0) { a = 1} while (b == 0) {

b = 1 }

Will this always terminate?

Why? Registers, Reorder Buffers, out-of-order execution

Page 4: Multi-core systems System Architecture COMP25212

Memory Coherency/Consistency

Coherency: Hardware ensuring that all memories remain the same

This can become very expensive

It is not sufficient to address all of the problems in the last example

Consistency: The model presented to the programmer of when changes are written

Page 5: Multi-core systems System Architecture COMP25212

Sequential Consistency

L. Lamport “the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program."

Informally• memory operations appear to execute one at a time• operations of a single core appear to execute in the order

described by the program

Page 6: Multi-core systems System Architecture COMP25212

Memory Consistency

Sequential Consistency is not the most stringent memory model.

It provides the behaviour that most software developers expect

Computer architectures and Java use relaxed consistency models

The compiler has to insert special instructions in order to maintain the program semantics• fence, membar

Page 7: Multi-core systems System Architecture COMP25212

Synchronization

How do we implement a lock?• Regular read and write operations

Label: read flagif (flag==0) {// lock is free

write flag 1} else {// wait until lock is free

goto Label}

Does it work? Do we have everything that we need?

Page 8: Multi-core systems System Architecture COMP25212

ISA support for Synchronization

Atomic compare and swap instruction• Parameters x, old_value, new_value• If [x] == old_value then [x] = new_value• Return [x]

Load-linked and store-conditional instructions• LL x - hardware locks the cache line corresponding to x and returns

its contents• SC x, new_value – hardware checks whether any instruction has

modified x since LL, if intact the store succeeds. Otherwise, it leaves unmodified the contents of x.

Transactional Memory

Page 9: Multi-core systems System Architecture COMP25212

The Need for Networks

Any multi-core system must clearly contain the means for cores to communicate• With memory• With each other (probably)

We have briefly discussed busses as the standard multi-core network

Others are possible• But have different characteristics• May provide different functionality

Page 10: Multi-core systems System Architecture COMP25212

Evaluating Networks

Bandwidth• Amount of data that can be moved per second• Highest bandwidth device: Container ship full of

hard disks Latency

• How long it takes a given piece of the message to traverse the network

• Container ship (several weeks), ADSL microseconds Congestion

• The effect on bandwidth and latency of the utilisation of the network by other processors

Page 11: Multi-core systems System Architecture COMP25212

Bandwidth vs. Latency

Definitely not the same thing

A truck carrying one million 16Gbyte flash memory cards to London• Latency = 4 hours (14,400 secs)• Bandwidth = 8Tbit/sec (8 * 1012 /sec)

A broadband internet connection• Latency = 100 microsec (10-4 sec)• Bandwidth = 10Mbit/sec (10 * 106 /sec)

Page 12: Multi-core systems System Architecture COMP25212

Bus

Common wire interconnection Usually parallel wires – address + data Only single usage at any point in time Controlled by clock – divided into time slots Sender must ‘grab’ a slot (via arbitration) Then transmit (address + data) Often ‘split transaction’

• E.g send memory address in one slot• Data returned by memory in later slot• Intervening slots free for use by others

Page 13: Multi-core systems System Architecture COMP25212

Crossbar

E.g to connect N cores to N memories

Can achieve ‘any to any’ (disjoint) in parallel

Page 14: Multi-core systems System Architecture COMP25212

Crossbar

E.g to connect N cores to N memories

Can achieve ‘any to any’ (disjoint) in parallel

Page 15: Multi-core systems System Architecture COMP25212

Ring

Simple but• Low bandwidth• Variable latency

Page 16: Multi-core systems System Architecture COMP25212

Tree

Variable bandwidth (Switched vs Hubs)(Depth of the Tree)

Variable Latency Reliability?

Page 17: Multi-core systems System Architecture COMP25212

Tree

Page 18: Multi-core systems System Architecture COMP25212

Mesh / Grid

Switched Reasonable bandwidth Variable Latency Convenient for very large systems physical layout ‘wrap around’ – becomes a toroid (ring doughnut)

Page 19: Multi-core systems System Architecture COMP25212

Mesh / Grid

Page 20: Multi-core systems System Architecture COMP25212

Connecting on-chip

QPI or HT

On Chip

coreL1 Inst L1 Data

Mem

ory

Cont

rolle

r

coreL1 Inst L1 Data

L2 Cache L2 Cache

L3 Shared Cache

Page 21: Multi-core systems System Architecture COMP25212

AMD Opteron (Istanbul)

Page 22: Multi-core systems System Architecture COMP25212

AMD Magny-cours

Page 23: Multi-core systems System Architecture COMP25212

Non Uniform Memory Address (NUMA)

Page 24: Multi-core systems System Architecture COMP25212

Amdahl’s Law

Speed up = S + PS + (P/#Processors)

S = Fraction of the code which is serial

P = Fraction of the code which can be parallel

Page 25: Multi-core systems System Architecture COMP25212

Amdahl’s Law

0.1024 1.024 10.24 102.4 10241

10

100

1000

0.50.750.950.991

Cores

Spee

dup