Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture...

Computer Architecture

ELEC3441

Lecture 13 –Multi-Core Processors

Dr. Hayden Kwok-Hay So

Department of Electrical and

Electronic Engineering 1

4,1956,043 6,681

11,86514,387

19,48421,871

24,129

10,000

100,000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012

25%/year

52%/year

22%/year

IBM POWERstation 100, 150 MHz

Digital Alphastation 4/266, 266 MHz

AlphaServer 4000 5/600, 600 MHz 21164

Digital AlphaServer 8400 6/575, 575 MHz 21264Professional Workstation XP1000, 667 MHz 21264A

Intel VC820 motherboard, 1.0 GHz Pentium III processor

IBM Power4, 1.3 GHz

Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz

Intel Core 2 Extreme 2 cores, 2.9 GHz

Intel Core Duo Extreme 2 cores, 3.0 GHz

Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz)

Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz)

Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology)

1.5, VAX-11/785

AMD Athlon 64, 2.8 GHz

Digital 3000 AXP/500, 150 MHz

HP 9000/750, 66 MHz

IBM RS6000/540, 30 MHz

MIPS M2000, 25 MHz

MIPS M/120, 16.7 MHz

Sun-4/260, 16.7 MHz

VAX 8700, 22 MHz

AX-11/780, 5 MHz

End of an Era …

HKUEEE ENGG3441 - HS 2

Limited by Power, ILP,

Memory speed

Ways to Achieve Parallelism

n Instruction Level Parallelism (ILP)

• Parallel operations come from instructions that

execute in parallel

• Dynamic: Super-scalar processor, OOO execution

• Static: VLIW

n Data Level Parallelism (DLP)

• Parallel operations come from concurrent

operation on independent data

• Vector machines, SIMD extensions

n Thread Level Parallelism

HKUEEE ENGG3441 - HS 3 4HKUEEE ENGG3441 - HS

Multiprocessor Systems on a Chipn Machines with more than 1 processors was popular

among servers and supercomputers in the 80 and 90s

n Uniprocessor speed comes to a halt due to power wall

n All major processor vendors move to multi-core designs

Chip Multi-ProcessorMulti-Processor

board level

Connecting Cores

CPU CPU CPU

On-chip NetworkShared Memory

CPU CPU CPU

Shared memory

Direct Network

CPU CPU

HKUEEE ENGG3441 - HS

Multi-processor System-on-Chip

Direct Connections

n Usually in the form of low latency, high throughput, point-to-point network between processors• By pass I/O subsystems

n Allows low-latency communication between neighboring processors• Sometimes with dedicated machine instructions

n Multi-hop routing for further processors• Typology of network plays an important role

• e.g. Ring, torus, mesh…

n Often tie to the distributed memory system

n Often proprietary design

n Commercial examples:• AMD: HyperTransport

• Intel: QuickPath Interconnect

Network Typology

ring mesh

On-chip Networkn The study of building network in system-on-chip

• A complete computer system on a chip

• Including graphics, peripheral and memory controllers, accelerators

n MPSoC: multi-processor system on a chip• Multiple compute core in the system

• Different types of cores

n Mostly proprietary

n Some example of on-chip network:• Advanced Microcontroller Bus Architecture (AMBA):

on-chip interconnect developed by ARM

• Wishbone: OpenCore standard

Shared memory coresn Common typology for commercial multi-core processors

n Various combination of shared and private cache/memory

Main Memory

Shared L2$

CPUCore

Main Memory

Shared L3$

CPUCore

L2$ L2$

e.g. Intel Nehalem, Sandy Bridge, Ivy Bridge

e.g. Intel Core, Core 2

Symmetric Multiprocessors

symmetric

• All memory is equally far

away from all processors

• Any processor can do any I/O

(set up a DMA transfer)

Memory

I/O controller

Graphics

output

CPU-Memory bus

bridge

Processor

I/O controller I/O controller

I/O bus

Networks

Processor

Synchronization

The need for synchronization arises whenever

there are concurrent processes in a system

(even in a uniprocessor system)

Two classes of synchronization:

Producer-Consumer: A consumer process must

wait until the producer process has produced

Mutual Exclusion: Ensure that only one process

uses a resource at a given time

producer

consumer

Shared

Resource

A Producer-Consumer Example

The program is written assuming

instructions are executed in order.

Producer posting Item x:Load Rtail, 0(tail)

Store 0(Rtail), x

Rtail=Rtail+1

Store 0(tail), Rtail

Consumer:Load Rhead, 0(head)

spin: Load Rtail, 0(tail)

if Rhead==Rtail goto spin

Load R, 0(Rhead)

Rhead=Rhead+1

Store 0(head), Rheadprocess(R)

Producer Consumertail head

Rtail Rtail Rhead R

Problems?

buf* tail;

buf* head;

A Producer-Consumer Example continued

Can the tail pointer get updatedbefore the item x is stored?

Programmer assumes that if 3 happens after 2, then 4happens after 1.

Problem sequences are:2, 3, 4, 14, 1, 2, 3

Consumer:Load Rhead, 0(head)

spin: Load Rtail, 0(tail)

if Rhead==Rtail goto spin

Load R, 0(Rhead)

Rhead=Rhead+1

Store 0(head), Rheadprocess(R)

Producer posting Item x:Load Rtail, 0(tail)

Store 0(Rtail), x

Rtail=Rtail+1

Store 0(tail), Rtail

Sequential ConsistencyA Memory Model

“ A system is sequentially consistent if the result of any

execution is the same as if the operations of all the

processors were executed in some sequential order, and the

operations of each individual processor appear in the order

specified by the program”

Leslie Lamport

Sequential Consistency =

arbitrary order-preserving interleaving

of memory references of sequential programs

P P P P P P

Sequential Consistency

Sequential concurrent tasks: T1, T2Shared variables: X, Y (initially X = 0, Y = 10)

T1: T2:Store (X), 1 #X ç 1 Load R1, (Y)

Store (Y), 11 #Y ç 11 Store (Y’), R1 #Y’ ç YLoad R2, (X)

Store (X’), R2 #X’ç X

what are the legitimate answers for X’ and Y’ ?

(X’,Y’) ∈ {(1,11), (0,10), (1,10), (0,11)} ?

If y is 11 then x cannot be 0

Sequential Consistency

Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( )

What are these in our example ?

T1: T2:Store (X), 1 #Xç1 Load R1, (Y)

Store (Y), 11#Yç11 Store (Y’), R1 #Y’çYLoad R2, (X)

Store (X’), R2 #X’çXadditional SC requirements

Does (can) a system with caches or out-of-order execution capability provide a sequentially consistent

view of the memory ?more on this later

Issues in Implementing Sequential Consistency

Implementation of SC is complicated by two issues

• Out-of-order execution capabilityLoad(a); Load(b) yesLoad(a); Store(b) yes if a ¹ bStore(a); Load(b) yes if a ¹ bStore(a); Store(b) yes if a ¹ b

• CachesCaches can prevent the effect of a store from being seen by other processors

P P P P P P

No common commercial architecture has a

sequentially consistent memory model!

Memory FencesInstructions to serialize memory accesses

Processors with relaxed or weak memory models (i.e.,permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses

Examples of processors with relaxed memory models:Sparc V8 (TSO,PSO): MembarSparc V9 (RMO):

Membar #LoadLoad, Membar #LoadStoreMembar #StoreLoad, Membar #StoreStore

PowerPC (WO): Sync, EIEIOARM: DMB (Data Memory Barrier)X86/64: mfence (Global Memory Barrier)

Memory fences are expensive operations, however, one pays the cost of serialization only when it is required

Memory Coherence in SMPs

Suppose CPU-1 updates A to 200.

write-back: memory and cache-2 have stale valueswrite-through: cache-2 has a stale value

Do these stale values matter?What is the view of shared memory for programming?

cache-1A 100

CPU-Memory bus

CPU-1 CPU-2

cache-2A 100

memoryA 100

Write-back Caches & SC

• T1 is executed

prog T2LD Y, R1ST Y’, R1LD X, R2

ST X’,R2

prog T1ST X, 1ST Y,11

cache-2cache-1 memory

X = 0Y =10X’=Y’=

X= 1Y=11

Y =Y’= X = X’=

• cache-1 writes back YX = 0Y =11X’=Y’=

X= 1Y=11

Y =Y’= X = X’=

X = 1Y =11X’=Y’=

X= 1Y=11

Y = 11Y’= 11X = 0X’= 0

• cache-1 writes back X

X = 0Y =11X’=Y’=

X= 1Y=11

Y = 11Y’= 11X = 0X’= 0

• T2 executed

X = 1Y =11X’= 0Y’=11

X= 1Y=11

Y =11Y’=11 X = 0X’= 0

• cache-2 writes back

X’ & Y’ inconsistent

Write-through Caches & SC

cache-2Y = Y’= X = 0

memoryX = 0Y =10X’=

cache-1X= 0Y=10

prog T2LD Y, R1ST Y’, R1LD X, R2

ST X’,R2

prog T1ST X, 1ST Y,11

Write-through caches don’t preserve

sequential consistency either

• T1 executed

Y = Y’= X = 0

X = 1Y =11X’=

X= 1Y=11

• T2 executedY = 11Y’= 11X = 0

X’= 0

X = 1Y =11X’= 0

Y’=11

X= 1Y=11

Maintaining Cache Coherence

§Hardware support is required such that– only one processor at a time has write permission for

a location

– no processor can load a stale copy of the location after a write

è cache coherence protocols

Cache Coherence vs. Memory Consistency

§ A cache coherence protocol ensures that all writes by one

processor are eventually visible to other processors, for one memory address

– i.e., updates are not lost

§ A memory consistency model gives the rules on when a

write by one processor can be observed by a read on

another, across different addresses– Equivalently, what values can be seen by a load

§ A cache coherence protocol is not enough to ensure

sequential consistency

– But if sequentially consistent, then caches must be coherent

§ Combination of cache coherence protocol plus processor

memory reorder buffer used to implement a given

architecture’s memory consistency model

Snoopy Cache, Goodman 1983

§ Idea: Have cache watch (or snoop upon) DMA transfers, and then “do the right thing”

§ Snoopy cache tags are dual-ported

Snoopy read portattached to Memory

BusData(lines)

Tags andState

Used to drive Memory Buswhen Cache is Bus Master

Shared Memory Multiprocessor

Use snoopy mechanism to keep all processors’ view of memory coherent

SnoopyCache

Physical

Memory

MemoryBus

SnoopyCache

Snoopy Cache Coherence Protocols

write miss:

the address is invalidated in all othercaches before the write is performed

read miss:

if a dirty copy is found in some cache, a write-

back is performed before the memory is read

Cache State Transition DiagramThe MSI protocol

M: ModifiedS: SharedI: Invalid

Each cache line has state bits

Address tag

statebits Write miss

(P1 gets line from memory)

Other processorintent to write

(P1 writes back)

Read miss(P1 gets line from memory)

Read by anyprocessor

P1 readsor writes

Cache state in processor P1

Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture...

Documents

Transcript of Computer Architecture ELEC3441elec3441/sp20/handout/L13... · Computer Architecture...

MA251 Computer Organization and Architecture [3-0-0-6] · PDF fileMA251 Computer Organization and Architecture [3-0-0-6] Lecture 5: Decoder, Demultiplexer, Encoder, and Multiplexer..

RISC vs CISC Ð Iron Lawelec3441/sp19/handout/L05-pipeline-4up.pdf · Computer Architecture ELEC3441!"#$%&"'(') *+,"-+.+./ 0&1'2345".'6789:234';8 0",3&$

DIRECT 2-0 Space Exploration Architecture Performance Analysis

0 Overview Architecture

Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture...Computer Architecture Paul Mellies Lecture 15 : Multicycle microarchitectures IorD = 0 AluSrcA = 0 AluSrcB

0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006.

Computer Architecture ELEC3441 End of an Era É …elec3441/sp17/handouts/L15-multi...Computer Architecture ELEC3441 Lecture 15 Ð Multithreading & Multi-Core Processors 24 Dr. Hayden

‘0 Demand Activated Manufacturing Architecture

White Paper BMC Service Request Management 2 0 Architecture

EJB 3[1].0 Architecture

Thesis Synopsis-2013 0 department of architecture

TOGAF 9 - Security Architecture Ver1 0

0 1.1 Introduction 1.2 System Software and Machine Architecture 1.3 Simplified Instructional Computer (SIC) SIC Machine Architecture SIC/XE Machine Architecture.

Life in the Roman Empire. Roman Architecture 0 The Roman Style 0 Heavier and Stronger than Greek architecture 0 Used concrete: a mix of stone, sand, cement,

Core Routers' Architecture 0

Email Gateway Security Reference Architecture v1 0

ECE590 Enterprise Storage Architecture Lab #0: Your Server

Architecture Life Magic Vol 0

0 Ephemerality and Architecture-Luma Ziad Ifram

0 3 5706 CONNECTION BETWEEN BUSINESS ARCHITECTURE AND SOFTWARE ARCHITECTURE 793 FRATILA LAURENTIU CATALIN.....793