Post on 27-Jul-2020
Computer Architecture
ELEC3441
Lecture 13 –Multi-Core Processors
Dr. Hayden Kwok-Hay So
Department of Electrical and
Electronic Engineering 1
5
9
13
18
24
51
80
117
183
280
481
649
993
1,267
1,779
3,016
4,1956,043 6,681
7,108
11,86514,387
19,48421,871
24,129
1
10
100
1000
10,000
100,000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
Pe
rfo
rma
nce
(vs.
VA
X-1
1/7
80
)
25%/year
52%/year
22%/year
IBM POWERstation 100, 150 MHz
Digital Alphastation 4/266, 266 MHz
Digital Alphastation 5/300, 300 MHz
Digital Alphastation 5/500, 500 MHz
AlphaServer 4000 5/600, 600 MHz 21164
Digital AlphaServer 8400 6/575, 575 MHz 21264Professional Workstation XP1000, 667 MHz 21264A
Intel VC820 motherboard, 1.0 GHz Pentium III processor
IBM Power4, 1.3 GHz
Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz
Intel Core 2 Extreme 2 cores, 2.9 GHz
Intel Core Duo Extreme 2 cores, 3.0 GHz
Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz)
Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz)
Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology)
1.5, VAX-11/785
AMD Athlon 64, 2.8 GHz
Digital 3000 AXP/500, 150 MHz
HP 9000/750, 66 MHz
IBM RS6000/540, 30 MHz
MIPS M2000, 25 MHz
MIPS M/120, 16.7 MHz
Sun-4/260, 16.7 MHz
VAX 8700, 22 MHz
AX-11/780, 5 MHz
End of an Era …
HKUEEE ENGG3441 - HS 2
Limited by Power, ILP,
Memory speed
Ways to Achieve Parallelism
n Instruction Level Parallelism (ILP)
• Parallel operations come from instructions that
execute in parallel
• Dynamic: Super-scalar processor, OOO execution
• Static: VLIW
n Data Level Parallelism (DLP)
• Parallel operations come from concurrent
operation on independent data
• Vector machines, SIMD extensions
n Thread Level Parallelism
HKUEEE ENGG3441 - HS 3 4HKUEEE ENGG3441 - HS
Multiprocessor Systems on a Chipn Machines with more than 1 processors was popular
among servers and supercomputers in the 80 and 90s
n Uniprocessor speed comes to a halt due to power wall
n All major processor vendors move to multi-core designs
HKUEEE ENGG3441 - HS 5
Chip Multi-ProcessorMulti-Processor
board level
Connecting Cores
6
CPU CPU CPU
On-chip NetworkShared Memory
CPU CPU CPU
CPU CPU CPU
CPU CPU CPU
Shared memory
Direct Network
CPU CPU
HKUEEE ENGG3441 - HS
Multi-processor System-on-Chip
Direct Connections
n Usually in the form of low latency, high throughput, point-to-point network between processors• By pass I/O subsystems
n Allows low-latency communication between neighboring processors• Sometimes with dedicated machine instructions
n Multi-hop routing for further processors• Typology of network plays an important role
• e.g. Ring, torus, mesh…
n Often tie to the distributed memory system
n Often proprietary design
n Commercial examples:• AMD: HyperTransport
• Intel: QuickPath Interconnect
HKUEEE ENGG3441 - HS 7
Network Typology
HKUEEE ENGG3441 - HS 8
ring mesh
torus
On-chip Networkn The study of building network in system-on-chip
• A complete computer system on a chip
• Including graphics, peripheral and memory controllers, accelerators
n MPSoC: multi-processor system on a chip• Multiple compute core in the system
• Different types of cores
n Mostly proprietary
n Some example of on-chip network:• Advanced Microcontroller Bus Architecture (AMBA):
on-chip interconnect developed by ARM
• Wishbone: OpenCore standard
HKUEEE ENGG3441 - HS 9
Shared memory coresn Common typology for commercial multi-core processors
n Various combination of shared and private cache/memory
HKUEEE ENGG3441 - HS 10
Main Memory
L1I$
Shared L2$
L1D$
CPUCore
L1I$
L1D$
CPUCore
Main Memory
L1I$
Shared L3$
L1D$
CPUCore
L1I$
L1D$
CPUCore
L2$ L2$
e.g. Intel Nehalem, Sandy Bridge, Ivy Bridge
e.g. Intel Core, Core 2
Symmetric Multiprocessors
11
symmetric
• All memory is equally far
away from all processors
• Any processor can do any I/O
(set up a DMA transfer)
Memory
I/O controller
Graphics
output
CPU-Memory bus
bridge
Processor
I/O controller I/O controller
I/O bus
Networks
Processor
Synchronization
12
The need for synchronization arises whenever
there are concurrent processes in a system
(even in a uniprocessor system)
Two classes of synchronization:
Producer-Consumer: A consumer process must
wait until the producer process has produced
data
Mutual Exclusion: Ensure that only one process
uses a resource at a given time
producer
consumer
Shared
Resource
P1 P2
A Producer-Consumer Example
13
The program is written assuming
instructions are executed in order.
Producer posting Item x:Load Rtail, 0(tail)
Store 0(Rtail), x
Rtail=Rtail+1
Store 0(tail), Rtail
Consumer:Load Rhead, 0(head)
spin: Load Rtail, 0(tail)
if Rhead==Rtail goto spin
Load R, 0(Rhead)
Rhead=Rhead+1
Store 0(head), Rheadprocess(R)
Producer Consumertail head
Rtail Rtail Rhead R
Problems?
buf* tail;
buf* head;
A Producer-Consumer Example continued
14
Can the tail pointer get updatedbefore the item x is stored?
Programmer assumes that if 3 happens after 2, then 4happens after 1.
Problem sequences are:2, 3, 4, 14, 1, 2, 3
1
2
3
4
Consumer:Load Rhead, 0(head)
spin: Load Rtail, 0(tail)
if Rhead==Rtail goto spin
Load R, 0(Rhead)
Rhead=Rhead+1
Store 0(head), Rheadprocess(R)
Producer posting Item x:Load Rtail, 0(tail)
Store 0(Rtail), x
Rtail=Rtail+1
Store 0(tail), Rtail
Sequential ConsistencyA Memory Model
15
“ A system is sequentially consistent if the result of any
execution is the same as if the operations of all the
processors were executed in some sequential order, and the
operations of each individual processor appear in the order
specified by the program”
Leslie Lamport
Sequential Consistency =
arbitrary order-preserving interleaving
of memory references of sequential programs
M
P P P P P P
Sequential Consistency
16
Sequential concurrent tasks: T1, T2Shared variables: X, Y (initially X = 0, Y = 10)
T1: T2:Store (X), 1 #X ç 1 Load R1, (Y)
Store (Y), 11 #Y ç 11 Store (Y’), R1 #Y’ ç YLoad R2, (X)
Store (X’), R2 #X’ç X
what are the legitimate answers for X’ and Y’ ?
(X’,Y’) ∈ {(1,11), (0,10), (1,10), (0,11)} ?
If y is 11 then x cannot be 0
Sequential Consistency
17
Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( )
What are these in our example ?
T1: T2:Store (X), 1 #Xç1 Load R1, (Y)
Store (Y), 11#Yç11 Store (Y’), R1 #Y’çYLoad R2, (X)
Store (X’), R2 #X’çXadditional SC requirements
Does (can) a system with caches or out-of-order execution capability provide a sequentially consistent
view of the memory ?more on this later
Issues in Implementing Sequential Consistency
18
Implementation of SC is complicated by two issues
• Out-of-order execution capabilityLoad(a); Load(b) yesLoad(a); Store(b) yes if a ¹ bStore(a); Load(b) yes if a ¹ bStore(a); Store(b) yes if a ¹ b
• CachesCaches can prevent the effect of a store from being seen by other processors
M
P P P P P P
No common commercial architecture has a
sequentially consistent memory model!
Memory FencesInstructions to serialize memory accesses
19
Processors with relaxed or weak memory models (i.e.,permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses
Examples of processors with relaxed memory models:Sparc V8 (TSO,PSO): MembarSparc V9 (RMO):
Membar #LoadLoad, Membar #LoadStoreMembar #StoreLoad, Membar #StoreStore
PowerPC (WO): Sync, EIEIOARM: DMB (Data Memory Barrier)X86/64: mfence (Global Memory Barrier)
Memory fences are expensive operations, however, one pays the cost of serialization only when it is required
Memory Coherence in SMPs
20
Suppose CPU-1 updates A to 200.
write-back: memory and cache-2 have stale valueswrite-through: cache-2 has a stale value
Do these stale values matter?What is the view of shared memory for programming?
cache-1A 100
CPU-Memory bus
CPU-1 CPU-2
cache-2A 100
memoryA 100
Write-back Caches & SC
21
• T1 is executed
prog T2LD Y, R1ST Y’, R1LD X, R2
ST X’,R2
prog T1ST X, 1ST Y,11
cache-2cache-1 memory
X = 0Y =10X’=Y’=
X= 1Y=11
Y =Y’= X = X’=
• cache-1 writes back YX = 0Y =11X’=Y’=
X= 1Y=11
Y =Y’= X = X’=
X = 1Y =11X’=Y’=
X= 1Y=11
Y = 11Y’= 11X = 0X’= 0
• cache-1 writes back X
X = 0Y =11X’=Y’=
X= 1Y=11
Y = 11Y’= 11X = 0X’= 0
• T2 executed
X = 1Y =11X’= 0Y’=11
X= 1Y=11
Y =11Y’=11 X = 0X’= 0
• cache-2 writes back
X’ & Y’ inconsistent
Write-through Caches & SC
22
cache-2Y = Y’= X = 0
X’=
memoryX = 0Y =10X’=
Y’=
cache-1X= 0Y=10
prog T2LD Y, R1ST Y’, R1LD X, R2
ST X’,R2
prog T1ST X, 1ST Y,11
Write-through caches don’t preserve
sequential consistency either
• T1 executed
Y = Y’= X = 0
X’=
X = 1Y =11X’=
Y’=
X= 1Y=11
• T2 executedY = 11Y’= 11X = 0
X’= 0
X = 1Y =11X’= 0
Y’=11
X= 1Y=11
Maintaining Cache Coherence
§Hardware support is required such that– only one processor at a time has write permission for
a location
– no processor can load a stale copy of the location after a write
è cache coherence protocols
23
Cache Coherence vs. Memory Consistency
§ A cache coherence protocol ensures that all writes by one
processor are eventually visible to other processors, for one memory address
– i.e., updates are not lost
§ A memory consistency model gives the rules on when a
write by one processor can be observed by a read on
another, across different addresses– Equivalently, what values can be seen by a load
§ A cache coherence protocol is not enough to ensure
sequential consistency
– But if sequentially consistent, then caches must be coherent
§ Combination of cache coherence protocol plus processor
memory reorder buffer used to implement a given
architecture’s memory consistency model
24
Snoopy Cache, Goodman 1983
§ Idea: Have cache watch (or snoop upon) DMA transfers, and then “do the right thing”
§ Snoopy cache tags are dual-ported
25
Proc.
Cache
Snoopy read portattached to Memory
BusData(lines)
Tags andState
A
D
R/W
Used to drive Memory Buswhen Cache is Bus Master
A
R/W
Shared Memory Multiprocessor
26
Use snoopy mechanism to keep all processors’ view of memory coherent
M1
M2
M3
SnoopyCache
DMA
Physical
Memory
MemoryBus
SnoopyCache
SnoopyCache
DISKS
Snoopy Cache Coherence Protocols
27
write miss:
the address is invalidated in all othercaches before the write is performed
read miss:
if a dirty copy is found in some cache, a write-
back is performed before the memory is read
Cache State Transition DiagramThe MSI protocol
28
M
S I
M: ModifiedS: SharedI: Invalid
Each cache line has state bits
Address tag
statebits Write miss
(P1 gets line from memory)
Other processorintent to write
(P1 writes back)
Read miss(P1 gets line from memory)
P1in
tent
to w
rite
Other processorintent to write
Read by anyprocessor
P1 readsor writes
Cache state in processor P1
Other processor reads(P1 writes back)
Two Processor Example(Reading and writing the same cache line)
29
M
S I
Write miss
Readmiss
P1inte
nt to
write
P2 intent to write
P2 reads,
P1 writes back
P1 reads
or writes
P2 intent to write
P1
M
S I
Write miss
Readmiss
P2inte
nt to
write
P1 intent to write
P1 reads,
P2 writes back
P2 reads
or writes
P1 intent to write
P2
P1 reads
P1 writes
P2 reads
P2 writes
P1 writes
P2 writes
P1 reads
P1 writes
Observation
§ If a line is in the M state then no other cache can have a copy
of the line!
§ Memory stays coherent, multiple differing copies cannot exist
30
M
S I
Write miss
Other processorintent to write
Readmiss
P1in
tent
to w
rite
Other processorintent to write
Read by anyprocessor
P1 readsor writes
Other processor readsP1 writes back
MESI: An Enhanced MSI protocol
increased performance for private data
31
M E
S I
M: Modified ExclusiveE: Exclusive but unmodified
S: Shared
I: Invalid
Each cache line has a tag
Address tag
statebits
Write miss
Other processorintent to write
Read miss,shared
Other processorintent to write
P1 write
Read by anyprocessor
Other processor reads
P1 writes back
P1 readP1 writeor read
Cache state in processor P1
P1 intent to write
Read miss, not sharedOther
processorreads
Other processor
intent to write, P1 writes back
Optimized Snoop with Level-2 Caches
32
Snooper Snooper Snooper Snooper
• Processors often have two-level caches
• small L1, large L2 (usually both on chip now)
• Inclusion property: entries in L1 must be in L2
invalidation in L2 è invalidation in L1
• Snooping on L2 does not affect CPU-L1 bandwidth
What problem could occur?
CPU
L1 $
L2 $
CPU
L1 $
L2 $
CPU
L1 $
L2 $
CPU
L1 $
L2 $
Intervention
33
When a read-miss for A occurs in cache-2,
a read request for A is placed on the bus• Cache-1 needs to supply & change its state to shared
• The memory may respond to the request also!
Does memory know it has stale data?
Cache-1 needs to intervene through memory controller
to supply correct data to cache-2
cache-1A 200
CPU-Memory bus
CPU-1 CPU-2
cache-2
memory (stale data)A 100
False Sharing
34
state line addr data0 data1 ... dataN
A cache line contains more than one word
Cache-coherence is done at the line-level and not
word-level
Suppose M1 writes wordi and M2 writes wordk and
both words have the same line address.
What can happen?
Out-of-Order Loads/Stores & CC
35
Blocking cachesOne request at a time + CC Þ SC
Non-blocking cachesMultiple requests (different addresses) concurrently + CC
Þ Relaxed memory models
CC ensures that all processors observe the same
order of loads and stores to an address
CacheMemorypushout (Wb-rep)
load/storebuffers
CPU
(S-req, E-req)
(S-rep, E-rep)
Wb-req, Inv-req, Inv-rep
snooper
(I/S/E)
CPU/MemoryInterface
36
Acknowledgementsn These slides contain material developed and
copyright by:• Arvind (MIT)
• Krste Asanovic (MIT/UCB)
• Joel Emer (Intel/MIT)
• James Hoe (CMU)
• John Kubiatowicz (UCB)
• David Patterson (UCB)
• John Lazzaro (UCB)
n MIT material derived from course 6.823
n UCB material derived from course CS152, CS252
HKUEEE ENGG3441 - HS