EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf ·...
Transcript of EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf ·...
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
EECS 470
Lecture 24
Chip Multiprocessors andChip Multiprocessors and
Simultaneous MultithreadingFall 2007
Prof. Thomas Wenisch
http://www.eecs.umich.edu/courses/eecs470
Slides developed in part by Profs. Slides developed in part by Profs. FalsafiFalsafi, Hill, Hoe, , Hill, Hoe, LipastiLipasti, , Martin, Roth, Martin, Roth,
Lecture 24 Slide 1EECS 470
ShenShen, Smith, , Smith, SohiSohi, , and and VijaykumarVijaykumar of Carnegie Mellon University, Purdue of Carnegie Mellon University, Purdue University, University of University, University of Pennsylvania, Pennsylvania, and University of Wisconsin. and University of Wisconsin.
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Announcements
•HW 6 Posted, due 12/7HW 6 Posted, due 12/7
•Project due 12/10Project due 12/10In‐class presentations (~8 minutes + questions)
Lecture 24 Slide 2EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Base Snoopy OrganizationP
DataAddr Cmd
Cache data RAM
Bus-
Tagsandstateforsnoop
TagsandstateforP
Processor-side
controller
Comparator
Busside
controllerTocontroller
Write-back buffer
Comparator
Tag
Tocontroller
Addr CmdSnoop state Data buffer Addr Cmd
Lecture 24 Slide 3EECS 470
System bus
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Non-Atomic State TransitionsOperations involve multiple actions
Look up cache tagsBus arbitrationCheck for writebackEven if bus is atomic overall set of actions is notEven if bus is atomic, overall set of actions is notRace conditions among multiple operations
Suppose P1 and P2 attempt to write cached block ASuppose P1 and P2 attempt to write cached block AEach decides to issue BusUpgr to allow S –> M
IssuesIssuesHandle requests for other blocks while waiting to acquire bus Must handle requests for this block A
Lecture 24 Slide 4EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Non-Atomicity Transient States
Two types of states• Stable (e.g. MESI) P W /
P r R d / —Stable (e.g. MESI)• Transient or Intermediate
Increases complexity
P r W r /—
B u s R d /F lu s hB u s R d X / F lu s h
P r W r /
M
B u s G r a n t /B u s U p g r
B u s G r a n t / B u s R d X
P r W r /—
P r R d /
E
S → M
P r W r /
B u s G r a n t / B u s R d ( S) B u s R d /F l u s h
B u s G r a n t /
P r R d / —
P r R d /—
S
B u s R d ( S )I → M
P r W r /B u s R e q
B u s R d X /F l u s h′
I → S ,E
B u s R d X /F l u s h
B u s R d X /F lu s h′
P r R d /B u s R e q
B u s R d /F l u s h′
I
P r W r /B u s R e q
Lecture 24 Slide 5EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scalability problems of Snoopy Coherence
• Prohibitive bus bandwidthRequired bandwidth grows with # CPUS…… but available BW per bus is fixedAdding busses makes serialization/ordering hardAdding busses makes serialization/ordering hard
• Prohibitive processor snooping bandwidthAll caches do tag lookup when ANY processor accesses memoryInclusion limits this to L2, but still lots of lookups
U h b b d h d ’ l b d 8 16 CPU• Upshot: bus‐based coherence doesn’t scale beyond 8–16 CPUs
Lecture 24 Slide 6EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scalable Cache Coherence• Scalable cache coherence: two part solution
• Part I: bus bandwidthReplace non‐scalable bandwidth substrate (bus)…with scalable bandwidth one (point to point network e g mesh)…with scalable bandwidth one (point‐to‐point network, e.g., mesh)
• Part II: processor snooping bandwidthp p gInteresting: most snoops result in no actionReplace non‐scalable broadcast protocol (spam everyone)…with scalable directory protocol (only spam processors that care)…with scalable directory protocol (only spam processors that care)
Lecture 24 Slide 7EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Directory Coherence Protocols• Observe: physical address space statically partitioned
+ Can easily determine which memory module holds a given lineThat memory module sometimes called “home”
– Can’t easily determine which processors have line in their cachesBus‐based protocol: broadcast events to all processors/cachesp p /
± Simple and fast, but non‐scalable
• Directories: non‐broadcast coherence protocold k h fExtend memory to track caching information
For each physical cache line whose home this is, track:Owner: which processor has a dirty copy (I.e., M state)Sharers: which processors have clean copies (I.e., S state)
Processor sends coherence event to home directoryHome directory only sends events to processors that care
Lecture 24 Slide 8EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Read Processing
Read A (miss)Node #1 Directory Node #2
Read A (miss)
A: Shared, #1
Lecture 24 Slide 9EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Write Processing
Read A (miss)Node #1 Directory Node #2
Read A (miss)
A: Shared, #1
A:Mod., #2
Trade‐offs:Trade offs:• Longer accesses (3‐hop between Processor, directory, other Processor)• Lower bandwidth no snoops necessary
( )
Lecture 24 Slide 10EECS 470
Makes sense either for CMPs (lots of L1 miss traffic) or large‐scale servers (shared‐memory MP > 32 nodes)
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Chip Performance Scalability: Hi t & E t tiHistory & Expectations
1000000
NUMA hi ?
1000
10000
100000
(in M
IP
SMT
NUMA on chip?
SMP on chip
10
100
1000
man
ce (
FirstSuperscalar Out-of-Order
Cores
SMT SMP on chip
0.1
1
10
Perfo
rm First Micro RISC
Cores
0.011970 1980 1990 2000 2010
Lecture 24 Slide 11EECS 470
Goal: 1 Tera inst/sec by 2010!
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Piranha: Performance and Managed C l itComplexity
Large‐scale server based on CMP nodesLarge scale server based on CMP nodes
CMP architectureexcellent platform for exploiting thread‐level parallelisminherently emphasizes replication over monolithic complexity
D i th d l d i l t ti l itDesign methodology reduces implementation complexitynovel simulation methodologyuse ASIC physical design process
Piranha: 2x performance advantage with team size of
Lecture 24 Slide 12EECS 470
approximately 20 people
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Piranha Processing Node
Alpha core:1‐issue, in‐order,500MHzCPU
Next few slides from
Luiz Barosso’s ISCA 2000 presentation of
Piranha: A Scalable ArchitectureBased on Single-Chip Multiprocessing
Lecture 24 Slide 13EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Piranha Processing Node
Alpha core:1‐issue, in‐order,
CPU 500MHzL1 caches:I&D, 64KB, 2‐way
D$I$ D$I$
Lecture 24 Slide 14EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Piranha Processing Node
Alpha core:1‐issue, in‐order,
CPU 500MHzL1 caches:I&D, 64KB, 2‐way
Intra‐chip switch (ICS)D$I$
CPU
D$I$
CPU
D$I$
CPU
D$I$ p32GB/sec, 1‐cycle delayD$I$
ICS
D$I$ D$I$ D$I$
D$I$ D$I$ D$I$ D$I$
CPU CPU CPU CPU
Lecture 24 Slide 15EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Piranha Processing Node
Alpha core:1‐issue, in‐order,
CPU 500MHzL1 caches:I&D, 64KB, 2‐way
Intra‐chip switch (ICS)D$I$L2$
CPU
D$I$L2$
CPU
D$I$L2$
CPU
D$I$L2$ p
32GB/sec, 1‐cycle delayL2 cache:shared, 1MB, 8‐way
D$I$
ICS
D$I$ D$I$ D$I$
L2$D$I$
L2$D$I$
L2$D$I$
L2$D$I$
CPU CPU CPU CPU
Lecture 24 Slide 16EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Piranha Processing Node
Alpha core:1‐issue, in‐order,MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU 500MHzL1 caches:I&D, 64KB, 2‐way
Intra‐chip switch (ICS)D$I$L2$
CPU
D$I$L2$
CPU
D$I$L2$
CPU
D$I$L2$ p
32GB/sec, 1‐cycle delayL2 cache:shared, 1MB, 8‐way
Memory Controller (MC)
D$I$
ICS
D$I$ D$I$ D$I$
RDRAM, 12.8GB/sec
L2$D$I$
L2$D$I$
L2$D$I$
L2$D$I$
CPU CPU CPU CPUMEM-CTL MEM-CTL MEM-CTL MEM-CTL
Lecture 24 Slide 17EECS 470
8 [email protected]/sec
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Piranha Processing Node
Alpha core:1‐issue, in‐order,MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU 500MHzL1 caches:I&D, 64KB, 2‐way
Intra‐chip switch (ICS)D$I$L2$
CPU
D$I$L2$
CPU
D$I$L2$
CPU
D$I$L2$HE
32GB/sec, 1‐cycle delayL2 cache:shared, 1MB, 8‐way
Memory Controller (MC)
D$I$
ICS
D$I$ D$I$ D$I$
RDRAM, 12.8GB/secProtocol Engines (HE & RE)μprog., 1K μinstr., even/odd interleavingL2$
D$I$L2$
D$I$L2$
D$I$L2$
D$I$RE
CPU CPU CPU CPUMEM-CTL MEM-CTL MEM-CTL MEM-CTL
Lecture 24 Slide 18EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Piranha Processing Node
Alpha core:1‐issue, in‐order,MEM-CTL MEM-CTL MEM-CTL MEM-CTL4 Links
CPU1 issue, in order,500MHz
L1 caches:I&D, 64KB, 2‐way
Intra‐chip switch (ICS)D$I$L2$
CPU
D$I$L2$
CPU
D$I$L2$
CPU
D$I$L2$
MEM CTL MEM CTL MEM CTL MEM CTL
HE
4 Links @ 8GB/s
Intra chip switch (ICS)32GB/sec, 1‐cycle delay
L2 cache:shared, 1MB, 8‐way
Memory Controller (MC)
D$I$
ICS
D$I$ D$I$ D$I$
oute
r
Memory Controller (MC)RDRAM, 12.8GB/sec
Protocol Engines (HE & RE):μprog., 1K μinstr., even/odd interleavingL2$
D$I$L2$
D$I$L2$
D$I$L2$
D$I$RE
Ro
/ gSystem Interconnect:4‐port Xbar routertopology independent32GB/sec total bandwidth
CPU CPU CPU CPUMEM-CTL MEM-CTL MEM-CTL MEM-CTL
Lecture 24 Slide 19EECS 470
/
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Piranha Processing Node
Alpha core:1‐issue, in‐order,MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU, ,
500MHzL1 caches:I&D, 64KB, 2‐way
Intra‐chip switch (ICS)D$I$L2$
CPU
D$I$L2$
CPU
D$I$L2$
CPU
D$I$L2$
MEM CTL MEM CTL MEM CTL MEM CTL
HE p ( )32GB/sec, 1‐cycle delay
L2 cache:shared, 1MB, 8‐way
Memory Controller (MC)
D$I$
ICS
D$I$ D$I$ D$I$
oute
r
y ( )RDRAM, 12.8GB/sec
Protocol Engines (HE & RE):μprog., 1K μinstr., even/odd interleavingL2$
D$I$L2$
D$I$L2$
D$I$L2$
D$I$RE
Ro
/ gSystem Interconnect:4‐port Xbar routertopology independent32GB/sec total bandwidth
CPU CPU CPU CPUMEM-CTL MEM-CTL MEM-CTL MEM-CTL
Lecture 24 Slide 20EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Single-Chip Piranha Performance
300
350
me L2Miss
350
200
250
300
ecut
ion
Tim L2Miss
L2HitCPU
233
191
100
150
mal
ized
Exe 145
100 100
44
0
50
P1500 MH
INO1GH
OOO1GH
P8500MH
P1500 MH
INO1GH
OOO1GH
P8500MH
Nor
m 34 44
500 MHz1-issue
1GHz1-issue
1GHz4-issue
500MHz1-issue
500 MHz1-issue
1GHz1-issue
1GHz4-issue
500MHz1-issue
OLTP DSSPiranha’s performance margin 3x for OLTP and 2.2x for DSS
Lecture 24 Slide 21EECS 470
Piranha has more outstanding misses better utilizes memory system
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Performance And Utilization• Performance (IPC) important
• Utilization (actual IPC / peak IPC) important too( p ) p
• Even moderate superscalars (e.g., 4‐way) not fully utilizedp ( g , y) yAverage sustained IPC: 1.5–2 → <50% utilization
Mis‐predicted branchesCache misses, especially L2Cache misses, especially L2Data dependences
Multi threading (MT)•Multi‐threading (MT)Improve utilization by multi‐plexing multiple threads on single CPUOne thread cannot fully utilize CPU? Maybe 2, 4 (or 100) can
Lecture 24 Slide 22EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Latency vs Throughput•MT trades (single‐thread) latency for throughput
– Sharing processor degrades latency of individual threads+ But improves aggregate latency of both threads+ Improves utilization
• Example• ExampleThread A: individual latency=10s, latency with thread B=15sThread B: individual latency=20s, latency with thread A=25sSequential latency (first A then B or vice versa): 30sParallel latency (A and B simultaneously): 25s
– MT slows each thread by 5s+ But improves total latency by 5s
• Different workloads have different parallelismS FP h l t f ILP ( 8 id hi )
Lecture 24 Slide 23EECS 470
SpecFP has lots of ILP (can use an 8‐wide machine)Server workloads have TLP (can use multiple threads)
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Core SharingTime sharing:
Run one threadOn a long‐latency operation (e.g., cache miss), switchAlso known as “switch‐on‐miss” multithreadingE g Niagara (UltraSPARC T1/T2)E.g., Niagara (UltraSPARC T1/T2)
Space sharing:Space sharing:Across pipeline depth
Fetch and issue each cycle form a different threadBoth across pipeline width and depthBoth across pipeline width and depth
Fetch and issue each cycle from from multiple threadsPolicy to decide which to fetch gets complicatedAlso known as “simultaneous” multithreading
Lecture 24 Slide 24EECS 470
Also known as simultaneous multithreadingE.g., Alpha 21464, IBM POWER5
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Instruction Issue
Time
Reduced function unit utilization due to dependencies
Lecture 24 Slide 25EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Superscalar Issue
Time
Superscalar leads to more performance, but lower utilization
Lecture 24 Slide 26EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Predicated Issue
Time
Adds to function unit utilization, but results are thrown away
Lecture 24 Slide 27EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Chip Multiprocessor
Time
Limited utilization when only running one thread
Lecture 24 Slide 28EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Coarse-grain Multithreading
Time
Preserves single‐thread performance, but can only hide long latencies (i e main memory accesses)
Lecture 24 Slide 29EECS 470
but can only hide long latencies (i.e., main memory accesses)
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Coarse-Grain Multithreading (CGMT)• Coarse‐Grain Multi‐Threading (CGMT)
+ Sacrifices very little single thread performance (of one thread)– Tolerates only long latencies (e.g., L2 misses)Thread scheduling policy
Designate a “preferred” thread (e.g., thread A)g p ( g , )Switch to thread B on thread A L2 missSwitch back to A when A L2 miss returns
Pipeline partitioningPipeline partitioningNone, flush on switch
– Can’t tolerate latencies shorter than twice pipeline depthNeed short in‐order pipeline for good performancep p g p
Example: IBM Northstar/Pulsar
Lecture 24 Slide 30EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
CGMTregfile
D$I$
D$BP
• CGMT
regfileregfile
thread scheduler
I$BP
D$
Lecture 24 Slide 31EECS 470
L2 miss?
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Fine Grained Multithreading
Time
Saturated workload > Lots of threadsSaturated workload ‐> Lots of threads
Unsaturated workload ‐> Lots of stalls
Intra‐thread dependencies still limit performance
Lecture 24 Slide 32EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Fine-Grain Multithreading (FGMT)• Fine‐Grain Multithreading (FGMT)
– Sacrifices significant single thread performanceg g p+ Tolerates all latencies (e.g., L2 misses, mispredicted branches, etc.)Thread scheduling policy
Switch threads every cycle (round robin) L2 miss or noSwitch threads every cycle (round‐robin), L2 miss or no
Pipeline partitioningDynamic, no flushingLength of pipeline doesn’t matter
– Need a lot of threadsExtreme example: Denelcor HEPp
So many threads (100+), it didn’t even need cachesFailed commercially
Not popular today
Lecture 24 Slide 33EECS 470
Not popular todayMany threads →many register files
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Fine-Grain Multithreading• FGMT
(Many) more threadsMultiple threads in pipeline at once
fil
regfileregfileregfileregfile
thread scheduler
regfile
D$I$BPP
Lecture 24 Slide 34EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Simultaneous Multithreading
Time
Maximum utilization of function units by independent operations
Lecture 24 Slide 35EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Simultaneous Multithreading (SMT)• Can we multithread an out‐of‐order machine?
Don’t want to give up performance benefitsDon’t want to give up natural tolerance of D$ (L1) miss latency
• Simultaneous multithreading (SMT)T l t ll l t i ( L2 i i di t d b h )+ Tolerates all latencies (e.g., L2 misses, mispredicted branches)
± Sacrifices some single thread performanceThread scheduling policy
Round‐robin (just like FGMT)Pipeline partitioning
Dynamic, hmmm…Example: Pentium4 (hyper‐threading): 5‐way issue, 2 threadsAnother example: Alpha 21464: 8‐way issue, 4 threads (canceled)
Lecture 24 Slide 36EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Simultaneous Multithreading (SMT)
regfile
map table
D$I$BP
• SMTReplicate map table, share physical register file
map tablesthread scheduler
fil
I$B
D$
regfile
Lecture 24 Slide 37EECS 470
P
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Issues for SMT• Cache interference
General concern for all MT variantsCan the working sets of multiple threads fit in the caches?Shared memory SPMD threads help here
Same insns→ share I$+ Same insns→ share I$+ Shared data → less D$ contentionMT is good for “server” workloads
k i l i h d l ( hi h i )To keep miss rates low, SMT might need a larger L2 (which is OK)Out‐of‐order tolerates L1 misses
• Large map table and physical register file#mt‐entries = (#threads * #arch‐regs)
Lecture 24 Slide 38EECS 470
#phys‐regs = (#threads * #arch‐regs) + #in‐flight insns
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
SMT Resource Partitioning• How are ROB/MOB, RS partitioned in SMT?
Depends on what you want to achieve
• Static partitioningDivide ROB/MOB, RS into T static equal‐sized partitionsE th t l IPC th d d ’t t hi h IPC+ Ensures that low‐IPC threads don’t starve high‐IPC ones
Low‐IPC threads stall and occupy ROB/MOB, RS slots– Low utilization
• Dynamic partitioningDivide ROB/MOB, RS into dynamically resizing partitionsLet threads fight for amongst themselvesLet threads fight for amongst themselves
+ High utilization– Possible starvation
Lecture 24 Slide 39EECS 470
ICOUNT: fetch policy prefers thread with fewest in‐flight insns
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Power Implications of MT• Is MT (of any kind) power efficient?
Static power? YesDissipated regardless of utilization
Dynamic power? Less clear, but probably yesHighly utilization dependentMajor factor is additional cache activitySome debate here
Overall? YesStatic power relatively increasing
Lecture 24 Slide 40EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
SMT vs. CMP• If you wanted to run multiple threads would you build a…
Chip multiprocessor (CMP): multiple separate pipelines?A multithreaded processor (SMT): a single larger pipeline?
• Both will get you throughput on multiple threadsCMP ill b i l ibl f t l kCMP will be simpler, possibly faster clockSMT will get you better performance (IPC) on a single thread
SMT is basically an ILP engine that converts TLP to ILPCMP is mainly a TLP engine
• Again, do bothSun’s Niagara (UltraSPARC T1)Sun s Niagara (UltraSPARC T1)8 processors, each with 4‐threads (coarse‐grained threading)1Ghz clock, in‐order, short pipeline (6 stages or so)
Lecture 24 Slide 41EECS 470
Designed for power‐efficient “throughput computing”
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Niagara: 32 Threads on Chip
8 six‐stage in‐order pipelines
Each pipeline, 4‐way fine‐grainEach pipeline, 4 way fine grain multithreaded
Small L1 write‐through caches
Shared L2
Screams for Web, OLTP
Shipping @1.2Ghz clock
Lecture 24 Slide 42EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Alpha 21464 Architecture Overview [Slid f Sh b M kh j @ I t l][Slides from Shubu Mukherjee @ Intel]
8‐wide out‐of‐order superscalar
Large on‐chip L2 cachearge on chip cache
Direct RAMBUS interface
On‐chip router for system interconnectOn chip router for system interconnect
Glueless, directory‐based, ccNUMAfor up to 512‐way multiprocessing
4‐way simultaneous multithreading (SMT)
Lecture 24 Slide 43EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Basic Out-of-order Pipeline
F t h D d Q R E t D h R R tiFetch Decode/Map
Queue Reg Read
Execute
Dcache/Stor
e Buffer
Reg Write
Retire
PC
Icache
RegisterMap
DcacheRegs Regs
Icache
Thread‐
Lecture 24 Slide 44EECS 470
Threadblind
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
SMT Pipeline
Fetch Decode/Map
Queue Reg Read
Execute
Dcache/Stor
e Buffer
Reg Write
Retire
Buffer
PC
Dcache
RegisterMap
Regs Regs
Icache
Dcacheg Regs
Lecture 24 Slide 45EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Changes for SMT
Basic pipeline – unchanged
Replicated resourcesProgram countersProgram countersRegister maps
Shared resourcesRegister file (size increased)Instruction queueqFirst and second level cachesTranslation buffersBranch predictor
Lecture 24 Slide 46EECS 470
Branch predictor
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Fetch Policy (Key)Simple fetch (e.g., round‐robin) policies don’t work. Why?
• Instruction from slow threads hog the pipelineInstruction from slow threads hog the pipeline
• Once instructions placed in IQ, can not remove them until they execute
Fetch optimization:
• i‐count
• Favor faster threads
• Count retirement rate and bias accordingly
• Must avoid starvation (tricky)
Lecture 24 Slide 47EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Multiprogrammed workload
250%
200%
250%
100%
150%1T2T3T
50%4T
0%SpecInt SpecFP Mixed Int/FP
Lecture 24 Slide 48EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Decomposed SPEC95 Applications
250%
200%
1T
100%
150%1T2T3T
50%4T
0%Turb3d Swm256 Tomcatv
Lecture 24 Slide 49EECS 470
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Multithreaded Applications
300%
250%
300%
150%
200%1T2T4T
50%
100% 4T
0%Barnes Chess Sort TP
Lecture 24 Slide 50EECS 470