Multi-This, Multi-That, …
description
Transcript of Multi-This, Multi-That, …
Multi-This, Multi-That, …
Limits on IPC
• Lam92– This paper focused on impact of control flow on ILP– Speculative execution can expose 10-400 IPC
• assumes no machine limitations except for control dependencies and actual dataflow dependencies
• Wall91– This paper looked at limits more broadly
• No branch prediction, no register renaming, no memory disambiguation: 1-2 IPC
• ∞-entry bpred, 256 physical registers, perfect memory disambiguation: 4-45 IPC
• perfect bpred, register renaming and memory disambiguation: 7-60 IPC
– This paper did not consider “control independent” instructions
Practical Limits
• Today, 1-2 IPC sustained– far from the 10’s-100’s reported by limit studies
• Limited by:– branch prediction accuracy– underlying DFG
• influenced by algorithms, compiler– memory bottleneck
– design complexity• implementation, test, validation, manufacturing, etc.
– power– die area
Differences Between Real Hardware and Limit Studies?
• Real branch predictors aren’t 100% accurate• Memory disambiguation is not perfect• Physical resources are limited
– can’t have infinite register renaming w/o infinite PRF– need infinite-entry ROB, RS and LSQ– need 10’s-100’s of execution units for 10’s-100’s of
IPC
• Bandwidth/Latencies are limited– studies assumed single-cycle execution– infinite fetch/commit bandwidth– infinite memory bandwidth (perfect caching)
Bridging the Gap
IPC
100
10
1
Single-IssuePipelined
SuperscalarOut-of-Order
(Today)
SuperscalarOut-of-Order(Hypothetical-Aggressive)
Limits
Diminishing returns w.r.t.larger instruction window,
higher issue-width
Power has been growingexponentially as well
Watts/
Past the Knee of the Curve?
“Effort”
Performance
ScalarIn-Order
Moderate-PipeSuperscalar/OOO
Very-Deep-PipeAggressive
Superscalar/OOO
Made sense to goSuperscalar/OOO:
good ROI
Very little gain forsubstantial effort
So how do we get more Performance?
• Keep pushing IPC and/or frequenecy?– possible, but too costly
• design complexity (time to market), cooling (cost), power delivery (cost), etc.
• Look for other parallelism– ILP/IPC: fine-grained parallelism– Multi-programming: coarse grained parallelism
• assumes multiple user-visible processing elements• all parallelism up to this point was user-invisible
User Visible/Invisible
• All microarchitecture performance gains up to this point were “free”– in that no user intervention required beyond
buying the new processor/system• recompilation/rewriting could provide even more
benefit, but you get some even if you do nothing
• Multi-processing pushes the problem of finding the parallelism to above the ISA interface
Workload Benefits
3-wideOOOCPU
Task A Task B
4-wideOOOCPU
Task A Task B
Benefit
3-wideOOOCPU
3-wideOOOCPU
Task ATask B
2-wideOOOCPU
2-wideOOOCPU
Task BTask A
runtime
This assumes you have twotasks/programs to execute…
… If Only One Task
3-wideOOOCPU
Task A
4-wideOOOCPU
Task ABenefit
3-wideOOOCPU
3-wideOOOCPU
Task A
2-wideOOOCPU
2-wideOOOCPU
Task A
runtime
Idle
No benefit over 1 CPU
Performancedegradation!
Sources of (Coarse) Parallelism
• Different applications– MP3 player in background while you work on
Office– Other background tasks: OS/kernel, virus
check, etc.– Piped applications
• gunzip -c foo.gz | grep bar | perl some-script.pl
• Within the same application– Java (scheduling, GC, etc.)– Explicitly coded multi-threading
• pthreads, MPI, etc.
(Execution) Latency vs. Bandwidth
• Desktop processing– typically want an application to execute as
quickly as possible (minimize latency)
• Server/Enterprise processing– often throughput oriented (maximize
bandwidth)– latency of individual task less important
• ex. Amazon processing thousands of requests per minute: it’s ok if an individual request takes a few seconds more so long as total number of requests are processed in time
Benefit of MP Depends on Workload
• Limited number of parallel tasks to run on PC– adding more CPUs than tasks provide zero performance benefit
• Even for parallel code, Amdahl’s law will likely result in sub-linear speedup
parallelizable
1CPU 2CPUs 3CPUs 4CPUs
• In practice, parallelizable portion may not be evenly divisible
Basic Models for Parallel Programs
• Shared Memory– there’s some portion of memory (maybe all)
which is shared among all cooperating processes (threads)
– communicate by reading/writing to shared locations
• Message Passing– no memory is shared– explicitly send a message to/from threads to
communicate data/control
Shared Memory Model
• That’s basically it…– need to fork/join threads, synchronize (typically locks)
Main Memory
Write X Read X
CPU0 CPU1
Cache Coherency
Main Memory
Write X Read X
L2 L2
Other copies (if any)are now stale
Need to invalidateother copies
Or updateother copiesNormally don’t need to modify main
memory: requests serviced by otherCPUs (only update on evict/writeback)
CPU0 CPU1
CPU0 has most recent value, soit services the load request instead
of going to main memory
Cache Coherency Protocols
• Not covered in this course– Milos’ 8803 next semester will cover in more detail
• Many different protocols– different number of states– different bandwidth/performance/complexity
tradeoffs
– current protocols usually referred to by their states• ex. MESI, MOESI, etc.
Recv
Message Passing Protocols
• Explicitly send data from one thread to another– need to track ID’s of other CPUs– broadcast may need multiple send’s– each CPU has own memory space
• Hardware: send/recv queues between CPUs
Send
CPU0 CPU1
You Can Fake One on the Other
Memory0
CPU0
Memory1
CPU1
Write X=42
Send (X,42)
Faking Shared Memoryon Message-Passing CPUs
Recv( )
“X” was written
Apart from communicationdelays, each individual
memory space has samecontents
Common Memory
“Memory0” “Memory1”
CPU0 CPU1
Send (42)
Write Q1=42
Recv ( )
Read Q1
Shared Memory Focus
• Most small-medium multi-processors (these days) use some sort of shared memory– shared memory doesn’t scale as well to larger
number of nodes• communications are broadcast based• bus becomes a severe bottleneck
– message passing doesn’t need centralized bus• can arrange multi-processor like a graph
– nodes = CPUs, edges = independent links/routes• can have multiple communications/messages in
transit at the same time
SMP Machines
• SMP = Symmetric Multi-Processing– Symmetric = All CPUs are “equal”– Equal = any process can run on any CPU
• contrast with older parallel systems with master CPU and multiple worker CPUs
CPU0
CPU1
CPU2
CPU3
Pictures found from google images
Hardware Modifications for SMP
• Processor– mainly support for cache coherence protocols
• includes caches, write buffers, LSQ• control complexity increases, as memory latencies may
be substantially more variable
• Motherboard– multiple sockets (one per CPU)– datapaths between CPUs and memory controller
• Other– Case: larger for bigger mobo, better airflow– Power: bigger power supply for N CPUs– Cooling: need to remove N CPUs’ worth of heat
Chip-Multiprocessing
• Simple SMP on the same chip
Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX
Pictures found from google images
Shared Caches
• Resources can be shared between CPUs– ex. IBM Power 5
CPU0 CPU1
L2 cache shared betweenboth CPUs (no need to
keep two copies coherent)
L3 cache is also shared (only tagsare on-chip; data are off-chip)
Benefits?
• Cheaper than mobo-based SMP– all/most interface logic integrated on to main chip
(fewer total chips, single CPU socket, single interface to main memory)
– less power than mobo-based SMP as well (communication on-die is more power-efficient than chip-to-chip communication)
• Performance– on-chip communication is faster
• Efficiency– potentially better use of hardware resources than
trying to make wider/more OOO single-threaded CPU
Performance vs. Power
• 2x CPUs not necessarily equal to 2x performance
• 2x CPUs ½ power for each– maybe a little better than ½ if resources can be shared
• Back-of-the-Envelope calculation:– 3.8 GHz CPU at 100W– Dual-core: 50W per CPU– P V3: Vorig
3/VCMP3 = 100W/50W VCMP = 0.8 Vorig
– f V: fCMP = 3.0GHz Benefit of SMP: Full powerbudget per socket!
Simultaneous Multi-Threading
• Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC– poor utilization
• SMP: 2-4 CPUs, but need independent tasks– else poor utilization as well
• SMT: Idea is to use a single large uni-processor as a multi-processor
SMT (2)
Regular CPU
CMP
2x HW Cost
SMT (4 threads)
Approx 1x HW Cost
Overview of SMT Hardware Changes
• For an N-way (N threads) SMT, we need:– Ability to fetch from N threads– N sets of registers (including PCs)– N rename tables (RATs)– N virtual memory spaces
• But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.)
SMT Fetch
• Duplicate fetch logic
I$
fetch
fetch
fetch
Decode, Rename, DispatchPC0PC1PC2
RS
• Cycle-Multiplexed fetch logic
I$PC0PC1PC2
cycle % N
fetch Decode, etc.
RS
• Alternatives– Other-Multiplexed fetch logic– Duplicate I$ as well
SMT Rename
• Thread #1’s R12 != Thread #2’s R12– separate name spaces– need to disambiguate
RAT0
RAT1
Thread0
Register #
Thread1
Register #
PRF RAT PRF
Thread-ID
Register #
concat
SMT Issue, Exec, Bypass, …
• No change neededThread 0:
Add R1 = R2 + R3Sub R4 = R1 – R5Xor R3 = R1 ^ R4Load R2 = 0[R3]
Thread 1:
Add R1 = R2 + R3Sub R4 = R1 – R5Xor R3 = R1 ^ R4Load R2 = 0[R3]
Thread 0:
Add T12 = RT20 + T8Sub T19 = T12 – T16Xor T14 = T12 ^ T19Load T23 = 0[T14]
Thread 1:
AddAdd T17 = RT29 + T3SubSub T5 = T17 – T2XorXor T31 = T17 ^ T5LoadLoad T25 = 0[T31]
Add T12 = RT20 + T8
Sub T19 = T12 – T16
Xor T14 = T12 ^ T19
Load T23 = 0[T14]
AddAdd T17 = RT29 + T3
SubSub T5 = T17 – T2
XorXor T31 = T17 ^ T5
LoadLoad T25 = 0[T31]
Shared RS Entries
After Renaming
SMT Cache
• Each process has own virtual address space– TLB must be thread-aware
• translate (thread-id,virtual page) physical page
– Virtual portion of caches must also be thread-aware
• VIVT cache must now be (virutal addr, thread-id)-indexed, (virtual addr, thread-id)-tagged
• Similar for VIPT cache
SMT Commit
• One “Commit PC” per thread• Register File Management
– ARF/PRF organization• need one ARF per thread
– Unified PRF• need one “architected RAT” per thread
• Need to maintain interrupts, exceptions, faults on a per-thread basis– like OOO needs to appear to outside world that it is
in-order, SMT needs to appear as if it is actually N CPUs
SMT Design Space
• Number of threads• Full-SMT vs. Hard-partitioned SMT
– full-SMT: ROB-entries can be allocated arbitrarily between the threads
– hard-partitioned: if only one thread, use all ROB entries; if two threads, each is limited to one half of the ROB (even if the other thread uses only a few entries); possibly similar for RS, LSQ, PRF, etc.
• Amount of duplication– Duplicate I$, D$, fetch engine, decoders, schedulers,
etc.?– There’s a continuum of possibilities between SMT and
CMP• ex. could have CMP where FP unit is shared SMT-styled
SMT Performance
• When it works, it fills idle “issue slots” with work from other threads; throughput improves
• But sometimes it can cause performance degradation!
Time( ) < Time( )Finish one task,
then do the otherDo both at sametime using SMT
How?
• Cache thrashing
I$ D$
Thread0 just fits inthe Level-1 Caches
Executesreasonablyquickly due
to high cachehit rates
Context switch to Thread1
I$ D$
Thread1 also fitsnicely in the caches
I$ D$
Caches were just big enoughto hold one thread’s data, but
not two thread’s worth
L2
Now both threads havesignificantly higher cache
miss rates
This is all combinable
• Can have a system that supports SMP, CMP and SMT at the same time
• Take a dual-socket SMP motherboard…• Insert two chips, each with a dual-core CMP…• Where each core supports two-way SMT
• This example provides 8 threads worth of execution, shared on 4 actual “cores”, split across two physical packages
OS Confusion
• SMT/CMP is supposed to look like multiple CPUs to the software/OS
2-waySMT
2-waySMT
2 cores(either SMP/CMP)
CPU0
CPU1
CPU2
CPU3
Say OS has twotasks to run…
AA
BB
idleidle
idleidle
Schedule tasks to(virtual) CPUs
A/BA/B
idleidle
Performanceworse thanif SMT wasturned offand used
2-way SMPonly
OS Confusion (2)
• Asymmetries in MP-Hierarchy can be very difficult for the OS to deal with– need to break abstraction: OS needs to know which
CPUs are real physical processor (SMP), which are shared in the same package (CMP), and which are virtual (SMT)
– Distinct applications should be scheduled to physically different CPUs
• no cache contention, no power contention– Cooperative applications (different threads of the
same program) should maybe be scheduled to the same physical chip (CMP)
• reduce latency of inter-thread communication, possibly reduce duplication if shared L2 is used
– Use SMT as last choice
Multi-* is HappeningJust a question of exactlyhow? Number of cores,support for SMT/SoeMT,asymmetric cores, etc.?