Advanced Topics in Pipelining - SMT and Single-Chip Multiprocessor
description
Transcript of Advanced Topics in Pipelining - SMT and Single-Chip Multiprocessor
Advanced Topics in Pipelining- SMT and Single-Chip Multiprocessor
Priya GovindarajanCMPE 200
Introduction
Researchers have proposed two alternative microarchitectures that exploit multiple threads of control: simultaneous multithreading SMT [1] chip multiprocessors CMP [2]
CMP Vs SMT
Why software and hardware trends will favor the CMP microarchitecture.
Conclusion on the performance results from comparison of simulated superscalar, SMT, and CMP microarchitectures.
SMT Discussion Outline
Introduction Mutithreading MT Approaches of Multithreading Motivation for introducing SMT Implementation of SMT CPU Performance estimates Architectural abstraction
Introduction to SMT SMT processors augment wide (issuing many
instructions at once) superscalar processors with hardware that allows the processor to execute instructions from multiple threads of control concurrently
Dynamically selecting and executing instructions from many active threads simultaneously.
Higher utilization of the processor’s execution resources
Provides latency tolerance in case a thread stalls due to cache misses or data dependencies.
When multiple threads are not available, however, the SMT simply looks like a conventional wide-issue superscalar.
Introduction to SMT SMT uses the insight that a dynamically
scheduled processor already has many of h/w mechanisms needed to support the integrated exploitation of TLP through MT.
MT can be built on top of out-of-order processor by adding a per thread register renaming, PCs and providing capability for instructions from multiple threads to commit.
Mutithreading: Exploiting Thread-Level Parallelism Multithreading
Multiple threads to share the functional units of a single processor in an overlapping fashion.
The processor must duplicate the independent state of each thread. (register file, a separate PC, page table)
Memory can be shared through the virtual memory mechanisms, which already support multiprocessing
Needs hardware support for changing the threads.
Multithreading….
Two main approaches to multithreading Fine-grained multithreading Coarse-grained multithreading
Fine-grained .. Coarse-grainedmultithreading Switches between
threads on each instruction, causing interleaving
Interleaving in round-robin. Skipping any threads that r stalled
Switches threads only on costly stalls.
Fine-grained multithreading
Advantages Hides throughput losses that arise from
both short and long stalls.Disadvantages Slows down the execution of an
individual threads, since a thread that is ready to execute without stalls will be delayed by instructions from other threads.
Coarse-grained multithreading
Advantages Relieves the need to have thread
switching be essentially free and is much less likely to slow down the execution of an individual threads
Coarse-grained multithreading Disadvantages Throughput losses, especially from
shorter stalls. This is because coarse grained issues
instructions from a single thread, when a stall occurs, the pipeline must be emptied or frozen.
New thread begins executing after the stall must fill the pipeline before instructions will be able to complete.
Simultaneous Multithreading Is a variation on multithreading that uses the
resources of a multiple-issue processors, dynamically scheduled processor to exploit TLP at the same time it exploits ILP.
Why ? Modern multiple-issue processors often have
more functional unit parallelism available than a single thread can effectively use.
With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without any dependences among them.
Basic Out-of-order Pipeline
SMT Pipeline
Challenges for SMT processor Dealing with a larger register file
needed to hold multiple contexts Maintaining low overhead on the
clock cycle, particularly in issue , completion
Ensuring cache conflicts by simultaneous execution of multiple threads do not cause significant performance degradation.
SMT
SMT will significantly enhance multistream performance across a wide range of applications without significant hardware cost and without major architectural changes
Instruction Issue
Reduced function unit utilization due to dependencies
Superscalar Issue
Superscalar leads to more performance, but lower utilization
Maximum utilization of function units by independent operations
Simultaneous Multithreading
Fine Grained Multithreading
Intra-thread dependencies still limit performance
Interleaving – no empty slot
Architectural Abstraction 1 CPU with 4 Thread Processing Units (TPUs ) Shared hardware resources
System Block Diagram
Changes for SMT Basic pipeline – unchanged Replicated resources
Program counters Register maps
Shared resources Register file (size increased) Instruction queue Instruction queue First and second level caches Translation buffers Branch predictor
Multithreaded applications Performance
Single-Chip Multiprocessor CMPs use relatively simple single-thread
processor cores to exploit only moderate amounts of parallelism within any one thread, while executing multiple threads in parallel across multiple processor cores.
If an application cannot be effectively decomposed into threads, CMPs will be underutilized.
Comparing Alternative Architectures
Super scalar Architecture
Issue up to 12 instructions per cycle
Comparing … SMT Architecture
8 separate PCs , executes instructions from 8 diff thread concurrently
Multi bankcaches
Chip multiprocessor architecture
8 small 2 issue superscalar processors. Depend on TLP
SMT and Memory Large demands on memory SMT require more bandwidth from
primary cache (MT allows more load and store)
To allow this they have 128-kbye cache
Complex MESI(modified , exclusive, shared and invalid) cache-coherence protocol
CMP and Memory Eight cores are independent and
integrated with their individual pairs of caches – another form of clustering leads to high-frequency design for primary cache system
Small cache size and tight connection to these caches allows single-cycle access.
Need simpler coherence scheme
Quantitative performance.. CPU cores
To keep the processors execution units busy, SMT features advanced branch prediction register renaming out-of-order issue non blocking data caches.
Which makes it inherently complex
CMP Approach…h/w simple
number of registers increases
Number of ports on each register must increase
Exploit ILP using more processors instead of large issue widths within single processor
CMP
Solution
SMT Approach Longer cycle times Long, high capacitance I/O wires span the large buffers,
queues and register files Extensive use of multiplexers and crossbars to
interconnect these units adds more capacitance Delays associates dominate delay along CPU’s critical
path The cycle time impact of these structures can be
mitigated by careful design using deep pipelining, by breaking the structures with small,fast clusters of closely related components by short wires.
But deep pipelining increases branch misprediction penalities and clustering tends to reduce the ability of the processor to find and exploit instruction level parallelism.
CMP Solution Short cycle time to be be targeted with relatively
little design effort, since its h/w is naturally clustered- each of the small CPUs is already a very small fast cluster of components.
Since OS allocates a single s/w thread of control to each processor, the partitioning of work among the “clusters” is natural and requires no h/w to dynamically allocate instructions to different clusters
Heavy reliance on s/w to direct instructions to clusters limits the amount of ILP of CMP but allows the clusters within CMP to be small and fast.
SMT and CMP Architectural point of view, the SMT processor’s
flexibility makes it superior. However, the need to limit the effects of interconnect
delays, which are becoming much slower than transistor gate delays, will also drive the billion-transistor chip design.
Interconnect delays will force the microarchitecture to be partitioned into small, localized processing elements.
CMP is much more promising because it is already partitioned into individual processing cores.
Because these cores are relatively simple, they are amenable to speed optimization and can be designed relatively easily.
Compiler support for SMT and CMP Programmers must find TLP in order to
maximize CMP performance SMT requires programmers to explicitly
divide code into threads to get maximum performance but unlike CMP, it can dynamically find more ILP if TLP is limited.
But with multithreaded OS these problems should prove to be less daunting
Having all eight of the CPUs on a single chip allows designers to exploit TLP even when threads communicate frequently
Performance results
A comparison of three architectures indicates that a multiprocessor on a chip will be easiest to implement while still offering excellent performance.
Disadvantages of CMP
When code cannot be MT, only one processor can be targeted to the task
However, a single 2 issue processor on CMP is only moderately slower than superscalar or SMT, since applications with little thread-level parallelism also lack ILP
Conclusion on CMP CMP is promising candidate for a billion-
transistor architecture. Offers superior performance using simple h/w Code that can be parallelized into multiple
threads, the small CMP cores will perform comparable or better
Easier to design and optimize SMTs use resources more efficiently than CMP,
but more execution units can be included in a CMP of similar area, since less die area need be devoted to wide-issue logic.
D. TULLSEN, S. EGGERS, AND H. LEVY, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Ann. Int’l Symp. Computer Architecture, ACM Press, New York, 1995, pp. 392-403.
J. BORKENHAGEN, R. EICKEMEYER, AND R. KALLA :A Multithreaded PowerPC Processor for Commercial Servers, IBM Journal of Research and Development, November 2000, Vol. 44, No. 6, pp. 885-98.
J. LO, S. EGGERS, J. EMER, H. LEVY, R. STAMM, AND D. TULLSEN. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(2), August 1997.
Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The case for a single-chip multiprocessor. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2--11, Cambridge, Massachusetts, October 1--5, 1996.
LANCE HAMMOND, BASEM A NAYFEH, KUNLE OLUKOTUn. A Single-Chip Multiprocessor. IEEE September1997 GULATI, M. AND BAGHERZADEH, N. 1996. Performance study of a multithreaded superscalar microprocessor. In
the 2nd International Symposium on High-Performance Computer Architecture(Feb.). 291–301. KYOUNG PARK, SUNG-HOON CHOI, YONGWHA CHUNG, WOO-JONG HAHN AND SUK-HAN YOON. On-Chip
Multiprocessor with Siultaneous Multithreading. http://etrij.etri.re.kr/etrij/pdfdata/22-04-02.pdf NAYFEH, B. A., HAMMOND, L., AND OLUKOTUN, K. 1996. Evaluation of design alternatives for a multiprocessor
microprocessor. In the 23rd Annual International Symposium on Computer Architecture (May). 67–77. OLUKOTUN, K., NAYFEH, B. A., HAMMOND, L., WILSON, K., AND CHANG, K. 1996. The case for a single-chip
multiprocessor. In the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct.). ACM, New York, 2–11.
LANCE HAMMOND, BENEDICT A. HUBBERT, MICHAEL SIU, MANOHAR K. PRABHU, MICHAEL CHEN, KUNLE OLUKOTUN. The Stanford Hydra CMP. IEEE Micro March/April 2000 (Vol. 20, No. 2)
S. EGGERS, J. EMER, H. LEVY, J. LO, R. STAMM, D. TULLSEN. Simultaneous Multithreading: A Platform for Next-generation Processors. In IEEE Micro, pages 12-18, September/October 1997
V. KRISHNAN AND J. TORRELLAS. Hardware and Software Support for Speculative Execution of Sequential Binaries on Chip-Multiprocessor. In ACM International Conference on Supercomputing (ICS’98), pages 85-92, June 1998.
goethe.ira.uka.de/people/ungerer/proc-arch/ EUROPAR-tutorial-slides.ppt http://www.acm.uiuc.edu/banks/20/6/page4.html Simultaneous Multithreading home page http://www.cs.washington.edu/research/smt/
The Stanford Hydra Chip Multiprocessor
Kunle OlukotunThe Hydra Team
Computer Systems Laboratory
Stanford University
Technology Architecture
Transistors are cheap, plentiful and fast Moore’s law 100 million transistors by 2000
Wires are cheap, plentiful and slow Wires get slower relative to transistors Long cross-chip wires are especially slow
Architectural implications Plenty of room for innovation Single cycle communication requires localized blocks of logic High communication bandwidth across the chip easier to
achieve than low latency
Exploiting Program Parallelism
Instruction
Loop
Thread
Process
Leve
ls o
f P
aral
lelis
m
Grain Size (instructions)
1 10 100 1K 10K 100K 1M
Hydra Approach A single-chip multiprocessor architecture
composed of simple fast processors Multiple threads of control
Exploits parallelism at all levels Memory renaming and thread-level
speculation Makes it easy to develop parallel programs
Keep design simple by taking advantage of single chip implementation
Outline
Base Hydra Architecture Performance of base architecture Speculative thread support Speculative thread performance Improving speculative thread
performance Hydra prototype design Conclusions
The Base Hydra Design
Shared 2nd-level cache Low latency interprocessor
communication (10 cycles) Separate read and write
buses
Single-chip multiprocessor Four processors Separate primary caches Write-through data caches
to maintain coherence
Write-through Bus (64b)
Read/Replace Bus (256b)
On-chip L2 Cache
DRAM Main Memory
Rambus Memory Interface
CPU 0
L1 Inst. Cache L1 Data Cache
CPU 1
L1 Inst. Cache L1 Data Cache
CPU 2
L1 Inst. Cache L1 Data Cache
CPU 3
L1 Inst. Cache L1 Data Cache
I/O Devices
I/O Bus Interface
CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller
Centralized Bus Arbitration Mechanisms
Hydra vs. Superscalar ILP only SS 30-50% better than single
Hydra processor ILP & fine thread SS and Hydra comparable ILP & coarse thread Hydra 1.5–2better “The Case for a CMP” ASPLOS ‘96
com
pres
s
m88
ksim
eqnt
ott
MP
EG
2
appl
u
apsi
swim
tom
catv
pmak
e
0
0.5
1
1.5
2
2.5
3
3.5
4
Spe
edup
Superscalar 6-way issue
Hydra 4 x 2-way issue
OLT
P
Problem: Parallel Software Parallel software is limited
Hand-parallelized applications Auto-parallelized dense matrix FORTRAN applications
Traditional auto-parallelization of C-programs is very difficult Threads have data dependencies synchronization Pointer disambiguation is difficult and expensive Compile time analysis is too conservative
How can hardware help? Remove need for pointer disambiguation Allow the compiler to be aggressive
Solution: Data Speculation Data speculation enables parallelization without regard for
data-dependencies Loads and stores follow original sequential semantics Speculation hardware ensures correctness Add synchronization only for performance Loop parallelization is now easily automated
Other ways to parallelize code Break code into arbitrary threads (e.g. speculative subroutines ) Parallel execution with sequential commits
Data speculation support Wisconsin multiscalar Hydra provides low-overhead support for CMP
Data Speculation Requirements I
Forward data between parallel threads Detect violations when reads occur too early
Iteration i+1
read X
read X
read X
write X
Iteration i
read X
read X
read X
write X
FORWARDING
VIOLATION
Original Sequential Loop Speculatively Parallelized Loop
Forwarding from write:
Iteration i+1
read X
read X
read X
write X
TIM
E
Iteration i
read X
read X
read X
write X
Data Speculation Requirements II
Safely discard bad state after violation Correctly retire speculative state
Iteration i+1
read X
TIM
E
Iteration i
write X
write A
write B
TRASH
Iteration i+1
Iteration i
write X
write X
PERMANENT STATE
21
Writes after Violations Writes after Successful Iterations
Data Speculation Requirements III
Maintain multiple “views” of memory
Iteration i+1
TIM
E
Iteration i
read X
write X
write X
read X
Multiple Memory “Views”
Iteration i+2
read X
Hydra Speculation Support
Write bus and L2 buffers provide forwarding “Read” L1 tag bits detect violations “Dirty” L1 tag bits and write buffers provide backup Write buffers reorder and retire speculative state Separate L1 caches with pre-invalidation & smart L2 forwarding for “view” Speculation coprocessors to control threads
Write-through Bus (64b)
Read/Replace Bus (256b)
On-chip L2 Cache
DRAM Main Memory
Rambus Memory Interface
CPU 0
L1 Inst. Cache
Speculation Write Buffers
CPU 1
L1 Inst. Cache
CPU 2
L1 Inst. Cache
CPU 3
L1 Inst. Cache
I/O Devices
I/O Bus Interface
CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller
Centralized Bus Arbitration Mechanisms
CP2 CP2 CP2 CP2
#0 #1 #2 #3 retire
L1 Data Cache & Speculation Bits
L1 Data Cache & Speculation Bits
L1 Data Cache & Speculation Bits
L1 Data Cache & Speculation Bits
Speculative Reads
L1 hitThe read bits are set
L1 missL2 and write buffers are checked in parallel
The newest bytes written to a line are pulled in by priority encoders on each byte (priority A-D)
CPU #i
CPU #i-1
CPU #i-2
CPU #i+1
Nonspeculative “Head” CPU
Speculativeearlier CPU
Speculative later CPU“Me”
L1 Cache
12
Write Buffer
Write Buffer
Write Buffer
Write Buffer
C B A
L2 Cache
D
Speculative Writes
A CPU writes to its L1 cache & write buffer
“Earlier” CPUs invalidate our L1 & cause RAW hazard checks
“Later” CPUs just pre-invalidate our L1
Non-speculative write buffer drains out into the L2
CPU #i
CPU #i-1
CPU #i-2
CPU #i+1
Nonspeculative “Head” CPU “Me”
L1 Cache
12 3
L2 Cache
4
Invalidations & RAW Detection Pre-invalidations
Write Bus
Write Buffer
Write Buffer
Write Buffer
Write Buffer
Speculativeearlier CPU
Speculative later CPU
Speculation Runtime System Software Handlers
Control speculative threads through CP2 interface Track order of all speculative threads Exception routines recover from data dependency
violations Adds more overhead to speculation than hardware
but more flexible and simpler to implement Complete description in “Data Speculation
Support for a Chip Multiprocessor” ASPLOS ‘98 and “Improving the Performance of Speculatively Parallel Applications on the Hydra CMP” ICS ‘99
Creating Speculative Threads Speculative loops
for and while loop iterations Typically one speculative thread per iteration
Speculative procedures Execute code after procedure speculatively Procedure calls generate a speculative thread
Compiler support C source to source translator Pfor, pwhile Analyze loop body and globalize any local variables
that could cause loop-carried dependencies
Base Speculative Thread Performance
Entire applications GCC 2.7.2 -O2 4 single-issue
processors Accurate modeling of
all aspects of Hydra architecture and real runtime system
com
pres
s
eqnt
ott
grep
m88
ksim wc
ijpeg
mpe
g2
alvi
n
chol
esky ea
r
sim
plex
spar
se1.
3
0
0.5
1
1.5
2
2.5
3
3.5
4
Spe
edup
Base
Improving Speculative Runtime System
Procedure support adds overhead to loops Threads are not created sequentially Dynamic thread scheduling necessary Start and end of loop: 75 cycles End of iteration: 80 cycles
Performance Best performing speculative applications use loops Procedure speculation often lowers performance Need to optimize RTS for common case
Lower speculative overheads Start and end of loop: 25 cycles End of iteration: 12 cycles (almost a factor of 7) Limit procedure speculation to specific procedures
Improved Speculative Performance
Improves performance of all applications
Most improvement for applications with fine-grained threads
Eqntott uses procedure speculation
com
pres
s
eqnt
ott
grep
m88
ksim wc
ijpeg
mpe
g2
alvi
n
chol
esky ea
r
sim
plex
spar
se1.
30
0.5
1
1.5
2
2.5
3
3.5
4
Spe
edup
Base
Optimized RTS
Optimizing Parallel Performance Cache coherent shared memory
No explicit data movement 100+ cycle communication latency Need to optimize for data locality Look at cache misses (MemSpy, Flashpoint)
Speculative threads No explicit data independence Frequent dependence violations limit performance Need to optimize to reduce frequency and impact of data
violations Dependence prediction can help Look at violation statistics (requires some hardware support)
Feedback and Code Transformations Feedback tool
Collects violation statistics (PCs, frequency, work lost) Correlates read and write PC values with source code
Synchronization Synchronize frequently occurring violations Use non-violating loads
Code Motion Find dependent load-stores Move loads down in thread Move stores up in thread
Code Motion Rearrange reads and writes to increase parallelism Delay reads and advance writes Create local copies to allow earlier data forwarding
read x
write x
read x
write x
iteration i
iteration i+1
read xwrite x read x
write x
iteration i
iteration i+1
read x read x’
read x’
Optimized Speculative Performance
Base performance
Optimized RTS with no manual intervention
Violation statistics used to manually transform code
com
pres
s
eqnt
ott
grep
m88
ksim w
c
ijpeg
mpe
g2
alvi
n
chol
esky ea
r
sim
plex
spar
se1.
3
0
0.5
1
1.5
2
2.5
3
3.5
4
Spe
edup
Size of Speculative Write State
Max size determines size of write buffer for max performance
Non-head processor stalls when write buffer fills up
Small write buffers (< 64 lines) will achieve good performance
compress 24
eqntott 40
grep 11
m88ksim 28
wc 8
ijpeg 32
mpeg 56
alvin 158
cholesky 4
ear 82
simplex 14
32 byte cache lines
Max no. lines of write state
Hydra Prototype
Design based on Integrated Device Technology (IDT) RC32364 88 mm2 in 0.25m with 8 KB I, D and 128 KB L2
8 mm
11 mm
Conclusions Hydra offers a new way to design microprocessors
Single-chip MP exploits parallelism at all levels Low overhead support for speculative parallelism Provides high performance on applications with medium
to large-grain parallelism Allows performance optimization migration path for
difficult to parallelize fine-grain applications Prototype Implementation
Work out implementation details Provide platform for application and compiler
development Realistic performance evaluation
Hydra Team
Team Monica Lam, Lance Hammond, Mike
Chen, Ben Hubbert, Manohar Prahbu, Mike Siu, Melvyn Lim and Maciek Kozyrczak (IDT)
URL http://www-hydra.stanford.edu