Final Review
description
Transcript of Final Review
Final Review
Dr. Bernard Chen Ph.D.University of Central Arkansas
Fall 2010
Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind
stall to proceedDIV F0 <- F2/F4ADD F10<- F0+F8SUB F12<- F8-F14
Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind
stall to proceedDIV F0 <- F2/F4SUB F12<- F8-F14ADD F10<- F0+F8
Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind stall to
proceedDIV F0 <- F2/F4SUB F12<- F8-F14ADD F10<- F0+F8
Enables out-of-order execution and allows out-of-order completion (e.g., SUB)
In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)
Overcome Data Hazards with Dynamic Scheduling However, Dynamic execution creates WAR and WAW hazards and makes exceptions harder
Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name;
There are 2 versions of name dependence
WAR InstrJ writes operand before InstrI
reads it If it caused a hazard in the
pipeline, called a Write After Read (WAR) hazard
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
WAW InstrJ writes operand before InstrI
writes it. If anti-dependence caused a
hazard in the pipeline, called a Write After Write (WAW) hazard
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
Thread-level parallelism (TLP) Thread: process with own instructions
and data thread may be a process part of a parallel
program of multiple processes, or it may be an independent program
Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute
(Ch4: Data Level Parallelism: Perform identical operations on data, and lots of data)
New Approach: Mulithreaded Execution Multithreading: multiple threads to
share the functional units of 1 processor via overlapping
Processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table
New Approach: Mulithreaded Execution When switch?
Alternate instruction per thread (fine grain)
When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)
Fine-Grained Multithreading Switches between threads on each
instruction, causing the execution of multiples threads to be interleaved
Usually done in a round-robin fashion, skipping any stalled threads
CPU must be able to switch threads every clock
Course-Grained Multithreading Switches threads only on costly stalls,
such as L2 cache misses Advantages
Relieves need to have very fast thread-switching
Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall
Course-Grained Multithreading Disadvantage is hard to overcome throughput
losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread,
when a stall occurs, the pipeline must be emptied or frozen
New thread must fill pipeline before instructions can complete
Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time
Multithreaded Categories
Thread 1 Thread 2 Thread 3 Thread 4
Thread 5
Multithreaded Categories
Tim
e (p
roce
ssor
cy
cle)
Superscalar Fine-Grained Coarse-Grained (2clock cycle)
Thread 1
Thread 2Thread 3Thread 4
Thread 5Idle slot
Flynn’s Taxonomy
M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
Single Instruction Single Data (SISD)(Uniprocessor)
Single Instruction Multiple Data SIMD(single PC/Server)
Multiple Instruction Single Data (MISD)(????)
Multiple Instruction Multiple Data MIMD(Clusters, SMP servers)
Back to Basics “A parallel computer is a collection of processing
elements that cooperate and communicate to solve large problems fast.”
Parallel Architecture = Computer Architecture + Communication Architecture
2 classes of multiprocessors WRT memory:1. Centralized Memory Multiprocessor
• < few dozen processor chips Small enough to share single, centralized memory
2. Physically Distributed-Memory multiprocessor• Larger number chips and cores• BW demands Memory distributed among processors
2 Models for Communication and Memory Architecture The first kind, communication
occurs through a shared address space.
Centralized memory processor utilized this type of communication, named symmetric shared memory multiprocessors
2 Models for Communication and Memory Architecture The first kind, communication occurs
through a shared address space
Even the physically separate memories can be addressed as on logically shared space Meaning that the memory reference can be
made by any processor to any memory location, (assume it has the access right)
These multiprocessors are called distributed shared memory (DSM)
2 Models for Communication and Memory Architecture1. Communication occurs through a shared
address space (via loads and stores): shared memory multiprocessors either
• symmetric shared memory (centralized memory MP)
• distributed shared memory (distributed memory MP)
2. Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors, distributed memory MP
Multiprocessors Performance Amdahl’s Law
2 Classes of Cache Coherence Protocols1. Snooping — Every cache with a
copy of data also has a copy of sharing status of block, but no centralized state is kept
2. Directory based — Sharing status of a block of physical memory is kept in just one location, the directory
Snooping Write through: the information is written
to both the block in the cache and to the block in the lower-level memory
Write back: the information is only to the block in the cache. The modified cache block is written to main memory only when it is replaced or needed
Snooping (write back) Time Processor
activityBus activity
Contents of CPU A’s cache
Contents of CPU A’s cache
Contents of memory X
0 0
1 CPU A read X
Cache miss for X
0 0
2 CPU B read X
Cache miss for X
0 0 0
3 CPU A write 1 to X
Invalidation for X
1 0
4 CPU B reads X
Cache miss for X
1 1 1
Snooping (write through) Time Processor
activityBus activity
Contents of CPU A’s cache
Contents of CPU A’s cache
Contents of memory X
0 0
1 CPU A read X
Cache miss for X
0 0
2 CPU B read X
Cache miss for X
0 0 0
3 CPU A write 1 to X
Invalidation for X
1 1
4 CPU B reads X
Cache miss for X
1 1 1
Directory-Based Cache Coherence Protocols To implement the operations, a directory must
track the state of each cache block:
Shared (S): one or more processors have the block cached, and the value is up-to-date
Uncached (U): no processor has a copy of the cache block
Modified/Executed (E): exactly one processor has a copy of the cache block. The processor is called the owner of the block
Directory-based ProtocolDirectory-based ProtocolInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X U 0 0 0
Bit Vector
CPU 0 Reads XCPU 0 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X S 1 0 0
7X
CPU 2 Reads XCPU 2 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X S 1 0 1
7X 7X
CPU 0 Writes 6 to XCPU 0 Writes 6 to XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X E 1 0 0
6X
CPU 1 Reads XCPU 1 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
6X
Caches
Memories
Directories X S 1 1 0
6X 6X
CPU 2 Writes 5 to X CPU 2 Writes 5 to X (Write back)(Write back)
Interconnection Network
CPU 0 CPU 1 CPU 2
6X
Caches
Memories
Directories X E 0 0 1
5X
CPU 0 Writes 4 to XCPU 0 Writes 4 to XInterconnection Network
CPU 0 CPU 1 CPU 2
5X
Caches
Memories
Directories X E 1 0 0
4X
Evaluating Switch Topologies Diameter Diameter
distance between farthest two nodesdistance between farthest two nodes Bisection widthBisection width
Min. number of edges in a cut which roughly Min. number of edges in a cut which roughly divides a network in two halves - determines divides a network in two halves - determines the min. bandwidth of the networkthe min. bandwidth of the network
Degree = Number of edges / node Degree = Number of edges / node constant degree board can be mass producedconstant degree board can be mass produced
Constant edge length? (yes/no)Constant edge length? (yes/no)
2-D Mesh Network
Binary Tree Network
Hypercube2 2 xx 2 2 xx … … xx 2 mesh 2 mesh
0010
0000
0100
0110 0111
1110
0001
0101
1000 1001
0011
1010
1111
1011
11011100
Hypercubes Illustrated
Butterfly Network0 1 2 3 4 5 6 7
3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7
2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7
1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7
0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7R ank 0
R ank 1
R ank 2
R ank 3