ACA April2011 Solved
Transcript of ACA April2011 Solved
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 1/15
April 2011
Explain the Dynamic scheduling mechanism with an example
The Dynamic Scheduling is used handle some cases when dependences are unknown at a compile time.
In which the hardware rearranges the instruction execution to reduce the stalls while maintaining data
flow and exception behavior. It also allows code that was compiled with one pipeline in mind to run
efficiently on a different pipeline.
For example, consider this code:
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14
The SUB.D instruction cannot execute because the dependence of ADD.D on DIV.D causes the pipeline
to stall; yet SUB.D is not data dependent on anything in the pipeline. This hazard creates a performance
limitation that can be eliminated by not requiring instructions to execute in program order.
To allow us to begin executing the SUB.D in the above example, we must separate the issue process into
two parts: checking for any structural hazards and waiting for the absence of a data hazard. We can still
check for structural hazards when we issue the instruction; thus, we still use in-order instruction issue
(i.e., instructions issue in program order), but we want an instruction to begin execution as soon as its
data operand is available. Thus, this pipeline does out-of-order execution, which implies out-of-order
completion.
To allow out-of-order execution, we essentially split the ID pipe stage of our simple five-stage pipeline
into two stages:
• Issue—Decode instructions, check for structural hazards.
• Read operands—Wait until no data hazards, then read operands.
In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order
issue); however, they can be stalled or bypass each other in the second stage (read operands) and thus
enter execution out of order. Score-boarding is a technique for allowing instructions to execute out-of-
order when there are sufficient resources and no data dependences; it is named after the CDC 6600
scoreboard, which developed this capability.
What is Flynn’s Taxonomy?
The idea of using multiple processors both to increase performance and to improve availability
was proposed by Flynn [1966] using simple model of categorizing all computers. He looked at
the parallelism in the instruction and data streams called for by the instructions at the most
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 2/15
constrained component of the multiprocessor, and placed all computers into one of four
categories:
1. Single instruction stream, single data stream (SISD)—This category is the uniprocessor.
2. Single instruction stream, multiple data streams (SIMD)—The same instruction is executed bymultiple processors using different data streams. SIMD computers exploit data-level parallelism
by applying the same operations to multiple items of data in parallel. Each processor has its
own data memory (hence multiple data), but there is a single instruction memory and control
processor, which fetches and dispatches instructions. For applications that display significant
data-level parallelism, the SIMD approach can be very efficient. The multimedia extensions
discussed in Appendices B and C are a form of SIMD parallelism. Vector architectures are the
largest class of SIMD architectures.
3. Multiple instruction streams, single data stream (MISD)—No commercial multiprocessor of
this type has been built to date.
4. Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own
instructions and operates on its own data. MIMD computers exploit thread-level parallelism,
since multiple threads operate in parallel. In general, thread-level parallelism is more flexible
than data-level parallel-ism and thus more generally applicable.
Explain the shared memory architectures?
Existing MIMD multiprocessors fall into two classes, depending on the number of processors
involved, which in turn dictates a memory organization and interconnect strategy. SharedMemory architectures are categorized into centralized-shared memory and distributed-shared
memory.
Centralized-Shared Memory – For multi-processors with small processor counts, it is possible
for the processors to share a single centralized memory. With large caches, a single memory,
possibly with multiple banks, can satisfy the memory demands of a small number of processors.
By using multiple point-to-point connections, or a switch, and adding additional memory banks,
a centralized shared-memory design can be scaled to a few dozen processors. Because there is
a single main memory that has a symmetric relationship to all processors and a uniform access
time from any processor, these multiprocessors are most often called symmetric (shared-
memory) multiprocessors (SMPs), and this style of architecture is sometimes called uniform
memory access (UMA), arising from the fact that all processors have a uniform latency from
memory, even if the memory is organized into multiple banks.
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 3/15
Distributed-Shared Memory – It consists of multiprocessors with physically distributed
memory. To support larger processor counts, memory must be distributed among the
processors rather than centralized; otherwise the memory system would not be able to sup-port the bandwidth demands of a larger number of processors without incurring excessively
long access latency.
Both direction networks (i.e., switches) and indirect networks (typically multi-dimensional
meshes) are used to interconnect large number of processors. Distributing the memory among
the nodes has two major benefits.
Advantages
• It is a cost-effective way to scale the memory bandwidth if most of the accesses are to
the local memory in the node.
• It reduces the latency for accesses to the local memory. These two advantages make
distributed memory attractive at smaller processor counts as processors get ever faster
and require more memory bandwidth and lower memory latency.
Disadvantages
A distributed-memory architecture are that communicating data between processors becomes
somewhat more complex, and that it requires more effort in the software to take advantage of
the increased memory bandwidth afforded by distributed memories.
Differentiate between shared memory and message passing architectures
Shared memory
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 4/15
• The physically separate memories can be addressed as one logically shared address
space.
• These multiprocessors are called distributed shared-memory (DSM) architectures.
• The same physical address on two processors refers to the same location in memory
• DSM multiprocessors are also called NUMAs (non-uniform memory access), since the
access time depends on the location of a data word in memory.
Message Passing
• The address space can consist of multiple private address spaces that are logically disjoint and
cannot be addressed by a remote processor.
• In such multiprocessors, the same physical address on two different processors refers to two
different locations in two different memories.
• Each processor-memory module is essentially a separate computer.
• A multiprocessor with multiple address spaces, communication of data is done by explicitly
passing messages among the processors. Therefore, these multiprocessors are often called
message-passing multiprocessors.
• Clusters inherently use message passing.
Write short notes on shared Virtual Memory and Static Interconnection Networks
Shared Virtual memory
A global virtual address space is shared among processors residing at large number of loosely coupled
processing nodes. The idea is to implement coherent shared memory on a network of processors
without physically shared memory. Each virtual address can be as large as single node can provide and is
shared by all nodes in the system. The SVM address space is organized in pages and can be accessed by
any node in the system. A memory mapping manager on each node views its local memory as large
cache of pages for its associated processors.
Static Interconnection Networks• Linear Array
o N nodes connected by n-1 links (not a bus). Segments between different pair of nodes
can be used in parallel
o Internal nodes have degree of 2. End nodes have degree of 1
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 5/15
o For small n, this is economical for large n, it is inappropriate
• Ring and Chordal Ring
o Like a linear array, but two nodes are connected by nth link; The ring can be uni- or bi-
directional.
o By adding additional links the node degree is increased and we obtain chordal ring.
o In the limit, we obtain fully-connected network, with node degree of n-1 and diameter
of 1
• Barrel Shifter
o Like a ring, but with additional links between all pair of nodes that have distance equal
to power of 2
o With a network of size N=2n, each node has degree d=2n-1, and network has diameterD = n/2
o Barrel shifter connectivity is greater than any chordal ring of lower node degree.
o Barrel shifter much less complex than fully interconnected network.
• Tree and Star
o A k-level completely balanced binary tree will have N=2k-1 nodes, with maximum node
degree of 3 and network diameter is 2(k-1).
o The balanced binary tree is scalable, since it has a constant maximum node degree.
o A star is a two-level tree with a node degree d = N-1 and a constant diameter of 2
• Fat Tree
o A fat tree is a tree in which number of edges between nodes increases closer to root.
o The edges represent communication channels, and since communication traffic
increases as root is approached, it seems logical to increase the number of channels
there.
• Mesh and Torus
o Pure Mesh – N = nk nodes with links between each adjacent pair of nodes in a row or
column. This is not a symmetric network; interior node degree d=2k, diameter = k(n-1)
o Illiac Mesh – Wraparound is allowed, thus reducing the network diameter to about half
that of the equivalent to pure mesh.
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 6/15
o A torus has ring connections in each dimension, and is symmetric. An n x n binary torus
has node degree of 4 and a diameter of 2x[n/2]
Draw a cacheline and bus state diagram of MESI cache coherence protocol.
Explain with example?
The MESI protocol is a widely used cache coherency and memory coherence protocol. It is the most
common protocol which supports write-back cache. Its use in personal computers became widespread
with the introduction of Intel's Pentium processor to "support the more efficient write-back cache in
addition to the write-through cache previously used by the Intel 486 processor "
Every cache line is marked with one of the four following states
ModifiedThe cache line is present only in the current cache, and is dirty ; it has been modified from thevalue in main memory. The cache is required to write the data back to main memory at some
time in the future, before permitting any other read of the (no longer valid) main memory state.The write-back changes the line to the Exclusive state.
ExclusiveThe cache line is present only in the current cache, but is clean; it matches main memory. It maybe changed to the Shared state at any time, in response to a read request. Alternatively, it maybe changed to the Modified state when writing to it.
SharedIndicates that this cache line may be stored in other caches of the machine and is "clean" ; it
matches the main memory. The line may be discarded (changed to the Invalid state) at any time.
InvalidIndicates that this cache line is invalid.
For any given pair of caches, the permitted states of a given cache line are as follows:
Operation
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 7/15
In a typical system, several caches share a common bus to main memory. Each also has an attached CPUwhich issues read and write requests. The caches' collective goal is to minimize the use of the sharedmain memory.
A cache may satisfy a read from any state except Invalid. An Invalid line must be fetched (to the Sharedor Exclusive states) to satisfy a read.
A write may only be performed if the cache line is in the Modified or Exclusive state. If it is in the Sharedstate, all other cached copies must be invalidated first. This is typically done by a broadcast operationknown as Request For Ownership (RFO).
A cache may discard a non-Modified line at any time, changing to the Invalid state. A Modified line must
be written back first.
A cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all of the other caches in the system) of the corresponding main memory location and insert the data that itholds. This is typically done by forcing the read to back off (i.e. retry later), then writing the data to main
memory and changing the cache line to the Shared state.
A cache that holds a line in the Shared state must listen for invalidate for request-for-ownership
broadcasts from other caches, and discard the line (by moving it into Invalid state) on a match.
A cache that holds a line in the Exclusive state must also snoop all read transactions from all othercaches, and move the line to Shared state on a match.
The Modified and Exclusive states are always precise: i.e. they match the true cache line ownershipsituation in the system. The Shared state may be imprecise: if another cache discards a Shared line, thiscache may become the sole owner of that cache line, but it will not be promoted to Exclusive state.
Other caches do not broadcast notices when they discard cache lines, and this cache could not use suchnotifications without maintaining a count of the number of shared copies.
In that sense the Exclusive state is an opportunistic optimization: If the CPU wants to modify a cache linethat is in state S, a bus transaction is necessary to invalidate all other cached copies. State E enablesmodifying a cache line with no bus transaction.
What are instruction dependencies? Explain how these dependencies resolved
using pipeline stall mechanism in 3-stage pipeline?
(1) Data Hazards
(i) Caused by data (RAW, WAW, WAR) dependences
(ii) Require
(A) Pipeline interlock (stall) mechanism to detect dependences and generate machine stall
cycles
(i) Reg-id comparators between instrs in REG stage and instrs in EXE/WRB stages
(iii) Stalls due to RAW hazards can be reduced by bypass network
(A) Reg-id comparators + data bypass paths + mux
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 8/15
(2) Structural Hazards
(i) Caused by resource constraints
(ii) Require pipeline stall mechanism to detect structural constraints
(3) Control (Branch) Hazards
(i) Caused by branches(ii) Instruction fetch of a next instruction has to wait until the target (including the branch
condition) of the current branch instruction need to be resolved
(iii) Use
(A) Pipeline stall to delay the fetch of the next instruction
(B) Predict the next target address (branch prediction) and if wrong, flush all the speculatively
fetched instructions from the pipeline
Define Amdahl’s law, CPI and Execution time
The performance gain that can be obtained by improving some portion of a computer can be calculated
using Amdahl’s Law. Amdahl’s Law states that the performance improvement to be gained from using
some faster mode of execution is limited by the fraction of the time the faster mode can be used.
Amdahl’s Law defines the speedup that can be gained by using a particular feature.
Speedup is the Ratio
Speedup= Performance for entire task using enhancement when possible / Performance for entire task
without using enhancement
Speedup= Execution Time for entire task without using enhancement / Execution Time for entire task
using enhancement when possible
Speedup tells us how much faster a task will run using the machine with the enhancement as opposed
to the original machine.
CPI – Is the average number of clock cycles per instruction calculated, if we know the number of clock
cycles and instruction count.
CPI is computed as
CPI = CPU clock cycles per program / Instruction count
Execution Time – If we know the Instruction path length (IC), CPI (clock cycles per instruction) and
clock cycle time then execution time is defined as
CPU Time = Instruction count X Clock cycle time X Cycles per instruction
Explain prefix computation and loop unrolling method with example
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 9/15
Loop Unrolling
A simple scheme for increasing the number of instructions relative to the branch and overhead
instructions is loop unrolling. Unrolling simply replicates the loop body multiple times, adjusting the loop
termination code.
Loop unrolling can also be used to improve scheduling. Because it eliminates the branch, it allows
instructions from different iterations to be scheduled together.
Example: Show our loop unrolled so that there are four copies of the loop body, assuming Rl - R2 (that
is, the size of the array) is initially a multiple of 32, which means that the number of loop iterations is a
multiple of 4. Eliminate any obviously redundant computations and do not reuse any of the registers.
Answer: Here is the result after merging the DADDUI instructions and dropping the unnecessary BNE
operations that are duplicated during unrolling. Note that R2 must now be set so that 32 (R2) is the
starting address of the last four elements.
Loop: L.D F0,0(R1)
ADD.D F4.F0.F2
S.D F4,0(R1) ;drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) ;drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) ;drop DADDUI & BNE
L.D F14,-24(R1)ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI Rl,Rl,#-3 2
BNE Rl,R2,Loop
We have eliminated three branches and three decrements of Rl. The addresses on the loads and stores
have been compensated to allow the DADDUI instructions on Rl to be merged.
Explain Hardware and Software Parallelism
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 10/15
September 2010
Write a neat functional structure of SIMD array processor with concurrent scalar
processor in control unit
Functional Structure of SIMD Array Processor
Functional structure of MIMD Array Processor
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 11/15
Define Clock rate, CPI, MIPS rate and throughput rate
Clock Rate
CPU is driven by a clock with a constant cycle time. Cycle time is represented using –Ĩ in nanoseconds
• Inverse of cycle time is the clock rate
f = 1/ Ĩ in megahertz
• Size of a program is determined by its instruction count (Ic), in terms of the number of machine
instructions to be executed in the program
CPI – Is the average number of clock cycles per instruction calculated, if we know the number of clock
cycles and instruction count.
CPI is computed as
CPI = CPU clock cycles per program / Instruction count
MIPS:
One alternative to time as the metric is MIPS(Million Instruction Per Second)
MIPS=Instruction count/(Execution time x1000000).
This MIPS measurement is also called Native MIPS to distinguish it from some alternative definitions of
MIPS.
MIPS Rate: The rate at which the instructions are executed at a given time.
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 12/15
Throughput: The total amount of work done in a given time.
Throughput rate: The rate at which the total amount of work done at a given time.
Define Amdahl’s law and explain
The performance gain that can be obtained by improving some portion of a computer can be calculated
using Amdahl’s Law. Amdahl’s Law states that the performance improvement to be gained from using
some faster mode of execution is limited by the fraction of the time the faster mode can be used.
Amdahl’s Law defines the speedup that can be gained by using a particular feature.
Speedup is the Ratio
Speedup=Performance for entire task using enhancement when possible / Performance for entire task
without using enhancement
Alternatively,
Speedup=Execution Time for entire task without using enhancement / Execution Time for entire task
using enhancement when possible
Speedup tells us how much faster a task will run using the machine with the enhancement as opposed
to the original machine. Amdahl’s Law gives us a quick way to find the speedup from some
enhancement, which depends on two factors:
1. The fraction of the computation time in the original machine that can be converted to take advantage
of the enhancement—For example, if 20 seconds of the execution time of a program that takes 60
seconds in total can use an enhancement, the fraction is 20/60.
This value, which we will call Fraction enhanced, is always less than or equal to 1.
2. The improvement gained by the enhanced execution mode; that is, how much faster the task would
run if the enhanced mode were used for the entire program— This value is the time of the original mode
over the time of the enhanced mode: If the enhanced mode takes 2 seconds for some portion of the
program that can completely use the mode, while the original mode took 5 seconds for the same
portion, the improvement is 5/2. We will call this value, which is always greater than 1,
Speedupenhanced.
The execution time using the original machine with the enhanced mode will be the time spent using the
unenhanced portion of the machine plus the time spent using the enhancement:
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 13/15
Amdahl’s Law can serve as a guide to how much an enhancement will improve performance and how to
distribute resources to improve cost/performance.
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 14/15
Write the VAX 8600 CISC processor Architecture
The VAX 8600 was introduced by Digital Equipment Corporation in 1985. This machine implements a
typical CISC architecture with micro programmed control. The instruction set contains about300 instructions with 20 different addressing modes. As shown in the Figure 5.2.3(a), the VAX 8600
executes the same instruction set, runs the same VMS operating system, and interfaces with the same I/
O buses (such as SBI and Unibus) as the VAX 11/780
The CPU in the VAX 8600 consists of two functional units for concurrent execution of integer and
floating-point instructions. The unified cache is used for holding both instructions and data. There are 16
GPRs in the instruction unit. Instruction pipelining has been built with six stages in the VAX 8600, as in
most CISC machines. The instruction unit prefetches and decodes instructions, handles branching
operations, and supplies operands to the two functional units in a pipelined fashion.
A translation look aside buffer (TLB) is used in the memory control unit for fast generation of a physical
address from a virtual address. Both integer and floating-point units are pipelined. The CPI of a VAX
8600 instruction varies within a wide range from 2 cycles to as high as 20 cycles. For example, both
multiply and divide may tie up the execution unit for a large number of cycles. This is caused by the useof long sequences of microinstructions to control hardware operations
What are reservation tables in pipelining? Mention its advantages?
• A reservation table displays the time-space flow of data through the pipeline for one function
evaluation
8/4/2019 ACA April2011 Solved
http://slidepdf.com/reader/full/aca-april2011-solved 15/15
• A static pipeline is specified by a single reservation table
• A dynamic pipeline may be specified by multiple reservation tables
• The number of columns in a reservation table is called the evaluation time of a given function.
• The checkmarks in a row correspond to the time instants (cycles) that a particular stage will be used.
• Multiple checkmarks in a row repeated usage of the same stage in different cycles
• Contiguous checkmarks extended usage of a stage over more than one cycle
• Multiple checkmarks in one columnmultiple stages are used in parallel
• A dynamic pipeline may allow different initiations to follow a mix of reservation table