Parallelism in Uniprocessor System and Granularity
-
Upload
bravoyusuf -
Category
Documents
-
view
5.021 -
download
5
description
Transcript of Parallelism in Uniprocessor System and Granularity
![Page 1: Parallelism in Uniprocessor System and Granularity](https://reader035.fdocuments.us/reader035/viewer/2022071700/552820fb4a79596c508b4602/html5/thumbnails/1.jpg)
CS-421 Parallel Processing BE (CIS) Batch 2004-05 Handout_2
Page - 1 - of 5
Parallelism in a Uniprocessor System 1. Multiprogramming & Timesharing
a. In multiprogramming several processes reside in main memory and CPU switches from one process (say
P1) to another (say P2) when the currently running process (P1) blocks for an I/O operation. I/O operation
for P1 is handled by a DMA unit while the CPU runs P2.
b. In timesharing processes are assigned slices of CPU’s time. The CPU executes the processes in the round
robin fashion as below.
Time quantum expires
- - - Job finishes
Processes Queue
• It appears that every user (process) has its own processor. (multiple virtual processors)
• Averts the monopoly of a single (computation intensive) process as in pure multiprogramming
2. Multiplicity of Functional Units
Use of multiple functional units like multiple adders, multipliers or even multiple ALUs to provide
concurrency is not a new idea in uniprocessor environment. It has been around for decades.
3. Harvard Architecture
a. This provides separate memory units for instructions and data. This effectively doubles the memory
bandwidth saving CPU’s time. E.g. is split cache in which instructions are kept in I-cache and data in D-
cache
b. In contrast when instructions and data are kept in the same memory, the architecture is called Princeton
Architecture. E.g. are unified cache, main memory, etc.
4. Memory Hierarchy
A parallel processing mechanism supported by memory hierarchy is the simultaneous transfer of
instructions/data between (CPU, cache) and (MM, secondary memory)
5. Pipelining a. Arithmetic Pipelining
X
Y
4-stage FP adder b. Instruction Pipelining
5-stage instruction pipeline
CPU
Normalize result
Exponent Comparison
Mantissa Alignment
Significand Addition
IF ID EX M WB
![Page 2: Parallelism in Uniprocessor System and Granularity](https://reader035.fdocuments.us/reader035/viewer/2022071700/552820fb4a79596c508b4602/html5/thumbnails/2.jpg)
CS-421 Parallel Processing BE (CIS) Batch 2004-05 Handout_2
Page - 2 - of 5
A time-space diagram is used to describe the progress of instructions through the pipeline.
WB I1 I2 I3 I4 I5 I6 M I1 I2 I3 I4 I5 I6 I7 EX I1 I2 I3 I4 I5 I6 I7 I8 ID I1 I2 I3 I4 I5 I6 I7 I8 I9 st
ages
IF I1 I2 I3 I4 I5 I6 I7 I8 I9 I10
1 2 3 4 5 6 7 8 9 10 Clock Cycles (Pipelined Execution)
We’ve assumed that every stage takes one clock cycle and there are no hazards in the instruction stream.
Instruction Latency = 5 cycles
Instruction Throughput = 6/10 IPC = 0.6 IPC
In order to gain better appreciation of pipelined execution, we draw time-space diagram for non-pipelined
execution as shown below:
WB I1 I2 M I1 I2 EX I1 I2 ID I1 I2
IF I1 I2 1 2 3 4 5 6 7 8 9 10 Clock Cycles (Non-Pipelined Execution)
Instruction Latency = 5 cycles
Instruction Throughput = 2/10 IPC = 0.2 IPC
This is evident that pipelined execution improves instruction throughput. However, it doesn’t improve
instruction latency. (In practice, pipelining increases instruction latency due to delay of pipeline registers
as will be explained subsequently.)
Speedup Suppose that a k-stage instruction pipeline executes a program containing n instructions. Let τ be the
cycle time.
Execution time on non-pipelined computer is given as
tnp = nkτ ----------(1)
Execution time on pipelined computer is given as
tp = (k – 1 + n)τ ----------(2)
where, k – 1 cycles are required to fill up the pipeline (also called pipeline setup time).
By definition, speedup S of pipelined execution over non-pipelined execution is given as
![Page 3: Parallelism in Uniprocessor System and Granularity](https://reader035.fdocuments.us/reader035/viewer/2022071700/552820fb4a79596c508b4602/html5/thumbnails/3.jpg)
CS-421 Parallel Processing BE (CIS) Batch 2004-05 Handout_2
Page - 3 - of 5
( )
)3(1
1
after before
−−−−−−−−−−+−
=
+−=
=
=
nknk
nknk
tt
tenhancementimetenhancementimeS
p
np
ττ
Clearly, for a given pipeline, greater speedup is achieved, as more and more instructions are executed. We
can compute the upper bound on speedup as follows:
kn
kknS Lim
ideal
=
+−
∞→=11
We regard it as ideal speedup because its derivation is based on the assumption of no pipeline hazards. As
can be seen, even ideal speedup cannot go beyond pipeline depth (i.e. number of pipeline stages).
Instruction Throughput Instruction throughput ω is defined as the number of instructions executed per unit time. This is
calculated as:
Multiplying numerator and denominator of (4) by k, we can express ω in terms of speedup S as:
τω
kS
=
The upper bound on ω is similarly found:
τ
τω
/1
111
=
+
−∞→=
nk
n Limideal
CPI Cycles per instruction (CPI) of pipelined execution can be found as:
( )
11
1
+−
=
+−=
nk
nnkCPI
The lower bound on CPI is
1
11
=
+
−∞→=
nknCPI Lim
ideal
( ) )4(1
−−−−−−−−−−+−
=τ
ωnk
n
![Page 4: Parallelism in Uniprocessor System and Granularity](https://reader035.fdocuments.us/reader035/viewer/2022071700/552820fb4a79596c508b4602/html5/thumbnails/4.jpg)
CS-421 Parallel Processing BE (CIS) Batch 2004-05 Handout_2
Page - 4 - of 5
Multiple Issue Architectures
These architectures are able to execute multiple instructions in one clock cycle (i.e. performance beyond
just pipelining). An N-way or N-issue architecture can achieve an ideal CPI of 1/N.
There are two major methods of implementing a multiple issue processor.
• Static multiple issue
• Dynamic multiple issue
Static Multiple Issue Architecture
• The scheduling of instructions into issue slots is done by the compiler.
• We can think of instructions issued in a given clock cycle forming an instruction packet.
• It is useful to think of the issue packet as a single instruction allowing several operations in
predefined fields. This was the reason behind the original name for this approach: Very Long
Instruction Word (VLIW) architecture.
• Intel has its own name for this technique i.e. EPIC (Explicitly Parallel Instruction Computing) used in
Itanium series.
• If it is not possible to find operations that can be done at the same time for all functional units, then
the instruction may contain a NOP in the group of fields for unneeded units.
• Because most instruction words contain some NOPs, VLIW programs tend to be very long.
• The VLIW architecture requires the compiler to be very knowledgeable of implementation details
of the target computer, and may require a program to be recompiled if moved to a different
implementation of the same architecture.
Dynamic Multiple Issue Architecture
– Also known as superscalars
– The processor (rather than the compiler) decides whether zero, one, or more instructions can be
issued in a given clock cycle.
– Support from compilers is even more crucial for the performance of superscalars because a
superscalar processor can only look at a small window of program. A good compiler schedules code
in such a way that facilities scheduling decisions by the processor.
******
![Page 5: Parallelism in Uniprocessor System and Granularity](https://reader035.fdocuments.us/reader035/viewer/2022071700/552820fb4a79596c508b4602/html5/thumbnails/5.jpg)
CS-421 Parallel Processing BE (CIS) Batch 2004-05 Handout_2
Page - 5 - of 5
Granularity Qualitative Definition
The level on which work is done in parallel
Examples
• Job / Program Level
- The highest level of parallelism conducted among programs through multiprogramming/timesharing,
multiprocessing.
- Coarsest granularity
• Task / Procedure Level
Conducted among tasks of a common program (problem) e.g. multithreading
• Interinstruction Level
Conducted among instructions through superscalar techniques
• Intrainstruction Level
- Conducted among different phases of an instruction through pipelining
- Finest granularity
Quantitative Definition
Granularity = (Time spent on Computation)/ (Time spent on Communication)
• Fine-Grained Applications
low granularity i.e. more communication and less computation
less opportunity for performance enhancement
Facilitates load balancing
• Coarse-Grained Applications
High granularity i.e. large number of instructions between synchronization and communication points
More opportunity for performance enhancement
Hardened to balance load
******