Ch 2 SymmShared Performance Issues
-
Upload
aruna-shanmugakumar -
Category
Documents
-
view
214 -
download
0
Transcript of Ch 2 SymmShared Performance Issues
-
7/30/2019 Ch 2 SymmShared Performance Issues
1/37
Multiprocessors and Thread-
Level Parallelism
UNIT IV
1
-
7/30/2019 Ch 2 SymmShared Performance Issues
2/37
Performance of Symmetric
Shard-Memory Multiprocessors
2
-
7/30/2019 Ch 2 SymmShared Performance Issues
3/37
Performance Symmetric Shared
MemoryIn a bus based multiprocessor using an invalidation
protocol,
Overall cache performance is a combination of
Uniprocessor cache miss traffic
Traffic caused by communication, which results in
invalidations and subsequent cache misses.
Changing the processor count, cache size, and blocksize can affect these two components of miss rate
3
-
7/30/2019 Ch 2 SymmShared Performance Issues
4/37
Performance Symmetric Shared
Memory UniProcessor Miss rates are :
1. Compulsory
2. Capacity
3. Conflict
Communication miss rate: coherence misses
True sharing misses + false sharing misses
4
-
7/30/2019 Ch 2 SymmShared Performance Issues
5/37
Coherence Misses
Def : The misses that rise from interprocessor
communications.
It can be broken down to 2 separate sources
True sharing misses
False sharing misses
5
-
7/30/2019 Ch 2 SymmShared Performance Issues
6/37
1. True Sharing Misses
It arises from the communication of data through the
cache coherence mechanism.
The first write by a PE to a shared cache block
causes an invalidation to establish ownership ofthat block
When another PE attempts to read a modified
word in that cache block, a miss occurs and the
resultant block is transferred
6
-
7/30/2019 Ch 2 SymmShared Performance Issues
7/37
2. False Sharing Misses
It arises from the use of an invalidation based coherence
algorithm with a single valid bit per cache block.
Occur when a block is invalidated (and a
subsequent reference causes a miss) becausesome word in the block, other than the one
being read, is written into.
invalidation does not cause a new value to be
communicated, but only causes an extra cache
miss
block is shared, but no word in the cache is
actually shared, 7
-
7/30/2019 Ch 2 SymmShared Performance Issues
8/37
True and False Sharing Miss
Example
Assume that words x1 and x2 are in the same
cache block, which is in the shared state in the
caches of P1 and P2. Assuming the following
sequence of events, identify each miss as a
true sharing miss or a false sharing miss.
8
Time P1 P2
1 Write x1
2 Read x2
3 Write x1
4 Write x2
5 Read x2
-
7/30/2019 Ch 2 SymmShared Performance Issues
9/37
Example Result
1: True sharing miss (invalidate P2)
2: False sharing miss
x2 was invalidated by the write of P1, but that
value of x1 is not used in P2
3: False sharing miss
The block containing x1 is marked shared due to
the read in P2, but P2 did not read x1. A writemiss is required to obtain exclusive access to the
block
4: False sharing miss
5: True sharing miss9
-
7/30/2019 Ch 2 SymmShared Performance Issues
10/37
Solution Explanation
1. This event is a true sharing miss, since x1 was read by P2 andneeds to be invalidated from P2.
2. This event is a false sharing miss, since x2 was invalidated by
the write of x1 in P1, but that value of x1 is not used in P2.
3. This event is a false sharing miss, since the block containingx1 is marked shared due to the read in P2, but P2 did not read
x1.
a write miss is required to obtain exclusive access to the
block1. This event is a false sharing miss for the same reason as step
3.
2. This event is a true sharing miss, since the value being read
was written by P2. 10
-
7/30/2019 Ch 2 SymmShared Performance Issues
11/37
Performance Measurements
The following are the diff. performance measurementsof symmetric shared memory multiprocessors:
1. Commercial workload
2. Multiprogramming & OS workload
3. Scientific / Technical workload
11
-
7/30/2019 Ch 2 SymmShared Performance Issues
12/37
1. Performance Measurements Commercial
workload
The following model are taken for the performancemeasurements of the commercial workload
Alphaserver 4100
Configurable simulator model
12
-
7/30/2019 Ch 2 SymmShared Performance Issues
13/37
Performance Measurements Commercial
workload - Alphaserver 4100
The Alphaserver 4100 has four processors Each processor has a three-level cache hierarchy:
1. L1 consist of a pair of 8 KB direct-mapped on-chip
caches,
One for instruction
One for data
1. L2 is a 96 KB on-chip unified 3-way set associative
cache with a 32-byte block size, using write-back.2. L3 is an off-chip, combined, direct-mapped 2 MB
caches with 64-byte blocks also using write-back.
13
-
7/30/2019 Ch 2 SymmShared Performance Issues
14/37
Performance Measurements Commercial
workload - Alphaserver 4100 ( Cont )
The latency for an access too L2 = 7 cycles
o L3 = 21 cycles
o
Main memory = 80 clock cyclesExecution time breaks down into
instruction execution time
Cache access time
Memory access time and
Other stalls
14
-
7/30/2019 Ch 2 SymmShared Performance Issues
15/37
Performance Measurements Commercial
workload - Alphaserver 4100 ( Cont )
Performance Performance of the DSS (Decision Support
System) and Altavista workloads is reasonable
Performance of OLTP(OnLine Transaction
Processing ) workload is poor
Impact on the OLTP benchmark of
L3 cache size
Processor count
Block size
are focused bcoz OLTP workload demands
from the memory system with large numbers of 15
-
7/30/2019 Ch 2 SymmShared Performance Issues
16/37
Execution Time of Commercial Workload
Poor performanceon OLTP due to
L3 misses (due toa poor performance
of the memory
hierarchy)
-
7/30/2019 Ch 2 SymmShared Performance Issues
17/37
Two-way set associative caches
The foll. diagram shows the effect of increasing the
cache size, using 2-way set associative caches, which
reduces the large number of conflict misses.
Effect of increasing the cache size reduces the largenumber of conflict misses using 2-way set associative
caches.
Execution time is improved as
L3 cache grows due to the reduction in L3 misses
Idle time also gets improved as
Reduces performance gains
17
-
7/30/2019 Ch 2 SymmShared Performance Issues
18/37
Relative Performance of the OLTP
w.r.t. the Size of L3 Use two-way set associative
-
7/30/2019 Ch 2 SymmShared Performance Issues
19/37
Contributing causes of memory
access cyclesThe following diagram displays the number of memory
access cycles contributed per instruction from 5
sources.
True sharing False sharing
Instruction
Capacity / ConflictCold
19
-
7/30/2019 Ch 2 SymmShared Performance Issues
20/37
Contributing Causes of Memory Access
Cycles with Increase Size of L3
-
7/30/2019 Ch 2 SymmShared Performance Issues
21/37
-
7/30/2019 Ch 2 SymmShared Performance Issues
22/37
L3 Miss Rate versus Block Size
-
7/30/2019 Ch 2 SymmShared Performance Issues
23/37
Explanations
True sharing & False sharing unchanged going from 1
MB to 8 MB ( L3 cache )
Uniprocessor cache misses improve with cache size
increase(Instruction, Capacity/Conflict, Compulsory) L3 cache simulated as two-way set associative. The cold,
false sharing and true sharing are unaffected by L3 cache
Contribution to memory access cycles increase as
processor count increase primarily due to increase true
sharing
Increase in true sharing miss rate leads to an overall
increase in memory access cycles per instruction 23
-
7/30/2019 Ch 2 SymmShared Performance Issues
24/37
2. Performance Measurements of the
Multiprogramming and OS workload
2 independent copies of the compile phase of
Andrew benchmark
o A parallel make using 8 processors
o Run for 5.24 seconds on 8 processors, creating203 processes & performing 787 disk requests
on 3 different file systems
o
Run with 128 MB of memory & no pagingactivity
24
-
7/30/2019 Ch 2 SymmShared Performance Issues
25/37
Performance Measurements of the
Multiprogramming and OS workload(Cont)
2 distinct phases
o Compile : Substantial compute activity
o Install the object files in a binary : dominated by
I/O
o Remove the object files: dominated by I/O & 2
PEs are active
25
-
7/30/2019 Ch 2 SymmShared Performance Issues
26/37
Performance Measurements of the
Multiprogramming and OS workload(Cont)
Measure CPU idle time & I-cache performance
L1 I-Cache: 32 KB, 2-way set associative with 64-
byte block, 1 CC hit time
L1 D-Cache: 32 KB, 2-way set associative with 32-
byte block, 1 CC hit time
L2 Cache: 1 MB unified, 2-way set associative with
128-byte block, 10 CC hit time Main memory: Single memory on a bus with an
access time of 100CC
Disk System: Fixed-access latency of 3ms (less 26
-
7/30/2019 Ch 2 SymmShared Performance Issues
27/37
Distribution of Execution Time in the
Multiprogrammed Parallel Make Workload
27
UserExecution
KernelExecution
SynchronizationWait
CPU idle forI/O
% instructionexecuted
27 3 1 69
% executiontime
27 7 2 64
A significant I-cache performance loss (at least for OS)
I-cache miss rate in OS for a 64-byte block size, 2-way se associative:1.7% (32KB) 0.2% (256KB)
I-cache miss rate in user-level: 1/6 of OS rate
-
7/30/2019 Ch 2 SymmShared Performance Issues
28/37
Data Miss Rate vs Data Cache Size
The misses can be broken into 3 significant classes
1. Compulsory misses represent the first access to
this block by this processor & are significant in this
workload2. Coherence misses represent misses due to
invalidations
3. Normal capacity misses include misses caused byinterference between the OS & user process &
between multiple user processes
28
-
7/30/2019 Ch 2 SymmShared Performance Issues
29/37
29Data miss rate vs Data cache size
User drops : a Factor of 3
Kernal drops : a Factor of 1.3
-
7/30/2019 Ch 2 SymmShared Performance Issues
30/37
Reasons
The reasons in which the behavior of the OS is more
complex than the user processes are
1. The kernal initializes all pages before allocating
to a user2. The kernal shares data and has a non trivial coherence
miss rate
30
-
7/30/2019 Ch 2 SymmShared Performance Issues
31/37
Components of Kernel Miss Rate
31
High rate of Compulsory
and Coherence miss
-
7/30/2019 Ch 2 SymmShared Performance Issues
32/37
Compulsory miss rate stays constant Capacity miss rate drops by more than a factor of 2.
including conflict miss rate
Coherence miss rate nearly doubles The probability of a miss being caused by an invalidation
increases with cache size
32
Components of kernal miss rate
-
7/30/2019 Ch 2 SymmShared Performance Issues
33/37
Kernel and User Behavior
Kernel behavior
Initialize all pages before allocating them to user
compulsory miss
Kernel actually shares data coherence miss
User process behavior
Cause coherence miss only when the process is
scheduled on a different processor small missrate
33
-
7/30/2019 Ch 2 SymmShared Performance Issues
34/37
Miss Rate VS. Block Size
34
32KB 2-way set associative data
cache
User drops: a factor of under 3
Kernel drops: a factor of 4
-
7/30/2019 Ch 2 SymmShared Performance Issues
35/37
Miss Rate VS. Block Size for Kernel
35
Compulsory drops significantly
Stay constant
-
7/30/2019 Ch 2 SymmShared Performance Issues
36/37
Compulsory & Capacity miss can be reduced
with larger block sizes
Largest improvement is reduction of compulsory
miss rateAbsence of large increases in the coherence miss
rate as block size is increased means that false
sharing effects are insignificant
36
Miss rate vs. block size
-
7/30/2019 Ch 2 SymmShared Performance Issues
37/37
37