Ch 2 SymmShared Performance Issues

7/30/2019 Ch 2 SymmShared Performance Issues

1/37

Multiprocessors and Thread-

Level Parallelism

UNIT IV

1


2/37

Performance of Symmetric

Shard-Memory Multiprocessors

2


3/37

Performance Symmetric Shared

MemoryIn a bus based multiprocessor using an invalidation

protocol,

Overall cache performance is a combination of

Uniprocessor cache miss traffic

Traffic caused by communication, which results in

invalidations and subsequent cache misses.

Changing the processor count, cache size, and blocksize can affect these two components of miss rate

3


4/37

Performance Symmetric Shared

Memory UniProcessor Miss rates are :

1. Compulsory

2. Capacity

3. Conflict

Communication miss rate: coherence misses

True sharing misses + false sharing misses

4


5/37

Coherence Misses

Def : The misses that rise from interprocessor

communications.

It can be broken down to 2 separate sources

True sharing misses

False sharing misses

5


6/37

1. True Sharing Misses

It arises from the communication of data through the

cache coherence mechanism.

The first write by a PE to a shared cache block

causes an invalidation to establish ownership ofthat block

When another PE attempts to read a modified

word in that cache block, a miss occurs and the

resultant block is transferred

6


7/37

2. False Sharing Misses

It arises from the use of an invalidation based coherence

algorithm with a single valid bit per cache block.

Occur when a block is invalidated (and a

subsequent reference causes a miss) becausesome word in the block, other than the one

being read, is written into.

invalidation does not cause a new value to be

communicated, but only causes an extra cache

miss

block is shared, but no word in the cache is

actually shared, 7


8/37

True and False Sharing Miss

Example

Assume that words x1 and x2 are in the same

cache block, which is in the shared state in the

caches of P1 and P2. Assuming the following

sequence of events, identify each miss as a

true sharing miss or a false sharing miss.

8

Time P1 P2

1 Write x1

2 Read x2

3 Write x1

4 Write x2

5 Read x2


9/37

Example Result

1: True sharing miss (invalidate P2)

2: False sharing miss

x2 was invalidated by the write of P1, but that

value of x1 is not used in P2


The block containing x1 is marked shared due to

the read in P2, but P2 did not read x1. A writemiss is required to obtain exclusive access to the

block


5: True sharing miss9


10/37

Solution Explanation

1. This event is a true sharing miss, since x1 was read by P2 andneeds to be invalidated from P2.

2. This event is a false sharing miss, since x2 was invalidated by

the write of x1 in P1, but that value of x1 is not used in P2.

3. This event is a false sharing miss, since the block containingx1 is marked shared due to the read in P2, but P2 did not read

x1.

a write miss is required to obtain exclusive access to the

block1. This event is a false sharing miss for the same reason as step

3.

2. This event is a true sharing miss, since the value being read

was written by P2. 10


11/37

Performance Measurements

The following are the diff. performance measurementsof symmetric shared memory multiprocessors:

1. Commercial workload

2. Multiprogramming & OS workload

3. Scientific / Technical workload

11


12/37

1. Performance Measurements Commercial

workload

The following model are taken for the performancemeasurements of the commercial workload

Alphaserver 4100

Configurable simulator model

12


13/37

Performance Measurements Commercial

workload - Alphaserver 4100

The Alphaserver 4100 has four processors Each processor has a three-level cache hierarchy:

1. L1 consist of a pair of 8 KB direct-mapped on-chip

caches,

One for instruction

One for data

1. L2 is a 96 KB on-chip unified 3-way set associative

cache with a 32-byte block size, using write-back.2. L3 is an off-chip, combined, direct-mapped 2 MB

caches with 64-byte blocks also using write-back.

13


14/37


workload - Alphaserver 4100 ( Cont )

The latency for an access too L2 = 7 cycles

o L3 = 21 cycles

o

Main memory = 80 clock cyclesExecution time breaks down into

instruction execution time

Cache access time

Memory access time and

Other stalls

14


15/37


workload - Alphaserver 4100 ( Cont )

Performance Performance of the DSS (Decision Support

System) and Altavista workloads is reasonable

Performance of OLTP(OnLine Transaction

Processing ) workload is poor

Impact on the OLTP benchmark of

L3 cache size

Processor count

Block size

are focused bcoz OLTP workload demands

from the memory system with large numbers of 15


16/37

Execution Time of Commercial Workload

Poor performanceon OLTP due to

L3 misses (due toa poor performance

of the memory

hierarchy)


17/37

Two-way set associative caches

The foll. diagram shows the effect of increasing the

cache size, using 2-way set associative caches, which

reduces the large number of conflict misses.

Effect of increasing the cache size reduces the largenumber of conflict misses using 2-way set associative

caches.

Execution time is improved as

L3 cache grows due to the reduction in L3 misses

Idle time also gets improved as

Reduces performance gains

17


18/37

Relative Performance of the OLTP

w.r.t. the Size of L3 Use two-way set associative


19/37

Contributing causes of memory

access cyclesThe following diagram displays the number of memory

access cycles contributed per instruction from 5

sources.

True sharing False sharing

Instruction

Capacity / ConflictCold

19


20/37

Contributing Causes of Memory Access

Cycles with Increase Size of L3


21/37


22/37

L3 Miss Rate versus Block Size


23/37

Explanations

True sharing & False sharing unchanged going from 1

MB to 8 MB ( L3 cache )

Uniprocessor cache misses improve with cache size

increase(Instruction, Capacity/Conflict, Compulsory) L3 cache simulated as two-way set associative. The cold,

false sharing and true sharing are unaffected by L3 cache

Contribution to memory access cycles increase as

processor count increase primarily due to increase true

sharing

Increase in true sharing miss rate leads to an overall

increase in memory access cycles per instruction 23


24/37

2. Performance Measurements of the

Multiprogramming and OS workload

2 independent copies of the compile phase of

Andrew benchmark

o A parallel make using 8 processors

o Run for 5.24 seconds on 8 processors, creating203 processes & performing 787 disk requests

on 3 different file systems

o

Run with 128 MB of memory & no pagingactivity

24


25/37

Performance Measurements of the

Multiprogramming and OS workload(Cont)

2 distinct phases

o Compile : Substantial compute activity

o Install the object files in a binary : dominated by

I/O

o Remove the object files: dominated by I/O & 2

PEs are active

25


26/37

Performance Measurements of the

Multiprogramming and OS workload(Cont)

Measure CPU idle time & I-cache performance

L1 I-Cache: 32 KB, 2-way set associative with 64-

byte block, 1 CC hit time

L1 D-Cache: 32 KB, 2-way set associative with 32-

byte block, 1 CC hit time

L2 Cache: 1 MB unified, 2-way set associative with

128-byte block, 10 CC hit time Main memory: Single memory on a bus with an

access time of 100CC

Disk System: Fixed-access latency of 3ms (less 26


27/37

Distribution of Execution Time in the

Multiprogrammed Parallel Make Workload

27

UserExecution

KernelExecution

SynchronizationWait

CPU idle forI/O

% instructionexecuted

27 3 1 69

% executiontime

27 7 2 64

A significant I-cache performance loss (at least for OS)

I-cache miss rate in OS for a 64-byte block size, 2-way se associative:1.7% (32KB) 0.2% (256KB)

I-cache miss rate in user-level: 1/6 of OS rate


28/37

Data Miss Rate vs Data Cache Size

The misses can be broken into 3 significant classes

1. Compulsory misses represent the first access to

this block by this processor & are significant in this

workload2. Coherence misses represent misses due to

invalidations

3. Normal capacity misses include misses caused byinterference between the OS & user process &

between multiple user processes

28


29/37

29Data miss rate vs Data cache size

User drops : a Factor of 3

Kernal drops : a Factor of 1.3


30/37

Reasons

The reasons in which the behavior of the OS is more

complex than the user processes are

1. The kernal initializes all pages before allocating

to a user2. The kernal shares data and has a non trivial coherence

miss rate

30


31/37

Components of Kernel Miss Rate

31

High rate of Compulsory

and Coherence miss


32/37

Compulsory miss rate stays constant Capacity miss rate drops by more than a factor of 2.

including conflict miss rate

Coherence miss rate nearly doubles The probability of a miss being caused by an invalidation

increases with cache size

32

Components of kernal miss rate


33/37

Kernel and User Behavior

Kernel behavior

Initialize all pages before allocating them to user

compulsory miss

Kernel actually shares data coherence miss

User process behavior

Cause coherence miss only when the process is

scheduled on a different processor small missrate

33


34/37

Miss Rate VS. Block Size

34

32KB 2-way set associative data

cache

User drops: a factor of under 3

Kernel drops: a factor of 4


35/37

Miss Rate VS. Block Size for Kernel

35

Compulsory drops significantly

Stay constant


36/37

Compulsory & Capacity miss can be reduced

with larger block sizes

Largest improvement is reduction of compulsory

miss rateAbsence of large increases in the coherence miss

rate as block size is increased means that false

sharing effects are insignificant

36

Miss rate vs. block size


37/37

37

Ch 2 SymmShared Performance Issues

Documents

Transcript of Ch 2 SymmShared Performance Issues