Ch 2 SymmShared Performance Issues

download Ch 2 SymmShared Performance Issues

of 37

Transcript of Ch 2 SymmShared Performance Issues

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    1/37

    Multiprocessors and Thread-

    Level Parallelism

    UNIT IV

    1

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    2/37

    Performance of Symmetric

    Shard-Memory Multiprocessors

    2

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    3/37

    Performance Symmetric Shared

    MemoryIn a bus based multiprocessor using an invalidation

    protocol,

    Overall cache performance is a combination of

    Uniprocessor cache miss traffic

    Traffic caused by communication, which results in

    invalidations and subsequent cache misses.

    Changing the processor count, cache size, and blocksize can affect these two components of miss rate

    3

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    4/37

    Performance Symmetric Shared

    Memory UniProcessor Miss rates are :

    1. Compulsory

    2. Capacity

    3. Conflict

    Communication miss rate: coherence misses

    True sharing misses + false sharing misses

    4

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    5/37

    Coherence Misses

    Def : The misses that rise from interprocessor

    communications.

    It can be broken down to 2 separate sources

    True sharing misses

    False sharing misses

    5

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    6/37

    1. True Sharing Misses

    It arises from the communication of data through the

    cache coherence mechanism.

    The first write by a PE to a shared cache block

    causes an invalidation to establish ownership ofthat block

    When another PE attempts to read a modified

    word in that cache block, a miss occurs and the

    resultant block is transferred

    6

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    7/37

    2. False Sharing Misses

    It arises from the use of an invalidation based coherence

    algorithm with a single valid bit per cache block.

    Occur when a block is invalidated (and a

    subsequent reference causes a miss) becausesome word in the block, other than the one

    being read, is written into.

    invalidation does not cause a new value to be

    communicated, but only causes an extra cache

    miss

    block is shared, but no word in the cache is

    actually shared, 7

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    8/37

    True and False Sharing Miss

    Example

    Assume that words x1 and x2 are in the same

    cache block, which is in the shared state in the

    caches of P1 and P2. Assuming the following

    sequence of events, identify each miss as a

    true sharing miss or a false sharing miss.

    8

    Time P1 P2

    1 Write x1

    2 Read x2

    3 Write x1

    4 Write x2

    5 Read x2

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    9/37

    Example Result

    1: True sharing miss (invalidate P2)

    2: False sharing miss

    x2 was invalidated by the write of P1, but that

    value of x1 is not used in P2

    3: False sharing miss

    The block containing x1 is marked shared due to

    the read in P2, but P2 did not read x1. A writemiss is required to obtain exclusive access to the

    block

    4: False sharing miss

    5: True sharing miss9

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    10/37

    Solution Explanation

    1. This event is a true sharing miss, since x1 was read by P2 andneeds to be invalidated from P2.

    2. This event is a false sharing miss, since x2 was invalidated by

    the write of x1 in P1, but that value of x1 is not used in P2.

    3. This event is a false sharing miss, since the block containingx1 is marked shared due to the read in P2, but P2 did not read

    x1.

    a write miss is required to obtain exclusive access to the

    block1. This event is a false sharing miss for the same reason as step

    3.

    2. This event is a true sharing miss, since the value being read

    was written by P2. 10

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    11/37

    Performance Measurements

    The following are the diff. performance measurementsof symmetric shared memory multiprocessors:

    1. Commercial workload

    2. Multiprogramming & OS workload

    3. Scientific / Technical workload

    11

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    12/37

    1. Performance Measurements Commercial

    workload

    The following model are taken for the performancemeasurements of the commercial workload

    Alphaserver 4100

    Configurable simulator model

    12

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    13/37

    Performance Measurements Commercial

    workload - Alphaserver 4100

    The Alphaserver 4100 has four processors Each processor has a three-level cache hierarchy:

    1. L1 consist of a pair of 8 KB direct-mapped on-chip

    caches,

    One for instruction

    One for data

    1. L2 is a 96 KB on-chip unified 3-way set associative

    cache with a 32-byte block size, using write-back.2. L3 is an off-chip, combined, direct-mapped 2 MB

    caches with 64-byte blocks also using write-back.

    13

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    14/37

    Performance Measurements Commercial

    workload - Alphaserver 4100 ( Cont )

    The latency for an access too L2 = 7 cycles

    o L3 = 21 cycles

    o

    Main memory = 80 clock cyclesExecution time breaks down into

    instruction execution time

    Cache access time

    Memory access time and

    Other stalls

    14

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    15/37

    Performance Measurements Commercial

    workload - Alphaserver 4100 ( Cont )

    Performance Performance of the DSS (Decision Support

    System) and Altavista workloads is reasonable

    Performance of OLTP(OnLine Transaction

    Processing ) workload is poor

    Impact on the OLTP benchmark of

    L3 cache size

    Processor count

    Block size

    are focused bcoz OLTP workload demands

    from the memory system with large numbers of 15

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    16/37

    Execution Time of Commercial Workload

    Poor performanceon OLTP due to

    L3 misses (due toa poor performance

    of the memory

    hierarchy)

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    17/37

    Two-way set associative caches

    The foll. diagram shows the effect of increasing the

    cache size, using 2-way set associative caches, which

    reduces the large number of conflict misses.

    Effect of increasing the cache size reduces the largenumber of conflict misses using 2-way set associative

    caches.

    Execution time is improved as

    L3 cache grows due to the reduction in L3 misses

    Idle time also gets improved as

    Reduces performance gains

    17

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    18/37

    Relative Performance of the OLTP

    w.r.t. the Size of L3 Use two-way set associative

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    19/37

    Contributing causes of memory

    access cyclesThe following diagram displays the number of memory

    access cycles contributed per instruction from 5

    sources.

    True sharing False sharing

    Instruction

    Capacity / ConflictCold

    19

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    20/37

    Contributing Causes of Memory Access

    Cycles with Increase Size of L3

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    21/37

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    22/37

    L3 Miss Rate versus Block Size

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    23/37

    Explanations

    True sharing & False sharing unchanged going from 1

    MB to 8 MB ( L3 cache )

    Uniprocessor cache misses improve with cache size

    increase(Instruction, Capacity/Conflict, Compulsory) L3 cache simulated as two-way set associative. The cold,

    false sharing and true sharing are unaffected by L3 cache

    Contribution to memory access cycles increase as

    processor count increase primarily due to increase true

    sharing

    Increase in true sharing miss rate leads to an overall

    increase in memory access cycles per instruction 23

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    24/37

    2. Performance Measurements of the

    Multiprogramming and OS workload

    2 independent copies of the compile phase of

    Andrew benchmark

    o A parallel make using 8 processors

    o Run for 5.24 seconds on 8 processors, creating203 processes & performing 787 disk requests

    on 3 different file systems

    o

    Run with 128 MB of memory & no pagingactivity

    24

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    25/37

    Performance Measurements of the

    Multiprogramming and OS workload(Cont)

    2 distinct phases

    o Compile : Substantial compute activity

    o Install the object files in a binary : dominated by

    I/O

    o Remove the object files: dominated by I/O & 2

    PEs are active

    25

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    26/37

    Performance Measurements of the

    Multiprogramming and OS workload(Cont)

    Measure CPU idle time & I-cache performance

    L1 I-Cache: 32 KB, 2-way set associative with 64-

    byte block, 1 CC hit time

    L1 D-Cache: 32 KB, 2-way set associative with 32-

    byte block, 1 CC hit time

    L2 Cache: 1 MB unified, 2-way set associative with

    128-byte block, 10 CC hit time Main memory: Single memory on a bus with an

    access time of 100CC

    Disk System: Fixed-access latency of 3ms (less 26

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    27/37

    Distribution of Execution Time in the

    Multiprogrammed Parallel Make Workload

    27

    UserExecution

    KernelExecution

    SynchronizationWait

    CPU idle forI/O

    % instructionexecuted

    27 3 1 69

    % executiontime

    27 7 2 64

    A significant I-cache performance loss (at least for OS)

    I-cache miss rate in OS for a 64-byte block size, 2-way se associative:1.7% (32KB) 0.2% (256KB)

    I-cache miss rate in user-level: 1/6 of OS rate

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    28/37

    Data Miss Rate vs Data Cache Size

    The misses can be broken into 3 significant classes

    1. Compulsory misses represent the first access to

    this block by this processor & are significant in this

    workload2. Coherence misses represent misses due to

    invalidations

    3. Normal capacity misses include misses caused byinterference between the OS & user process &

    between multiple user processes

    28

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    29/37

    29Data miss rate vs Data cache size

    User drops : a Factor of 3

    Kernal drops : a Factor of 1.3

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    30/37

    Reasons

    The reasons in which the behavior of the OS is more

    complex than the user processes are

    1. The kernal initializes all pages before allocating

    to a user2. The kernal shares data and has a non trivial coherence

    miss rate

    30

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    31/37

    Components of Kernel Miss Rate

    31

    High rate of Compulsory

    and Coherence miss

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    32/37

    Compulsory miss rate stays constant Capacity miss rate drops by more than a factor of 2.

    including conflict miss rate

    Coherence miss rate nearly doubles The probability of a miss being caused by an invalidation

    increases with cache size

    32

    Components of kernal miss rate

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    33/37

    Kernel and User Behavior

    Kernel behavior

    Initialize all pages before allocating them to user

    compulsory miss

    Kernel actually shares data coherence miss

    User process behavior

    Cause coherence miss only when the process is

    scheduled on a different processor small missrate

    33

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    34/37

    Miss Rate VS. Block Size

    34

    32KB 2-way set associative data

    cache

    User drops: a factor of under 3

    Kernel drops: a factor of 4

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    35/37

    Miss Rate VS. Block Size for Kernel

    35

    Compulsory drops significantly

    Stay constant

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    36/37

    Compulsory & Capacity miss can be reduced

    with larger block sizes

    Largest improvement is reduction of compulsory

    miss rateAbsence of large increases in the coherence miss

    rate as block size is increased means that false

    sharing effects are insignificant

    36

    Miss rate vs. block size

  • 7/30/2019 Ch 2 SymmShared Performance Issues

    37/37

    37