SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.
-
Upload
jamir-anslow -
Category
Documents
-
view
256 -
download
1
Transcript of SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.
![Page 1: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/1.jpg)
SE-292 High Performance ComputingMemory Hierarchy
R. Govindarajan
govind@serc
![Page 2: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/2.jpg)
2
Memory Hierarchy
![Page 3: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/3.jpg)
3
Memory OrganizationMemory hierarchy
CPU registers few in number (typically 16/32/128) subcycle access time (nsec)
Cache memory on-chip memory 10’s of KBytes (to a few MBytes) of locations. access time of a few cycles
Main memory 100’s of MBytes storage access time several 10’s of cycles
Secondary storage (like disk) 100’s of GBytes storage access time msec
![Page 4: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/4.jpg)
4
Cache Memory; Memory Hierarchy Recall: In discussing pipeline, we assumed
that memory latency will be hidden so that it appears to operate at processor speed
Cache Memory: HW that makes this happen Design principle: Locality of Reference Temporal locality: least recently used objects are
least likely to be referenced in the near future Spatial locality: neighbours of recently reference
locations are likely to be referenced in the near future
![Page 5: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/5.jpg)
5
Cache Memory Exploits ThisCache: Hardware structure that provides
memory contents the processor references directly (most of the time) fast
CacheCPU Main Memory
address
data
![Page 6: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/6.jpg)
6
Cache DesignCache
address A
Fast Memory
`Do I Have It’? Logic
Lookup Logic
Table of `Addresses I Have’
Cache Directory
Cache RAM
Typical size: 32KB
![Page 7: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/7.jpg)
7
How to do Fast Lookup?• Search Algorithms• Hashing: Hash table, indexed into using a hash
function• Hash function on address A. Which bits?
address A
For a small program, everything would index into the same place (collision)A and neighbours possibly differ only in these bits; should be treated as one
msbs lsbs
![Page 8: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/8.jpg)
8
Summing up
Cache organized in terms of blocks, memory locations that share the same address bits other than lsbs. Main memory too.
Address used as
block offset
Index into directorytag
![Page 9: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/9.jpg)
9
How It Works
CPU
Main Memory
Cache Memory
.
.
.
Case 1: Cache hit
: Direct Mapping
Same: Hit
address
tag index offset
Case 2: Cache miss
Not Same: Miss
.
.
.
![Page 10: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/10.jpg)
10
Cache Terminology Cache hit: A memory reference where the
required data is found in the cache Cache Miss: A memory reference where the
required data is not found in the cache Hit Ratio: # of hits / # of memory references Miss Ratio = (1 - Hit Ratio) Hit Time: Time to access data in cache Miss Penalty: Time to bring a block to cache
![Page 11: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/11.jpg)
11
Cache Organizations 1. Where can a block be placed in the cache?
Direct mapped, Set Associative
2. How to identify a block in cache? Tag, valid bit, tag checking hardware
3. Replacement policy? LRU, FIFO, Random …
4. What happens on writes? Hit: When is main memory updated?
Write-back, write-through Miss: What happens on a write miss?
Write-allocate, write-no-allocate
![Page 12: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/12.jpg)
12
Block Placement: Direct Mapping A memory block goes to the unique cache block
(memory block no.) mod (# cache blocks)
0123456789101112131415
14
6
01234567
614
Cache
memory block number
![Page 13: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/13.jpg)
13
Identifying Memory Block (DM Cache)Assume 32-bit address space, 16 KB cache,
32byte cache block size.
Offset field -- to identify bytes in a cache line Offset Bits = log (32) = 5 bits
No. of Cache blocks = 16KB/ 32 = 512 Index Bits = log (522) = 5 bits
Tag -- identify which memory block is in this cache block -- remaining bits (= 18bits)
Tag 18 bits
Index9 bits
Offset5 bits
![Page 14: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/14.jpg)
14
Accessing Block (DM Cache)
Tag V D
=
AND Cache Hit
Data
Tag 18 bits
Index9 bits
Offset5 bits
![Page 15: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/15.jpg)
15
Block Placement: Set Associative A memory block goes to unique set, and
within the set to any cache block
0123456789101112131415
01234567Set 3
Set 2
Set 0
Set 1
3
7
11
15
715
memory block number
![Page 16: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/16.jpg)
16
Identifying Memory Block (Set Associative Cache)
Assume 32-bit address space, 16 KB cache, 32byte
cache block size, 4-way set-associative.
Offset field -- to identify bytes in a cache line Offset Bits = log (32) = 5 bits
No. of Sets = Cache blocks / 4 = 512/4 = 128 Index Bits = log (128) = 7 bits
Tag -- identify which memory block is in this cache block -- remaining bits (= 20 bits)
Tag 20 bits
Index7 bits
Offset5 bits
![Page 17: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/17.jpg)
17
Accessing Block (2-w Set-Associative)
OR
Cache HitData
Tag 19 bits
Index8 bits
Offset5 bits
Tag V D
Tag V D
=
=
![Page 18: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/18.jpg)
18
Block Replacement
Direct Mapped: No choice is required Set-Associative: Replacement strategies
First-In-First-Out (FIFO) simple to implement
Least Recently Used (LRU) complex, but based on (temporal) locality,
hence higher hits Random
simple to implement
![Page 19: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/19.jpg)
19
Block Replacement…•Hardware must keep track of LRU information
•Separate valid bits for each word (or sub-block) of cache can speedup access to the required word on a cache miss
DataOR
Cache Hit
Tag 19 bits
Index8 bits
Offset5 bits
Tag VD
=
=
L
Tag VDL
4 bits
![Page 20: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/20.jpg)
20
Write PoliciesWhen is Main Memory Updated on Write Hit? Write through: Writes are performed both in
Cache and in Main Memory + Cache and memory copies are kept consistent
-- Multiple writes to the same location/block cause higher memory traffic
-- Writes must wait for longer time (memory write)
Solution: Use a Write Buffer to hold these write requests and allow processor to proceed immediately
![Page 21: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/21.jpg)
21
Write Policies…
Write back: writes performed only on cache. Modified blocks are written back in memory on replacement
o Need for dirty bit with each cache block
+ Writes are faster than with write through
+ Reduced traffic to memory
-- Cache & main memory copies are not always the same
-- Higher miss penalty due to write-back time
![Page 22: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/22.jpg)
22
Write Policies…What happens on a Write Miss?
Write-Allocate: allocate a block in the cache and load the block from memory to cache.
Write-No-Allocate: write directly to main memory. Write allocate/no-allocate is orthogonal to
write-through/write-back policy. Write-allocate with write-back Write-no-allocate with write-through: ideal if
mostly-reads-few-writes on data
![Page 23: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/23.jpg)
23
What Drives Computer Architecture?
Year
200
0198
0198
1 198
3198
4198
5198
6198
7198
8198
9199
0199
1199
2199
3199
4199
5199
6199
7199
8199
9198
2
Processor60%/yr.(2X/1.5yr)
Memory9%/yr.(2X/10 yrs)
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
“Moore’s Law”
1
10
100
1000
DRAM
CPU
![Page 24: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/24.jpg)
25
Cache and Programming Objective: Learn how to assess cache related
performance issues for important parts of our programs
Will look at several examples of programs Will consider only data cache, assuming
separate instruction and data caches Data cache configuration:
Direct mapped 16 KB write back cache with 32B block size
Offset: 5bIndex: 9bTag : 18b
![Page 25: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/25.jpg)
26
Example 1: Vector Sum Reductiondouble A[2048]; sum=0.0;
for (i=0; i<2048, i++)
sum = sum +A[i]; To do analysis, must view program close to
machine code form (to see loads/stores)
Loop: FLOAD F0, 0(R1)
FADD F2, F0, F2
ADDI R1, R1, 8
BLE R1, R3, Loop
![Page 26: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/26.jpg)
27
Example 1: Vector Sum Reduction• To do analysis:
• Observe loop index i, sum and &A[i] are implemented in registers and not load/stored inside loop
• Only A[i] is loaded from memory• Hence, we will consider only accesses to array
elements
![Page 27: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/27.jpg)
28
Example 1: Reference Sequence load A[0] load A[1] load A[2] … load A[2047] Assume base address of A (i.e., address of
A[0]) is 0xA000, 10 10 0000 000 0 0000 Cache index bits: 100000000 (value = 256)
Size of an array element (double) = 8B So, 4 consecutive array elements fit into each
cache block (block size is 32B) A[0] – A[3] have index of 256
A[4] – A[7] have index of 257 and so on
100000000 00000100000000 01000100000000 10000100000000 11000
100000001 00000100000001 01000100000001 10000100000001 11000
![Page 28: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/28.jpg)
29
Example 1: Cache Misses and HitsA[0] 0xA000 256 Miss Cold start
A[1] 0xA008 256 Hit
A[2] 0xA010 256 Hit
A[3] 0xA018 256 Hit
A[4] 0xA020 257 Miss Cold start
A[5] 0xA028 257 Hit
A[6] 0xA030 257 Hit
A[7] 0xA038 257 Hit
.. .. .. ..
.. .. .. ..
A[2044] 0xDFE0 255 Miss Cold start
A[2045] 0xDFE8 255 Hit
A[2046] 0xDFF0 255 Hit
A[2047] 0xDFF8 255 Hit
Hit ratio of our loop is 75% -- there are 1536 hits out of 2048 memory accesses
This is entirely due to spatial locality of reference.
If the loop was preceded by a loop that accessed all array elements, the hit ratio of our loop would be 100%, 25% due to temporal locality and 75% due to spatial locality
Cold start miss: we assume that the cache is initially empty. Also called a Compulsory Miss
A[0] 0xA000 256
A[1] 0xA008 256
A[2] 0xA010 256
A[3] 0xA018 256
A[4] 0xA020 257
A[5] 0xA028 257
A[6] 0xA030 257
A[7] 0xA038 257
.. .. ..
.. .. ..
A[2044] 0xDFE0 255
A[2045] 0xDFE8 255
A[2046] 0xDFF0 255
A[2047] 0xDFF8 255
![Page 29: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/29.jpg)
30
Example 1 with double A[4096]Why should it make a difference? Consider the case where the loop is preceded by
another loop that accesses all array elements in order
The entire array no longer fits into the cache – cache size: 16KB, array size: 32KB
After execution of the previous loop, the second half of the array will be in cache
Analysis: our loop sees misses as we just saw Called Capacity Misses as they would not be misses
if the cache had been big enough
![Page 30: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/30.jpg)
31
Example 2: Vector Dot Productdouble A[2048], B[2048], sum=0.0;
for (i=0; i<2048, i++) sum = sum +A[i] * B[i];
• Reference sequence:• load A[0] load B[0] load A[1] load B[1] …
• Again, size of array elements is 8B so that 4 consecutive array elements fit into each cache block
Assume base addresses of A and B are 0xA000 and 0xE000 0000 10 10 0000 000 0 0000
0000 11 10 0000 000 0 0000
![Page 31: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/31.jpg)
32
Example 2: Cache Hits and Misses
A[0] 0xA000 256 Miss Cold start
B[0] 0xE000 256 Miss Cold start
A[1] 0xA008 256 Miss Conflict
B[1] 0xE008 256 Miss Conflict
A[2] 0xA010 256 Miss Conflict
B[2] 0xE010 256 Miss Conflict
A[3] 0xA018 256 Miss Conflict
B[3] 0xE018 256 Miss Conflict
.. .. .. ..
.. .. .. ..
B[1023] 0xFFF8 511 Miss Conflict
Conflict miss: a miss due to conflicts in cache block requirements from memory accesses of the same program
Hit ratio for our program: 0%
Source of the problem: the elements of arrays A and B accessed in order have the same cache index
Hit ratio would be better if the base address of B is such that these cache indices differ
![Page 32: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/32.jpg)
33
Example 2 with Padding• Assume that compiler assigns addresses as
variables are encountered in declarations• To shift base address of B enough to make cache
index of B[0] different from that of A[0]double A[2052], B[2048];
• Base address of B is now 0xE020• 0xE020 is 1110 0000 0010 0000 Cache index of B[0] is 257; B[0] and A[0] do not conflict for
the same cache block• Whereas Base address of A is 0xA000 which is
1010 0000 0000 0000 – cache index is 256 Hit ratio of our loop would then be 75%
![Page 33: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/33.jpg)
34
Example 2 with Array MergingWhat if we re-declare the arrays as
struct {double A, B;} array[2048];
for (i=0; i<2048, i++)
sum += array[i].A*array[i].B;
Hit ratio: 75%
![Page 34: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/34.jpg)
35
Example 3: DAXPY Double precision Y = aX + Y, where X and Y
are vectors and a is a scalar double X[2048], Y[2048], a; for (i=0; i<2048;i++) Y[I] = a*X[I]+Y[I];
Reference sequence load X[0] load Y[0] store Y[0] load X[1] load Y[1]
store Y[1] … Hits and misses: Assuming that base
addresses of X and Y don’t conflict in cache, hit ratio of 83.3%
![Page 35: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/35.jpg)
36
Example 4: 2-d Matrix Sum double A[1024][1024], B[1024][1024];
for (j=0;j<1024;j++)
for (i=0;i<1024;i++)
B[i][j] = A[i][j] + B[i][j]; Reference Sequence:
load A[0,0] load B[0,0] store B[0,0]
load A[1,0] load B[1,0] store B[1,0] … Question: In what order are the elements of a
multidimensional array stored in memory?
![Page 36: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/36.jpg)
37
Storage of Multi-dimensional Arrays Row major order
Example: for a 2-dimensional array, the elements of the first row of the array are followed by those of the 2nd row of the array, the 3rd row, and so on
This is what is used in C Column major order
A 2-dimensional array is stored column by column in memory
Used in FORTRAN
![Page 37: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/37.jpg)
38
Example 4: 2-d Matrix Sum double A[1024][1024], B[1024][1024];
for (j=0;j<1024;j++)
for (i=0;i<1024;i++)
B[i][j] = A[i][j] + B[i][j]; Reference Sequence:
load A[0,0] load B[0,0] store B[0,0]
load A[1,0] load B[1,0] store B[1,0] …
![Page 38: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/38.jpg)
39
Example 4: Hits and Misses Reference order and storage order for
an array are not the same Our loop will show no spatial locality
Assume that packing has been to eliminate conflict misses due to base addresses
Reference Sequence:load A[0,0] load B[0,0] store B[0,0]load A[1,0] load B[1,0] store B[1,0] …
Miss(cold), Miss(cold), Hit for each array element
Hit ratio: 33.3% Question: Will A[0,1] be in the cache when
required?
![Page 39: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/39.jpg)
40
Example 4 with Loop Interchange double A[1024][1024], B[1024][1024];
for (i=0;i<1024;i++)
for (j=0;j<1024;j++)
B[i][j] = A[i][j] + B[i][j]; Reference Sequence:
load A[0,0] load B[0,0] store B[0,0]
load A[0,1] load B[0,1] store B[0,1] … Hit ratio: 83.3%
![Page 40: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/40.jpg)
41
Is Loop Interchange Always Safe?
for (i=1; i<2048; i++)
for (j=1; j<2048; j++)
A[i][j] = A[i+1][j-1] + A[i][j-1];
A[1,1] = A[2,0]+A[1,0]
A[2,1] = A[3,0]+A[2,0]
…
A[1,2] = A[2,1]+A[1,1]
A[1,1] = A[2,0]+A[1,0]
A[1,2] = A[2,1]+A[1,1]
…
A[2,1] = A[3,0]+A[2,0]
for (i=1; i<2048; i++)
for (j=1; j<2048; j++)
for (i=2047; i>1; i--)
![Page 41: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/41.jpg)
42
Example 5: Matrix Multiplicationdouble X[N][N], Y[N][N], Z[N][N];
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
X[i][j] += Y[i][k] * Z[k][j];
Reference Sequence:
X Y Z
X[i][j] Y[i][k] Z[k][j]/ Dot product inner loop
Y[0,0], Z[0,0], Y[0,1], Z[1,0], Y[0,2], Z[2,0] … X[0,0],
Y[0,0], Z[0,1], Y[0,1], Z[1,1], Y[0,2], Z[2,1] … X[0,1],
…
Y[1,0], Z[0,0], Y[1,1], Z[1,0], Y[1,2], Z[2,0] … X[1,0],
Y[0,0], Z[0,0], Y[0,1], Z[1,0], Y[0,2], Z[2,0] … X[0,0],
Y[1,0], Z[0,0], Y[1,1], Z[1,0], Y[1,2], Z[2,0] … X[1,0],
Y[2,0], Z[0,0], Y[2,1], Z[1,0], Y[2,2], Z[2,0] … X[2.0],
…
![Page 42: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/42.jpg)
44
With Loop Interchanging• Can interchange the 3 loops in any way• Example: Interchange i and k loops
• For inner loop: Z[k][j] can be loaded into register once for each (k,j), reducing the number of memory references
double X[N][N], Y[N][N], Z[N][N];
for (k=0; k<N; k++)
for (j=0; j<N; j++)
for (i=0; i<N; i++)
X[i][j] += Y[i][k] * Z[k][j];Z[k][j]X[i][j] Y[i][k]
![Page 43: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/43.jpg)
45
Let’s try some Loop Unrolling Insteaddouble X[N][N], Y[N][N], Z[N][N];
for (i=0; i<N; i++)
for (k=0; k<N; k++)
X[i][j] += Y[i][k] * Z[k][j];
for (j=0; j<N; j++)
Exploits spatial locality for array Z?
for (k=0; k<N; k+=2)
X[i][j] += Y[i][k]*Z[k][j] + Y[i][k+1]*Z[k+1][j];
Unroll k loop
![Page 44: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/44.jpg)
46
Let’s try some Loop Unrolling Insteaddouble X[N][N], Y[N][N], Z[N][N];
for (i=0; i<N; i++)
for (k=0; k<N; k++)
X[i][j] += Y[i][k] * Z[k][j];
for (j=0; j<N; j++)
Exploits spatial locality for array Z
for (k=0; k<N; k++) {
X[i][j] += Y[i][k]*Z[k][j];
X[I][j+1] += Y[I][k]*Z[k][j+1];
}
Unroll j loopfor (j=0; j<N; j+=2)
![Page 45: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/45.jpg)
47
Let’s try some Loop Unrolling Insteaddouble X[N][N], Y[N][N], Z[N][N];
for (i=0; i<N; i++)
for (k=0; k<N; k++)
X[i][j] += Y[i][k] * Z[k][j];
for (j=0; j<N; j++)
Exploits spatial locality for array Z
Exploits temporal locality for array Y
Unroll j loop
Blocking or Tiling
Unroll k loop
for (j=0; j<N; j+=2)
for (k=0; k<N; k+=2){
X[i][j] += Y[i][k]*Z[k][j] +Y[i][k+1]*Z[k+1][j];
X[i][[j+1] += Y[i][k]*Z[k][j+1] +Y[i][k+1]*Z[k+1][j+1];
}
![Page 46: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/46.jpg)
49
Blocked Matrix Multiplication Y Z
K
K
J
for (J=0; J<N; J+=B)
for (K=0; K<N; K+=B)
r += Y[i][k] * Z[k][j];
X[i][j] += r;
for (j=J; j<min(J+B,N); j++){
for (k=K, r=0; k<min(K+B,N); k++)
}
for (i=0; i<N; i++)
![Page 47: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/47.jpg)
50
Some Homework1. Read N9, N10
2. Implementing Matrix Multiplication
Objective: Best programs for multiplying 1024x1024 double matrices on any 2 different machines that you normally use.
Techniques: Loop interchange, blocking, etc
Criterion: Execution time
Report: Program and execution times
![Page 48: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/48.jpg)
52
Reality Check
Question 1: Are real caches built to work on virtual addresses or physical addresses?
Question 2: What about multiple levels in caches?
Question 3: Do modern processors use pipelining of the kind that we studied?
![Page 49: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/49.jpg)
53
Virtual Memory System
To support memory management when multiple processes are running concurrently. Page based, segment based, ...
Ensures protection across processes. Address Space: range of memory addresses a
process can address (includes program (text), data, heap, and stack) 32-bit address 4 GB with VM, address generated by the processor is
virtual address
![Page 50: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/50.jpg)
54
Page-Based Virtual Memory
A process’ address space is divided into a no. of pages (of fixed size).
A page is the basic unit of transfer between secondary storage and main memory.
Different processes share the physical memory
Virtual address to be translated to physical address.
![Page 51: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/51.jpg)
55
Virtual Pages to Physical Frame Mapping
Process 1
Process k
Main Memory
![Page 52: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/52.jpg)
56
Page Mapping Info (Page Table) A page is mapped to any frame in the main
memory. Where to store/access the mapping?
Page Table Each process will have its own page table!
Address Translation: virtual to physical address translation
![Page 53: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/53.jpg)
57
Address Translation Virtual address (issued by processor) to
Physical Address (used to access memory)
Virtual Page No. 18-bits
Offset14 bits
Phy. Page # Pr DV
Physical Frame No. 18-bits
Offset14 bits
Virtual Address
Physical AddressPage Table
![Page 54: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/54.jpg)
58
Memory Hierarchy: Secondary to Main Memory Analogous to Main memory to Cache. When the required virtual page is not in main
memory: Page Hit When the required virtual page is not in main
memory: Page fault Page fault penalty very high (~10’s of mSecs.)
as it involves disk (secondary storage) access.
Page fault ratio shd. be very low (to ) Page fault handled by OS.
![Page 55: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/55.jpg)
59
Page Placement
A virtual page is placed anywhere in physical memory (fully associative).
Page Table keeps track of the page mapping. Separate page table for each process. Page table size is quite large!
Assume 32-bit address space and 16KB page size.
# of entries in page table = 232 / 214 = 218 = 256K
Page table size = 256 K * 4B = 1 MB = 64 pages! Page table itself may be paged (multi-level page
tables)!
![Page 56: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/56.jpg)
60
Page Identification Use virtual page number to index into page table. Accessing page table causes one extra memory access!
Virtual Page No. 18-bits
Offset14 bits
Phy. Page # Pr DV
Physical Frame No. 18-bits
Offset14 bits
![Page 57: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/57.jpg)
61
Page Replacement Page replacement can use more
sophisticated policies than in Cache. Least Recently Used Second-chance algorithm Recency vs. frequency
Write Policies Write-back Write-allocate
![Page 58: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/58.jpg)
62
Translation Look-Aside Buffer Accessing the page tables causes one extra
memory access! To reduce translation time, use translation look-
aside buffer (TLB) which caches recent address translations.
TLB organization similar to cache orgn. (direct mapped, set-, or full-associative).
Size of TLB is small (4 512 entries). TLBs is important for fast translation.
![Page 59: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/59.jpg)
63
Translation using TLB Assume 128 entry 4-way associative TLB
P. P. # Pr DV
Virtual Page No.
Offset
Physical
Frame No. Offset
18 bits 14 bits
=
Tag13bits
Ind.5bits
Phy. Page #TagPrVD
TLB Hit
![Page 60: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/60.jpg)
64
Q1: Caches and Address Translation
MMU CacheVirtual
AddressPhysical Address
MMUCacheVirtual
AddressVirtual
AddressPhysical Address
Physical Addressed Cache
Virtual Addressed Cache
(if cache miss) (to main memory)
![Page 61: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/61.jpg)
65
Which is less preferable? Physical addressed cache
Hit time higher (cache access after translation) Virtual addressed cache
Data/instruction of different processes with same virtual address in cache at the same time … Flush cache on context switch, or Include Process id as part of each cache directory entry
Synonyms Virtual addresses that translate to same physical address More than one copy in cache can lead to a data
consistency problem
![Page 62: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/62.jpg)
66
Another possibility: Overlapped operation
MMU
Cache
Virtual Address Physical Address
Indexing into cache directory using virtual
address
Tag comparison using physical
address
Virtual indexed physical tagged cache
![Page 63: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/63.jpg)
67
Addresses and Caches `Physical Addressed Cache’
Physical Indexed Physical Tagged `Virtual Addressed Cache’
Virtual Indexed Virtual Tagged Overlapped cache indexing and translation
Virtual Indexed Physical Tagged Physical Indexed Virtual Tagged (?)
![Page 64: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/64.jpg)
68
Cache Tag 16 bits
Physical Page No 18 bits
Physical Indexed Physical Tagged Cache
Virtual Address Virtual Page No
18 bitsPage Offset
14 bits
Coffset
MMU
Physical Address
= Physical Cache 5
16KB page size
64KB direct mapped cache with 32B block size
C-Index 11 bitsPage Offset
14 bits
![Page 65: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/65.jpg)
69
Virtual Index Virtual Tagged Cache
Virtual Address
Physical Address
=
VPN 18 bits
Coffset
C-Index 11 bits
MMU
5
Page Offset14 bits
PPN 18 bits
Hit/Miss
![Page 66: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/66.jpg)
70
Virtual Index Physical Tagged Cache
Virtual Address
Physical Address
=
VPN 18 bits
Coffset
C-Index 11 bits
MMU
Cache Tag 16 bits
5
P-Offset14 bits
![Page 67: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/67.jpg)
71
Multi-Level Caches Small L1 cache -- to give a low hit time, and hence
faster CPU cycle time . Large L2 cache to reduce L1 cache miss penalty. L2 cache is typically set-associative to reduce L2-
cache miss ratio! Typically, L1 cache is direct mapped, separate I and
D cache orgn. L2 is unified and set-associative. L1 and L2 are on-chip; L3 is also getting in on-chip.
![Page 68: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/68.jpg)
72
Multi-Level Caches
CPU MMU
L1 I-Cache
L1D-Cache
L2 Unified Cache
Memory
Acc. Time
16-30 ns
Acc. Time
100 ns
Acc. Time
2 - 4 ns
![Page 69: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/69.jpg)
73
Cache Performance
One Level Cache AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Two-Level CachesAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x
(Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)
![Page 70: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/70.jpg)
75
48-bit virtual addr. and 44-bit physical address.
64KB 2-way assoc. L1 I-Cache with 64byte blocks ( 512 sets)
L1 I-Cache is virtually index and tagged (address translation reqd. only on a miss)
8-bit ASID for each process (to avoid cache flush on context switch)
Putting it Together: Alpha 21264
![Page 71: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/71.jpg)
76
Alpha 21264 8KB page size ( 13 bit
page offset) 128 entry fully
associative TLB 8MB Direct mapped
unified L2 cache, 64B block size
Critical word (16B) first Prefetch next 64B into
instrn. prefetcher
![Page 72: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/72.jpg)
77
21264 Data Cache L1 data cache uses
virtual addr. index, but physical addr. tag
Addr. translation along with cache access.
64KB 2-way assoc. L1 data cache with write back.
![Page 73: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/73.jpg)
78
Q2: High Performance Pipelined Processors
Pipelining Overlaps execution of consecutive instructions Performance of processor improves
Current processors use more aggressive techniques for more performance
Some exploit Instruction Level Parallelism - often, many consecutive instructions are independent of each other and can be executed in parallel (at the same time)
![Page 74: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/74.jpg)
79
Instruction Level Parallelism Processors Challenge: identifying which instructions are
independent Approach 1: build processor hardware to
analyze and keep track of dependences Superscalar processors: Pentium 4, RS6000,…
Approach 2: compiler does analysis and packs suitable instructions together for parallel execution by processor VLIW (very long instruction word) processors:
Intel Itanium
![Page 75: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/75.jpg)
80
ILP Processors (contd.)
PipelinedIF WBMEMEXID
IF WBMEMEXIDIF WBMEMEXID
SuperscalarIF WBMEMEXIDIF WBMEMEXID
IF WBMEMEXIDIF WBMEMEXID
VLIW/EPICIF WBMEMEXID
EXIF WBMEMEXID
EX
![Page 76: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/76.jpg)
81
Multicores Multiple cores in a
single die Early efforts utilized
multiple cores for multiple programs Throughput oriented
rather than speedup-oriented!
Can they be used by Parallel Programs?
![Page 77: SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc.](https://reader033.fdocuments.us/reader033/viewer/2022052202/551775df5503463e368b4e02/html5/thumbnails/77.jpg)
82
Assignment #21. Learn about the loop unrolling that gcc can do for you. We have
unrolled the DAXPY loop 2 times to perform the computation of 2 elements in each loop iteration. Study the effects of increasing the degree of loop unrolling.
DAXPY Loop: double a, X[16384], Y[16384], Z[16384];
for (i = 0 ; i < 16384; i++)Z[i] = a * X[i] + Y[i] ;
2. Understand the static instruction scheduling performed by the compiler in the above code with and without the Optimization flags.
3. 4. Do Problem 5.18 (page 351-352) in H&P, Compute Architecture Book, Ed.4.
4. Implement Matrix Multiplication of 4096x4096 double matrices on any 2 different machines that you normally use. Apply Loop interchange, blocking, etc to reduce the Execution time
(Due: Oct. 14, 2010)