Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100...
Transcript of Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100...
![Page 1: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/1.jpg)
Cache Performance II
1
![Page 2: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/2.jpg)
cache operation (associative)
valid tag valid tag data data1 10 1 00 00 11 AA BB
1 11 1 01 B4 B5 33 44
10011 1
index
=
=
tag
AND
AND
OR is hit? (1)
offset
data(B5)
2
![Page 3: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/3.jpg)
cache operation (associative)
valid tag valid tag data data1 10 1 00 00 11 AA BB
1 11 1 01 B4 B5 33 44
10011 1
index
=
=
tag
AND
AND
OR is hit? (1)
offset
data(B5)
2
![Page 4: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/4.jpg)
cache operation (associative)
valid tag valid tag data data1 10 1 00 00 11 AA BB
1 11 1 01 B4 B5 33 44
10011 1
index
=
=
tag
AND
AND
OR is hit? (1)
offset
data(B5)
2
![Page 5: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/5.jpg)
Tag-Index-Offset formulas
m memory addreses bits (Y86-64: 64)
E number of blocks per set (“ways”)
S = 2s number of sets
s (set) index bits
B = 2b block size
b (block) offset bits
t = m − (s + b) tag bits
C = B × S × E cache size (excluding metadata)3
![Page 6: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/6.jpg)
Tag-Index-Offset exercisem memory addreses bits (Y86-64: 64)E number of blocks per set (“ways”)S = 2s number of setss (set) index bitsB = 2b block sizeb (block) offset bitst = m − (s + b) tag bitsC = B × S × E cache size (excluding metadata)
My desktop:L1 Data Cache: 32 KB, 8 blocks/set, 64 byte blocksL2 Cache: 256 KB, 4 blocks/set, 64 byte blocksL3 Cache: 8 MB, 16 blocks/set, 64 byte blocks
Divide the address 0x34567 into tag, index, offset foreach cache.
4
![Page 7: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/7.jpg)
T-I-O exercise: L1quantity value for L1block size (given) B = 64Byte
B = 2b (b: block offset bits)
block offset bits b = 6blocks/set (given) E = 8cache size (given) C = 32KB = E × B × S
S = C
B × E(S: number of sets)
number of sets S = 32KB64Byte × 8 = 64
S = 2s (s: set index bits)set index bits s = log2(64) = 6
5
![Page 8: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/8.jpg)
T-I-O exercise: L1quantity value for L1block size (given) B = 64Byte
B = 2b (b: block offset bits)block offset bits b = 6
blocks/set (given) E = 8cache size (given) C = 32KB = E × B × S
S = C
B × E(S: number of sets)
number of sets S = 32KB64Byte × 8 = 64
S = 2s (s: set index bits)set index bits s = log2(64) = 6
5
![Page 9: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/9.jpg)
T-I-O exercise: L1quantity value for L1block size (given) B = 64Byte
B = 2b (b: block offset bits)block offset bits b = 6blocks/set (given) E = 8cache size (given) C = 32KB = E × B × S
S = C
B × E(S: number of sets)
number of sets S = 32KB64Byte × 8 = 64
S = 2s (s: set index bits)set index bits s = log2(64) = 6
5
![Page 10: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/10.jpg)
T-I-O exercise: L1quantity value for L1block size (given) B = 64Byte
B = 2b (b: block offset bits)block offset bits b = 6blocks/set (given) E = 8cache size (given) C = 32KB = E × B × S
S = C
B × E(S: number of sets)
number of sets S = 32KB64Byte × 8 = 64
S = 2s (s: set index bits)set index bits s = log2(64) = 6
5
![Page 11: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/11.jpg)
T-I-O exercise: L1quantity value for L1block size (given) B = 64Byte
B = 2b (b: block offset bits)block offset bits b = 6blocks/set (given) E = 8cache size (given) C = 32KB = E × B × S
S = C
B × E(S: number of sets)
number of sets S = 32KB64Byte × 8 = 64
S = 2s (s: set index bits)set index bits s = log2(64) = 6
5
![Page 12: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/12.jpg)
T-I-O exercise: L1quantity value for L1block size (given) B = 64Byte
B = 2b (b: block offset bits)block offset bits b = 6blocks/set (given) E = 8cache size (given) C = 32KB = E × B × S
S = C
B × E(S: number of sets)
number of sets S = 32KB64Byte × 8 = 64
S = 2s (s: set index bits)set index bits s = log2(64) = 6
5
![Page 13: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/13.jpg)
T-I-O resultsL1 L2 L3
sets 64 1024 8192block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)
6
![Page 14: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/14.jpg)
T-I-O: splittingL1 L2 L3
block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)
0x34567: 3 4 5 6 70011 0100 0101 0110 0111
bits 0-5 (all offsets): 100111 = 0x27
7
![Page 15: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/15.jpg)
T-I-O: splittingL1 L2 L3
block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)
0x34567: 3 4 5 6 70011 0100 0101 0110 0111
bits 0-5 (all offsets): 100111 = 0x27
7
![Page 16: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/16.jpg)
T-I-O: splittingL1 L2 L3
block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)
0x34567: 3 4 5 6 70011 0100 0101 0110 0111
bits 0-5 (all offsets): 100111 = 0x27
L1:bits 6-11 (L1 set): 01 0101 = 0x15bits 12- (L1 tag): 0x34
7
![Page 17: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/17.jpg)
T-I-O: splittingL1 L2 L3
block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)
0x34567: 3 4 5 6 70011 0100 0101 0110 0111
bits 0-5 (all offsets): 100111 = 0x27
L1:bits 6-11 (L1 set): 01 0101 = 0x15bits 12- (L1 tag): 0x34
7
![Page 18: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/18.jpg)
T-I-O: splittingL1 L2 L3
block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)
0x34567: 3 4 5 6 70011 0100 0101 0110 0111
bits 0-5 (all offsets): 100111 = 0x27
L2:bits 6-15 (set for L2): 01 0001 0101 = 0x115bits 16-: 0x3
7
![Page 19: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/19.jpg)
T-I-O: splittingL1 L2 L3
block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)
0x34567: 3 4 5 6 70011 0100 0101 0110 0111
bits 0-5 (all offsets): 100111 = 0x27
L2:bits 6-15 (set for L2): 01 0001 0101 = 0x115bits 16-: 0x3
7
![Page 20: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/20.jpg)
T-I-O: splittingL1 L2 L3
block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)
0x34567: 3 4 5 6 70011 0100 0101 0110 0111
bits 0-5 (all offsets): 100111 = 0x27
L3:bits 6-18 (set for L3): 0 1101 0001 0101 =0xD15bits 18-: 0x0
7
![Page 21: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/21.jpg)
matrix squaring
Bij =n∑
k=1Aik × Akj
/* version 1: inner loop is k, middle is j */for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)for (int k = 0; k < N; ++k)
B[i*N+j] += A[i * N + k] * A[k * N + j];
8
![Page 22: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/22.jpg)
matrix squaring
Bij =n∑
k=1Aik × Akj
/* version 1: inner loop is k, middle is j*/for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)for (int k = 0; k < N; ++k)
B[i*N+j] += A[i * N + k] * A[k * N + j];
/* version 2: outer loop is k, middle is i */for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)
B[i*N+j] += A[i * N + k] * A[k * N + j]; 9
![Page 23: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/23.jpg)
matrix squaring
Bij =n∑
k=1Aik × Akj
/* version 1: inner loop is k, middle is j*/for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)for (int k = 0; k < N; ++k)
B[i*N+j] += A[i * N + k] * A[k * N + j];
/* version 2: outer loop is k, middle is i */for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)
B[i*N+j] += A[i * N + k] * A[k * N + j]; 9
![Page 24: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/24.jpg)
performance
0 100 200 300 400 500 600N
0.0
0.2
0.4
0.6
0.8
1.0
1.2 billions of instructionsk innerk outer
0 100 200 300 400 500 600N
0.0
0.2
0.4
0.6
0.8
1.0 billions of cyclesk innerk outer
10
![Page 25: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/25.jpg)
alternate view 1: cycles/instruction
0 100 200 300 400 500 600N
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 cycles/instruction
11
![Page 26: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/26.jpg)
alternate view 2: cycles/operation
0 100 200 300 400 500 600N
1.0
1.5
2.0
2.5
3.0
3.5 cycles/multiply or add
12
![Page 27: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/27.jpg)
loop orders and locality
loop body: Bij+ = AikAkj
kij order: Bij, Akj have spatial locality
kij order: Aik has temporal locality
… better than …
ijk order: Aik has spatial locality
ijk order: Bij has temporal locality
13
![Page 28: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/28.jpg)
loop orders and locality
loop body: Bij+ = AikAkj
kij order: Bij, Akj have spatial locality
kij order: Aik has temporal locality
… better than …
ijk order: Aik has spatial locality
ijk order: Bij has temporal locality
13
![Page 29: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/29.jpg)
matrix squaring
Bij =n∑
k=1Aik × Akj
/* version 1: inner loop is k, middle is j*/for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)for (int k = 0; k < N; ++k)
B[i*N+j] += A[i * N + k] * A[k * N + j];
/* version 2: outer loop is k, middle is i */for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)
B[i*N+j] += A[i * N + k] * A[k * N + j]; 14
![Page 30: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/30.jpg)
matrix squaring
Bij =n∑
k=1Aik × Akj
/* version 1: inner loop is k, middle is j*/for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)for (int k = 0; k < N; ++k)
B[i*N+j] += A[i * N + k] * A[k * N + j];
/* version 2: outer loop is k, middle is i */for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)
B[i*N+j] += A[i * N + k] * A[k * N + j]; 14
![Page 31: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/31.jpg)
matrix squaring
Bij =n∑
k=1Aik × Akj
/* version 1: inner loop is k, middle is j*/for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)for (int k = 0; k < N; ++k)
B[i*N+j] += A[i * N + k] * A[k * N + j];
/* version 2: outer loop is k, middle is i */for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)
B[i*N+j] += A[i * N + k] * A[k * N + j]; 14
![Page 32: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/32.jpg)
L1 misses
0 100 200 300 400 500 600N
0
20
40
60
80
100
120
140 read misses/1K instructionsk innerk outer
15
![Page 33: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/33.jpg)
L1 miss detail (1)
0 50 100 150 200N
0
20
40
60
80
100
120
140
matrix smallerthan L1 cache
read misses/1K instruction
16
![Page 34: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/34.jpg)
L1 miss detail (2)
0 50 100 150 200N
0
20
40
60
80
100
120
140
matrix smallerthan L1 cache
N = 93; 93 * 11 210
N = 114; 114 * 9 210
N = 27
read misses/1K instruction
17
![Page 35: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/35.jpg)
addressesA[k*114+j] is at 10 0000 0000 0100A[k*114+j+1] is at 10 0000 0000 1000A[(k+1)*114+j] is at 10 0011 1001 0100A[(k+2)*114+j] is at 10 0101 0101 1100…A[(k+9)*114+j] is at 11 0000 0000 1100
recall: 6 index bits, 6 block offset bits (L1)
18
![Page 36: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/36.jpg)
addressesA[k*114+j] is at 10 0000 0000 0100A[k*114+j+1] is at 10 0000 0000 1000A[(k+1)*114+j] is at 10 0011 1001 0100A[(k+2)*114+j] is at 10 0101 0101 1100…A[(k+9)*114+j] is at 11 0000 0000 1100
recall: 6 index bits, 6 block offset bits (L1)
18
![Page 37: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/37.jpg)
conflict misses
powers of two — lower order bits unchanged
A[k*93+j] and A[(k+11)*93+j]:1023 elements apart (4092 bytes; 63.9 cache blocks)
64 sets in L1 cache: usually maps to same set
A[k*93+(j+1)] will not be cached (next i loop)
even if in same block as A[k*93+j]
19
![Page 38: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/38.jpg)
L2 misses
0 100 200 300 400 500 600N
0
2
4
6
8
10 L2 misses/1K instructionsijkkij
20
![Page 39: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/39.jpg)
systematic approach (1)
for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)
B[i*N+j] += A[i*N+k] * A[k*N+j];
goal: get most out of each cache missif N is larger than the cache:miss for Bij — 1 comptuationmiss for Aik — N computationsmiss for Akj — 1 computationeffectively caching just 1 element
21
![Page 40: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/40.jpg)
‘flat’ 2D arrays and cache blocks
A[N]A[0]
Ax0 AxN
22
![Page 41: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/41.jpg)
array usage: kij order
Ax0 AxN
Aik
Ak0 to AkN
Bi0 to BiN
Akj
Bij
for all k: for all i: for all j: Bij+ = Aik × Akj
N calculations for Aik
1 for Akj, Bij
Aik reused in innermost loop (over j)definitely cached (plus rest of cache block)
Akj reused in next middle loop (over i)cached only if entire row fitsBij reused in next outer loop
probably not still in cache next time(but, at least some spatial locality)
23
![Page 42: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/42.jpg)
array usage: kij order
Ax0 AxN
Aik
Ak0 to AkN
Bi0 to BiN
Akj
Bij
for all k: for all i: for all j: Bij+ = Aik × Akj
N calculations for Aik
1 for Akj, Bij
Aik reused in innermost loop (over j)definitely cached (plus rest of cache block)
Akj reused in next middle loop (over i)cached only if entire row fitsBij reused in next outer loop
probably not still in cache next time(but, at least some spatial locality)
23
![Page 43: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/43.jpg)
array usage: kij order
Ax0 AxN
Aik
Ak0 to AkN
Bi0 to BiN
Akj
Bij
for all k: for all i: for all j: Bij+ = Aik × Akj
N calculations for Aik
1 for Akj, Bij
Aik reused in innermost loop (over j)definitely cached (plus rest of cache block)
Akj reused in next middle loop (over i)cached only if entire row fitsBij reused in next outer loop
probably not still in cache next time(but, at least some spatial locality)
23
![Page 44: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/44.jpg)
array usage: kij order
Ax0 AxN
Aik
Ak0 to AkN
Bi0 to BiN
Akj
Bij
for all k: for all i: for all j: Bij+ = Aik × Akj
N calculations for Aik
1 for Akj, Bij
Aik reused in innermost loop (over j)definitely cached (plus rest of cache block)
Akj reused in next middle loop (over i)cached only if entire row fits
Bij reused in next outer loopprobably not still in cache next time(but, at least some spatial locality)
23
![Page 45: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/45.jpg)
array usage: kij order
Ax0 AxN
Aik
Ak0 to AkN
Bi0 to BiN
Akj
Bij
for all k: for all i: for all j: Bij+ = Aik × Akj
N calculations for Aik
1 for Akj, Bij
Aik reused in innermost loop (over j)definitely cached (plus rest of cache block)
Akj reused in next middle loop (over i)cached only if entire row fits
Bij reused in next outer loopprobably not still in cache next time(but, at least some spatial locality)
23
![Page 46: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/46.jpg)
inefficiencies
if N is large enough that a row doesn’t fit in cache:keeping one block in cacheaccessing one element of that block
if N is small enough that a row does fit in cache:keeping one block in cache for Aik, using one elementkeeping row in cache for Akj, using N times
24
![Page 47: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/47.jpg)
systematic approach (2)
for (int k = 0; k < N; ++k) {for (int i = 0; i < N; ++i) {
Aik loaded once in this loop (N2 times):for (int j = 0; j < N; ++j)
Bij, Akj loaded each iteration (if N big):B[i*N+j] += A[i*N+k] * A[k*N+j];
N3 multiplies, N3 adds
about 1 load per operation
25
![Page 48: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/48.jpg)
a transformation
for (int kk = 0; kk < N; kk += 2)for (int k = kk; k < kk + 2; ++k)
for (int i = 0; i < N; i += 2)for (int j = 0; j < N; ++j)
B[i*N+j] += A[i*N+k] * A[k*N+j];
split the loop over k — should be exactly the same(assuming even N)
26
![Page 49: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/49.jpg)
a transformation
for (int kk = 0; kk < N; kk += 2)for (int k = kk; k < kk + 2; ++k)
for (int i = 0; i < N; i += 2)for (int j = 0; j < N; ++j)
B[i*N+j] += A[i*N+k] * A[k*N+j];
split the loop over k — should be exactly the same(assuming even N)
26
![Page 50: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/50.jpg)
simple blocking
for (int kk = 0; kk < N; kk += 2)/* was here: for (int k = kk; k < kk + 2; ++k) */for (int i = 0; i < N; i += 2)
for (int j = 0; j < N; ++j)for (int k = kk; k < kk + 2; ++k)
B[i*N+j] += A[i*N+k] * A[k*N+j];
now reorder split loop
27
![Page 51: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/51.jpg)
simple blocking
for (int kk = 0; kk < N; kk += 2)/* was here: for (int k = kk; k < kk + 2; ++k) */for (int i = 0; i < N; i += 2)
for (int j = 0; j < N; ++j)for (int k = kk; k < kk + 2; ++k)
B[i*N+j] += A[i*N+k] * A[k*N+j];
now reorder split loop
27
![Page 52: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/52.jpg)
simple blocking – expanded
for (int kk = 0; kk < N; kk += 2) {for (int i = 0; i < N; i += 2) {for (int j = 0; j < N; ++j) {
/* process a "block": */B[i*N+j] += A[i*N+kk] * A[kk*N+j];B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j];
}}
}
28
![Page 53: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/53.jpg)
simple blocking – expanded
for (int kk = 0; kk < N; kk += 2) {for (int i = 0; i < N; i += 2) {for (int j = 0; j < N; ++j) {
/* process a "block": */B[i*N+j] += A[i*N+kk] * A[kk*N+j];B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j];
}}
}
Temporal locality in Bijs
28
![Page 54: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/54.jpg)
simple blocking – expanded
for (int kk = 0; kk < N; kk += 2) {for (int i = 0; i < N; i += 2) {for (int j = 0; j < N; ++j) {
/* process a "block": */B[i*N+j] += A[i*N+kk] * A[kk*N+j];B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j];
}}
}
More spatial locality in Aik
28
![Page 55: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/55.jpg)
simple blocking – expanded
for (int kk = 0; kk < N; kk += 2) {for (int i = 0; i < N; i += 2) {for (int j = 0; j < N; ++j) {
/* process a "block": */B[i*N+j] += A[i*N+kk] * A[kk*N+j];B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j];
}}
}
Still have good spatial locality in Akj, Bij
28
![Page 56: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/56.jpg)
improvement in read misses
0 100 200 300 400 500 600N
0
5
10
15
20read misses/1K instructions of unblocked
blocked (kk+=2)unblocked
29
![Page 57: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/57.jpg)
simple blocking (2)
same thing for i in addition to k?
for (int kk = 0; kk < N; kk += 2) {for (int ii = 0; ii < N; ii += 2) {for (int j = 0; j < N; ++j) {
/* process a "block": */for (int k = kk; k < kk + 2; ++k)
for (int i = 0; i < ii + 2; ++i)B[i*N+j] += A[i*N+k] * A[k*N+j];
}}
}
30
![Page 58: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/58.jpg)
simple blocking — expanded
for (int k = 0; k < N; k += 2) {for (int i = 0; i < N; i += 2) {for (int j = 0; j < N; ++j) {
/* process a "block": */Bi+0,j += Ai+0,k+0 * Ak+0,j
Bi+0,j += Ai+0,k+1 * Ak+1,j
Bi+1,j += Ai+1,k+0 * Ak+0,j
Bi+1,j += Ai+1,k+1 * Ak+1,j}
}}
Now Akj reused in inner loop — more calculationsper load!
31
![Page 59: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/59.jpg)
simple blocking — expanded
for (int k = 0; k < N; k += 2) {for (int i = 0; i < N; i += 2) {for (int j = 0; j < N; ++j) {
/* process a "block": */Bi+0,j += Ai+0,k+0 * Ak+0,j
Bi+0,j += Ai+0,k+1 * Ak+1,j
Bi+1,j += Ai+1,k+0 * Ak+0,j
Bi+1,j += Ai+1,k+1 * Ak+1,j}
}}
Now Akj reused in inner loop — more calculationsper load!
31
![Page 60: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/60.jpg)
array usage (better)
Aik to Ai+1,k+1
Ak0to
Ak+1,N
Bi0 to Bi+1,N
more temporal locality:N calculations for each Aik
2 calculations for each Bij (for k, k + 1)2 calculations for each Akj (for k, k + 1)
more spatial locality:calculate on each Ai,k and Ai,k+1 togetherboth in same cache block — same amount of cache loads
32
![Page 61: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/61.jpg)
array usage (better)
Aik to Ai+1,k+1
Ak0to
Ak+1,N
Bi0 to Bi+1,N
more temporal locality:N calculations for each Aik
2 calculations for each Bij (for k, k + 1)2 calculations for each Akj (for k, k + 1)
more spatial locality:calculate on each Ai,k and Ai,k+1 togetherboth in same cache block — same amount of cache loads
32
![Page 62: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/62.jpg)
generalizing cache blocking
for (int kk = 0; kk < N; kk += K) {for (int ii = 0; ii < N; ii += I) {
with I by K block of A cached:for (int jj = 0; jj < N; jj += J) {
with K by J block of A, I by J block of B cached:for i, j, k in I by J by K block:
B[i * N + j] += A[i * N + k]* A[k * N + j];
}}
}
33
![Page 63: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/63.jpg)
generalizing cache blocking
for (int kk = 0; kk < N; kk += K) {for (int ii = 0; ii < N; ii += I) {
with I by K block of A cached:for (int jj = 0; jj < N; jj += J) {
with K by J block of A, I by J block of B cached:for i, j, k in I by J by K block:
B[i * N + j] += A[i * N + k]* A[k * N + j];
}}
}
Bij used K times for one miss
33
![Page 64: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/64.jpg)
generalizing cache blocking
for (int kk = 0; kk < N; kk += K) {for (int ii = 0; ii < N; ii += I) {
with I by K block of A cached:for (int jj = 0; jj < N; jj += J) {
with K by J block of A, I by J block of B cached:for i, j, k in I by J by K block:
B[i * N + j] += A[i * N + k]* A[k * N + j];
}}
}
Aik used > J times for one miss
33
![Page 65: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/65.jpg)
generalizing cache blocking
for (int kk = 0; kk < N; kk += K) {for (int ii = 0; ii < N; ii += I) {
with I by K block of A cached:for (int jj = 0; jj < N; jj += J) {
with K by J block of A, I by J block of B cached:for i, j, k in I by J by K block:
B[i * N + j] += A[i * N + k]* A[k * N + j];
}}
}
Akj used I times for one miss
33
![Page 66: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/66.jpg)
generalizing cache blocking
for (int kk = 0; kk < N; kk += K) {for (int ii = 0; ii < N; ii += I) {
with I by K block of A cached:for (int jj = 0; jj < N; jj += J) {
with K by J block of A, I by J block of B cached:for i, j, k in I by J by K block:
B[i * N + j] += A[i * N + k]* A[k * N + j];
}}
}
catch: IK + KJ + IJ elements must fit in cache
33
![Page 67: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/67.jpg)
keeping values in cache
can’t explicitly ensure values are kept in cache
…but reusing values effectively does thiscache will try to keep recently used values
cache optimization idea: choose what’s in the cache
34
![Page 68: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/68.jpg)
array usage: block
Aik block(I × K)
Akj block(K × J) Bij block
(I × J)
inner loop keeps “blocks” from A, B in cache
Bij calculation uses strips from AK calculations for one load (cache miss)Aik calculation uses strips from A, BJ calculations for one load (cache miss)
(approx.) KIJ fully cached calculationsfor KI + IJ + KJ loads(assuming everything stays in cache)
35
![Page 69: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/69.jpg)
array usage: block
Aik block(I × K)
Akj block(K × J) Bij block
(I × J)
inner loop keeps “blocks” from A, B in cache
Bij calculation uses strips from AK calculations for one load (cache miss)
Aik calculation uses strips from A, BJ calculations for one load (cache miss)
(approx.) KIJ fully cached calculationsfor KI + IJ + KJ loads(assuming everything stays in cache)
35
![Page 70: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/70.jpg)
array usage: block
Aik block(I × K)
Akj block(K × J) Bij block
(I × J)
inner loop keeps “blocks” from A, B in cacheBij calculation uses strips from AK calculations for one load (cache miss)
Aik calculation uses strips from A, BJ calculations for one load (cache miss)
(approx.) KIJ fully cached calculationsfor KI + IJ + KJ loads(assuming everything stays in cache)
35
![Page 71: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/71.jpg)
array usage: block
Aik block(I × K)
Akj block(K × J) Bij block
(I × J)
inner loop keeps “blocks” from A, B in cacheBij calculation uses strips from AK calculations for one load (cache miss)Aik calculation uses strips from A, BJ calculations for one load (cache miss)
(approx.) KIJ fully cached calculationsfor KI + IJ + KJ loads(assuming everything stays in cache)
35
![Page 72: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/72.jpg)
cache blocking efficiency
load I × K elements of Aik:do > J multiplies with each
load K × J elements of Akj:do I multiplies with each
load I × J elements of Bij:do K adds with each
bigger blocks — more work per load!
catch: IK + KJ + IJ elements must fit in cache
36
![Page 73: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/73.jpg)
cache blocking rule of thumb
fill the most of the cache with useful data
and do as much work as possible from that
example: my desktop 32KB L1 cache
I = J = K = 48 uses 482 × 3 elements, or 27KB.
assumption: conflict misses aren’t important
37
![Page 74: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/74.jpg)
view 2: divide and conquer
partial_square(float *A, float *B,int startI, int endI, ...) {
for (int i = startI; i < endI; ++i) {for (int j = startJ; j < endJ; ++j) {
...}square(float *A, float *B, int N) {for (int ii = 0; ii < N; ii += BLOCK)...
/* segment of A, B in use fits in cache! */partial_square(
A, B,ii, ii + BLOCK,jj, jj + BLOCK, ...);
}38
![Page 75: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/75.jpg)
cache blocking ugliness — fringe
fullblock
partialblock
39
![Page 76: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/76.jpg)
cache blocking ugliness — fringe
for (int kk = 0; kk < N; kk += K) {for (int ii = 0; ii < N; ii += I) {for (int jj = 0; jj < N; jj += J) {
for (int k = kk; k < min(kk+K, N) ; ++k) {// ...
}}
}}
40
![Page 77: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/77.jpg)
cache blocking ugliness — fringe
for (kk = 0; kk + K <= N; kk += K) {for (ii = 0; ii + I <= N; ii += I) {for (jj = 0; jj + J <= N; ii += J) {
// ...}for (; jj < N; ++jj) {
// handle remainder}
}for (; ii < N; ++ii) {// handle remainder
}}for (; kk < N; ++kk) {// handle remainder
} 41
![Page 78: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/78.jpg)
cache blocking and miss rate
0 100 200 300 400 500 600N
0.000.010.020.030.040.050.060.070.080.09 read misses/multiply or add
blockedunblocked
42
![Page 79: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/79.jpg)
what about performance?
0 100 200 300 400 500 600N
0.0
0.5
1.0
1.5
2.0 cycles per multiply/add [less optimized loop]
unblockedblocked
0 200 400 600 800 1000N
0.0
0.1
0.2
0.3
0.4
0.5 cycles per multiply/add [optimized loop]
unblockedblocked
43
![Page 80: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/80.jpg)
performance for big sizes
0 2000 4000 6000 8000 10000N
0.0
0.2
0.4
0.6
0.8
1.0matrix inL3 cache
cycles per multiply or addunblockedblocked
44
![Page 81: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/81.jpg)
optimized loop???
performance difference wasn’t visible at small sizes
until I optimized arithmetic in the loop
(by supplying better options to GCC)
1: loading Bi,j through Bi,j+7 with one instruction
2: doing adds and multiplies with less instructions
3: simplifying address computations
but… how can that make cache blocking better???
45
![Page 82: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/82.jpg)
optimized loop???
performance difference wasn’t visible at small sizes
until I optimized arithmetic in the loop
(by supplying better options to GCC)
1: loading Bi,j through Bi,j+7 with one instruction
2: doing adds and multiplies with less instructions
3: simplifying address computations
but… how can that make cache blocking better???45
![Page 83: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/83.jpg)
overlapping loads and arithmetic
time
load load load
multiply
add
multiply multiply multiply multiply
add add add
speed of load might not matter if these are slower
46
![Page 84: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/84.jpg)
register reuse
for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)
B[i*N+j] += A[i*N+k] * A[k*N+j];// optimize into:for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i) {float Aik = A[i*N+k]; // hopefully keep in register!
// faster than even cache hit!for (int j = 0; j < N; ++j)
B[i*N+j] += Aik * A[k*N+j];}
}
can compiler do this for us?47
![Page 85: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/85.jpg)
can compiler do register reuse?
Not easily — What if A = B?
for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i) {// want to preload A[i*N+k] here!for (int j = 0; j < N; ++j) {
// but if A = B, modifying here!B[i*N+j] += A[i*N+k] * A[k*N+j];
}}
}
48
![Page 86: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/86.jpg)
Automatic register reuse
Compiler would need to generate overlap check:if ((B > A + N * N || B < A) &&
(B + N * N > A + N * N ||B + N * N < A)) {
for (int k = 0; k < N; ++k) {for (int i = 0; i < N; ++i) {
float Aik = A[i*N+k];for (int j = 0; j < N; ++j) {
B[i*N+j] += Aik * A[k*N+j];}
}}
} else { /* other version */ }
49
![Page 87: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/87.jpg)
“register blocking”
for (int k = 0; k < N; ++k) {for (int i = 0; i < N; i += 2) {
float Ai0k = A[(i+0)*N + k];float Ai1k = A[(i+1)*N + k];for (int j = 0; j < N; j += 2) {
float Akj0 = A[k*N + j+0];float Akj1 = A[k*N + j+1];B[(i+0)*N + j+0] += Ai0k * Akj0;B[(i+1)*N + j+0] += Ai1k * Akj0;B[(i+0)*N + j+1] += Ai0k * Akj1;B[(i+1)*N + j+1] += Ai1k * Akj1;
}}
}
50
![Page 88: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/88.jpg)
cache blocking: summary
reorder calculation to reduce cache misses:
make explicit choice about what is in cache
perform calculations in cache-sized blocks
get more spatial and temporal localitytemporal locality — reuse values in manycalculations
before they are replaced in the cache
spatial locality — use adjacent values in calculationsbefore cache block is replaced
51
![Page 89: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/89.jpg)
avoiding conflict misses
problem — array is scattered throughout memory
observation: 32KB cache can store 32KB contiguousarray
contiguous array is split evenly among sets
solution: copy block into contiguous array
52
![Page 90: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/90.jpg)
avoiding conflict misses (code)
process_block(ii, jj, kk) {float B_copy[I * J];/* pseudocode for loop to save space */for i = ii to ii + I, j = jj to jj + J:
B_copy[i * J + j] = B[i * N + j];for i = ii to ii + I, j = jj to jj + J, k:
B_copy[i * J + j] += A[k * N + j] * A[i * N + k];for all i, j:
B[i * N + j] = B_copy[i * J + j];}
53
![Page 91: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/91.jpg)
prefetching
processors detect sequential access patterns
e.g. accessing memory address 0, 8, 16, 24, …?
processor will prefetch 32, 48, etc.
another way to take advantage of spatial locality
part of why miss rate is so low
54
![Page 92: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/92.jpg)
matrix sum
int sum1(int matrix[4][8]) {int sum = 0;for (int i = 0; i < 4; ++i) {
for (int j = 0; j < 8; ++j) {sum += matrix[i][j];
}}
}
access pattern:matrix[0][0], [0][1], [0][2], …, [1][0] …
55
![Page 93: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/93.jpg)
matrix sum: spatial locality
[0][0] iter. 0 miss[0][1] iter. 1 hit (same block as before)[0][2] iter. 2 miss[0][3] iter. 3 hit (same block as before)[0][4] iter. 4 miss[0][5] iter. 5 hit[0][6] iter. 6 …[0][7] iter. 7[1][0] iter. 8[1][1] iter. 9… …
matrix in memory (4 bytes/row)
8-bytecache block?
56
![Page 94: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/94.jpg)
matrix sum: spatial locality
[0][0] iter. 0 miss[0][1] iter. 1 hit (same block as before)[0][2] iter. 2 miss[0][3] iter. 3 hit (same block as before)[0][4] iter. 4 miss[0][5] iter. 5 hit[0][6] iter. 6 …[0][7] iter. 7[1][0] iter. 8[1][1] iter. 9… …
matrix in memory (4 bytes/row)8-byte
cache block?
56
![Page 95: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/95.jpg)
matrix sum: spatial locality
[0][0] iter. 0 miss[0][1] iter. 1 hit (same block as before)[0][2] iter. 2 miss[0][3] iter. 3 hit (same block as before)[0][4] iter. 4 miss[0][5] iter. 5 hit[0][6] iter. 6 …[0][7] iter. 7[1][0] iter. 8[1][1] iter. 9… …
matrix in memory (4 bytes/row)8-byte
cache block?
56
![Page 96: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/96.jpg)
block size and spatial locality
larger blocks — exploit spatial locality
… but larger blocks means fewer blocks for same size
less good at exploiting temporal locality
57
![Page 97: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/97.jpg)
alternate matrix sum
int sum2(int matrix[4][8]) {int sum = 0;// swapped loop orderfor (int j = 0; j < 8; ++j) {
for (int i = 0; i < 4; ++i) {sum += matrix[i][j];
}}
}
access pattern:matrix[0][0], [1][0], [2][0], …, [0][1], …
58
![Page 98: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/98.jpg)
matrix sum: bad spatial locality
[0][0] iter. 0[0][1] iter. 4[0][2] iter. 8[0][3] iter. 12[0][4] iter. 16[0][5] iter. 20[0][6] iter. 24[0][7] iter. 28[1][0] iter. 1[1][1] iter. 5… …
matrix in memory (4 bytes/row)
8-bytecache block?
miss unless value notevicted for 4 iterations
59
![Page 99: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/99.jpg)
matrix sum: bad spatial locality
[0][0] iter. 0[0][1] iter. 4[0][2] iter. 8[0][3] iter. 12[0][4] iter. 16[0][5] iter. 20[0][6] iter. 24[0][7] iter. 28[1][0] iter. 1[1][1] iter. 5… …
matrix in memory (4 bytes/row)8-byte
cache block?miss unless value notevicted for 4 iterations
59
![Page 100: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/100.jpg)
cache organization and miss rate
depends on program; one example:
SPEC CPU2000 benchmarks, 64B block size
LRU replacement policies
data cache miss rates:Cache size direct-mapped 2-way 8-way fully assoc.1KB 8.63% 6.97% 5.63% 5.34%2KB 5.71% 4.23% 3.30% 3.05%4KB 3.70% 2.60% 2.03% 1.90%16KB 1.59% 0.86% 0.56% 0.50%64KB 0.66% 0.37% 0.10% 0.001%128KB 0.27% 0.001% 0.0006% 0.0006%
Data: Cantin and Hill, “Cache Performance for SPEC CPU2000 Benchmarks”http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/ 60
![Page 101: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/101.jpg)
cache organization and miss rate
depends on program; one example:
SPEC CPU2000 benchmarks, 64B block size
LRU replacement policies
data cache miss rates:Cache size direct-mapped 2-way 8-way fully assoc.1KB 8.63% 6.97% 5.63% 5.34%2KB 5.71% 4.23% 3.30% 3.05%4KB 3.70% 2.60% 2.03% 1.90%16KB 1.59% 0.86% 0.56% 0.50%64KB 0.66% 0.37% 0.10% 0.001%128KB 0.27% 0.001% 0.0006% 0.0006%
Data: Cantin and Hill, “Cache Performance for SPEC CPU2000 Benchmarks”http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/ 60
![Page 102: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/102.jpg)
is LRU always better?
least recently used exploits temporal locality
61
![Page 103: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/103.jpg)
making LRU look bad
* = least recently useddirect-mapped (2 sets) fully-associative (1 set)
read 0 miss: mem[0]; — miss: mem[0], —*read 1 miss: mem[0]; mem[1] miss: mem[0]*, mem[1]read 3 miss: mem[0]; mem[3] miss: mem[3], mem[1]*read 0 hit: mem[0]; mem[3] miss: mem[3]*, mem[0]read 2 miss: mem[2]; mem[3] miss: mem[2], mem[0]*read 3 hit: mem[2]; mem[3] miss: mem[2]*, mem[3]read 1 hit: mem[2]; mem[1] hit: mem[1], mem[3]*read 2 hit: mem[2]; mem[1] miss: mem[1]*, mem[2]
62
![Page 104: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/104.jpg)
cache operation (associative)
valid tag data valid tag data1 10 00 11 1 00 AA BB
1 11 B4 B5 1 01 33 44
10011 1
index
=
=
tag
AND
AND
OR is hit? (1)
offset
data(B5)
63
![Page 105: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/105.jpg)
cache operation (associative)
valid tag data valid tag data1 10 00 11 1 00 AA BB
1 11 B4 B5 1 01 33 44
10011 1
index
=
=
tag
AND
AND
OR is hit? (1)
offset
data(B5)
63
![Page 106: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017 · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]](https://reader035.fdocuments.us/reader035/viewer/2022081409/608b3b3b36fec255d73638b5/html5/thumbnails/106.jpg)
cache operation (associative)
valid tag data valid tag data1 10 00 11 1 00 AA BB
1 11 B4 B5 1 01 33 44
10011 1
index
=
=
tag
AND
AND
OR is hit? (1)
offset
data(B5)
63