Memory Operation and Performance To understand the memory architecture so that you could write...

Memory Operation and Performance

To understand the memory architecture so that you could write programs that could take the advantages and make the programs faster. This lecture covers: Memory Systems Caches Virtual Memory (VM)

Caches – fast memory between CPU and main Memory

Cache Design Parameters

A Diversity of Caches , different level of caches

Looking at the Caches

Cache-aware Programming (Column-major and row-major, row-major will make the program faster.)

A Diversity of Caches

Multiple Levels of Caches (L1, L2, or L3 cache, L1 is faster than L2 and L2 is faster than L3. L1 is more expensive than L2..)

On-Chip Caches (only one level cache, L1)

Instruction and Data Caches (separate data and instruction in the cache. There is likely that instruction will be reused later.)

Instruction and Data Cache – from http://www.kids-online.net/learn/clickjr/details/cpu.html

Right is two photos of a CPU (Central Processing Unit). The photo on the bottom is the CPU chip from the outside. The photo on top is a large road map of the inside of the CPU, showing data cache and instruction cache.

Multiple Levels of CachesModern computer systems don't have just one cache between the CPU and memory.There is usually a hierarchy of caches. The caches are usually called L1, L2, etc.—which is shorthand for Level 1, Level 2, and so on.An L1 cache is the cache that is within the CPU, and is, therefore, the fastest and smallest, but more expensiveThe last cache (usually L2 or L3) is the cache that loads data directly from the DRAM main memory, less expensive

L1 and L2 Cache

Level 1 cache memory, is memory that is included in the CPU itself. Level 2 cache memory, is memory outside of the CPU. Photo below shows level 2 cache memory on the Processor

Example – shows the benefit of cache memory of

different levels

/* Assumes n is a power of two */ It is to divide the data into two

void merge_sort (int * data, int n) { int half = n >> 1; if (n == 1) return; binary_sort(data, half); binary_sort(data + half, half); merge(data, data + half, half); }

// no need to memorise

Graph of Merge Sortthe access times in nanoseconds (ns) for

the L1 cache (T1), L2 cache (T2), L3 cache (T3), and main memory (Tm).

Merge sort- in a fast small cache (in L1 only)

Look at the total time

Merge sort - in a slow cache (in L3 only)

Look at the total time, it is

longer

On-Chip Caches

multilevel caches can improve the performance of a computer.

However, usually there is no major difference between having a single L3-sized cache and three caches

It is not as significant as the difference between a single large cache and a single small one

Instruction and Data CachesPrograms access and fetch instructions in much more predictable ways than they do data.

For instance, instruction fetches exhibit much more spatial locality than data, because it is very likely that an instruction fetch will be soon followed by the fetch of the instruction next to it. For example, the program is executing a++, there is high chance to execute b++ and c= a + b*3;

a++; b++; c =a+ b*3;

Even when a branch or jump instruction makes this untrue, it is very likely that the instruction fetched next will be one that has already been fetched recently.

multiple levels of caches

Note that L1 is within CPU chip

Looking at the cache designWe can deduce many things about the cache design of a particular computer by carefully examining its memory performance. We can design a benchmark program whose locality we control such as. int data[MAXSIZE];

for (i = 0; i < repeat; i++) { for (i = 0; i < N; i++) { dummy = data[i]; } }

Explanation to the programThis loop accesses a chunk of memory repeatedly. By varying N, we vary the temporal locality of the accesses. For example, for N == 4, each of the values data[i] will be accessed every 4 iterations, but if N is 16, each data[i] will be accessed only every 16 iterations. A cache of size 16 would cause the benchmark to perform much more poorly for N == 32 than for N == 8, because for N == 32, each data[i] would have been evicted (means removed) from the cache before it was accessed again.

Control the spatial locality

Here, stride controls the amount of spatial locality

int data[MAXSIZE]; for (i = 0; i < repeat; i++) { for (i = 0; i < N; i += stride) { dummy = data[i]; } }

Result of benchmark

Transfer rate in MB/s

You can see the

performance Is not

proportional to L1 cache. Why? It is effective

between 4K and 512

bytes

Interpretation of result

We immediately notice that memory performance comes in three discrete steps.In the best performing step, the program is accessing so little data that all of its references fit in the L1 cache, and the rest of the hierarchy is almost never required. In the next step down, the references no longer fit in the L1 but fit in the L2 cache, and access to main memory is almost never required. Try to fit L1, L2 etc.

Graph showing size of L1 and performance

Performance: Transfer rate

The effect of stride (steps)

Cache-Aware Programming

That is how to optimise the performance

Instruction Cache Overflow

Cache Collisions

Unused Cache Lines

Insufficient Temporal Locality

Example (1) – 4ms (assume 1M)

Example (2) – 3ms (assume 1M)

Example (3) – 3 ms (assume 512K)

Example (4) – 2.5 ms (assume 512K)

Example (5) – 2.3 ms (Assume 256K)

Example (5) – 2.3 ms

Instruction cache - Program of complicated for/loop

Below is a program involving three complicated operations:

for (i = 0; i < MAX; i++) {

<Complicated operation on A[i]>

<Complicated operation on B[i]>

<Complicated operation on C[i]>

}

It is better to separate into threeSo that each complicated operation can maximise the

cache memory (instruction cache).for (i = 0; i < MAX; i++) { <Complicated operation on A[i]> } for (i = 0; i < MAX; i++) { <Complicated operation on B[i]> } for (i = 0; i < MAX; i++) { <Complicated operation on C[i]> }

Cache Collisions

Cache collisions can also cause our programs to execute slowly.

a cache collision occurs when a cache line is evicted (switched out) even though the cache is not full.

It happens when the line is full, the system has to decide which data line to remove

(switch out).

Program of cache collisionBelow is the program involving variables a and b.int a[N]; <other stuff...> int b[N]; <other stuff...> int c[N];

for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; }

Reason of Cache Collision

It is possible that the compiler may allocate a, b, and c to memory addresses that map to the same cache set.

In this case, the assignment c[i] = a[i] + b[i] will cause three cache misses in every iteration of the loop, because the cache will be constantly evicting the cache line that the CPU requires next.

This operation will cause three operations, as c[], a[] and b[] are in the same cache line.

Graph showing the Cache collision

The solution is to offset memory location

#define CACHELINESIZE <Cache line size of system> #define COFFSET ((2 * CACHELINESIZE) / sizeof(int))

int a[N]; <other stuff...> int b[N]; <other stuff...> int int c[N + COFFSET];

for (i = 0; i < N; i++) { c[i + COFFSET] = a[i] + b[i]; }

Graph showing cache after change

Under-used Cache Lines

Suppose the cache line is 32 bytes wide, as it often is. If a program is reading contiguous 4-byte integers (continuous), the reference to the first will cause the first eight integers (integers 0–7) to be loaded into the cache. The reference to the 9th will cause integers 8–15 to be loaded, and so on. The hit ratio, even on a cold cache, will be at least 7/8, or 0.875. Now consider a program that reads integers with a stride of eight or more. This means that the program reads the first integer, then the 9th (or higher), then the 17th etc.

Graph showing the effectCache miss

Example of a matrix

int data[M][N];

for (i = 0 ; i < N; i++) {

for (j = 0; j < M; j++) {

sum += data[j][i];

}

}

Row-major and Column-major

Accessing a column-major

Accessing row data

It will be faster, as it accesses [0,0], [0,1][0,2] which will be loaded into cache line after reading [00] up to [13]

Changing the order of the iterations is not always better. Below is an example.

//It is because we have fixed transposed[i][j], but not original [j][i]

int original[M][N]; int transposed[N][M];

for (i = 0; i < M; i++) { for (j = 0; j < N; j++) { transposed[i][j] = original[j][i]; } }

Effect of rotating shape

This is the effect of previous program. It is to rotate the image.

Insufficient Temporal Locality

int original[M][N]; int transposed[N][M]; for (k = 0; k < M / m; k++) { for (l = 0; l < N / n; k++) { for (i = k*m; i < (k+1)*m; i++) { for (j = l*n; j < (l+1)*n; j++) { transposed[i][j] = original[j][i]; } } } }

Blocked transpose gets around cache misses

m and n must be a square and is determined by the cache line size, say 32 bytes. So that it will fit into the cache .

Virtual Memory (VM)

The term virtual memory refers to a combination of hardware and operating system software that solve several computing problems.

It receives a single name because it is a single mechanism, but it meets several goals:

To simplify memory management and program loading by providing virtual addresses.

To allow multiple large programs to be run without the need for large amounts of RAM, by providing virtual storage.

Virtual Addresses

Segmentation – group pages together with different size

Memory Protection – due to the support of more than ONE process, to protect the memory being corrupted by others

Paging – use the same size in disk and memory and load it into memory or from memory to disk. But computers hold several programs in memory at the same time.

Virtual Memory - Explanation

Sequence of virtual memory. Program size is larger than main memory.

Memory

DiskPage

contradictory about VM facts:

The compiler determines the address at which a program will execute, by hard-wiring a lot of addresses of variables and instructions into the machine code it generates.

The location of the program is not determined until the program is executed and may be anywhere in main memory.

Solution to contradictory factsCode Relocation: Have the compiler generate addresses relative to a base address, and change the base address when the program is executed. This means that the address of each reference is calculated explicitly by adding the relative address to the base address. Drawback:

Address Translation: At run time, provide programs the illusion that there are no other programs in memory. Compilers can then generate any absolute address they wish. Two programs may contain references to the same address without interference.

Virtual and Physical Addresses

The addresses issued by the compiler are called virtual addresses.

The addresses that result from the translation are called physical addresses, because they refer to an actual memory chip.

Multiple programs without relocation

Program A shares

some memory locations belonging

to Program

B.

Relocatable code can share memory

Program A uses

the memory locations belonging to itself.

Summary

Cache, L1 (within CPU), L2 and L3

Data cache and instruction cache

Program: column major and row major, row-major can enhance the performance.

Virtual memory: memory is too small to cater for the whole program. It loads the page into memory.

Memory Operation and Performance To understand the memory architecture so that you could write...

Documents

Transcript of Memory Operation and Performance To understand the memory architecture so that you could write...