1 Lecture 6 Performance Measurement and Improvement.
-
date post
19-Dec-2015 -
Category
Documents
-
view
220 -
download
2
Transcript of 1 Lecture 6 Performance Measurement and Improvement.
3
Rationale for this unit
This lecture is about making programs run fast. Usually speed is not the most important concern while writing a program.
The professional programmer is usually most concerned with making a program that is easy to
write, debug, and maintain.
A programmer is not just coding.
4
Reason on simple program (1)
A correct program, even if is slow, computes right answers faster than a program that is not. It is often better to use a simple but slow algorithm.
A program that is finished computes right answers much faster than a program that is not.
Fast programs often take much more time to develop, and they are useless until they are finished.
Simple but fast program
5
Reason on simple program (2)
Computers’ performance is double in speed every 18 months. Computer technology changes so fast that improvements in speed can often be obtained simply by waiting for the next generation of hardware.Speed improvements of less than a factor of two are barely noticeable to users in an interactive setting.
7
Measurement and Profiling
First, how to measure program’s performance
What to Measure (execution speed)
Timing Mechanisms (use wall clock, such as your watch)
8
What to Measure (CPU time and Wall clock)
The most common thing to measure is CPU time.
CPU time is the time a process spends executing instructions.
It does not count any time spent executing other programs or just waiting.
9
What to Measure (Wall clock)
An alternative is to measure real time or "wall clock time“This is the time an ordinary clock on the wall or a wrist watch shows.
The difference between CPU time and wall time can give some indication of the time spent waiting for I/O.
Wall time
CPU time
I/O time
10
CPU time
It can be divided between user time, the time spent directly executing your program code, and
system time, the time spent by the operating system on behalf of your program
11
Timing MechanismsThere are two ways to measure the timing behavior of a program. The most obvious is direct measurement with a timer (wall clock – difference between start and end times.)An alternative to using timers directly is to use statistical sampling. A timer periodically interrupts the program and records the program counter or increments a counter. (profiling)
12
High-Resolution on Pentium Systems
Typical operating system clocks are not very precise because they rely on hardware to interrupt the processor every clock period.
The operating system then increments a counter
Intel Pentium processors (among others) have a very high-speed internal 64-bit counter that can be accessed by special instructions.
15
Principles - Performance
The 80/20 Rule – It means 80% of the CPU time is spent in 20% of the program.
In this case, you can have better performance by looking at this 20%.
Amdahl's Law – for parallel processing, the performance is limited by sequential part of the program.
16
Explanation
Suppose the program really spends 80% of its time in one spot, and suppose you can rewrite this spot to take a negligible mount of time.
The program will now execute in 20% of its original time, meaning that it now runs 5 times as fast.
17
Example of 80/20: 10% on one module means 2% as a whole
A module consists of 5 modules
20 ms
20 ms
20 ms
20 ms
20 ms
20 ms
18 ms
20 ms
20 ms
20 ms
18
Example of 80/20: 10% on one means 5% as a whole
A module consists of 5 modules
10 ms
50 ms
10 ms
10 ms
10 ms
10 ms
45 ms
10 ms
10 ms
10 ms
Conclusion: focus on module with more CPU
time
21
Example – a simple for loop
#include <stdio.h>
#include <stdlib.h>
void main() {
for (int i = 0; i < 1000; i++)
printf("The value is %d \n", i, i^2);
}
28
Example – a simple while loop
#include <stdio.h>
#include <stdlib.h>
void main() {
int i = 0;
while (i < 1000) {
printf("The value is %d \n", i, i^2);
i++;
}
32
A program that can be used to determine Mega flop
// This is matrix multiplication#include <stdio.h>#include <stdlib.h>#include <memory.h>void main(){
float a[250][250], b[250][250], c[250][250];int i, j, k;for (i = 0; i< 250; i++)
for (j = 0; j < 250; j++)for (k =0; k <250; k++)
c[i][j] += a[i][k] * b[k][j]; // matrix multiplication
}
34
Determination of Mega Flop
The time it takes for my machine is 349ms.This program involves 250^3 steps including two floating point operations, an add and a multiply 250 x 250 x 250 = 15625000.The performance for this loop is 15625000/349ms = 15.625 x 10^6 /0.349 s = 44 MFLOPs (mega floating point operation). Note that for super computer, the value is about 1000 MFLOPs. You can try your computer at home to determine your machine’s performance.
35
Same output but change the program#include <stdio.h>
#include <stdlib.h>#include <memory.h>// this program uses a temporary location t// to store the valuevoid main(){
float a[250][250], b[250][250], c[250][250];int i, j, k;float r = 0.0;for (i = 0; i< 250; i++){
for (j = 0; j < 250; j++) {for (k =0; k <250; k++) {
r += a[i][k] * b[k][j]; //this is matrix multiplication}c[i][j] = r;}
}}
36
Same machine – 254ms, why?
This is related to the cache memory effect, as the data is stored in cache. This will be explained later.
37
Summary
It is better to write a simple but fast program. The procedure is to write a program that works, then makes it faster.There is a rule called 80/20 which means 80% of CPU time spends on 20% of program. You should focus on these 20%.To measure the performance – ProfilingTo determine which causes the delay.