Performance
-
Upload
aveek-chatterjee -
Category
Documents
-
view
6 -
download
3
description
Transcript of Performance
1
IT401 Computer Organization and Architecture
Prasun GhosalDepartment of Information Technology
Bengal Engineering and Science University, Shibpur
2
Outline
•How to measure, report and summarize performance?
•What are the major factors that determine the performance of a computer?
•Execution time is the only adequate measure of performance
•Benchmarks, what are they, and how are they used to evaluate performance
3
Why study Performance?•Hardware performance is often key to the effectiveness of an entire system of Hardware and Software
•The goal is not just to assess performance but need to understand what affects performance of a machine
•To improve performance of software understand how hardware affects system performance
How well a program uses instructions of the machine?
How well underlying HW implements instructions?
How well memory and I/O systems perform?
4
How to define performance?Airplanes example
•Passenger capacity
•Cruising range (miles)
•Cruising speed (m.p.h)
•Passenger throughput (passengers * m.p.h)
Which airplane has the best performance?
•Highest cruising speed
•Longest range
•Largest capacity
•Speed•Highest cruising speed
•highest throughput
Run a program on two different workstations, which is fastest?
•User: response time (execution time)
•Computer center manager: throughput
(how many tasks were performed during a time interval)
•Relationship between response time and throughput
5
PerformanceUse response time or execution time. To maximize performance minimize execution time for some task
imeExecutionTePerformanc 1=
What does it mean that Performance(X) is greater than Performance(Y)?
)()(
)(1
)(1
)()(
XimeExecutionTYimeExecutionT
YimeExecutionTXimeExecutionT
YePerformancXePerformanc
>
>
>
X is n times faster than Y nXimeExecutionTYimeExecutionT
YePerformancXePerformanc
==)()(
)()(
6
Performance ExampleMachine A runs a program in 10 seconds and machine B runs the same program in 15 seconds, how much faster is A than B?
5.11015
)()(
)()(
===AimeExecutionTBimeExecutionT
BePerformancAePerformanc
A is 1.5 times faster than B
7
Measuring Performance1/4
Time is the measure of computer performance (sec per program)•Response time or elapsed time
Total time to complete a task including everything (disk access, memory access, operating system overhead, …)
•CPU execution time (CPU time)
Time CPU spends computing for this task and does not include time spent waiting for I/O or running other programs (some computers are timeshared)
•CPU execution time can be divided into
User CPU time: CPU time spent in the program
System CPU time: CPU time spent in operating system performing tasks on behalf of the program
8
Measuring Performance2/4
Example of user CPU time and System CPU time
Output of Unix time command
90.7u 12.9s 2:39 65%User CPU time 90.7 sec
System CPU time 12.9 sec
Elapsed time 2:39 = ( 2 minutes and 39 sec) = 159 sec
% of elapsed time that is CPU time = (90.7 + 12.9)/159 = 65%
Then 100 – 65 = 35% of elapsed time was spent doing something else
(waiting for I/O, running other programs, …)
9
Measuring Performance3/4
Express CPU execution time in terms of other metric that relates to how fast the HW can perform basic functions
•Computers governed by a clock that runs at constant rate and determines when events happen in HW
•Length of a clock period is Clock cycle (measured in nanoseconds (10-9 sec) or picoseconds (10-12 sec))
•Clock rate is 1/(clock cycle) (measured in Megahertz (MHz = 106 Hz), or Gigahertz (GHz = 109 Hz) )
•1 Hertz is 1 cycle/sec
CPU execution time = CPU clock cycles for a program * clock cycle time
CPU execution time = CPU clock cycles for a program / clock rate
How to improve CPU execution time?
10
Measuring Performance4/4
Relating to Software•express CPU clock cycles in terms of program instructions
•CPU clock cycles = Instruction for a program * Average clock cycles per instruction
•Clock cycles per instruction (average number of cycles each instruction takes to execute) is abbreviated as CPI
•CPI can be used to compare two implementations of the same instruction set architecture (since instruction count for a program will remain the same)
11
Measuring Performance1/5
CPU clock cycles = Instructions for a program * CPI
CPU time = CPU clock cycles * clock cycle time
CPU time = Instruction count * CPI * clock cycle time
CPU time = Instruction count * CPI/clock rate
clockcycleSeconds
nInstructiosClockcycle
ogramnsInstructioTime **
Pr=
Basic Performance Components
•CPU execution time
•Instruction count
•CPI
•Clock cycle time
12
Measuring Performance2/5
How to determine values of performance components
•CPU execution time: measurement
•Clock cycle time: published as part of documentation for a machine
•Instruction count: •Software tools to profile execution, or use a simulator of the architecture
•Hardware counters if available to measure the # of instructions executed
•CPI: varies by application , as well as among implementations within the same instruction set. Obtained through a detailed simulation or by combining HW counters and simulation
CPI Can be calculated if different types of instructions and individual clock cycle counts are known
13
Measuring Performance3/5
∑==
n
iii CCPI
1)*(CPU clock cycles
Ci: number of instructions of class i executed
CPIi: average number of cycles per instruction for that instruction class
n: number of instruction classes
Overall program CPI dependent on•Number of cycles for each instruction type
•Frequency of each instruction type in the program execution
14
Measuring Performance4/5
CPU clock cycles = Instructions for a program * CPI
imeExecutionTePerformanc 1=
CPU time = CPU clock cycles * clock cycle time
CPU time = CPU clock cycles for a program / clock rate
CPU time = Instruction count * CPI * clock cycle time
CPU time = Instruction count * CPI/clock rate
∑==
n
iii CCPI
1)*(CPU clock cycles
15
Benchmarks1/3
Concept of Workload
•Informally, set of programs that the user runs day in and day out
Benchmarks•Programs specifically chosen to measure performance
•Form a workload that the user hopes will predict the performance of the actual workload
•Best benchmark types are real programs
•Use of benchmarks whose performance depends on small code segments encourages optimizations in either the architecture or compiler
•A problem: Compilers with special-purpose optimizations targeted at specific benchmarks. Will such optimizations produce good or correct code with a real application?
16
Benchmarks2/3
COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED
•Matrix 300 in SPEC suite in 1989
•SPEC is System Performance Evaluation Cooperative
•For matrix 300, the enhanced compiler improves performance by a factor of more than 9!. Although not that much improvement with other benchmarks.
•SPEC benchmark web site http://www.specbench.org
17
Benchmarks3/3
Why real programs are not used to measure performance?•Small size of benchmark (easier compilation and simulation)
•Compilers might not be available for a new machine
•Numerous published performance results are available for small benchmarks
•Benchmarks are OK for the initial design phase, but a working computer system should be evaluated with a real program
Writing Performance reports•Reproducibility
•Include everything needed to be able to duplicate the experiment
18
Comparing and Summarizing Performance1/4
Selected benchmark
Agreed to use response time or throughput
How to summarize performance of a group of benchmarks?
M/C A M/C BP1 1 10P2 1000 100Total 1001 110
A is 10 times faster than B for P1
B is 10 times faster than A for P2
What is the relative performance of A & B?
Use Total Execution Time 1.91101001
)()(
)()(
===BimeExecutionTAimeExecutionT
AePerformancBePerformanc
19
Comparing and Summarizing Performance2/4
•B is 9.1 times faster than A for P1 and P2 together
•One figure as Summary of performance directly proportional to execution time
•If the workload consists of running P1 and P2 an equal number of times, this statement would predict the relative execution times for the workload on each machine
•Average of execution times that is directly proportional to total execution time is arithmetic mean (AM)
∑==
n
iiTime
nAM
1)(1 Time(i): execution time for ith program
n: total number of programs in the workload
A Smaller mean means smaller average execution time and thus improved performance
20
Comparing and Summarizing Performance3/4
•Arithmetic mean proportional to execution time, if programs in workload are each run an equal number of times. What happens if not the case?
•Assign a weighting factor w(i) to each program to indicate frequency of the program in the workload
•Weighted arithmetic mean
•AM special case of weighted AM when all weights are equal
∑==
n
iiTimeiwWeightedAM
1)(*)(
21
Comparing and Summarizing Performance4/4
Program M/C A M/C B M/C CP1 1 10 20P2 1000 100 20
Table shows runtimes of P1 and P2 on three machines A, B, and C
Workload consists of P1 and P2.
P1 is run 10 times as often as P2
Find which machine is fastest for this workload and by how much?
22
SPEC95 Benchmarks•CPU benchmark
•Created by a set of computer companies in 1989
•SPEC95 (8 integer and 10 floating point programs). Figure 2.6
•SPEC95 web site (http://www.specbench.org/osg/cpu95/news/cpu95descr.html)
•SPEC ratio for xxx.benchmark =
xxx.benchmark reference time /xxx.benchmark run time
Normalized measure. Higher results indicate faster performance
•Reference machine is a Sun SPARCstation 10/40
•SPECint95 or SPECfp95 summary measurement is obtained by taking geometric meanof the SPEC ratios
nn
iiSPECratio∏
=1)( ∏
=
n
iia
1)( Product of a1 * a2 * ..* an
23
SPEC95 Benchmark results for Pentium and Pentium Pro
•At same clock rate, Pentium Pro is 1.4 to 1.5 times faster
•When clock rate increased by a certain factor, processor performance increases by a lower factor
•Pentium clock rate from 100 to 200 MHz. SPECint95 performance improves by only 1.7 (Why?)
24
SPEC95 Benchmark results for Pentium and Pentium Pro
•At same clock rate, Pentium Pro is 1.7 to 1.8 times faster
•Clock rate from 100 to 200 MHz, SPECfp95 improves by only 1.4 (Why?)
•Bottleneck at memory system due to increase of processor speed, which effect is more evident on floating point benchmarks because of size.
25
Performance Summary Example1/2
M/C A M/C BP1 1482 139P2 2266 254P3 6206 690
Which machine is faster according to total execution time? And by how much?
Total Execution Time (A) = 1482 + 2266 + 6206 = 9954
Total Execution (B) = 139 + 254 + 690 = 1083
Machine B is fastest by 9954/1083 = 9.27 times
26
Performance Summary Example2/2
M/C A M/C BP1 1482 139P2 2266 254P3 6206 690
Which machine is faster by the geometric mean measure?
Remember how SPEC reported performance?
•Normalize in reference to one machine
•Choose A as reference machine
•Obtain Execution time ratios (ET Ratio)
ET Ratio(P1) = ET(A)/ET(B) = 1482/139 = 10.66
ET Ratio (P2) = 2266/254 = 8.92
ET Ratio(P3) = 6206/690 = 8.99
Geometric Mean = (Ratio (P1) * Ratio(P2) * Ratio(P3))1/3
Geometric Mean = 9.49
Machine B is 9.49 times faster than A according to geometric mean measure
27
Amdahl’s Law1/3
PitfallExpecting the improvement of one aspect of a machine to increase performance by an amount proportional to the size of the improvement
Program runs in 100 sec on a machine
Multiply operations responsible for 80 sec of time
How much do we need to improve the speed of multiplication if program is to run 5 times faster?
Execution time after improvement =
(Execution time affected by improvement/Amount of improvement + Execution time unaffected)
Execution time after improvement = 80/n + (100-80) = 20 = (100/5)
20 = 80/n + 20 80/n = 0 no n can be found to achieve the requested improvement
Make the common case fast
28
Amdahl’s Law2/3
Another form of Amdahl’s Law (to yield Speedup)Speedup = Performance after improvement/Performance before
Speedup = Execution time before/Execution time after improvement
Assume new hardware added to machine
f = fractions of all operations which use new hardware
s = speedup of those operations using new hardware
Execution time with new hardware is Tnew
Execution time without new hardware is Told
Tnew = f* Told/s + (1-f) * Told
Overall speedup S = Told/Tnew
Speedup = s / (s –f * (s-1))
fs 0.1 Speedup2 1.0526325 1.08695710 1.098901
s 0.5 Speedup2 1.3333335 1.66666710 1.818182
s 0.9 Speedup2 1.8181825 3.57142910 5.263158
s 0.99 Speedup2 1.9801985 4.80769210 9.174312
29
Amdahl’s Law3/3
Example of memory versus processor speedup
A = B op C
Assume memory access takes 4 cycles and a typical operation takes 2 cycles
Which of the following achieves the best increase in performance
•Increase memory speed by 50%
•Double operation speed
Calculate how many memory accesses are needed first?
1 to get instruction from memory
2 to get B and C from memory
1 to store result (A) back in memory
Then we need a total of 4 memory access operations
Memory access time = 4 (accesses) * 4 (cycles/access) = 16 cycles
Operation time = 1 (operation) * 2 (cycles/operation) = 2 cycles
Total number of cycles = 16 + 2 = 18
Option 1 increase memory speed by 50%
s1 = 1.5 (how?)
f1 = memory access time/ total time
= 16/18 = 0.889
S1 = 1.42
Option 2 double operation speed
s2 = 2
f2 = operation time/total time
= 2/18 = 0.111
S2 = 1.059
30
MIPS as a Performance MetricMIPS is million instructions per second
MIPS = instruction count / (Execution time * 106)
Instruction execution rate (instruction/sec)
Faster machines have a higher MIPS rating
Problems with MIPS
•Does not take into account capabilities of instructions
(can not compare computers with different ISA)
•Varies between programs on the same computer
(a machine can not have a single MIPS rating for all programs)
•Can vary inversely with performance