Performance

1

IT401 Computer Organization and Architecture

Prasun GhosalDepartment of Information Technology

Bengal Engineering and Science University, Shibpur

2

Outline

•How to measure, report and summarize performance?

•What are the major factors that determine the performance of a computer?

•Execution time is the only adequate measure of performance

•Benchmarks, what are they, and how are they used to evaluate performance

3

Why study Performance?•Hardware performance is often key to the effectiveness of an entire system of Hardware and Software

•The goal is not just to assess performance but need to understand what affects performance of a machine

•To improve performance of software understand how hardware affects system performance

How well a program uses instructions of the machine?

How well underlying HW implements instructions?

How well memory and I/O systems perform?

4

How to define performance?Airplanes example

•Passenger capacity

•Cruising range (miles)

•Cruising speed (m.p.h)

•Passenger throughput (passengers * m.p.h)

Which airplane has the best performance?

•Highest cruising speed

•Longest range

•Largest capacity

•Speed•Highest cruising speed

•highest throughput

Run a program on two different workstations, which is fastest?

•User: response time (execution time)

•Computer center manager: throughput

(how many tasks were performed during a time interval)

•Relationship between response time and throughput

5

PerformanceUse response time or execution time. To maximize performance minimize execution time for some task

imeExecutionTePerformanc 1=

What does it mean that Performance(X) is greater than Performance(Y)?

)()(

)(1

)(1

)()(

XimeExecutionTYimeExecutionT

YimeExecutionTXimeExecutionT

YePerformancXePerformanc

>

>

>

X is n times faster than Y nXimeExecutionTYimeExecutionT

YePerformancXePerformanc

==)()(

)()(

6

Performance ExampleMachine A runs a program in 10 seconds and machine B runs the same program in 15 seconds, how much faster is A than B?

5.11015

)()(

)()(

===AimeExecutionTBimeExecutionT

BePerformancAePerformanc

A is 1.5 times faster than B

7

Measuring Performance1/4

Time is the measure of computer performance (sec per program)•Response time or elapsed time

Total time to complete a task including everything (disk access, memory access, operating system overhead, …)

•CPU execution time (CPU time)

Time CPU spends computing for this task and does not include time spent waiting for I/O or running other programs (some computers are timeshared)

•CPU execution time can be divided into

User CPU time: CPU time spent in the program

System CPU time: CPU time spent in operating system performing tasks on behalf of the program

8


Example of user CPU time and System CPU time

Output of Unix time command

90.7u 12.9s 2:39 65%User CPU time 90.7 sec

System CPU time 12.9 sec

Elapsed time 2:39 = ( 2 minutes and 39 sec) = 159 sec

% of elapsed time that is CPU time = (90.7 + 12.9)/159 = 65%

Then 100 – 65 = 35% of elapsed time was spent doing something else

(waiting for I/O, running other programs, …)

9


Express CPU execution time in terms of other metric that relates to how fast the HW can perform basic functions

•Computers governed by a clock that runs at constant rate and determines when events happen in HW

•Length of a clock period is Clock cycle (measured in nanoseconds (10-9 sec) or picoseconds (10-12 sec))

•Clock rate is 1/(clock cycle) (measured in Megahertz (MHz = 106 Hz), or Gigahertz (GHz = 109 Hz) )

•1 Hertz is 1 cycle/sec

CPU execution time = CPU clock cycles for a program * clock cycle time

CPU execution time = CPU clock cycles for a program / clock rate

How to improve CPU execution time?

10


Relating to Software•express CPU clock cycles in terms of program instructions

•CPU clock cycles = Instruction for a program * Average clock cycles per instruction

•Clock cycles per instruction (average number of cycles each instruction takes to execute) is abbreviated as CPI

•CPI can be used to compare two implementations of the same instruction set architecture (since instruction count for a program will remain the same)

11


CPU clock cycles = Instructions for a program * CPI

CPU time = CPU clock cycles * clock cycle time

CPU time = Instruction count * CPI * clock cycle time

CPU time = Instruction count * CPI/clock rate

clockcycleSeconds

nInstructiosClockcycle

ogramnsInstructioTime **

Pr=

Basic Performance Components

•CPU execution time

•Instruction count

•CPI

•Clock cycle time

12


How to determine values of performance components

•CPU execution time: measurement

•Clock cycle time: published as part of documentation for a machine

•Instruction count: •Software tools to profile execution, or use a simulator of the architecture

•Hardware counters if available to measure the # of instructions executed

•CPI: varies by application , as well as among implementations within the same instruction set. Obtained through a detailed simulation or by combining HW counters and simulation

CPI Can be calculated if different types of instructions and individual clock cycle counts are known

13


∑==

n

iii CCPI

1)*(CPU clock cycles

Ci: number of instructions of class i executed

CPIi: average number of cycles per instruction for that instruction class

n: number of instruction classes

Overall program CPI dependent on•Number of cycles for each instruction type

•Frequency of each instruction type in the program execution

14


CPU clock cycles = Instructions for a program * CPI

imeExecutionTePerformanc 1=

CPU time = CPU clock cycles * clock cycle time

CPU time = CPU clock cycles for a program / clock rate

CPU time = Instruction count * CPI * clock cycle time

CPU time = Instruction count * CPI/clock rate

∑==

n

iii CCPI

1)*(CPU clock cycles

15

Benchmarks1/3

Concept of Workload

•Informally, set of programs that the user runs day in and day out

Benchmarks•Programs specifically chosen to measure performance

•Form a workload that the user hopes will predict the performance of the actual workload

•Best benchmark types are real programs

•Use of benchmarks whose performance depends on small code segments encourages optimizations in either the architecture or compiler

•A problem: Compilers with special-purpose optimizations targeted at specific benchmarks. Will such optimizations produce good or correct code with a real application?

16

Benchmarks2/3

COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED

•Matrix 300 in SPEC suite in 1989

•SPEC is System Performance Evaluation Cooperative

•For matrix 300, the enhanced compiler improves performance by a factor of more than 9!. Although not that much improvement with other benchmarks.

•SPEC benchmark web site http://www.specbench.org

http://www.specbench.org/�

http://www.specbench.org/�

17

Benchmarks3/3

Why real programs are not used to measure performance?•Small size of benchmark (easier compilation and simulation)

•Compilers might not be available for a new machine

•Numerous published performance results are available for small benchmarks

•Benchmarks are OK for the initial design phase, but a working computer system should be evaluated with a real program

Writing Performance reports•Reproducibility

•Include everything needed to be able to duplicate the experiment

18

Comparing and Summarizing Performance1/4

Selected benchmark

Agreed to use response time or throughput

How to summarize performance of a group of benchmarks?

M/C A M/C BP1 1 10P2 1000 100Total 1001 110

A is 10 times faster than B for P1

B is 10 times faster than A for P2

What is the relative performance of A & B?

Use Total Execution Time 1.91101001

)()(

)()(

===BimeExecutionTAimeExecutionT

AePerformancBePerformanc

19


•B is 9.1 times faster than A for P1 and P2 together

•One figure as Summary of performance directly proportional to execution time

•If the workload consists of running P1 and P2 an equal number of times, this statement would predict the relative execution times for the workload on each machine

•Average of execution times that is directly proportional to total execution time is arithmetic mean (AM)

∑==

n

iiTime

nAM

1)(1 Time(i): execution time for ith program

n: total number of programs in the workload

A Smaller mean means smaller average execution time and thus improved performance

20


•Arithmetic mean proportional to execution time, if programs in workload are each run an equal number of times. What happens if not the case?

•Assign a weighting factor w(i) to each program to indicate frequency of the program in the workload

•Weighted arithmetic mean

•AM special case of weighted AM when all weights are equal

∑==

n

iiTimeiwWeightedAM

1)(*)(

21


Program M/C A M/C B M/C CP1 1 10 20P2 1000 100 20

Table shows runtimes of P1 and P2 on three machines A, B, and C

Workload consists of P1 and P2.

P1 is run 10 times as often as P2

Find which machine is fastest for this workload and by how much?

22

SPEC95 Benchmarks•CPU benchmark

•Created by a set of computer companies in 1989

•SPEC95 (8 integer and 10 floating point programs). Figure 2.6

•SPEC95 web site (http://www.specbench.org/osg/cpu95/news/cpu95descr.html)

•SPEC ratio for xxx.benchmark =

xxx.benchmark reference time /xxx.benchmark run time

Normalized measure. Higher results indicate faster performance

•Reference machine is a Sun SPARCstation 10/40

•SPECint95 or SPECfp95 summary measurement is obtained by taking geometric meanof the SPEC ratios

nn

iiSPECratio∏

=1)( ∏

=

n

iia

1)( Product of a1 * a2 * ..* an

http://www.specbench.org/osg/cpu95/news/cpu95descr.html�

http://www.specbench.org/osg/cpu95/news/cpu95descr.html�

23

SPEC95 Benchmark results for Pentium and Pentium Pro

•At same clock rate, Pentium Pro is 1.4 to 1.5 times faster

•When clock rate increased by a certain factor, processor performance increases by a lower factor

•Pentium clock rate from 100 to 200 MHz. SPECint95 performance improves by only 1.7 (Why?)

24

SPEC95 Benchmark results for Pentium and Pentium Pro

•At same clock rate, Pentium Pro is 1.7 to 1.8 times faster

•Clock rate from 100 to 200 MHz, SPECfp95 improves by only 1.4 (Why?)

•Bottleneck at memory system due to increase of processor speed, which effect is more evident on floating point benchmarks because of size.

25

Performance Summary Example1/2

M/C A M/C BP1 1482 139P2 2266 254P3 6206 690

Which machine is faster according to total execution time? And by how much?

Total Execution Time (A) = 1482 + 2266 + 6206 = 9954

Total Execution (B) = 139 + 254 + 690 = 1083

Machine B is fastest by 9954/1083 = 9.27 times

26

Performance Summary Example2/2

M/C A M/C BP1 1482 139P2 2266 254P3 6206 690

Which machine is faster by the geometric mean measure?

Remember how SPEC reported performance?

•Normalize in reference to one machine

•Choose A as reference machine

•Obtain Execution time ratios (ET Ratio)

ET Ratio(P1) = ET(A)/ET(B) = 1482/139 = 10.66

ET Ratio (P2) = 2266/254 = 8.92

ET Ratio(P3) = 6206/690 = 8.99

Geometric Mean = (Ratio (P1) * Ratio(P2) * Ratio(P3))1/3

Geometric Mean = 9.49

Machine B is 9.49 times faster than A according to geometric mean measure

27

Amdahl’s Law1/3

PitfallExpecting the improvement of one aspect of a machine to increase performance by an amount proportional to the size of the improvement

Program runs in 100 sec on a machine

Multiply operations responsible for 80 sec of time

How much do we need to improve the speed of multiplication if program is to run 5 times faster?

Execution time after improvement =

(Execution time affected by improvement/Amount of improvement + Execution time unaffected)

Execution time after improvement = 80/n + (100-80) = 20 = (100/5)

20 = 80/n + 20 80/n = 0 no n can be found to achieve the requested improvement

Make the common case fast

28

Amdahl’s Law2/3

Another form of Amdahl’s Law (to yield Speedup)Speedup = Performance after improvement/Performance before

Speedup = Execution time before/Execution time after improvement

Assume new hardware added to machine

f = fractions of all operations which use new hardware

s = speedup of those operations using new hardware

Execution time with new hardware is Tnew

Execution time without new hardware is Told

Tnew = f* Told/s + (1-f) * Told

Overall speedup S = Told/Tnew

Speedup = s / (s –f * (s-1))

fs 0.1 Speedup2 1.0526325 1.08695710 1.098901

s 0.5 Speedup2 1.3333335 1.66666710 1.818182

s 0.9 Speedup2 1.8181825 3.57142910 5.263158

s 0.99 Speedup2 1.9801985 4.80769210 9.174312

29

Amdahl’s Law3/3

Example of memory versus processor speedup

A = B op C

Assume memory access takes 4 cycles and a typical operation takes 2 cycles

Which of the following achieves the best increase in performance

•Increase memory speed by 50%

•Double operation speed

Calculate how many memory accesses are needed first?

1 to get instruction from memory

2 to get B and C from memory

1 to store result (A) back in memory

Then we need a total of 4 memory access operations

Memory access time = 4 (accesses) * 4 (cycles/access) = 16 cycles

Operation time = 1 (operation) * 2 (cycles/operation) = 2 cycles

Total number of cycles = 16 + 2 = 18

Option 1 increase memory speed by 50%

s1 = 1.5 (how?)

f1 = memory access time/ total time

= 16/18 = 0.889

S1 = 1.42

Option 2 double operation speed

s2 = 2

f2 = operation time/total time

= 2/18 = 0.111

S2 = 1.059

30

MIPS as a Performance MetricMIPS is million instructions per second

MIPS = instruction count / (Execution time * 106)

Instruction execution rate (instruction/sec)

Faster machines have a higher MIPS rating

Problems with MIPS

•Does not take into account capabilities of instructions

(can not compare computers with different ISA)

•Varies between programs on the same computer

(a machine can not have a single MIPS rating for all programs)

•Can vary inversely with performance

Performance

Documents

Transcript of Performance