Inroduction and Performance Analysis

1

Computer Architecture

2

Performance

What do you mean by performance of computer?Two important metrics• Response Time or Latency – Time taken for

completion of a single job. Smaller is better.• Throughput – Number of jobs done per unit of

time. Larger is better.

Does one imply the other?• Yes. Eg. If latency decreases, throughput will increase.• No. Eg. In pipelining, latency may have be increased to

increase throughput!

3

CPU Performance Equation

RateClock

ninstructioPerClocksnsInstructioNoTIMECPU

TimeCycleClockNeededCyclesClockTIMECPU

_

__*_._

__*___

What is this Response Time or Throughput??

4

How can we Improve Performance ?

• No. Instructions can be reduced by:– Better ISA– Better Compiler– Better Algorithm

• Clocks Per Instruction can be reduced by:– Better Hardware Design– Make the common case faster

• Clock Rate can be increased by:– Hardware Design

5

Numerical AssignmentA computer (3.06 GHz) has the following CPI

Instruction Type A B CCPI 1 2 3

An algorithm may be implemented in 2 ways I1 and I2, for each implementation the number of instructions used (in million) are as follows

Instruction Type A B CI1 0 2 2I2 2 2 1

1. Which implementation has lesser number of instructions?2. What is average CPI for both implementations? Which implementation is

faster?3. What is the total time taken for executing I1 and I2?4. What can you say about the MIPS rating?

6

1. No. of Instructions I1 = 4 M

No. of Instructions I2 = 5 M

Hence I1 has lesser number of instructions

2. Clocks req. by I1 = 2*2 + 2*3 = 10 M.

by I2 = 2*1 + 2*2 + 1*3 = 9 M.

Average CPI for I1 = 10/4 = 2.5

I2 = 9/5 = 1.8

I2 is faster as it requires lesser number of clock cycles. Notice that number of instructions required by I1 is lesser.

3. Total Time for I1 = 10 M / 3.06 GHz = 3.27 mS

I2 = 9 M / 3.06 GHz = 2.94 mS

7

4. MIPS rating = Million Instructions per second. This can be calculated from • CPI and clock rate of machine

MIPS = clock rate / CPI * 10-6

• Total Execution Time and Instruction Count

MIPS = Instruction Count / Total Execution Time * 10-6

MIPS rating for I1 = 1224 MIPS

for I2 = 1700 MIPS

MIPS rating for I2 machine > MIPS rating for I1 machine. This is as expected, since I2 has lesser execution time.

8

Probable Conclusions

1. Total Number of instructions is definitely not a good metric.

2. MIPS is a good metric.

9

Numerical AssignmentA computer (3.06 GHz) has the following CPI

Instruction Type A B CCPI 5 2 3

An algorithm may be implemented in 2 ways I1 and I2, for each implementation the number of instructions used (in million) are as follows

Instruction Type A B CI1 0 2 2I2 1 2 0

1. Which implementation has lesser number of instructions?2. What is average CPI for both implementations? Which implementation is

faster?3. What is the total time taken for executing I1 and I2?4. What can you say about the MIPS rating?

10

1. No. of Instructions I1 = 4 M

No. of Instructions I2 = 3 M

Hence number of instructions for I1 is greater than number of instructions for I2.

2. Clocks req. by I1 = 2*2 + 2*3 = 10 M.

by I2 = 1*5 + 2*2 = 9 M.

Average CPI for I1 = 10/4 = 2.5

I2 = 9/3 = 3

I2 is faster as it requires lesser number of clock cycles.

3. Total Time for I1 = 10 M / 3.06 GHz = 3.27 mS

I2 = 9 M / 3.06 GHz = 2.94 mS

11

4. MIPS rating for I1 = 1224 MIPS

for I2 = 1020 MIPS

MIPS rating for I1 machine > MIPS rating for I2 machine. This is unexpected, since I2 has lesser execution time.

Conclusion

MIPS is also not a good metric for overall system performance.

12

Conclusion

Total time of execution is always a better metric as it sums up all factors and can not be replaced by considering

1. MIPS

2. Total number of instructions

3. Clock Rate

alone.

13

Measuring Performance

Now that we know that performance is dependent upon program, which program(s) should be used to measure performance?

Benchmarks.

14

Benchmarks

• Are a set of programs that are specifically chosen for measuring performance.

• Types of Benchmarks– Real Programs– Kernel

• Extract the key feature from a program – Component– Synthetic

• Dhrystone – floating Point• Whetstone – Integer and String Arithemetic

– I/O – Parallel

15

Challenges

1. Vendors may tinker with benchmark to make them run better on their platform. At-times this is permitted.

2. Give data set rather than a single performance number.

3. Concentrate only on computational power.

16

Popular Benchmarks• SPEC - Standard Performance Evaluation Corporation

– Floating point– Integer– Web– Graphics

• TPC – Transaction Processing Performance Council– Web Server– Transaction Processing– Decision Support Systems

• BAPCo – Business Applications Performance Corporation– Popular business applications

• EEMBC – Embedded Microprocessor Benchmark Consortium– Embedded Applications

17

Statistical Summarization of Data

For Response time metric

Arithmetic Mean

For Throughput metric

Harmonic Mean or Geometric Mean.

SPEC uses Geometric Mean

18

Are Benchmarks enough?

Benchmarks give the overall performance, if one wants to optimize performance, it may be necessary to know about the instruction or section of program where maximum time is being spent.

Profilers do this job.

19

Profiling or Dynamic Program Analysis

• Program behavior is analysed as it is being run.

• Techniques used– Instruction Set Simulation– Hardware Interrupts– OS Hooks– Code instrumentation

• Example, Intel Vtune, Gprof

20

Simulation

• Difficult to build the system. Simulation is cost effective.

• Beneficial for learning/improving some aspect of architecture.

• Simulators available are :– Kiel – Instruction Simulator– Little Mans Simulator – Simulator of a

machine– Cacheprof – Cache Simulator

21

Moore’s Law (1965)

Moore's Law states that the number of transistors on a single chip at the same price will double every 18 to 24 months.

22

Implication?

As more transistors are added to the chip of the same area, their speed increases, hence circuits become faster. Or clock rate increases.

Moore’s Law in combination with various other factors like ILP (Instruction Level Parallelism) were responsible for major improvements till a long time.

23

Trends in Computing (Intel Processors)

Fastest Processor reported in Text, 2003

Current fastest processor, 2008

Intel® Processor name

Pentium 4Intel® Core™ i7-965 Processor

Extreme Edition

Processor speed 3.20 GHz 3.20GHz

Processor Primary Level Cache

12KB + 8KB 4x32KB

Processor secondary cache

512 KB 4x256KB Level 2 cache

Processor third level cache

2 MB Unified inclusive 8MB L3

24

Observations

Fastest Processor reported in Text 2003




Extreme Edition



12KB + 8KB 4x32KB





Processor Speed or Clock Rate has not changed!!!

25

Observations

Fastest Processor reported in Text 2003




Extreme Edition



12KB + 8KB 4x32KB





What is 4?

26

The Answer

Multi Core Approach - Actually more transistors are being used to pack more cores into a chip, rather than increasing clock speed.

Why?

1. Power Wall

2. Memory Wall

3. No more ILP.

27

Topics for further Study

• Papers– Performance papers– Memory Wall.

• Software– Intel Vtune or any other profiling tool– Little Mans Computer Simulator or any other

simulator apart from keil.

28

Amdahl’s Law

Execution time after improvement

= Execution time affected by improvement

Amount of improvement

+ Execution time unaffected by improvement

29

What this means?

Even if we substantially increase performance any one component, it may not result in overall substantial performance improvement.

A new architecture increases the speed of memory instructions by 50%. If memory instructions account for 50% of total time taken. What is the overall increase in performance?

Told = 100, Tnew = 25 + 50 = 75. Imp = 25%

30

What is better?

a. 20% increase in perf. of instructions executing 90% of time.

b. 90% increase in perf of instructions executing 20% of time.

Inroduction and Performance Analysis

Documents

Transcript of Inroduction and Performance Analysis