1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system...

30
1 CS465 Performance Revisited (Chapter 1) o compare performance of simple system configuratio stand the performance implications of architectural

Transcript of 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system...

Page 1: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

1

CS465

PerformanceRevisited

(Chapter 1)

Be able to compare performance of simple system configurations and understand the performance implications of architectural choices

Page 2: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

2

Performance and Cost: Purchasing vs Design Views

• Our goal is to understand cost & performance implications of architectural choices. Consider 2 views:– Purchasing perspective: given 4 machines, which measure yields

the best decision? • best performance • least cost• best performance / cost

– Design perspective: select the design that yields • best performance• least cost • best performance / cost

• Both require– basis for comparison– metric for evaluation

Page 3: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

3

• Measure, Report, and Summarize• Make intelligent choices• See through the marketing hype• Key to understanding underlying organizational motivation

Why is some hardware better than others for different programs?

What factors of system performance are hardware related?(e.g., Do we need a new machine, or a new operating

system?)

How does the machine's instruction set affect performance?

Performance

Page 4: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

4

Performance

• Which of these determines performance?– # of cycles to execute program?– # of instructions in program?– # of cycles per second?– average # of cycles per instruction?– average # of instructions per second?

• Common pitfall: thinking one of the variables is indicative of performance when it really isn’t.

• Performance is determined by execution time

1998 Morgan Kaufmann Publishers

Page 5: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

5

Two notions of “performance”

°

Which is the best measure of performance?Passenger capacity, range, speed, throughput, travel time

Which has higher performance?

Speed: Concorde is fastest; Range: Douglas DC-8-50 is longest

Airplane Passengers Range (mi) Speed (mph)

Boeing 777 375 4630 610Boeing 747 470 4150 610BAC/Sud Concorde 132 4000 1350Douglas DC-8-50 146 8720 544

Page 6: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

6

Two notions of “Performance” - 2

• Throughput (passenger-milesperhour): Boeing 747 is highest• Cost of operation vs Cost of operation per passenger-miles per

hour?

Plane

Boeing 747

BAD/Sud Concodre

Speed

610 mph

1350 mph

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput (pmph)

286,700

178,200

CPU:

Time to do the task – execution time, response time, latency

Tasks per day, hour, week, sec, ns.. (Capacity) – throughput, bandwidth

Response time and throughput are often in opposition

Page 7: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

7

How to Measure Performance

• Two approaches– User perspective: Response time / Execution time– Computer Center Manager perspective: Throughput

based on number of jobs completed

• How quickly is each job completed vs Total Amount of work done

• We focus on execution time for a single job

Page 8: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

8

Measures

• Performance is inversely proportional to the execution time.

• Performancex = 1/execution timex

– Execution time decreases by 4 implies that performance has increased by 4.

• “x is n times faster than y” means that • Performancex / Performancey = n

Page 9: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

9

Measuring Execution Time

• Elapsed Time– includes everything (disk and memory accesses,

I/O, etc.)– a useful number, but often not good for CPU

assessment

• CPU time– doesn't include I/O or time spent running other

programs– CPU time = system time + user time

• Our focus: user CPU time – time spent executing the lines of code that are "in"

our program

Page 10: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

10

Clock Cycles

• Instead of reporting execution time in seconds, we often use cycles

• Clock “ticks” indicate when to start activities (one abstraction):

• cycle time = time between ticks = seconds per cycle• clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)

A 200 Mhz. clock has a cycle time

time

seconds

program

cycles

program

seconds

cycle

1

200 106 109 5 nanoseconds

Morgan Kaufmann Publishers

Page 11: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

11

Performance can be enhanced by either:

________ the # of required cycles for a program, or

________ the clock cycle time or,

________ the clock rate (inverse of clock cycle time).

How to Improve Performance?seconds

program

cycles

program

seconds

cycle

Page 12: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

12

• Is it safe to assume that # of cycles = # of instructions?

This assumption is incorrect. Different instructions take different amounts of time.For example: add, lw in MIPS.

Why do some instructions take more time than others? Remember – these are not lines of C code – these are machineinstructions.

time

1st

inst

ruct

ion

2nd

inst

ruct

ion

3rd

inst

ruct

ion

4th

5th

6th ...

How many cycles are required for a program?

Page 13: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

13

• Multiplication/division takes more time than addition

• Floating point operations take longer than integer ones

• Accessing memory takes more time than accessing registers

• Cycles to execute an instruction can be different for different machines.

• In the same family of computers – cycles to execute an instruction can be different

• Important point: changing the cycle time often changes the number of cycles

required for various instructions (more later)

time

Different numbers of cycles for different instructions

Page 14: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

14

• Our favorite program runs in 10 seconds on computer A, which has a 400 Mhz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?"

Don't Panic, can easily work this out from basic principles

Example

Morgan Kaufmann Publishers

Page 15: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

15

ExampleClock CyclesA = CPU timeA * Clock Rate

= 10 s * 400 * 106 c/s = 4 *109

1.2 * Clock CyclesA 1.2 * 4 *109 cyclesClock RateB = ------------------------- = --------------------------

CPU timeB 6 seconds

Clock RateB = 800 cycles per second = 800 MHz

Execution time from 10 s to 6s Clock from 400 MHz to 800 MHz.

Page 16: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

16

Factors Affecting Computer Performance

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Instr Count CPI Clock PeriodProgram xCompiler x xInstruction Set x x ?Organization x xTechnology x

Page 17: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

17

CPI

CPU time = ClockPeriod * CPIi * Ci

ClockPeriod = ClockCycleTime

i = 1

n

CPI = CPI * F where Fi = Ci

i = 1

n

i i

Instruction Count

"instruction frequency"

Invest Resources where time is Spent!

CPI = Clock Cycles to execute program / Instruction Count

= (CPU Time * Clock Rate) / Instruction Count

“Average cycles per instruction”

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPIi = CPI for instr class iCi = Count of instr class i instructions executed

Page 18: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

18

CPI Example

• For a program,Machine A has a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has a clock cycle time of 20 ns. and a CPI of 1.2

• For machine ACPU time = IC CPI Clock cycle time

CPU time = IC 2.0 10 ns = 20 IC ns

• For machine BCPU time = IC 1.2 20 ns = 24 IC ns

Page 19: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

19

A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively).

The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.

Which sequence will be faster? How much?What is the CPI for each sequence?

# of Instructions Example

Page 20: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

20

• A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.Which sequence will be faster? How much?What is the CPI for each sequence?

# of Instructions Example

9 6 9/6 2 sequencefor cycles CPU

10 510/5 1 sequencefor cycles CPU

6

9

1)1(4

312114 2 sequencefor CPI

5

10

2)1(2

322112 1 sequencefor CPI

Page 21: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

21

Example – Instruction Mix / CPI Baseline mixInstruction Frequency Cycles per Cycles for % CPUClass (I ) % instr. - CPI(I) class time

ALU 50 1 0.5 23Load 20 5 1 45Store 10 3 0.3 14

Branch 20 2 0.4 182.2

High speed data cache, reduces load to 2 cyclesInstruction Frequency Cycles per Cycles for % CPU

Class % instr. - CPI(I) class timeALU 50 1 0.5 31Load 20 2 0.4 25Store 10 3 0.3 19

Branch 20 2 0.4 251.6

Branch prediction reduces branch to 1 cycleInstruction Frequency Cycles per Cycles for % CPU

Class % instr. - CPI(I) class timeALU 50 1 0.5 25Load 20 5 1 50Store 10 3 0.3 15

Branch 20 1 0.2 102

Two ALU instructions per cycleInstruction Frequency Cycles per Cycles for % CPUClass % instr. - CPI(I) class timeALU 50 0.5 0.25 13Load 20 5 1 51Store 10 3 0.3 15Branch 20 2 0.4 21

1.95

High speed cache reduces load to 2 cycles

What consumes the most CPU time?

Branch prediction reduces branch to 1 cycle

Two ALU instructions per cycle

Page 22: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

22

Amdahl's Law

Compute Task = Component that can be parallelized + Component that is serial

Speedup due to enhancement E (impacts parallelizable part): ExTime w/o E Performance with ESpeedup(E) = -------------------- = ------------------------- ExTime with E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the taskby a factor S and the remainder of the task is unaffected then what is the speedup?

ExTime(without E) = ((1-F) + F) X ExTime(without E)

ExTime(with E) = ((1-F) + F/S) X ExTime(without E)

Speedup(with E) = 1 (1-F) + F/S

F F/S

Page 23: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

23

Summary: Instruction set design (MIPS)• Use general purpose registers with a load-store architecture: YES

• Provide at least 16 general purpose registers plus separate floating-point registers: 31 GPR & 32 FPR

• Support basic addressing modes: displacement (with an address offset size of 12 to 16 bits), immediate (size 8 to 16 bits), and register deferred; : YES: 16 bits for immediate, displacement (disp=0 => register deferred)

• All addressing modes apply to all data transfer instructions : YES • Use fixed instruction encoding if interested in performance and use

variable instruction encoding if interested in code size : Fixed • Support these data sizes and types: 8-bit, 16-bit, 32-bit integers and 32-

bit and 64-bit IEEE 754 floating point numbers: YES• Support these simple instructions, since they will dominate the number

of instructions executed: load, store, add, subtract, move register-register, and, shift, compare equal, compare not equal, branch (with a PC-relative address at least 8-bits long), jump, call, and return: YES

• Aim for a minimalist instruction set: YES

Page 24: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

24

How to Evaluate Instruction Sets?

Metric we use : Time to execute the program

NOTE: this depends on instructions set, processor organization, and compilation techniques.

CPI

Instruction Count

Cycle Time

Page 25: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

25

Some Popular Performance Measures

MIPS = Millions of Instructions per second– Easy to understand; faster machines have higher

MIPS

Problems– Different computers have different instruction sets.

How does one compare MIPS across platforms– MIPS based on specific programs– MIPS depends on compilers

MIPS are not a function of the CPU alone

BENCHMARKS – BASIS FOR COMPARISON

Page 26: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

26

MFLOPS

• Millions of floating point operations per second

• Problems– Different machines have different set of

floating point operations– Programs require a varying mix of floating

point operations.

Page 27: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

27

How to evaluate?

• Target workload– Depends on user environment– Depends on “current” usage pattern

• Standard workloads (benchmarks)– Should not be narrow – manufacturers design to

benchmark – design appropriate instruction sets!!!• SPEC 2000

– A mix of tasks– CINT2000 – integer– CFP2000 – floating point

• SPECweb 99– Focus on webserver thruput

Page 28: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

28

SPEC ratings for Pentium 3 and Pentium 4

Fig 4.6 in 3rd Edition

Page 29: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

29

Relative Performance of 3 Intel processors

Fig 4.8 in 3rd Edition

Page 30: 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of.

30

Relative Energy Efficiency

Fig 4.9 in 3rd Edition