More Unix Utilities ( find, diff/cmp, at/crontab, tr ) CS465.
1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system...
-
Upload
damon-francis -
Category
Documents
-
view
212 -
download
0
Transcript of 1 CS465 Performance Revisited (Chapter 1) Be able to compare performance of simple system...
1
CS465
PerformanceRevisited
(Chapter 1)
Be able to compare performance of simple system configurations and understand the performance implications of architectural choices
2
Performance and Cost: Purchasing vs Design Views
• Our goal is to understand cost & performance implications of architectural choices. Consider 2 views:– Purchasing perspective: given 4 machines, which measure yields
the best decision? • best performance • least cost• best performance / cost
– Design perspective: select the design that yields • best performance• least cost • best performance / cost
• Both require– basis for comparison– metric for evaluation
3
• Measure, Report, and Summarize• Make intelligent choices• See through the marketing hype• Key to understanding underlying organizational motivation
Why is some hardware better than others for different programs?
What factors of system performance are hardware related?(e.g., Do we need a new machine, or a new operating
system?)
How does the machine's instruction set affect performance?
Performance
4
Performance
• Which of these determines performance?– # of cycles to execute program?– # of instructions in program?– # of cycles per second?– average # of cycles per instruction?– average # of instructions per second?
• Common pitfall: thinking one of the variables is indicative of performance when it really isn’t.
• Performance is determined by execution time
1998 Morgan Kaufmann Publishers
5
Two notions of “performance”
°
Which is the best measure of performance?Passenger capacity, range, speed, throughput, travel time
Which has higher performance?
Speed: Concorde is fastest; Range: Douglas DC-8-50 is longest
Airplane Passengers Range (mi) Speed (mph)
Boeing 777 375 4630 610Boeing 747 470 4150 610BAC/Sud Concorde 132 4000 1350Douglas DC-8-50 146 8720 544
6
Two notions of “Performance” - 2
• Throughput (passenger-milesperhour): Boeing 747 is highest• Cost of operation vs Cost of operation per passenger-miles per
hour?
Plane
Boeing 747
BAD/Sud Concodre
Speed
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput (pmph)
286,700
178,200
CPU:
Time to do the task – execution time, response time, latency
Tasks per day, hour, week, sec, ns.. (Capacity) – throughput, bandwidth
Response time and throughput are often in opposition
7
How to Measure Performance
• Two approaches– User perspective: Response time / Execution time– Computer Center Manager perspective: Throughput
based on number of jobs completed
• How quickly is each job completed vs Total Amount of work done
• We focus on execution time for a single job
8
Measures
• Performance is inversely proportional to the execution time.
• Performancex = 1/execution timex
– Execution time decreases by 4 implies that performance has increased by 4.
• “x is n times faster than y” means that • Performancex / Performancey = n
9
Measuring Execution Time
• Elapsed Time– includes everything (disk and memory accesses,
I/O, etc.)– a useful number, but often not good for CPU
assessment
• CPU time– doesn't include I/O or time spent running other
programs– CPU time = system time + user time
• Our focus: user CPU time – time spent executing the lines of code that are "in"
our program
10
Clock Cycles
• Instead of reporting execution time in seconds, we often use cycles
• Clock “ticks” indicate when to start activities (one abstraction):
• cycle time = time between ticks = seconds per cycle• clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)
A 200 Mhz. clock has a cycle time
time
seconds
program
cycles
program
seconds
cycle
1
200 106 109 5 nanoseconds
Morgan Kaufmann Publishers
11
Performance can be enhanced by either:
________ the # of required cycles for a program, or
________ the clock cycle time or,
________ the clock rate (inverse of clock cycle time).
How to Improve Performance?seconds
program
cycles
program
seconds
cycle
12
• Is it safe to assume that # of cycles = # of instructions?
This assumption is incorrect. Different instructions take different amounts of time.For example: add, lw in MIPS.
Why do some instructions take more time than others? Remember – these are not lines of C code – these are machineinstructions.
time
1st
inst
ruct
ion
2nd
inst
ruct
ion
3rd
inst
ruct
ion
4th
5th
6th ...
How many cycles are required for a program?
13
• Multiplication/division takes more time than addition
• Floating point operations take longer than integer ones
• Accessing memory takes more time than accessing registers
• Cycles to execute an instruction can be different for different machines.
• In the same family of computers – cycles to execute an instruction can be different
• Important point: changing the cycle time often changes the number of cycles
required for various instructions (more later)
time
Different numbers of cycles for different instructions
14
• Our favorite program runs in 10 seconds on computer A, which has a 400 Mhz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?"
Don't Panic, can easily work this out from basic principles
Example
Morgan Kaufmann Publishers
15
ExampleClock CyclesA = CPU timeA * Clock Rate
= 10 s * 400 * 106 c/s = 4 *109
1.2 * Clock CyclesA 1.2 * 4 *109 cyclesClock RateB = ------------------------- = --------------------------
CPU timeB 6 seconds
Clock RateB = 800 cycles per second = 800 MHz
Execution time from 10 s to 6s Clock from 400 MHz to 800 MHz.
16
Factors Affecting Computer Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Instr Count CPI Clock PeriodProgram xCompiler x xInstruction Set x x ?Organization x xTechnology x
17
CPI
CPU time = ClockPeriod * CPIi * Ci
ClockPeriod = ClockCycleTime
i = 1
n
CPI = CPI * F where Fi = Ci
i = 1
n
i i
Instruction Count
"instruction frequency"
Invest Resources where time is Spent!
CPI = Clock Cycles to execute program / Instruction Count
= (CPU Time * Clock Rate) / Instruction Count
“Average cycles per instruction”
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPIi = CPI for instr class iCi = Count of instr class i instructions executed
18
CPI Example
• For a program,Machine A has a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has a clock cycle time of 20 ns. and a CPI of 1.2
• For machine ACPU time = IC CPI Clock cycle time
CPU time = IC 2.0 10 ns = 20 IC ns
• For machine BCPU time = IC 1.2 20 ns = 24 IC ns
19
A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively).
The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.
Which sequence will be faster? How much?What is the CPI for each sequence?
# of Instructions Example
20
• A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.Which sequence will be faster? How much?What is the CPI for each sequence?
# of Instructions Example
9 6 9/6 2 sequencefor cycles CPU
10 510/5 1 sequencefor cycles CPU
6
9
1)1(4
312114 2 sequencefor CPI
5
10
2)1(2
322112 1 sequencefor CPI
21
Example – Instruction Mix / CPI Baseline mixInstruction Frequency Cycles per Cycles for % CPUClass (I ) % instr. - CPI(I) class time
ALU 50 1 0.5 23Load 20 5 1 45Store 10 3 0.3 14
Branch 20 2 0.4 182.2
High speed data cache, reduces load to 2 cyclesInstruction Frequency Cycles per Cycles for % CPU
Class % instr. - CPI(I) class timeALU 50 1 0.5 31Load 20 2 0.4 25Store 10 3 0.3 19
Branch 20 2 0.4 251.6
Branch prediction reduces branch to 1 cycleInstruction Frequency Cycles per Cycles for % CPU
Class % instr. - CPI(I) class timeALU 50 1 0.5 25Load 20 5 1 50Store 10 3 0.3 15
Branch 20 1 0.2 102
Two ALU instructions per cycleInstruction Frequency Cycles per Cycles for % CPUClass % instr. - CPI(I) class timeALU 50 0.5 0.25 13Load 20 5 1 51Store 10 3 0.3 15Branch 20 2 0.4 21
1.95
High speed cache reduces load to 2 cycles
What consumes the most CPU time?
Branch prediction reduces branch to 1 cycle
Two ALU instructions per cycle
22
Amdahl's Law
Compute Task = Component that can be parallelized + Component that is serial
Speedup due to enhancement E (impacts parallelizable part): ExTime w/o E Performance with ESpeedup(E) = -------------------- = ------------------------- ExTime with E Performance w/o E
Suppose that enhancement E accelerates a fraction F of the taskby a factor S and the remainder of the task is unaffected then what is the speedup?
ExTime(without E) = ((1-F) + F) X ExTime(without E)
ExTime(with E) = ((1-F) + F/S) X ExTime(without E)
Speedup(with E) = 1 (1-F) + F/S
F F/S
23
Summary: Instruction set design (MIPS)• Use general purpose registers with a load-store architecture: YES
• Provide at least 16 general purpose registers plus separate floating-point registers: 31 GPR & 32 FPR
• Support basic addressing modes: displacement (with an address offset size of 12 to 16 bits), immediate (size 8 to 16 bits), and register deferred; : YES: 16 bits for immediate, displacement (disp=0 => register deferred)
• All addressing modes apply to all data transfer instructions : YES • Use fixed instruction encoding if interested in performance and use
variable instruction encoding if interested in code size : Fixed • Support these data sizes and types: 8-bit, 16-bit, 32-bit integers and 32-
bit and 64-bit IEEE 754 floating point numbers: YES• Support these simple instructions, since they will dominate the number
of instructions executed: load, store, add, subtract, move register-register, and, shift, compare equal, compare not equal, branch (with a PC-relative address at least 8-bits long), jump, call, and return: YES
• Aim for a minimalist instruction set: YES
24
How to Evaluate Instruction Sets?
Metric we use : Time to execute the program
NOTE: this depends on instructions set, processor organization, and compilation techniques.
CPI
Instruction Count
Cycle Time
25
Some Popular Performance Measures
MIPS = Millions of Instructions per second– Easy to understand; faster machines have higher
MIPS
Problems– Different computers have different instruction sets.
How does one compare MIPS across platforms– MIPS based on specific programs– MIPS depends on compilers
MIPS are not a function of the CPU alone
BENCHMARKS – BASIS FOR COMPARISON
26
MFLOPS
• Millions of floating point operations per second
• Problems– Different machines have different set of
floating point operations– Programs require a varying mix of floating
point operations.
27
How to evaluate?
• Target workload– Depends on user environment– Depends on “current” usage pattern
• Standard workloads (benchmarks)– Should not be narrow – manufacturers design to
benchmark – design appropriate instruction sets!!!• SPEC 2000
– A mix of tasks– CINT2000 – integer– CFP2000 – floating point
• SPECweb 99– Focus on webserver thruput
28
SPEC ratings for Pentium 3 and Pentium 4
Fig 4.6 in 3rd Edition
29
Relative Performance of 3 Intel processors
Fig 4.8 in 3rd Edition
30
Relative Energy Efficiency
Fig 4.9 in 3rd Edition