Relative Performance! - University of Hong Kongelec3441/sp16/handouts/L02-perf-4up.pdf2nd Semester,...
Transcript of Relative Performance! - University of Hong Kongelec3441/sp16/handouts/L02-perf-4up.pdf2nd Semester,...
Computer Architecture ELEC2401 & ELEC3441
2nd Semester, 2015-16 Dr. Hayden Kwok-Hay So
Department of Electrical and
Electronic Engineering
Computer Performance
2nd sem. '15-16 ENGG3441 - HS 2
How do you measure performance of a computer?
How do you make a computer fast?
Ways to measure Performance
n Execution time (response time) ≠ Throughput n We will focus on execution time in this course
2nd sem. '15-16 ENGG3441 - HS 3
Execution Time Throughput
Time to finish a task Number of tasks finish per unit time
Relative Performancen Define performance of a computer as
2nd sem. '15-16 ENGG3441 - HS 4
Performance = 1ExecutionTime
n Computer B is n times faster than Computer A if:
n = PerformanceBPerformanceA
=ExecutionTimeAExecutionTimeB
Quick Checkn Computer A finishes a task in 5s, Computer B
finishes the same task in 4s. Which one is faster, by how much?
2nd sem. '15-16 ENGG3441 - HS 5
PerformanceBPerformanceA
=ExecutionTimeAExecutionTimeB
=54=1.25
Computer B is 1.25 times faster than Computer A
Ways to Measure Execution Timen Wall Clock Time (Elapse Time)
• The total time a user experiences that a computer takes to finish a task
• Includes OS overhead, I/O, idle time, time shared with other users
n CPU Time • The time spent on a user task in the CPU • User CPU + OS CPU time • Does not include I/O, time spent by other users, etc
n Focus on CPU Time in this course
2nd sem. '15-16 ENGG3441 - HS 6
$ time shasum afile 132ecc0e19eec19d5dc775752efeac280cecebdc afile real 0m20.177s user 0m12.835s sys 0m1.786s
2nd sem. '15-16 ENGG3441 - HS 7
How can we determine CPU time needed to execute a program?
CPUTime = # of instructionprogram
×# of cycleinstruction
×timecycle
The Iron Law
CPU Time – Step 1
n Most modern CPUs are synchronous digital systems
n The time needs to finish executing a task is determined by the number of cycles needed for that ask, multiply by the cycle time.
2nd sem. '15-16 ENGG3441 - HS 8
CPUTime =CycleCount×CycleTime
=CycleCount
ClockFrequency
Digital system design review…
Synchronous Sequential Circuitsn A synchronous sequential circuit contains exactly 1
clock signal n All state elements are connected to the same clock
signal • è the state of the entire circuit is updated at the same time
n Common form of synchronous sequential circuits:
9
clk
Comb Logic
clk
input
clk
Comb Logic
clk
Comb Logic
Comb Logic
output
2nd sem. '15-16 ENGG3441 - HS
Clock Signaln A clock signal is particularly important signal in a
synchronous sequential circuit • It controls the action of all DFFs
n A clock signal toggles between ‘0’ and ‘1’ periodically
n The frequency of the toggling determines the maximum speed of the circuit • E.g.: in the accumulator example earlier, the output S
cannot change faster than the clock frequency
10
X x0 x1 x2
S 0 x0 x0 + x1 x0 + x1 + x2
clk
1
clock period
1clock period
= clock frequency
e.g. Intel CPU runs at 3 GHz, Mobile phone processors at 1 GHz Lab FPGA board at 50 MHz
2nd sem. '15-16 ENGG3441 - HS
Timing in Synchronous Circuits
n In a synchronous sequential circuit, signal changes occur only during clock edge
n All signals are therefore synchronized to change values right after a clock edge
n In the above example, need to make sure correct value of y available BEFORE next clock edge • Avoid glitches
11
ab
c
d yclk clk
clk
2nd sem. '15-16 ENGG3441 - HS
Timing in Synchronous Circuitsn In general, the propagation
delay through the combinational logic between any two registers must be shorter than the clock period
n The longest such path is called the critical path of the circuit
n The critical path determines the maximum clock speed
12
clk
clk clk
Comb Logic
a
b
x
y
clk
1
From glitch example Stable before
clock edge
2nd sem. '15-16 ENGG3441 - HS
Attention Span n the period of time during which you continue to
be interested in something:
n Two 50 mins lectures + 10 mins break • Remind me if I forget!!!
Per
cent
age
of p
eopl
e
payi
ng a
ttent
ion
Time (mins) 20 40 60 80 100 120 0
0
50
100
25
75
2nd sem. '15-16 ENGG3441 - HS 13
CPU Time – Step 1 – Summary
n To improve performance:
n Increase clock freq è shorter critical path è less work accomplished in 1 cycle è more cycles needed • Engineers need tradeoff between the two
2nd sem. '15-16 ENGG3441 - HS 14
1. Increase clock frequency 2. Reduce cycle count
CPUTime =CycleCount×CycleTime = CycleCountClockFrequency
How many cycle does it take to finish a program?
CPU Time – Step 2 – Cycle Per Instruction (CPI)
n Program A has 2000 instructions, each instruction takes 2 cycles to finish. How many cycles does it take to complete Program A?
n Program B has 3000 instructions. 2000 of them takes 2 cycles and 1000 of them takes 1 cycle. How many cycles does the program take to finish?
2nd sem. '15-16 ENGG3441 - HS 15
CycleCount = InstructionCount×CyclePerInstruction
Average CPIn In general, different machine instructions may
take different amount of time to complete.
n Assuming n classes of instructions, then total clock cycle:
n Weighted average CPI:
2nd sem. '15-16 ENGG3441 - HS 16
ClockCycle = CPIi × InstructionCount ii=1
n
∑
CPI = CycleCountInstructionCount
= CPIi ×InstructionCount iInstructionCounti=1
n
∑
CPI Example (1)
n The ISA of computer A includes 3 classes of instructions that take different number of cycles to complete. A program P is compiled using compiler J, resulting in the utilization above.
n What is the average CPI of the compiled program?
2nd sem. '15-16 ENGG3441 - HS 17
Class C1 C2 C3Cycles 1 4 8Compiler J 100 50 100
CPI Example (2)
n A newer compiler K was developed to compile same program P, resulting in the utilization above.
n What is the average CPI of the compiled program using compiler K?
2nd sem. '15-16 ENGG3441 - HS 18
Class C1 C2 C3Cycles 1 4 8Compiler J 100 50 100Compiler K 350 100 50
Ans: 2.3
Which compiler was better…?
CPI Example (3)
n Observation: • Compiler J results in higher CPI • Compiler K uses more instructions
n But most importantly:
2nd sem. '15-16 ENGG3441 - HS 19
Class C1 C2 C3 #instr #cycle CPICycles 1 4 8Compiler J 100 50 100 250 1100 4.4Compiler K 350 100 50 500 1150 2.3
Compiler J uses fewer cycles è shorter run time è better
Number of Instructions
2nd sem. '15-16 ENGG3441 - HS 20
a = 0!b = a + 1!c = a + b!b = c + b!
How many instructions are there in the following code?
If CPI = 1, how many cycles does it take to complete?
# of instr: 4 # of cycles: 4
Number of Instructions
2nd sem. '15-16 ENGG3441 - HS 21
i = 0!loop: a = a + 1! i = i + 1! if i < 10 goto loop!
How many instructions are there in the following code?
If CPI = 1, how many cycles does it take to complete?
# of STATIC instructions: 4 # of DYNAMIC instructions: 1 + 3 * 10 = 31 # of cycles: 31
Number of Instructions
2nd sem. '15-16 ENGG3441 - HS 22
r = 0!for (i=b; i>0; i=i-1)! r = r + a!
How many instructions are there in the following code?
# of DYNAMIC instructions: ≈ b # of cycles: ≈b
To compute: r = a×b
r = a * b!
# of instructions: 1 # of cycles: 1 (?)
Instruction Count & CPIn The number of instructions in a program
depends on • Nature of application • Compiler techniques • Type of available instruction of an ISA
n Average cycles per instruction depends on • CPU microarchitecture • ISA (CISC vs RISC) • The current running state of CPU
n Different instructions may have different CPI • Average CPI affected by instruction mix
2nd sem. '15-16 ENGG3441 - HS 23
Combining All – The Iron Law
2nd sem. '15-16 ENGG3441 - HS 24
CPUTime = # of instructionprogram
×# of cycleinstruction
×timecycle
• Algorithm • Language • Compiler • ISA
• Language • Compiler • ISA • Micro
-architecture
• ISA • Hardware
design
CISC vs RISCn CISC: Complex Instruction Set Computer
RISC: Reduced Instruction Set Computer
n CISC and RISC are two different computer design strategies:
2nd sem. '15-16 ENGG3441 - HS 25
VAXx86
PA-RISC
SPARC
MIPS
RISC-V
Alpha ARM
CISC RISC
CISCn ISA includes complex instructions
• E.g. VAX has a POLY instruction that evaluate polynomial in hardware
n Includes complex addressing mode • Mem-reg; mem-mem; indirect; relative; double-indirect..
n Hardware implement complex instructions using multiple clock cycles • microcode
n One promise of CISC ISA is that it allows shorter compiled code and make compiler easier. • Still relevant today in embedded systems
n Drawback: • Less attractive as compiler techniques improve • Complex hardware è slow
2nd sem. '15-16 ENGG3441 - HS 26
RISCn ISA specifies simple instructions
• Mostly register-register transfer • Simple addressing mode
n Simpler hardware design • Allows hardware optimization • Faster hardware overall • Allows easy pipelining
n Simple ISA allows compiler optimization
n Generated code length is generally longer
n Most (if not all) ISA after the 80s are RISC
2nd sem. '15-16 ENGG3441 - HS 27
RISC vs CISC – Iron Law
2nd sem. '15-16 ENGG3441 - HS 28
Microarchitecture CPI Cycle TimeCISC >1 short
RISC – single cycle unpipelined
1 long
RISC – pipelined 1 short
CPUTime = # of instructionprogram
×# of cycleinstruction
×timecycle
Amdahl’s Law Reviewn Describes the overall speedup of a system due to
speed improvement that applies to a portion of the system.
n Let P be the portion of the system that can be sped up by a factor of S,
n Amdahl’s Law stays that the overall speedup is:
n E.g.: P = 50%, S=100 è speedup = 1.98x
2nd sem. '15-16 ENGG3441 - HS 29
1
(1−P)+ PS
0 ≤ P ≤1
Amdahl’s Law Example
n Q1: a new implementation of C3 reduces its execution length by half to 4 cycles, how much improvement in performance can be achieved?
2nd sem. '15-16 ENGG3441 - HS 30
Class C1 C2 C3Cycles 1 4 8# instr 200 130 60# cycles 200 520 480
P = 480200+ 520+ 480
= 0.4 S = 2
⇒ speedup = 1(1− 0.4)+ 0.4 / 2
=1.25
Amdahl’s Law Example
n Q2: Which instruction class, when its cycle count is reduced by half, will result in most performance improvement? • Largest CPI? • Most used? • Most cycles used?
2nd sem. '15-16 ENGG3441 - HS 31
Class C1 C2 C3Cycles 1 4 8# instr 200 130 60# cycles 200 520 480
Amdahl’s Law Implications
n In most applications, only portion of the computation can be sped up • improved hardware designs • parallelization
n Amdahl’s Law è max speedup is limited by P • If only small portion of program can be sped up, then it
doesn’t matter how large S is 2nd sem. '15-16 ENGG3441 - HS 32
Can we get to a speedup of 10 with P=0.9?
Benchmark Programsn A benchmark suite is a set of programs used to
compare processor performance
n Need to be representative of typical workload
n Kernel vs whole application • Recall Amdahl’s Law
n Avoid over optimization for specific benchmark
n SPEC benchmark • Several benchmark suites commonly used in computer
architecture research • E.g. SPEC CPU2006
2nd sem. '15-16 ENGG3441 - HS 33 ENGG3441 - HS
SPEC CPU Benchmark n Programs used to measure performance
• Supposedly typical of actual workload n Standard Performance Evaluation Corp (SPEC)
• Develops benchmarks for CPU, I/O, Web, …
n SPEC CPU2006 • Elapsed time to execute a selection of programs
• Negligible I/O, so focuses on CPU performance • Normalize relative to reference machine • Summarize as geometric mean of performance ratios
• CINT2006 (integer) and CFP2006 (floating-point)
nn
1iiratio time Execution∏
=
2nd sem. '15-16 34
ENGG3441 - HS
CINT2006 for Intel Core i7 920
2nd sem. '15-16 35
Matrix-Matrix Multiplication
2nd sem. '15-16 ENGG3441 - HS 36
a0,0 ! a0,N−1" # "
aN−1,0 ! aN−1,N−1
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
×
b0,0 ! b0,N−1" # "
bN−1,0 ! bN−1,N−1
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
=
!
" ai,kbk, jk=0
N−1
∑ "
!
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
r[i][j] = 0!for (k=0; k<N; k++)! r[i][j] += a[i][k] * b[k][j]!
Matrix-Matrix Multiplication
n If all instructions have CPI=1, then time to complete is ~N3 cycles.
n What are the factors that will make this run faster/slower?
2nd sem. '15-16 ENGG3441 - HS 37
=
!
" ai,kbk, jk=0
n−1
∑ "
!
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
for(i=0; i<N; i++)! for(j=0; j<N; j++)! r[i][j] = 0! for (k=0; k<N; k++)! r[i][j] += a[i][k] * b[k][j]!
Total number of instructions: N 3 [×, +, assignment]
And in conclusion…n The study of computer architecture allows us to
construct better computer systems • Performance, power
n Computer architecture is a study that crosses software and hardware
n We will use RISC-V as main ISA for class work, but design principles applicable to other computer designs
n The “Iron Law” determines performance of a CPU n ISA, microarchitecture, compilers, and hardware
technology all play a role in determining CPU performance
2nd sem. '15-16 ENGG3441 - HS 38
39
Acknowledgements n These slides contain material developed and
copyright by: • Arvind (MIT) • Krste Asanovic (MIT/UCB) • Joel Emer (Intel/MIT) • James Hoe (CMU) • John Kubiatowicz (UCB) • David Patterson (UCB)
n MIT material derived from course 6.823 n UCB material derived from course CS152,
CS252
2nd sem. '15-16 ENGG3441 - HS