Inroduction and Performance Analysis
-
Upload
drsaagrawal -
Category
Documents
-
view
214 -
download
0
description
Transcript of Inroduction and Performance Analysis
![Page 1: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/1.jpg)
1
Computer Architecture
![Page 2: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/2.jpg)
2
Performance
What do you mean by performance of computer?Two important metrics• Response Time or Latency – Time taken for
completion of a single job. Smaller is better.• Throughput – Number of jobs done per unit of
time. Larger is better.
Does one imply the other?• Yes. Eg. If latency decreases, throughput will increase.• No. Eg. In pipelining, latency may have be increased to
increase throughput!
![Page 3: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/3.jpg)
3
CPU Performance Equation
RateClock
ninstructioPerClocksnsInstructioNoTIMECPU
TimeCycleClockNeededCyclesClockTIMECPU
_
__*_._
__*___
What is this Response Time or Throughput??
![Page 4: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/4.jpg)
4
How can we Improve Performance ?
• No. Instructions can be reduced by:– Better ISA– Better Compiler– Better Algorithm
• Clocks Per Instruction can be reduced by:– Better Hardware Design– Make the common case faster
• Clock Rate can be increased by:– Hardware Design
![Page 5: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/5.jpg)
5
Numerical AssignmentA computer (3.06 GHz) has the following CPI
Instruction Type A B CCPI 1 2 3
An algorithm may be implemented in 2 ways I1 and I2, for each implementation the number of instructions used (in million) are as follows
Instruction Type A B CI1 0 2 2I2 2 2 1
1. Which implementation has lesser number of instructions?2. What is average CPI for both implementations? Which implementation is
faster?3. What is the total time taken for executing I1 and I2?4. What can you say about the MIPS rating?
![Page 6: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/6.jpg)
6
1. No. of Instructions I1 = 4 M
No. of Instructions I2 = 5 M
Hence I1 has lesser number of instructions
2. Clocks req. by I1 = 2*2 + 2*3 = 10 M.
by I2 = 2*1 + 2*2 + 1*3 = 9 M.
Average CPI for I1 = 10/4 = 2.5
I2 = 9/5 = 1.8
I2 is faster as it requires lesser number of clock cycles. Notice that number of instructions required by I1 is lesser.
3. Total Time for I1 = 10 M / 3.06 GHz = 3.27 mS
I2 = 9 M / 3.06 GHz = 2.94 mS
![Page 7: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/7.jpg)
7
4. MIPS rating = Million Instructions per second. This can be calculated from • CPI and clock rate of machine
MIPS = clock rate / CPI * 10-6
• Total Execution Time and Instruction Count
MIPS = Instruction Count / Total Execution Time * 10-6
MIPS rating for I1 = 1224 MIPS
for I2 = 1700 MIPS
MIPS rating for I2 machine > MIPS rating for I1 machine. This is as expected, since I2 has lesser execution time.
![Page 8: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/8.jpg)
8
Probable Conclusions
1. Total Number of instructions is definitely not a good metric.
2. MIPS is a good metric.
![Page 9: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/9.jpg)
9
Numerical AssignmentA computer (3.06 GHz) has the following CPI
Instruction Type A B CCPI 5 2 3
An algorithm may be implemented in 2 ways I1 and I2, for each implementation the number of instructions used (in million) are as follows
Instruction Type A B CI1 0 2 2I2 1 2 0
1. Which implementation has lesser number of instructions?2. What is average CPI for both implementations? Which implementation is
faster?3. What is the total time taken for executing I1 and I2?4. What can you say about the MIPS rating?
![Page 10: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/10.jpg)
10
1. No. of Instructions I1 = 4 M
No. of Instructions I2 = 3 M
Hence number of instructions for I1 is greater than number of instructions for I2.
2. Clocks req. by I1 = 2*2 + 2*3 = 10 M.
by I2 = 1*5 + 2*2 = 9 M.
Average CPI for I1 = 10/4 = 2.5
I2 = 9/3 = 3
I2 is faster as it requires lesser number of clock cycles.
3. Total Time for I1 = 10 M / 3.06 GHz = 3.27 mS
I2 = 9 M / 3.06 GHz = 2.94 mS
![Page 11: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/11.jpg)
11
4. MIPS rating for I1 = 1224 MIPS
for I2 = 1020 MIPS
MIPS rating for I1 machine > MIPS rating for I2 machine. This is unexpected, since I2 has lesser execution time.
Conclusion
MIPS is also not a good metric for overall system performance.
![Page 12: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/12.jpg)
12
Conclusion
Total time of execution is always a better metric as it sums up all factors and can not be replaced by considering
1. MIPS
2. Total number of instructions
3. Clock Rate
alone.
![Page 13: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/13.jpg)
13
Measuring Performance
Now that we know that performance is dependent upon program, which program(s) should be used to measure performance?
Benchmarks.
![Page 14: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/14.jpg)
14
Benchmarks
• Are a set of programs that are specifically chosen for measuring performance.
• Types of Benchmarks– Real Programs– Kernel
• Extract the key feature from a program – Component– Synthetic
• Dhrystone – floating Point• Whetstone – Integer and String Arithemetic
– I/O – Parallel
![Page 15: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/15.jpg)
15
Challenges
1. Vendors may tinker with benchmark to make them run better on their platform. At-times this is permitted.
2. Give data set rather than a single performance number.
3. Concentrate only on computational power.
![Page 16: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/16.jpg)
16
Popular Benchmarks• SPEC - Standard Performance Evaluation Corporation
– Floating point– Integer– Web– Graphics
• TPC – Transaction Processing Performance Council– Web Server– Transaction Processing– Decision Support Systems
• BAPCo – Business Applications Performance Corporation– Popular business applications
• EEMBC – Embedded Microprocessor Benchmark Consortium– Embedded Applications
![Page 17: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/17.jpg)
17
Statistical Summarization of Data
For Response time metric
Arithmetic Mean
For Throughput metric
Harmonic Mean or Geometric Mean.
SPEC uses Geometric Mean
![Page 18: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/18.jpg)
18
Are Benchmarks enough?
Benchmarks give the overall performance, if one wants to optimize performance, it may be necessary to know about the instruction or section of program where maximum time is being spent.
Profilers do this job.
![Page 19: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/19.jpg)
19
Profiling or Dynamic Program Analysis
• Program behavior is analysed as it is being run.
• Techniques used– Instruction Set Simulation– Hardware Interrupts– OS Hooks– Code instrumentation
• Example, Intel Vtune, Gprof
![Page 20: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/20.jpg)
20
Simulation
• Difficult to build the system. Simulation is cost effective.
• Beneficial for learning/improving some aspect of architecture.
• Simulators available are :– Kiel – Instruction Simulator– Little Mans Simulator – Simulator of a
machine– Cacheprof – Cache Simulator
![Page 21: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/21.jpg)
21
Moore’s Law (1965)
Moore's Law states that the number of transistors on a single chip at the same price will double every 18 to 24 months.
![Page 22: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/22.jpg)
22
Implication?
As more transistors are added to the chip of the same area, their speed increases, hence circuits become faster. Or clock rate increases.
Moore’s Law in combination with various other factors like ILP (Instruction Level Parallelism) were responsible for major improvements till a long time.
![Page 23: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/23.jpg)
23
Trends in Computing (Intel Processors)
Fastest Processor reported in Text, 2003
Current fastest processor, 2008
Intel® Processor name
Pentium 4Intel® Core™ i7-965 Processor
Extreme Edition
Processor speed 3.20 GHz 3.20GHz
Processor Primary Level Cache
12KB + 8KB 4x32KB
Processor secondary cache
512 KB 4x256KB Level 2 cache
Processor third level cache
2 MB Unified inclusive 8MB L3
![Page 24: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/24.jpg)
24
Observations
Fastest Processor reported in Text 2003
Current fastest processor, 2008
Intel® Processor name
Pentium 4Intel® Core™ i7-965 Processor
Extreme Edition
Processor speed 3.20 GHz 3.20GHz
Processor Primary Level Cache
12KB + 8KB 4x32KB
Processor secondary cache
512 KB 4x256KB Level 2 cache
Processor third level cache
2 MB Unified inclusive 8MB L3
Processor Speed or Clock Rate has not changed!!!
![Page 25: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/25.jpg)
25
Observations
Fastest Processor reported in Text 2003
Current fastest processor, 2008
Intel® Processor name
Pentium 4Intel® Core™ i7-965 Processor
Extreme Edition
Processor speed 3.20 GHz 3.20GHz
Processor Primary Level Cache
12KB + 8KB 4x32KB
Processor secondary cache
512 KB 4x256KB Level 2 cache
Processor third level cache
2 MB Unified inclusive 8MB L3
What is 4?
![Page 26: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/26.jpg)
26
The Answer
Multi Core Approach - Actually more transistors are being used to pack more cores into a chip, rather than increasing clock speed.
Why?
1. Power Wall
2. Memory Wall
3. No more ILP.
![Page 27: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/27.jpg)
27
Topics for further Study
• Papers– Performance papers– Memory Wall.
• Software– Intel Vtune or any other profiling tool– Little Mans Computer Simulator or any other
simulator apart from keil.
![Page 28: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/28.jpg)
28
Amdahl’s Law
Execution time after improvement
= Execution time affected by improvement
Amount of improvement
+ Execution time unaffected by improvement
![Page 29: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/29.jpg)
29
What this means?
Even if we substantially increase performance any one component, it may not result in overall substantial performance improvement.
A new architecture increases the speed of memory instructions by 50%. If memory instructions account for 50% of total time taken. What is the overall increase in performance?
Told = 100, Tnew = 25 + 50 = 75. Imp = 25%
![Page 30: Inroduction and Performance Analysis](https://reader038.fdocuments.us/reader038/viewer/2022110323/55cf9063550346703ba5673c/html5/thumbnails/30.jpg)
30
What is better?
a. 20% increase in perf. of instructions executing 90% of time.
b. 90% increase in perf of instructions executing 20% of time.