KT6213 Computer Organization and Architecture Lecture 1: Introduction 1.
-
Upload
alicia-smith -
Category
Documents
-
view
222 -
download
0
Transcript of KT6213 Computer Organization and Architecture Lecture 1: Introduction 1.
KT6213
Computer Organization Computer Organization and Architectureand Architecture
Lecture 1: Introduction
1
KT6213
Architecture & Organization Architecture & Organization (Stalling)(Stalling)
Architecture is those attributes visible to the programmer◦ Instruction set, number of bits used for
data representation, I/O mechanisms, addressing techniques.
◦ e.g. Is there a multiply instruction?Organization is how features are
implemented◦ Control signals, interfaces, memory
technology.◦ e.g. Is there a hardware multiply unit or is
it done by repeated addition?
2
KT6213
Computer Architecture Is …Computer Architecture Is …
the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.
Amdahl, Blaaw, and Brooks, 1964
3
KT6213
Cont.Cont.Computer Architecture is the design of
computers, including their instruction sets, hardware components, and system organization [Patterson].
Thus two essential parts of computer architecture: ◦ Instruction-set Architecture (ISA)
◦ Hardware-system Architecture (HSA) Technology ProgrammingLanguages
OperatingSystems History
ApplicationsInterface Design(Inst. Set Arch.)
Measurement & Evaluation
Parallelism
Computer Architecture:• Instruction Set Design• Organization• Hardware
4
KT6213
Instruction-set Architecture (ISA)Instruction-set Architecture (ISA)The instruction set architecture of a computer
includes anything a programmer would need to know to make the computer run correctly. This include:◦ (a) The number and types of registers◦ (b) Instruction set (what operations can be performed?)◦ (c) Instruction format (how are they specified?)◦ (d) Addressing mode (how is data obtained? - direct vs.
indirect)◦ (e) Exception handling (what happens when something
goes wrong?) Instruction-set architecture includes the
specifications that determine how machine-language programs will interact with the computer. That is, in general, two computers with the same ISA will run the same programs. This is the notion of a computer-family architecture.
5
KT6213
Hardware-system Architecture (HSA)Hardware-system Architecture (HSA)The Hardware-system architecture deals with
the computer's major hardware subsystems, including central processing unit (CPU), its storage system, and its input-output system.
The computer hardware design determines the implementation of the various computer components. This includes◦ (a) Capabilities and performance of the functional units
(e.g., registers, ALUs, shifters) ◦ (b) Methods for connecting the functional units (e.g.,
data bus)◦ (c) Control logic for the functional units
Typically, the computer hardware is designed based on the instruction set architecture.
6
KT6213
A successful ISA generally has many implementations (a computer-family) which are different in their HSA.
Compatibility is the ability of different computers to run the same programs. ◦ Upward compatibility allows high-performance
members of a family to run the same program as do the low-performance members
◦ Downward compatibility is not always possible, since high-performance family members often have features not available on lower-performance members.
7
KT6213
Computer FamilyComputer Family
A computer family is a set of implementations that share the same or similar ISA (using a variety of technologies, memory sizes, and speeds). For example, IBM system/360 (1960s), PDP-8 family (1965), PDP-11 family (1965), and IBM system/370 (1970s).
All Intel x86 family share the same basic architecture
The IBM System/370 family share the same basic architecture
This gives code compatibility◦ At least backwards
Organization differs between different versions
8
KT6213
Computer EvolutionComputer Evolution
9
KT6213
Historical Historical PerspectivPerspectivee
10
KT6213
Early ComputingEarly Computing
1946: ENIAC, us Army, 18,000 Vacuum Tubes
1949: UNIVAC I, $250K, 48 systems sold
1954: IBM 701, Core Memory
1957: Moving Head Disk
1958: Transistor, FORTRAN, ALGOL, CDC & DEC
Founded
1964: IBM 360, CDC 6600, DEC PDP-8
1969: UNIX
1970: FLOPPY DISK
1981: IBM PC, 1st Successful Portable (Osborne1)
1986: Connection Machine, MAX Headroom Debut
11
KT6213
Underlying TechnologiesUnderlying Technologies
Generation
Evolutionary
Parallelism
Year Logic Storage Prog. Lang. O/S54 Tubes core (8 ms)58 Transistor (10µs) FORTRAN60 ALGOL, COBOL Batch64 Hybrid (1µs) thin film (200ns) Lisp, APL, Basic66 IC (100ns) PL/1, Simula,C67 Multiprog.71 LSI (10ns) 1k DRAM O.O. V.M.
73 (8-bit µP)75 (16-bit µP) 4k DRAM78 VLSI (10ns) 16k DRAM Networks80 64k DRAM84 (32-bit µP) 256k DRAM ADA87 ULSI 1M DRAM89 GAs 4M DRAM C++92 (64-bit µP) 16M DRAM Fortran90
12
KT6213
What has happened in the 1990sWhat has happened in the 1990s
“Network-Integrated Computing”◦ Wide-area AND local-area integration of
cluster-based computing, and high performance networks
Scalable Technologies for Computing, Networking, and Information Systems◦ Systems that scale DOWN as well as UP◦ High performance workstations◦ Clusters and distributed systems◦ Massively parallel I/O and computer servers◦ National Information Infrastructure
13
KT6213
What has been predicted for the What has been predicted for the Late 1990s and Early 2000sLate 1990s and Early 2000s
Technology◦ Very large dynamic RAM: 64 Mbits and
beyond◦ Large fast Static RAM: 1 MB, 10ns
Complete systems on a chip◦ 10+ Million Transistors
Parallelism ◦ Superscalar, Superpipeline, Vector,
Multiprocessors◦ Processor Arrays
14
KT6213
What has been predicted for the Late What has been predicted for the Late 1990s and Early 2000s1990s and Early 2000sLow Power
◦ 50% of PCs portable by 1995◦ Performance per watt
Parallel I/O◦ Many applications is I/O limited, not
computation◦ Computation scaling but memory, I/O
bandwidth not keeping paceMultimedia
◦ New interface technologies◦ Video, speech, handwriting, virtual reality, …
15
KT6213
Review of Technology Trends and Review of Technology Trends and Cost /PerformanceCost /Performance
16
KT6213
Original Food Chain PictureOriginal Food Chain Picture
Big Fishes Eating Little Fishes
17
KT6213
1988 Computer Food Chain1988 Computer Food Chain
PCWork-stationMini-
computer
Mainframe
Mini-supercomputer
Supercomputer
Massively Parallel Processors
18
KT6213
1998 Computer Food Chain1998 Computer Food Chain
PCWork-station
Mainframe
Supercomputer
Mini-supercomputerMassively Parallel Processors
Mini-computer
Now who is eating whom?
Server
19
KT6213
Why Such Change in 10 years?Why Such Change in 10 years?
Performance◦ Technology Advances
CMOS VLSI dominates older technologies (TTL, ECL) in cost AND performance
◦ Computer architecture advances improves low-end RISC, superscalar, RAID, …
Price: Lower costs due to …◦ Simpler development
CMOS VLSI: smaller systems, fewer components◦ Higher volumes
CMOS VLSI : same dev. cost 10,000 vs. 10,000,000 units Function
◦ Rise of networking/local interconnection technology
20
KT6213
Moore’s LawMoore’s Law
Gordon Moore - cofounder of IntelIncreased density of components on chipNumber of transistors on a chip will double
every yearSince 1970’s development has slowed a little
◦ Number of transistors doubles every 18 monthsCost of a chip has remained almost
unchangedHigher packing density means shorter
electrical paths, giving higher performanceSmaller size gives increased flexibilityReduced power and cooling requirementsFewer interconnections increases reliability
21
KT6213
Performance MismatchPerformance Mismatch
Processor speed increasedMemory capacity increasedMemory speed lags behind processor speed
22
KT6213
DRAM and Processor DRAM and Processor CharacteristicsCharacteristics
23
KT6213
Trends in DRAM useTrends in DRAM use
24
KT6213
Memory Capacity Memory Capacity (Single Chip DRAM)(Single Chip DRAM)
size
Year
Bit
s
1000
10000
100000
1000000
10000000
100000000
1000000000
1970 1975 1980 1985 1990 1995 2000
Year size(Mb) cyc time
1980 0.0625 250 ns
1983 0.25 220 ns
1986 1 190 ns
1989 4 165 ns
1992 16 145 ns
1996 64 120 ns
2000 256 100 ns
25
KT6213
total amount ofwork done in a given time
time between the start and the completionof an event
26
KT6213
Performance milestonePerformance milestone
27
KT6213
Technology TrendsTechnology Trends(Summary)(Summary)
Capacity Speed (latency)
Logic 2x in 3 years 2x in 3 years
DRAM 4x in 3 years 2x in 10 years
Disk 4x in 3 years 2x in 10 years
28
KT6213
Growth in CPU Transistor CountGrowth in CPU Transistor Count
29
KT6213
Year
Tra
nsis
tors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000
i80386
i4004
i8080
Pentium
i80486
i80286
i8086
Technology Trends: Microprocessor Technology Trends: Microprocessor CapacityCapacity
CMOS improvements:• Die size: 2X every 3 yrs• Line width: halve / 7 yrs
“Graduation Window”
Alpha 21264: 15 millionPentium Pro: 5.5 millionPowerPC 620: 6.9 millionAlpha 21164: 9.3 millionSparc Ultra: 5.2 million
Moore’s Law
30
KT6213
Growth in Processor Growth in Processor PerformancePerformance
31
KT6213
Performance TrendsPerformance Trends(Summary)(Summary)Workstation performance (measured in Spec
Marks) improves roughly 50% per year (2X every 18 months)
Improvement in cost performance estimated at 70% per year
32
KT6213
Measurement and EvaluationMeasurement and Evaluation
Design
Analysis
Architecture is an iterative process:• Searching the space of possible designs• At all levels of computer systems
Creativity
Good IdeasGood Ideas
Mediocre IdeasBad Ideas
Cost /PerformanceAnalysis
33
KT6213
Computer Engineering Computer Engineering MethodologyMethodology
TechnologyTrends
34
KT6213
Computer Engineering Computer Engineering MethodologyMethodology
Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks
TechnologyTrends
Benchmarks
35
KT6213
Computer Engineering Computer Engineering MethodologyMethodology
Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks
Simulate NewSimulate NewDesigns andDesigns and
OrganizationsOrganizations
TechnologyTrends
Benchmarks
Workloads
36
KT6213
Computer Engineering Computer Engineering MethodologyMethodology
Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks
Simulate NewSimulate NewDesigns andDesigns and
OrganizationsOrganizations
Implement NextImplement NextGeneration SystemGeneration System
TechnologyTrends
Benchmarks
Workloads
ImplementationComplexity
37
38
39
40
41
42
43
44
KT6213
Summary: Price vs. CostSummary: Price vs. Cost
0%
20%
40%
60%
80%
100%
Mini W/S PC
Average Discount
Gross Margin
Direct Costs
Component Costs
0
1
2
3
4
5
Mini W/S PC
Average Discount
Gross Margin
Direct Costs
Component Costs
4.73.8
1.8
3.52.5
1.5
45
KT6213
System PerformanceSystem Performance
46
47
KT6213
Measuring and Reporting Measuring and Reporting PerformancePerformance
• Designing high performance computers is one of the major goals of any computer architect.
• As a result, assessing the performance of computer hardware is at the heart of computer design and greatly affects the demand and market value of the computer.
• However, measuring performance of a computer system is not a straightforward task:
Metrics – How do we describe in a numerical way the performance of a computer?
What tools do we use to find those metrics? How do we summarize the performance?
48
KT6213
Measuring and Reporting Perform Measuring and Reporting Performanceance
What do we mean by one Computer is faster than another?◦ program runs less time
Response time or execution time◦ time that users see the output
Throughput ◦ total amount of work done in a given time
49
KT6213
PerformancePerformance
“Increasing and decreasing” ?????
We use the term “improve performance” or “ improve execution time” When we mean increase performance and decrease execution time .
improve performance = increase performance improve execution time = decrease execution ti
me
50
KT6213
What is performance ?What is performance ?
…. how fast does this computer run MY program ?
…. is machine A faster than machine B, and if so, how much faster ?
…. one of the three factors driving architecture◦ …. effective use of new technology
Should I use it to enhance the architecture or improve performance of existing architecture.
◦ …. can a desired performance improvement be achieved by a given set of implementation or organization changes ?
51
KT6213
Measuring Performance Measuring Performance
Definition of time
Wall Clock timeResponse timeElapsed time
◦ A latency to complete a task including disk accesses, memory accesses, I/O activities, operating system overhead
52
Measure Performance – user CPU time
• Response time may include disk access, memory access, input/output activities, CPU event and operating system overhead—everything.
• In order to get an accurate measure of performance, we use CPU time instead of using response time.
• CPU time is the time the CPU spends computing a program and does not include time spent waiting for I/O or running other programs.
• CPU time can also be divided into user CPU time (program) and system CPU time (OS).
• Key in UNIX command time, we have,
90.7u 12.9s 2.39 65% (user CPU, system CPU, total response,%)
• In our performance measures, we use user CPU time – because of its independence on the OS and other factors.
53
KT6213
Performance MetricsPerformance Metrics
54
KT6213
Benchmarking: Performance Benchmarking: Performance MeasureMeasure
55
KT6213
What is the computer user What is the computer user interested in?interested in?
Reduce the time to run certain task◦ Execution time (response time) ◦ The time between the start and the
completion of an event. Increase the tasks per week, day, hour, sec,
ns … ◦ Throughput ◦ The total amount of work done in a given
time.
56
Example
Do the following changes to a computer system increase throughput, reduce response time, or both?
1)Replacing the processor in a computer with a faster version
2)Adding additional processors to a system that uses multiple processors for separate tasks –for example handling an airline reservation system
Answer
1) Both response time and throughput are improved.2) Only throughput increases.
57
KT6213
Performance MeasurePerformance Measure
• Time to run the task (ExTime)– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns … (Performance)– Throughput, bandwidth
Plane
Boeing 747
Concorde
Speed
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput (pmph)
286,700
178,200
58
KT6213
Execution TimeExecution Time
In this subject, we will be primarily interested in execution time as a measure of performance.
The relationship between performance and execution time on a computer X (reciprocal) is defined as follows:
XX timeExecution
1 ePerformanc
To maximize performance of an application, we need to minimize its execution time.
59
KT6213
Performance comparison Performance comparison
To compare design alternatives, we use the following equation:
nY
X
X
Y
ePerformanc
ePerformanc
TimeExecution
TimeExecution
• “X is fast n times faster than Y” or “the throughput of X is n times higher than Y” means that the execution time is n times less on X than Y.
60
KT6213
Metrics of PerformanceMetrics of Performance
Compiler
Programming Language
Application
DatapathControl
Transistors Wires Pins
ISA
Function Units
(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per monthOperations per second
What is actually measured?
61
KT6213
Aspects of CPU PerformanceAspects of CPU Performance
Inst Count CPIClock Rate
Program X
Compiler X (X)
Inst. Set. X X
Organization X X
Technology X
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
62
CPU Performance All computers are constructed using a clock to
operate its circuits. • Typically measured by two basic metrics
• Clock rate – today in MHz and GHz
• Clock cycle time: clock cycle time = 1/clock rate
• E.g., 1 GHz rate corresponds to 1 ns cycle time
Thus the CPU time for a program is given by:
Or,
timecycleClock program afor cyclesclock CPUTime CPU
rateClock
program afor cyclesclock CPU Time CPU
63
KT6213
More typically, we tend to count # instructions executed, known as Instruction Count (IC)
CPI: average number of clock cycles per instruction
Hence, alternative method to get the CPU timeIC
program afor cyclesclock CPU CPI
timecycleClock CPIIC timeCPU rateClock
CPIIC
CPU performance is equally dependent upon 3 characteristics: clock cycle time, CPI, IC. They are independent and making one better often makes another worse because the basic technologies involved in changing each characteristics are interdependent.
Clock cycle time: hardware technology and organizationCPI: organization and instruction set architectureIC: instruction set architecture and complier technology
CPU Performance
Depends on IS of the computer and its compiler
64
CPU PerformanceExample: Suppose we have 2 implementations of the same instruction set architecture. Computer A has a clock cycle time of 10 nsec and a CPI of 2.0 for some program, and computer B has a clock cycle time of 20 nsec and a CPI of 1.2 for the same program.
What machine is faster for this program?
Answer
Assume the program require IN instructions to be executed: CPU clock cycleA = IN x 2.0
CPU clock cycleB = IN x 1.2CPU timeA = IN x 2.0 x 10 = 20 x IN nsecCPU timeB = IN x 1.2 x 20 = 24 x IN nsec
So computer A is faster than computer B.
65
KT6213
Often the overall performance is easy to deal with
on a per instruction set basis
The overall CPI can be expressed as:
n
ii
i
n
iii
1
1 CPIcountn Instructio
IC
countn Instructio
CPIIC CPI
timecycleClock CPIIC timeCPU1
n
iii
n
iii
1
CPIIC cyclesclock CPU
# times instruction i is executed
CPI for instruction i
CPU Performance
66
KT6213 67
KT6213
Cycles Per InstructionCycles Per Instruction
Base Machine (Reg / Reg)
Instruction Freq CPI (% Time)
ALU 50% 1 (33%)
Load 20% 2 (27%)
Store 10% 2 (13%)
Branch 20% 2 (27%)
(Total CPI 1.5)
Example: Suppose we have a machine where we can count the frequency with which instructions are executed. We also know how many cycles it takes for each instruction type.
How do we get total CPI?How do we get %time?
68
KT6213
CPU PerformanceExample: Suppose we have made the following measurements: Frequency of FP operations (other than FPSQR) = 25% Avg. CPI of FP operations = 4.0 Avg. CPI of other instructions = 1.33 Frequency of FPSQR = 2% CPI of FPSQR = 20
Compare two designs: 1) decrease CPI of FPSQR to 2; 2) decrease the
avg. CPI of all FP operations to 2.5.
Answer
First, find the original CPI:
The CPI with design 1:
The CPI with design 2:
051.233.1%734%27CPIcountn Instructio
IC CPI
1o
n
ii
i
1.6912)-(202%-2.051FPSQRfor CPI Improved%2CPI CPI1 o
1.6491.3373%2.527% CPI2 So design 1 is better
69
KT6213
MIPS: Millions of Instructions per MIPS: Millions of Instructions per SecondSecond
70
KT6213
Relative MIPS and SpeedupRelative MIPS and Speedup
71
KT6213
CPU Performance Measure - how many millions of instructions per second (MIPS) a CPU can execute.◦ MIPS - million instructions per second◦ (a) MIPS = instruction count/(execution time x 10^6) ◦ = clock rate/(CPI x 10^6)◦ (b) Example: A program that executes 3 million instructions◦ in 2 seconds has a MIPS rating of 1.5◦ (c) Advantages
easy to understand easy to measure
◦ (d) Disadvantages cannot accurately compare machines with different instructions set (machines with powerful
instruction sets are penalized) varies from program to program (programs with lots of simple instructions will have higher MIPS
ratings) can vary inversly with performance
(e) different types◦ - native: As defined in (a) for a given program◦ - peak: Use instruction mix that minimizes CPI (may be very unrealistic)◦ - relative: compared to reference machine
72
KT6213
MFLOPS: Focus on One Type of MFLOPS: Focus on One Type of WorkWork
73
KT6213
Normalized MFLOPSNormalized MFLOPS
74
75
76
Some “Misleading” Performance Measures
There are certain computer performance measures which are famous with computer manufactures and sellers – but may be misleading.
MIPS (Million Instructions Per Second)
MIPS depends on the instruction set to make it difficult to compare MIPS of different instructions.
MIPS varies between programs on the same computer – different programs use different instruction mix.
MIPS can vary inversely to performance –most importantly.
77
Some “Misleading” Performance Measures
MFLOPS: Focus on one type of work MFLOPS (Million Floating-point Operations Per Second) depends
on the program. Must be FP intensive.
MFLOPS depends on the computer as well.
The floating-point operations vary in complexity (e.g., add & divide).
Peak Performance: Performance that the manufacture guarantees you won’t exceed
Difference between peak performance and average performance is huge.
Peak performance is not useful in predicting observed performance.
78
KT6213
Programs for Measuring Programs for Measuring PerformancePerformance
• Real applications: text processing software (Word), compliers (C), and other applications like Photoshop – have inputs, outputs, and options when the user wants to use them.
One major downside: Real applications often encounter portability problems arising from dependences on OS or complier.
• Modified (or scripted) applications: Modification is to enhance portability or focus on one particular aspect of system performance. Scripts are used to simulate the interaction behaviors.
• Kernels: small, key pieces from real programs. Typically used to evaluate individual features of the machine.
• Toy benchmarks: typically between 10 and 100 lines of code and produce a known result.
• Synthetic benchmarks: artificially created code to match an average execution profile.
79
KT6213
• A key advantage of such suites is that the weakness of any one benchmark is lessened by the presence of the other benchmarks.
• Good vs. Bad benchmarks
– Improving product for real programs vs. improving product for benchmarks to get more sales.
– If benchmarks are inadequate, then sales wins!
Benchmark Suites
• They are a collection of programs (workload) that try to explore and capture all the strengths and weaknesses of a computer system (real programs, kernels).
80
KT6213
SPEC BenchmarksSPEC Benchmarks
Most successful attempts and widely adopted. First generation 1989
◦ 10 programs yielding a single number (“SPECmarks”) Second generation 1992
◦ SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs)
◦ Unlimited compiler flags. Third generation 1995
◦ New set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point)
◦ Single flag setting for all programs: SPECint_base95, SPECfp_base95
◦ “benchmarks useful for 3 years”
SPEC: System Performance Evaluation Cooperation
81
KT6213
SPEC BenchmarksSPEC Benchmarks
Program Language What Is It164.gzip C Compression175.vpr C FPGA Circuit Placement and Routing176.gcc C C Programming Language Compiler181.mcf C Combinatorial Optimization186.crafty C Game Playing: Chess197.parser C Word Processing252.eon C++ Computer Visualization253.perlbmk C PERL Programming Language254.gap C Group Theory, Interpreter255.vortex C Object-oriented Database256.bzip2 C Compression300.twolf C Place and Route Simulator
CINT2000 (Integer Component of SPEC CPU2000)
http://www.spec.org/osg/cpu2000/CINT2000/82
KT6213
SPEC BenchmarksSPEC Benchmarks
Program Language What Is It168.wupwise Fortran 77 Physics / Quantum Chromodynamics171.swim Fortran 77 Shallow Water Modeling172.mgrid Fortran 77 Multi-grid Solver: 3D Potential Field173.applu Fortran 77 Parabolic / Elliptic Differential
Equations177.mesa C 3-D Graphics Library178.galgel Fortran 90 Computational Fluid Dynamics179.art C Image Recognition / Neural Networks183.equake C Seismic Wave Propagation Simulation187.facerec Fortran 90 Image Processing: Face Recognition188.ammp C Computational Chemistry189.lucas Fortran 90 Number Theory / Primality Testing191.fma3d Fortran 90 Finite-element Crash Simulation 200.sixtrack Fortran 77 High Energy Physics Accelerator Design 301.apsi Fortran 77 Meteorology: Pollutant Distribution
CFP2000 (Floating Point Component of SPEC CPU2000)
http://www.spec.org/osg/cpu2000/CFP2000/83
KT6213
SPEC BenchmarksSPEC Benchmarks
Sample Results For CINT2000
Base Base Base Peak Peak Peak
Benchmarks Ref Time Run Time Ratio Ref Time Run Time Ratio
164.gzip 1400 277 505* 1400 270 518*
175.vpr 1400 419 334* 1400 417 336*
176.gcc 1100 275 399* 1100 272 405*
181.mcf 1800 621 290* 1800 619 291*
186.crafty 1000 191 522* 1000 191 523*
197.parser 1800 500 360* 1800 499 361*
252.eon 1300 267 486* 1300 267 486*
253.perlbmk 1800 302 596* 1800 302 596*
254.gap 1100 249 442* 1100 248 443*
255.vortex 1900 268 710* 1900 264 719*
256.bzip2 1500 389 386* 1500 375 400*
300.twolf 3000 784 382* 3000 776 387*
SPECint_base2000 438
SPECint2000 442
http://www.spec.org/osg/cpu2000/results/res2000q3/cpu2000-20000718-00168.asc
Intel OR840(1 GHz Pentium III processor)
84
85
86
More Benchmarks TPC: Transaction Processing Council
– Measure the ability of a system to handle transactions, which consist of database accesses and updates.
– Many variants depending on transaction complexity
– TPC-A: simple bank teller transaction style
– TPC-C: complex database query
– 34 kernels in 5 classes
– 16 automotive/industrial; 5 consumer; 3 networking; 4 office
automation; 6 telecommunications
EDN Embedded Microprocessor Benchmark Consortium (EEMBC, “embassy”)
87
KT6213
Management would like to have one number.
Technical people want more:
1. They want to have evidence of reproducibility – there should be enough information so that you or someone else can repeat the experiment.
2. There should be consistency when doing the measurements multiple times.
How to Summarize Performance
How would you report these results?
Computer A Computer B Computer C
Program P1 (secs) 1 10 20
Program P2 (secs) 1000 100 20
Total Time (secs) 1001 110 40
88
Comparing and Summarizing Performance
Comparing the performance by looking at individual programs is not fair.
Total execution time: a consistent summary measure.
Arithmetic Mean – provides a simple average
– Timei: execution time for program i in the workload
– Doesn’t account for weight: all programs are treated equal
n
iiTime
n 1
1
89
Normalized Time Metrics Normalized execution time metrics
Measure the performance by normalizing it to a reference machine: Execution time ratioi
Geometric Mean
Geometric mean is consistent no matter which machine is the reference .
The arithmetic mean should not be used to average normalized execution times.
However, geometric mean still doesn’t form accurate predication models (doesn’t predict execution time)
ni
n
1iRatio TimeExecution
90
Comparing performance of two computer using SPECRatio
The relationship between Geomatric mean and Performance Ratio
91
The ratio of the geometric means is equal to the geometric mean of the performance ratios, which implies that the choice of the reference computer isirrelevant.
92
Weighted Variants
What is the proper mixture of programs for the workload?
Weight is a weighting factor to each program to indicate the relative frequency of the program in the workload: % of use
Weighted Arithmetic Mean
– Weighti: frequency of program i in the workload
– May be better but beware the dominant program time
n
iii
i
TimeWeightWeight 1
1
93
A B C W(1) W(2) W(3)
Program P1 (secs) 1 10 20 0.5 0.909 0.999
Program P2 (secs) 1000 100 20 0.5 0.091 0.001
Arithmetic Mean 500.5 55 20
Weighted Arithmetic Mean (1) 500.5 55 20
Weighted Arithmetic Mean (2) 91.82 18.18 20
Weighted Arithmetic Mean (3) 2 10.09 20
Example
94
Example
A B C W(1)
Program P1 (secs) 1 10 20 100/101
Program P2 (secs) 1000 100 20 1/101
Program P1 (secs) 1 10 20 Normalized to A
Program P2 (secs) 1 0.1 0.02
Geometric Mean 1 1 0.63
Arithmetic Mean 500.5 55
Weighted Arithmetic Mean (1) 10.89 10.89 20
Machines A and B have the same performance according to the Geometric Mean measure, yet this would only be true for a workload that P1 runs 100 times more than P2 according to the Weighted Arithmetic Mean measure.
20
95
KT6213
Locality of Reference Other important fundamental observations come from
the properties of programs.
Principle of locality: Programs tend to reuse data and instructions they have used recently.
There are two different types of locality:
Temporal Locality (locality in time): If an item is referenced, it will tend to be referenced again soon (loops, reuse, etc.)
Spatial Locality (locality in space/location): If an item is referenced, items whose addresses are close one another tend to be referenced soon (straight line code, array access, etc.)
We can predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the past.
96
KT6213
Quantitative Principles of Computer Quantitative Principles of Computer DesignDesign
Already known how to define, measure and summarize performance, then we can explore some of the principles and guidelines in design and analysis of computers.
Make the common case fast
In making a design trade-off, favor the frequent case over the infrequent case.
Improving the frequent event, rather than the rare event, will obviously help performance.
Frequent case is often simpler and can be done faster than the infrequent case.
We have to decide what the frequent case is and how much performance can be improved by making the case faster.
97
KT6213
Two equations to evaluate design Two equations to evaluate design alternatives alternatives
The performance gain that can be obtained by improving some porting of a computer can be calculated using Amdahl’s Law.
Amdahl’s Law states that the performance improvement to be gained from using some fast mode of execution is limited by the fraction of the time the faster mode can be used.
Amdahl’s Law defines the speedup that can be gained by using a particular feature.
Amdahl’s Law
The CPU Performance Equation Essentially all computers are constructed using a clock running at
a constant rate. CPU time then can be expressed by the amount of clock cycles.
98
KT6213
Amdahl's LawAmdahl's Law
Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected
tEnhancemenWithTimeExecution
tEnhancemenWithoutTimeExecutionESpeedup
___
___)(
Speedup due to enhancement E:
This fraction enhanced
• Fractionenhanced: the fraction of the execution time in the original machine that can be converted to take advantage of the enhancement
• Speedupenhanced: the improvement gained by the enhanced execution mode
99
KT6213
ExTimenew = ExTimeold x (1 - Fractionenhanced) +
Speedupoverall =ExTimeold
ExTimenew
Speedupenhanced
=1
(1 - Fractionenhanced) +Speedupenhanced
This fraction enhanced
ExTimeold ExTimenew
Fractionenhanced
Fractionenhanced
Amdahl's LawAmdahl's Law
100
KT6213
Amdahl's LawAmdahl's Law
101
KT6213
Amdahl's LawAmdahl's LawExample: Floating point (FP) instructions are improved to run faster by a factor of 2; but only 10% of its time is used to execute instructions FP. What’s the overall speedup gained by this improvement?
Speedupoverall = 10.95
= 1.053
ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold
Answer
Amdahl’s Law can serve as a guide to how much an enhancement will improve performance and how to distribute resource to improve cost-performance.
It is particularly useful for comparing performances both of the overall system and the CPU of two design alternatives.
102
KT6213
Example of Amdahl’s LawExample of Amdahl’s Law
Floating point instructions improved to run 3X; but only 5% of its time is to run instructions FP
Speedupoverall =1
0.kkk= 1.yyy
ExTimenew = ExTimeold x (0.95 + .05/3) = 0.kkk x ExTimeold
103
KT6213
ExampleExampleSuppose that we want to enhance the
processor used for Web serving. The new processor is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original processor is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement?
104
KT6213
ExampleExample
A common transformation required in graphics processors is square root. Implementations of floating-point (FP) square root vary significantly in performance, especially among processors designed for graphics. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark. One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for half of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort as required for the fast square root. Compare these two design alternatives.
105
KT6213
Amdahl's Law - contAmdahl's Law - cont
106
KT6213
Amdahl's Law - contAmdahl's Law - cont
107
KT6213
Amdahl's Law - contAmdahl's Law - cont
108
KT6213
ExerciseExercise
Eg: Our program takes 10s to run on computer A, which has 400 MHz clock. We want it to run in 6s. The designer says that the clock rate can be increased, but it will cause the total number of cycles for the program to increase to 1.2 times the previous value. What is the minimum clock rate required to get the desired speedup ?
109
KT6213
SolutionSolution
Using formula:ExTime = INS/PROG X CPI /clock rateinitial10 = INS/PROG X CPI/400final6 = INS/PROG X 1.2 CPI /CLKmin
==> 10/6 = CLKmin/(1.2 X 400)==> CLKmin = 10/6 X (1.2 X 400) = 800 MHz
110
KT6213
Exercise: Program runs in 100s. Multiplies = 80% of program. Designer M can improve speedup of multiply operations. Now, I am a user and I need to make MY program 5 times faster. How much speedup of multiply instructions should M achieve to allow me to reach my overall speedup goal ?
111
KT6213
Summary, #1Summary, #1• Designing to Last through Trends
Capacity Speed
Logic 2x in 3 years 2x in 3 years
DRAM 4x in 3 years 2x in 10 years
Disk 4x in 3 years 2x in 10 years
• 6yrs to graduate => 16X CPU speed, DRAM/Disk size
• Time to run the task– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns, …– Throughput, bandwidth
• “X is n times faster than Y” means ExTime(Y) Performance(X)
--------- = --------------
ExTime(X) Performance(Y)
112
KT6213
Summary, #2Summary, #2
Amdahl’s Law:
CPI Law:
Execution time is the REAL measure of computer performance!
Good products created when have:◦ Good benchmarks, good ways to summarize performance
Die Cost goes roughly with die area4
Can PC industry support engineering/research investment?
Speedupoverall =ExTimeold
ExTimenew
=
1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
113