Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf ·...
Transcript of Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf ·...
1-1Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Introduction to
Computer
Architecture
1-2Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Computer Architecture
From Wikipedia, the free encyclopedia
In computer engineering, computer architecture is a set of
rules and methods that describe the functionality,
organization, and implementation of computer systems.
Some definitions of architecture define it as describing
the capabilities and programming model of a computer
but not a particular implementation. In other definitions
computer architecture involves instruction set
architecture design, microarchitecture design, logic
design, and implementation.
What is Computer Architecture
1-3Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Computer Architecture
Wikipedia
In computer engineering, computer architecture is a set of rules and methods that describe the functionality, organization, and implementation of computer systems. Some definitions of architecture define it as describing the capabilities and programmingmodel of a computer but not a particular implementation. In other definitions computer architecture involves instruction set architecture design, microarchitecture design, logic design, and implementation.
Translation:
computer architecture = { rules and methods | describe
Functionality— system capabilities and programming model
Organization— instruction set architecture, microarchitecture
Implementation— logic design
}
What is Computer Architecture
1-4Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
PerformanceLow run time
Fast programs
Low latency No waiting between programs and operations
Low energy consumptionLow electric billsLong battery lifeNo overheating
Market factorsLow cost (in relation to realistic demand for devices)Reliable manufacture and deliveryProfitability
Computer ArchitectureWhat rules and methods?
1-5Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Computing Platform by ApplicationWorkstation applications
Office, basic number crunching, graphics, gamingA few sequential loop-oriented threadsTypical CPU — Intel x86 (2 to 16 cores)
Mobile applicationsLow power version of workstationTypical CPU — ARM (1 to 4 cores)
Online Transaction Processing (OLTP)Banking, order processing, inventory, student information systemThousands of independent SQL transactions with memory latencyTypical CPU — SPARC (64 to 256 cores)
Supercomputer applicationsHeavy number crunching, data miningThousands of separable sequential loop-oriented threads Typical CPU — IBM Power (up to 512 Kcores)
1-6Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Mainframe + Virtualization + CloudMainframe
120 CPU cores + 3840 GB RAM + 8 GB/s I/O + reliabilityReplaces 10 to 1000 serversComplex partitioning
Allocate hardware subsystems as neededMultiple independent operating systems
Server VirtualizationSoftware over OS partitions hardware resources Multiple guest operating systems over OS
Cloud computingProvider sells standard system interface as a service
Infrastructure as Service, Platform as Service, Software as Service
Customer sees system specified in contractProvider handles operations+administration+maintenance (OAM)
1-7Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Introductionto
Performance
1-8Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Basic DefinitionsPerformance (ב יצועי ם)
Processing speed Performance measures
Response time ( ז מ ן תגובה)Elapsed time from start to finish of a defined task
Run Time ( ז מ ן ריצ ה)Response time for a start to finish program task
Latency ( ז מ ן ה מת נה)Excess response time — depends on context
Throughput (תפוקה)Number of defined tasks performed per unit time
Speedup (שיפור)
1old run time new run time old run time
new run timeS S > ⇒= <
1-9Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Run Time and Clock CyclesCPU is timed by periodic signal called a Clock
Clock Cycle (CC) measured in seconds per cycleClock Rate = cycles per second = Hz (Hertz)Instruction requires 1 or more clock cycles to process
Higher clock rate ⇒ shorter run time
Fewer clock cycles (at constant clock rate) ⇒ shorter run time
clockcycle
clock cycles to run program seconds per clock cycles
clock cycles to run programclock cycles per second
= ×
=
Run time
1-10Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Speedup and Clock Rate
Speedup follows from Higher clock rateFewer program clock cycles
Improvements to codeStructural improvements in hardware
old o
new ne
ld
old
w
ol
ne
d
w
new
new
old
program clock cycles ×seconds per clocprogram clock cycles ×seconds per clock cycle
program clock cyclesclo
program clock cyclesclock rate
ck rate
program clock c
=
=
=
k cycle
S= TT
new
new
old
old×clock rate
program clockycles
clock rat les ecyc
1-11Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Factors Affecting Run TimeCPU hardware
Hardware → average clock cycles (CC) required per instruction
Memory (RAM + cache)Quantity and organization affects data availability
Internal communication and I/OSpeed and organization affects data availability
Operating system efficiencyCPU devotes less time to dense OS codeOS manages tasks/threads to keep hardware busy
CompilerConverts high level language to machine codeOptimized code runs faster
Special hardwareDedicated processors (graphics, memory management)
Application codeEfficient algorithms, data structures, parallelization
1-12Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Examples of Factors
Affecting Performance
1-13Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
CPU Hardware Example —Multiple Core ProcessorsN Core Symmetric Multiprocessor (SMP)
N complete CPUs on one chipDivide work among N processors
Each CPU has multiple Execution Units (EU)ALU operates on integersFPU operates on float / doubleVector processor operates on long registers
OS assigns threads to each coreIf program threads are separableIf data structures are not too entangled
Registers
ExecutionCore (ALUs)
Cache
MainMemory
I/O BusPCI Bridge
CPU 0 CPU 1
Registers
ExecutionCore (ALUs)
Cache
Dual CoreProcessor
1-14Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
CPU Hardware Example —Vector ProcessorVector Processor
SIMD — Single Instruction Multiple DataPerforms same operation on 4, 8, or 16 bytes in parallel
No carry/borrow between bytes
Example64-bit Source and Destination registers PARALLEL_ADD on 8 pairs of byte operands
SRC0…7 + DEST0…7 = DEST0…7SRC8…15 + DEST8…15 = DEST8…15
…SRC56…63 + DEST56…63 = DEST56…63
SRC 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 +
DEST 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
DEST 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
1-15Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Memory Example —Hybrid Data StructureGraphic array
200 vertex points = 25 groups of 8 wordsHybrid Data Structure for efficient vector processing
Coordinates and colors Stored in separate data structuresStructures handled in CONCURRENT threads on separate CPUs
Coordinatesstruct { float x[8], y[8], z[8] ; } H_xyz[25] ;
8-word group loaded and processed as vector on CPU 0Each loop updates 8 x-coordinates, then 8 y's, then 8 z's
Colorsstruct { float r[8], g[8], b[8] ;} H_rgb[25] ;
8-word group loaded and processed as vector on CPU 1Each loop updates 8 reds, then 8 greens, then 8 blues
1-16Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Memory Example — Color Data StructureAddressing in 32-bit processors
Processor sends 32-bit aligned address A (multiple of 4)Reads 4-byte word — bytes from addresses A, A+1, A+2, A+3
Access to individual byte requires reading entire dword
24-bit True Color3 color bytes — Red, Blue, Green28 = 256 levels per color (0x00 – 0xFF)Most 24-bit colors split between dwordsAccess to pixel color ⇒ 2 memory cycles
32-bit True ColorPad 24-bit color with blank byteAlign color data on 32-bit addresses One memory cycle per pixel
dword dword dword R G B R G B R G B R G B
1 cycle 2 cycles 2 cycles 1 cycles
dword dword R G B — R G B —
1-17Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Compiler Efficiency Example
main(){ int i,j; for (i = 0; i < 10; i++){ j = 2 * i; } } 0000 MOV WORD PTR [BP-02],0000 ; i = 0 0005 CMP WORD PTR [BP-02],+0A 0009 JGE 0018 ; break on i ≥ 10 000B MOV AX,[BP-02] ; AX ← i 000E SHL AX,1 ; AX ← 2 * AX 0010 MOV [BP-04],AX ; j ← AX 0013 INC WORD PTR [BP-02] ; i++ 0016 JMP 0005 ; loop 0018 RET
C code compiled inefficiently for Intel 8086 processor
1-18Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Page from Intel 8086 Manual
80186/80188 HIGH-INTEGRATION 16-BIT MICROPROCESSORS,COPYRIGHT © INTEL CORPORATION, 1995
Clock Cycles per Instruction
1-19Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Program Timing for 8086
Instruction 8086 Clock Cycles (CC)
MOV WORD PTR [BP-02],0000 MOV imm to r/m 4/13
start: CMP WORD PTR [BP-02],+0A CMP r/m,imm 3/10
JGE stop Jcc (not taken/taken) 4/13
MOV AX,[BP-02] MOV r/m to reg 2/9
SHL AX,1 Shift reg 2
MOV [BP-04],AX MOV reg to r/m 2/12
INC WORD PTR [BP-02] INC r/m 3/15
JMP start JMP 14
stop: RET RET 16
Loop control instructions
ALU instructions
Setup/takedown instructions (run once)
Instruction timings are given in 8086 manual (in clock cycles)
Program contains
1-20Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Program Run Time
N = number of loop iterationsTotal clock cycles = 13 + N × 10 + (N – 1) × (4 + 9 + 2 + 12 + 15 + 14) + 13 + 16
= 66 × N – 14
For N = 11 (stop on i = 10), Total CC = 712
Instruction 8086 Clock Cycles (CC)
MOV WORD PTR [BP-02],0000 13 CC (runs once)
start: CMP WORD PTR [BP-02],+0A 10 CC on each loop
JGE stop 4 CC on all loops but last 13 CC on last
MOV AX,[BP-02] 9 CC on all loops but last
SHL AX,1 2 CC on all loops but last
MOV [BP-04],AX 12 CC on all loops but last
INC WORD PTR [BP-02] 15 CC on all loops but last
JMP start 14 CC on all loops but last
stop: RET 16 CC (runs once)
1-21Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Example —More Efficient Compilation
Total clock cycles = 4 + N × 3 + (N – 1) × (4 + 2 + 2 + 2 + 3 + 14) + 13 + 16= 30 × N + 6
For N = 11 (stop on i = 10), Total CC = 337
Using register variables requires large number of registers
Instruction 8086 Clock Cycles (CC)
MOV SI,0000 4 CC (runs once)
start: CMP SI,+0A 3 CC on each loop
JGE stop 4 CC on all loops but last 13 CC on last
MOV AX,SI 2 CC on all loops but last
SHL AX,1 2 CC on all loops but last
MOV DI,AX 2 CC on all loops but last
INC SI 3 CC on all loops but last
JMP start 14 CC on all loops but last
stop: RET 16 CC (runs once)
712
3372.11S ==
Store Variables in Registers —Not Memory
1-22Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Example — Even More Efficient Compilation
Total clock cycles = 4 + N × (2 + 2 + 2 + 3 + 3) + (N – 1) × 13 + 4 + 16= 25 × N + 11
For N = 10 (stop on i = 10), Total CC = 261
Instruction 8086 Clock Cycles
MOV SI,0000 MOV imm to reg 4
start: MOV AX,SI MOV reg to reg 2
SHL AX,1 SHIFT reg 2
MOV DI,AX MOV reg to reg 2
INC SI INC reg 3
CMP SI,+0A CMP reg,imm 3/10
JL start Jcc (not taken/taken) 4/13
stop: RET RET 16
712
2612.73S ==
Rebuild Loop
1-23Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Measuring
Performance
1-24Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
BenchmarksDefinition
Collection of programs for measurement and comparison of system performance
Requirements Standard and scientific
Consistent result on repeated testsConsistent result by anyone repeating tests
Test system in realistic wayReflect statistically representative use of
Instruction typesData typesLoop lengthOS and compiler conditions
Summarize data so comparisons make sense
1-25Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
SPEC BenchmarkPrograms for system performance measurement + comparison
Standard + repeatable Test system for realistic conditionsSummary score for easy comparisonResults posted at http://www.spec.org/
Specific test suitesCint — CPU integer instructionsCfp — CPU FP instructionsPerformance as file server, web server, mail server, graphics
Updated every few years to reflect realistic conditionsBased on current statistical distributions of computing tasksCurrent CPU test version — 2017
Previous version — 2006
Reports speedupRun time compared with a standard machine
1-26Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
How SPEC WorksUser runs n programs on test machine
Records run-time conditionsRecords program run-time in seconds
SPEC provides run-times on reference machineSun Fire V4902100 MHz UltraSPARC-IV+ processorPowerful symmetric multiprocessing (SMP) server (2006 – 2014)
User calculates speedup for each program
User calculates geometric mean of speedups
, 1, 2,...,ref
ii test
ii n
TTS ==
, 1, 2,...,testi i nT =
refiT
( )
( ) ( )( )
1
1
test machine on ref
machine A on refmachine A compared to machine B
machine B on ref
i
nn refitest
i
TT
S
SS S
=
⎡ ⎤⎛ ⎞⎢ ⎥⎜ ⎟⎜ ⎟⎢ ⎥⎝ ⎠⎣ ⎦
=
= ∏
1-27Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Typical Reference Run Times
Cint2017 Programs
Program Language KLOC Application Ref Run Time
600.perlbench_s C 362 Perl interpreter 1773
602.gcc_s C 1,304 GNU C compiler 3982
605.mcf_s C 3 Route planning 4709
620.omnetpp_s C++ 134 Discrete Event simulation ‐ computer network
1630
623.xalancbmk_s C++ 520 XML to HTML conversion via XSLT 1413
625.x264_s C 96 Video compression 1770
631.deepsjeng_s C++ 10 Artificial Intelligence: alpha‐beta tree search (Chess)
1434
641.leela_s C++ 21 Artificial Intelligence: Monte Carlo tree search (Go)
1706
648.exchange2_s Fortran 1 Artificial Intelligence: recursive solution generator (Sudoku)
2948
657.xz_s C 33 General data compression 6188
KLOC = 1000 lines of code
1-28Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Typical SPEC Report — 1
Base = standard configuration Peak = specialist configuration
SPEC(R) CPU2017 Integer Speed Result ASUSTeK Computer Inc.
ASUS RS700-E9(Z11PP-D24) Server System (2.70 GHz, Intel Xeon Gold 6150)
CPU2017 License: 9016 Test date: Dec-2017 Test sponsor: ASUSTeK Computer Inc. Hardware availability: Jul-2017 Tested by: ASUSTeK Computer Inc. Software availability: Sep-2017 Base Base Base Peak Peak Peak Benchmarks Thrds Run Time Ratio Thrds Run Time Ratio -------------- ------ --------- --------- ------ --------- --------- 600.perlbench_s 72 286 6.22 72 239 7.42 602.gcc_s 72 423 9.42 72 413 9.65 605.mcf_s 72 426 11.1 72 421 11.2 620.omnetpp_s 72 257 6.35 72 248 6.58 623.xalancbmk_s 72 150 9.46 72 140 10.1 625.x264_s 72 150 11.8 72 150 11.8 631.deepsjeng_s 72 280 5.11 72 282 5.08 641.leela_s 72 393 4.34 72 392 4.36 648.exchange2_s 72 220 13.4 72 220 13.4 657.xz_s 72 280 22.1 72 277 22.3 SPECspeed2017_int_base 8.87 SPECspeed2017_int_peak 9.16
1-29Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Typical SPEC Report — 2 HARDWARE -------- CPU Name: Intel Xeon Gold 6150 Max MHz.: 3700 Nominal: 2700 Enabled: 36 cores, 2 chips Orderable: 1, 2 chip(s) Cache L1: 32 KB I + 32 KB D on chip per core L2: 1 MB I+D on chip per core L3: 24.75 MB I+D on chip per chip Other: None Memory: 768 GB (24 x 32 GB 2Rx4 PC4-2666V-R) Storage: 1 x 240 GB SATA SSD Other: None SOFTWARE -------- OS: Red Hat Enterprise Linux Server release 7.3 (x86_64) Kernel 3.10.0-514.el7.x86_64 Compiler: C/C++: Version 18.0.0.128 of Intel C/C++ Compiler; Fortran: Version 18.0.0.128 of Intel Fortran Compiler Parallel: Yes Firmware: Version 0601 released Oct-2017 File System: xfs System State: Run level 3 (multi-user) Base Pointers: 64-bit Peak Pointers: 32/64-bit Other: jemalloc: jemalloc memory allocator library V5.0.1
1-30Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Some Cint2017 Results
Processor Clock (GHz)
Total Chips
Total Cores
Total Threads
Cint 2017 Base
Cint 2006 Base
Ratio
Intel Xeon Gold 6146 3.2 2 24 24 10.1 83.0 8.21
Intel Xeon Gold 6146 3.2 4 48 48 9.95 85.7 8.61
Intel Xeon Platinum 8153
2.0 4 64 64 7.00 62.8 8.97
Intel Xeon Bronze 3104 1.7 2 12 12 4.20 68.5 16.31
Intel Xeon Platinum 8180
2.5 8 224 224 9.37 81.6 8.71
Intel Core 2 Duo E6850 with auto parallel
3.0 1 2 2 — 19.9 —
Intel Core 2 Duo E6850 with no auto parallel
3.0 1 2 1 — 18.7 —
1-31Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Some Comments on Cint2017 ResultsAuto parallel
High level Cint code not threaded for parallel processingAuto parallel compiler creates parallel threads using heuristicsProvides limited speed up (or even degradation)All CPU results in table use auto parallel except last
Intel Xeon Gold 6146 with 3.2 GHz clockFastest CPU in Cint2017 tests2 chips (24 threads) slightly faster than 4 chips (48 threads)
Communication between more threads can slow processing4 chips faster on Cint2006 (using different benchmark programs)
Intel Xeon Platinum 8152 with 2.0 GHz clockCint with 64 threads = 7.00With 3.2 GHz clock, expect Cint = 7 x 3.2 GHz / 2.0 GHz = 11.2Not much better than Gold 6146 with 24 threads
Core Duo E6850 — old processor not tested on Cint2017Cint2006 with 1 threads (no auto parallel) = 18.7Cint2006 with 2 threads (auto parallel) = 19.9 = 6% speed up
1-32Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Representative Cint2006 Results
Sponsor Processor Clock (GHz)
Auto Parallel
Total Chips
Total Cores
Total Threads
Base
Hypertechnologies Intel Core i7‐5960X 4.5 Yes 1 8 8 79.7
Supermicro Intel Core i7‐6700K 4.4 Yes 1 4 4 77.4
NEC Intel Xeon E3‐1270 3.6 Yes 1 4 4 74.2
Huawei Intel Xeon E5‐2699 2.2 Yes 2 44 44 74.0
Supermicro Intel Core i5‐6600 3.3 Yes 1 4 4 71.0
Dell Intel Xeon E5‐2699 2.2 Yes 2 44 88 70.5
Intel Intel Core 2 Duo E6850 3.0 Yes 1 2 2 21.3
Intel Intel Core 2 Duo E6850 3.0 No 1 2 1 20.2
Dell Pentium 4 670 3.8 No 1 1 1 11.5
Intel Intel Pentium M 780 2.3 No 1 1 1 10.7
1-33Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019
Actual Sources of Performance Improvement1978 Clock speed of 8086 is 4 MHz2008 Xeon (clock speed of 4 GHz) is 100,000 times faster
Clock speedup = 4 GHz / 4 MHz = 1000Structural speedup = 100,000 / 1000 = 100
Reducing waiting time between operationsPerforming operations in parallel
No more clock speedupPentium 4 clock rate (4 GHz) = 4 x Pentium III clock (1 GHz)Clock speedup 1 GHz → 4 GHz required structural slowdown
Pentium 4 at 1 GHz slower than Pentium III at 1 GHzRun Pentium III at 4 GHz ⇒ melt CPU
Clock speed → physical limit of about 10 GHzSignal takes clock cycle to cross Pentium 4 at speed of light
Future speedup comes from structural improvementsMore coresBetter architectures
2-1Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Instruction Set Architecture
Choosing Ingredients
for a Computer Design
2-2Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
סקיר ת ה פרק
מהו מעבדvon Neumannמבנה
שלבי תכנון מעבדקבוצת הפקו דותמבנה פקודה
)נתונים(אופרנדים שמירת נתוני ם וסו ג י זכרון
פעולו ת שיקול ים לתכנון קבוצת הפקודות
)CISC(שפה מורכבת מימוש פקודות בחומרה
microcode
2-3Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Von Neumann ArchitectureStored-Program Digital Computer
• Digital computation in ALU• Programmable via set of
standard instructions• Internal storage of data • Internal storage of program• Automatic Input/Output• Automatic sequencing of
instruction execution bydecoder/controller
ArithmeticLogicUnit
(ALU)
input memory output
controller
data/instruction path
control pathVon Neumann Architecture
Data and instructions stored in a single memory unitHarvard Architecture
Data and instructions stored in a separate memory units
2-4Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Stages in Computer DesignInstruction Set Architecture (ISA)
1. Look at universe of problems to be solved2. Define atomic operations at level of system programmer
• Small and orthogonal operations (each performs different task)• Can be combined to perform any operation
3. Specify instruction set for machine language• Choose a minimum set of basic operations• Not too many ways to solve same problem
Implementation1. Design machine as implementation of ISA2. Evaluate theoretical performance3. Identify performance problem areas4. Improve processor efficiency
2-5Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Instruction Features Instruction
Description of an Operation performed on Operands
Operations Actions performed on data
OperandsSources — data inputs to operationDestinations — data outputs from operationSpecified by
Addressing Mode — location of data in machineData Type —Integer, Long, Floating Point, Decimal, String, Constant,
etc.
2-6Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Instruction Set Architecture General Instruction is instance of data structure
Machine Language is range of data structure Instruction for Operation ∈ {legal actions} Operand ∈ {legal Addressing Modes}
Describe sources and destinations
Typical machine instructionADD destination, source_1, source_2destination ← source_1 + source_2
Operand...OperandOperandOperation
2-7Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Instruction DefinitionsOperations and operands
unary — one source operandbinary — two source operandsn-ary — n source operands
Address specifier Describes address format
Addressing modeOperation model
Data width
Intel Non-Intel 2 bytes word half-word 4 bytes dword word 8 bytes quadword doubleword
2-8Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Memory Hierarchy
Long TermStorage
Main Memory(RAM) Cache Register
All Filesand Data
Running Programsand Data
Next FewInstructionsand Data
CurrentData
Memory location inside CPU
Fast access to small amount of
information
Organized by CPU
Memory location in or near CPU
Fast access to important data and
instructions from RAM
Copy of RAM section
Memory location outside CPU
Stores "all" data and instructions of running programs
Organized by addresses
Memory locations outside CPU and RAM
Stores data and instructions of "all"
programs
Organized by OS
2-9Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Register NamingRegisters are part of CPU design
Information stored in registers called architectural stateDescribes machine status and program status
General Purpose (GP) registersHold data for instructions
Width of data is width of standard integer in CPU
Referenced by names or numbersIntel x86: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP, EIPGeneral: R0, R1, … , R127
Special Purpose registersMachine status registersOperating system registers
2-10Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Flat Memory Organization
N-bit address space
Physical Address = AN-1 AN-2 … A1 A0
Can form 2N addresses, from 0 to (2N – 1)
Every byte in RAM has N-bit address
Processor refers to memory locations by physical RAM addresses
Processor stores memory addresses in N-bit address registers
Data Byte 11111…111 Data Byte 11111…110 Data Byte 11111…101 Data Byte 11111…100
… … Data Byte 00000…111 Data Byte 00000…110 Data Byte 00000…101 Data Byte 00000…100 Data Byte 00000…011 Data Byte 00000…010 Data Byte 00000…001 Data Byte 00000…000 Memory Location Address
memory addresses
N-bit register
CPU
2-11Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Word Organization in Memory
Word orderLittle endian
Least significant byte stored at lower addressWord is stored "little-end-first"Example: 4-byte word 69 b3 36 7d stored as
Big endianMost significant byte stored at lower addressWord is stored "big-end-first"Example: 4-byte word 69 b3 36 7d stored as
AlignmentRequirement that address of s-byte data unit be multiple of sFormally — address A % s = 0
8086 requires segments to be aligned on 16-byte boundariesIA-32 requires pages to be aligned on 4 KB boundaries
stored byte 69 b3 36 7d address 07 06 05 04 03 02 01 00
stored byte 7d 36 b3 69 address 07 06 05 04 03 02 01 00
2-12Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
ImmediateConstant = IMM = numerical value coded into instruction
Register operands
register name = a CPU storage locationREGS[register name] = data stored in registerREGS[R3] = data stored in register R3 = 11223340
Memory operands
address = a memory storage locationMEM[address] = data stored in memoryMEM[11223344] = data stored at address 11223344 = 45
Effective Address (EA) — pointer arithmetic
REGS[R3] ← &(variable)MEM[REGS[R3]+4] = *(&(variable)+4) = *(REGS[R3]+4)
= *(11223340+4) = 45
Specifying Operands
11223340
R3
45
11223344
2-13Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Structured Operation ModelsDefines basic arithmetic procedure and ALU organization
Stack
Z = X + Y → push Xpush YADDpop Z
AccumulatorAll operations use accumulator AZ = X + Y → load X
add Y
store Z
Push Pointer ← Pointer – d Stack[Pointer] ← memory/register
Pop memory/register ← Stack[Pointer] Pointer ← Pointer + d
Binary Op
Stack[Pointer + d] ← Stack[Pointer + d] Op Stack[Pointer] Pointer ← Pointer + d
Stack ALU used in Java bytecode
Accumulator ALU used in hand calculator
2-14Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
General Register Operation ModelsRegister-Memory Model
Operands can be stored in any REGISTER or MEMORY locationZ = X + Y → load R1, X
add R1, R1, Ystore Z, R1
Register- Register ModelMEMORY operands must be loaded to a REGISTER
Also called LOAD-STORE MODELZ = X + Y → load R1, X
load R2, Yadd R1, R1, R2store Z, R1
Easier to implementStatistically, most loaded operands are used more than once
2-15Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Typical Addressing ModesMode Syntax Memory Access Use
Register R3 Regs[R3] Register data Immediate #3 3 Constant Direct (absolute)
(1001) Mem[1001] Static data
Register deferred
(R1) Mem[Regs[R1]] Pointer
Displacement 100(R1) Mem[100+Regs[R1]] Local variable Indexed (R1 + R2) Mem[Regs[R1]+Regs[R2]] Array addressing Memory indirect
@(R3) Mem[Mem[Regs[R3]]] Pointer to pointer
Auto Increment
(R2)+ Mem[Regs[R2]] Regs[R2] ← Regs[R2]+d Stack access
Auto Decrement
-(R2) Regs[R2] ← Regs[R2]-d Mem[Regs[R2]]
Stack access
Scaled 100(R2)[R3] Mem[100+Regs[R2]+Regs[R3]*d] Indexing arrays PC-relative (PC) Mem[PC+value]
PC-relative deferred
1001(PC) Mem[PC+Mem[1001]]
Store data relative to program counter (instruction address)
2-16Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Typical OperationsData transfer
Load (r ← m), store (m ← r), move (r/m ← r/m), convert data types
Arithmetic/Logical (ALU)Integer arithmetic (+ – × ÷ compare shift) and logical (AND, OR, NOR, XOR)
DecimalInteger arithmetic on decimal numbers
Floating point (FPU)Floating point arithmetic (+ – × ÷ sqrt trig exp …)
StringString move, string compare, string search
ControlConditional and unconditional branch, call/return, trap
Operating System System calls, virtual memory management instructions
GraphicsPixel operations, compression/decompression operations
2-17Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Classic Computer Organization
2-18Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Considerations in Classic Computer DesignExpensive memory
RAM ~ $5000/MB wholesale in 1977
Poor compilersNon-optimizingBad error messagingFast code written or optimized in assembly language
Semantic Gap ArgumentBelief among theoreticians in 1960s and 1970Computer language should imitate natural language
Large vocabularyHigh redundancy
2-19Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Implications for Machine Language Machine Language should be high level
Language defines many instructionsEach instruction performs a lot of workLanguage defines many addressing modes
AdvantagesAssembly language programming is easier Each stored instruction in memory more powerfulMore power per instruction requires less memory
2-20Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Classic Machine Design
CISC (Complex Instruction Set Computer)
300+ instruction types
15+ addressing modes
10+ data types
Automated procedure handling
Complex machine implementations
2-21Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
CISCCISC was conventional wisdom in 1960s and 1970s
MainframesLarge and expensive computersOwned by big businesses and governmentsManufacturers: IBM, Control Data, Burrows, HoneywellFrom 1960s to 1980s, mainframes were CISC machines
MinicomputersSmaller computers for smaller organizationsManufacturers: Digital (PDP/VAX), Data General (Eclipse)Promoted academic computer science, smaller operating systems
(Unix), computer networking
MicrocomputersIntel designed the 8086 (1979) to work like a tiny VAXThe PC is the only CISC computer still manufactured
2-22Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Physical Implementation
Main Memory
Registers
MAR MDR+PCIRDecoderStatusWord
Address Data PC - program counterIR - instruction register
MAR - memory address registerMDR - memory data register
ALU Subsystem
System Bus
INOUT
ALU Operat ion
1
23
A LU Result F lagcontrol
2-23Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
RegistersGeneral Registers
R0 … Rn-1
Register width is standard integer in ISAPC
Program Counter Holds address of next instruction to execute
IRInstruction Register Holds binary code of instruction being executed
MARMemory Address Register Holds physical address for RAM access
MDRMemory Data Register Holds data during Read/Write memory operations
2-24Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Device Communication
A device WRITES with OE = 1 and READS with IE = 1Von Neumann controller distributes OE and IE signals to devices
Bus: A vehicle for carrying many passengers
Device 1
Device 3
Device 2
Write
OE
Read
IE
Write
OE
Read
IE
Write
OE
Read
IE
Syst
em B
us
Device BRead
IE
Device AWrite
OE
2-25Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Atomic Operations ⎯ Instruction Fetch(1) MAR ← PC(2) READ(3) IR ← MDR(4) PC ← PC + length(instruction)
2-26Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
(1) MAR ← PC
Main Memory
Registers
MAR MDR+PCIRDecoderStatusWord
Address DataPC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Operat ion
1
23
ALU Result F lagcontrol
2-27Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
(2) READ
Main Memory
Registers
MAR MDR+PCIRDecoderStatusWord
Address DataPC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Opera tion
1
23
A LU Result Flagcontrol
2-28Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
(3) IR ← MDR
Registers
MAR MDR+PCIRDecoderStatusWord
Address DataPC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Operat ion
1
23
ALU Result F lag
Main MemoryAddress Data
control
2-29Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
(4) PC ← PC + length(instruction)
Registers
MAR MDR+PCIRDecoderStatusWord
PC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Opera tion
1
23
A LU Result Flag
Main MemoryAddress Data
control
2-30Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Atomic OperationsInstruction: SUB R1, R2, 100(R3) ALU_IN ← R3ALU ← 100ADDMAR ← OUTREADALU_IN ← MDRALU ← R2SUBR1 ← OUT
2-31Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
SUB R1, R2, 100(R3): ALU_IN ← R3
Registers
MAR MDR+PCIRDecoderStatusWord
PC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Opera tion
1
23
A LU Result Flag
Main MemoryAddress Data
control
R3
2-32Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
SUB R1, R2, 100(R3): ALU ← 100
Registers
MAR MDR+PCIRDecoderStatusWord
PC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Opera tion
1
23
A LU Result Flag
Main MemoryAddress Data
control
R3
100
2-33Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
SUB R1, R2, 100(R3): ADD
Registers
MAR MDR+PCIRDecoderStatusWord
PC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Opera tion
1
23
A LU Result Flag
Main MemoryAddress Data
control
R3
100 100+R3
2-34Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
SUB R1, R2, 100(R3): MAR ← OUT
Registers
MAR MDR+PCIRDecoderStatusWord
PC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Opera tion
1
23
A LU Result Flag
Main MemoryAddress Data
control
100+R3
2-35Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
SUB R1, R2, 100(R3): READ
Registers
MAR MDR+PCIRDecoderStatusWord
PC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Opera tion
1
23
A LU R esult Flag
Main MemoryAddress Data
control
100+R3
2-36Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
SUB R1, R2, 100(R3): ALU_IN ← MDR
Registers
MAR MDR+PCIRDecoderStatusWord
PC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Opera tion
1
23
A LU Result Flag
Main MemoryAddress Data
control
(100+R3)
2-37Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
SUB R1, R2, 100(R3): ALU ← R2
Registers
MAR MDR+PCIRDecoderStatusWord
PC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Opera tion
1
23
A LU Result Flag
Main MemoryAddress Data
control
(100+R3)
R2
2-38Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
SUB R1, R2, 100(R3): SUB
Registers
MAR MDR+PCIRDecoderStatusWord
PC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Opera tion
1
23
A LU Result Flag
Main MemoryAddress Data
control
(100+R3)
R2 R2-100(R3)
2-39Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
SUB R1, R2, 100(R3): R1 ← OUT
Registers
MAR MDR+PCIRDecoderStatusWord
PC- program counterIR- instruction register
MAR- memory address registerMDR- memory data register
ALU Subsystem
System Bus
INOUT
ALU Opera tion
1
23
A LU R esult Flag
Main MemoryAddress Data
control
R2-100(R3)
2-40Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Decoding Machine InstructionsMachine Language Instruction
SUB R1, R2, 100(R3)
Microcode Instruction Sequence (Microprogram)ALU_IN ← R3ALU ← 100ADDMAR ← OUTREADALU_IN ← MDRALU ← R2SUBR1 ← OUT
2-41Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
MicrocodeMicrocode
One line of microprogram Implementation-level atomic operationAtomic ⇒ operation must complete before servicing interrupt
Decoder"Interprets" machine language instruction into microprogramDecoder ROM stores microprogram for every legal instructionNew instruction ⇒ add microprogram to decoder
Microprogram is sequenced by decoderState machine for each instructionEach state provides control signals to every subsystemEach line of microcode is executed in the correct order
Based on work of Maurice V. Wilkes (1951)
2-42Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
Clock Cycles Per InstructionClock Cycle (CC)
Determined by length of longest microcode operationsOne line of microcode finishes before next line begins
Most microcode lines finish in one clock cycleMemory access may take several clock cycles
Clock Cycles Per InstructionMachine language instruction implemented as lines of microcodeClock Cycles Per Instruction = number of microcode lines
Memory accesses may take extra clock cycles
Clock cycles for program = number of microcode lines in program
( ) ( )program Instruction of type
instruction types
CC Instructions of type CC i
ii
== ×∑
Instruction type — same basic microcode structure
2-43Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019
CISC Creates Anti‐CISC Revolution 1974 — 1977
Data General introduces Eclipse 32-bit CISC minicomputerDigital (DEC) introduces VAX 32-bit CISC minicomputerFirst serious inexpensive competition to mainframe computers
1977 — 1990Serious computers became available to small organizationsUNIX developed as minicomputer operating systemTCP/IP developed to support networks of minicomputersComputer Science emerged as separate academic disciplineStudents needed topics for projects, theses, dissertations
1980 — 1990Research results on minicomputer performance
CISC uses machine resources inefficientlyMost machine instructions are rarely used in programsCISC machines run slowly to support unnecessary features
3-1Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Quantitative Performance
Theory
3-2Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Amdahl Equation for MultiprocessorsSymmetric Multiprocessor (SMP)
N equivalent microprocessors Communication network between processorsOperating system runs on one+ processorOS assigns tasks to processors by some scheduling system
Amdahl equation for SMP
ArchitecturalState
ExecutionCore
Cache
MainMemory
I/O BusPCI Bridge
CPU 0 CPU 1
ArchitecturalState
ExecutionCore
Cache
CPU 2
ArchitecturalState
ExecutionCore
Cache
CPU 3
ArchitecturalState
ExecutionCore
Cache
( )1
1
fraction of work that can be enhanced (parallelized)
speedup for part to be enhanced (number of processors)P
PP
FS F NF
N
==
=− +
Quad Core CPU
3-3Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Example of Amdahl EquationFor multiprocessor system
Typical small Dell file serverN = 8 Xeon processorsFp = 80% of work can be parallelized
If number of processors were unlimited
( )1 1 3.330.80 0.20 0.101 0.80
8
S = = ≅+− +
( )
1 1 1 51 1 0.801
NP P
P
SF FFN
→∞= ⎯⎯⎯→ = =− −− +
Maximum speedup is 5Future enhancements
require more parallelization Fp
( )1
1
fraction of work that can be enhanced (parallelized)
speedup for part to be enhanced (number of processors)P
PP
FS F NF
N
==
=− +
3-4Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Basic Performance MeasuresRun Time ( זמן רי צה)
Elapsed time T from start to finish of a defined program task
Latency (זמ ן ה מתנה)Excess response time — depends on context
Throughput (תפוקה)Number of defined tasks performed per unit time
Enhancement (שינוי מבנה)Change to system ⇒ new run time T '
Speedup ( שיפור)
'1 ' <
TT
T TS S > ⇒=
1=
+Throughput
T latency between tasks
3-5Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Processor Performance EnhancementsHardware Enhancements
Clock rateInstruction implementationMemory organizationNumber of processing elements (CPUs, ALUs, registers)
Software EnhancementsRun time optimizationsCompilerOperating system
Enhanced Run TimeRun time = sum of partial run timesEnhancement ⇒ partial run times are longer, shorter, or unchangedS > 1 ⇒ Sum of new partial run times < sum of old partial run times
3-6Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Run Time Enhancements
total run time
partial run timecan be enhanced
partial run timecannot be enhanced
enhanced total run time
enhancedpartial
run time
unchangedpartial
run time
enhancement
3-7Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Amdahl EquationDefinitions
T total run time of a taskT' total run time of a task after enhancementte partial run time that can be enhancedte' partial run time that can be enhanced, after enhancementt0 partial run time that cannot be enhancedFe fraction of run time that can be enhanced = te / TSe Speedup of portion of run time that can be enhanced = te / te'
0 0
1 1 11' ' ' ' 1e e e e e
e eee
T TST t t t t T t t t F F
ST T T Tt
= = = = =+ − − ++ +
Amdahl equation expresses speedup in terms of relative quantitiesActual run-times not needed if RELATIVE ENHANCEMENTS are known
t0 tet0 te'
T
T'
3-8Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Example of Amdahl EquationProgram partial run times
T
tFP
tINT
400 msTotal Run Time
300 msFloat Instructions
100 msInteger Instructions
Enhance partial run time of Float Instructions
T'
tFP'tINT
300 msTotal Run Time
200 msFloat Instructions
100 msInteger Instructions
400 4300' 3
Speedup from actual run times
msms
S TT
= = =
300 3 75%400 4300 3 1.50200 2'
ms ms ms ms
e
eFP
FP
FP
F
SttTt
= = = =
= = = =
1 1 1 43 1 1 331 41 4 234 2
Speedup from relative enhancements
ee
e
S FFS
= = = =+− + − +
3-9Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Application of Amdahl EquationOn some CPU
Float (FP) instructions account for 50% of total run timeSquare root (FP) accounts for 20% of total run time
Choose between two alternative enhancementsSpeedup of Se = 2 for all FP instructionsSpeedup of Se = 10 for square root instruction
Enhancement 1
Enhancement 2
11 1 1 1.331 0.50 0.50 0.251 1 0.50
2e ee
SF F
S
= = = ≅ ⇒+− + − +
33% speedup
21 1 1 1.221 0.20 0.80 0.021 1 0.20
10e ee
SF F
S
= = = ≅ ⇒+− + − +
22% speedup
3-10Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Generalized Amdahl Equation
New definitionstd portion of run time that is degradedtd' portion of run time that is degraded,
after degradationFd fraction of run time that is degraded = td / TSd Speedup of portion of run time that is degraded = td / td'
( )0
1 1' '' ' ' 1 e de d e e d de d
e de de d
T TS F FT t t t t t tT t t t F FS ST t T t T
= = = =− −+ + − + + ++ +
Result of reasonable architectural changeEnhancements to most featuresDegradations to some featuresOverall enhancement
t0 te
t0 te'T'
td
t'd
T
3-11Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Amdahl's "Law"To make good architectural improvements
Focus on enhancements that positively affect most featuresIgnore degradations that negatively affect few features
Example — simple "RISC" processor 94% of run time is 5 times faster than CISC processor1% of run time is 10 times slower than CISC processor5% of run time is same as for CISC processor
This RISC processor is (overall) about 3 times faster than CISC
Even though some operations are slower
( )
( )
1
1
10.94 0.011 0.94 0.01 15 10
1 2.940.05 0.19 0.10
e de d
e d
S F FF FS S
=− + + +
=− + + +
= ≅+ +
3-12Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Detailed Analysis of CPU Run TimesAmdahl equation requires relative run time data
Run time data requires measurements on running programsMeasurements on running programs require CPU implementation
CPU analysis predicts run time without building CPUAssumptions:
Instructions can be grouped together according to resource usageExample — ADD R1, R2, R3 and SUB R1, R2, R3
All instructions in a group run in same number of clock cyclesEvery clock cycle measures same unit of timeInstruction run time = clock cycle time × number of clock cyclesGroup run time = instruction run time × instructions in group Total run time = sum of instruction group run times
3-13Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Definitions
i
i
i
i
i
Tt iICCPI
===
=
total run time of program
total run time of instructions in group
number of instructions in group ( nstruction ount)
number of clock cycles to run 1 instruction in group ( ycles
I C
C Per
1
iN i
R
IC
τ
τ
=
=
= = = = =
=
nstruction)
number of clock cycles to run all instructions in group
seconds per clock cycle
clock rate clock frequency clock cycles per second Hertz (Hz)
total number of instructions i
I
n pr
'
NCPIquantity quantity
==
=
ogram
total number of clock cycles to run program
average number of clock cycles per instruction for the program
new value of after architectural change
3-14Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
CPU Equation
Clock cycles to run all instructions of type i
×clock cycles
instruction of type instructions of type i i ii
iN IC CPI= = ×
Total clock cycles to run all instructions in program
i i ii i
N N IC CPI= = ×∑ ∑all groups
Average number of clock cycles per instruction for program
1 1 ii i i i
i i i
NCPIIC
ICCPI N IC CPI CPIIC IC IC
= =
= = × = ×∑ ∑ ∑
total number of clock cycles to run programtotal number of instructions in program
Ratio iICIC
is proportion (percent) of instructions in group i
1i
i
ICIC
=∑
weighted average
3-15Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Example of CPU Equation
Program distribution
121,000Branch
CPIiICiInstruction Type i
4,000
5,000
8Load / Store
4Integer
12 × 1000 = 12,000 cycles1000/10000 = 10%Branch
Ni = ICi × CPIiICi / ICInstruction Type i
4000/10000 = 40%
5000/10000 = 50%
8 × 4000 = 32,000 cyclesLoad / Store
4 × 5000 = 20,000 cyclesInteger
5,000 1,000 4,00010,000
int branch load/store
instructions
IC IC IC IC= + +
= + +=
20,000 12,000 32,000 64,000cycles cycles cycles cyclesN = + + =
/ 64,000 /10,000 6.4cycles instuctions cycles per instructionCPI N IC= = =
4 0.50 12 0.10 8 0.40 6.4 cycles per instructionii
i
ICCPI CPI
IC= × = × + × + × =∑
3-16Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
CPU Run Time
Run time of one instruction of type i
iiCPI τ= ×clock cycles seconds
×instruction of type clock cycle
Run time for all instructions of type i
clock cycles seconds× ×
instruction of type clock cycleinstructions of type i
i i
it i
IC CPI τ
=
= × ×
Total run time for program
all groups
ii i i i
i i i
ICT t CPI IC CPI ICIC
τ τ⎛ ⎞= = × × = × × ×⎜ ⎟⎝ ⎠
∑ ∑ ∑
So =
clock cycles per instruction number of instructions clock cycle
T CPI IC τ× ×= × ×
3-17Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
CPU Run Time — ExampleFor a certain CPU
Instructions in a typical programs can be grouped as50% integer ALU instructions that run in 8 clock cycles10% float ALU instructions that run in 20 clock cycles20% load instructions that run in 10 clock cycles10% store instructions that run in 15 clock cycles10% branch instructions that run in 10 clock cycles
The clock speed is 100 MHzA typical program runs 1,000,000 instructions
Running 500,000 ALU instructions, 100,000 FP instructions, 100,000 loads, …
The average number of cycles per instruction is
The typical program runs in
8 0.5 20 0.1 10 0.2 15 0.1 10 0.1 10.5ii
i
ICCPI CPIIC
= × = × + × + × + × + × =∑
6
8
10.5 10 0.10510
CPI ICT CPI ICR
τ × ×= × × = = =
cycles/instruction instructionsseconds
Hz
3-18Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
C Code to Runtime Example — 1High level code
int x = 0 , n = 0 , a[5] ;while ( n < 5 ){
x = x + a[n] ;n++ ;
}
Assembly program1000 MOV R1, 0 load 11002 MOV R2, 2000 load 1 13%1004 ADD R1, R1, (R2)+ ALUAI 5 29%1008 CMP R2, 2020 ALU 5 29%1012 JL 1004 JMP 5 29%
IC = 17 100%
compile+
optimize
3-19Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
C Code to Runtime Example — 2Assembly code
ADD R1, R1, (R2)+
MicroprogramALU_IN, MAR ← R2ALU ← 4ADDR2 ← ALU_OUTREADALU_IN ← MDRALU ← R1ADDR1 ← ALU_OUT
interpretto
microcode
CPIALU-autoinc = 9
3-20Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
C Code to Runtime Example — 3Assembly program
1000 MOV R1, 0 ICi CPIi1002 MOV R2, 2000 load 13% 21004 ADD R1, R1, (R2)+ ALUAI 29% 91008 CMP R2, 2020 ALU 29% 31012 JL 2004 JMP 29% 12
IC = 17 100%
Average CPI
Total clock cycles
Run time with 1 GHz clock rate
2 0.13 9 0.29 3 0.29 12 0.29 7.2ii
i
ICCPI CPIIC
= × = × + × + × + × =∑
7.2 17 122N CPI IC= × = × =
9 7122 10 seconds 1.22 10 seconds 0.122 microsecondsT N − −= ×τ = × = × =
3-21Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Applying CPU Equation
''
'
' ' ' '
' ' ' '
Calculate
Calculate
1. Run time before enhanceme
Calculat
nt
2. Characterize enhancement
3. Run time after enhancement
4. Speed p
e
u
T CPI IC
ICCPI
T CPI IC
T CPI ICST CPI IC
τ
τ
τ
ττ
= × ×
= × ×
× ×= =
× ×
3-22Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
CPU Equation — Example 2For a certain CPU
25% of all instructions in programs are float (ICFP / IC = 0.25)FP group includes ADD, SUB, MULT, DIV, SQRT
Average FP instruction runs in 4 clock cycles (CPIFP = 4)
2% of all instructions in program are square root (ICSQRT / IC = 0.02)SQRT (FP) instruction runs in 20 cycles (CPISQRT = 20)
Average CPI for all other instructions in program is 4/3 clock cycles
(ICother / IC = 1 – 0.25 = 0.75 CPIother = 4/3)
Average cycles per instruction
( )44 0.25 1 0.25 2.003
ii
i
ICCPI CPIIC
= × = × + × − =∑
3-23Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Example 2 — 2Two possible enhancements
1. Improve performance of all FP instructionsEnhance average CPIFP = 4 cycles to CPIFP' = 2 cyclesNo change in program ⇒ ICi' = ICi for all instruction typesNo change to clock rate ⇒ τ' = τ
2. Improve performance of SQRT (FP) instructionEnhance CPISQRT = 20 cycles to CPISQRT' = 2 cyclesNo change in program ⇒ ICi' = ICi for all instruction typesNo change to clock rate ⇒ τ' = τ
To evaluate enhancements, must find CPI'
3-24Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Example 2 — 3Enhancement 1
Improve average FP from CPIFP = 4 cycles to CPIFP' = 2 cycles
( )
'' ''
''' '' '
40.25 1 0.253
1.50
2.00 1.33' ' ' ' ' ' 1.50
2
ii
i
FPFP
ICCPI CPIIC
ICICCPI CPIIC IC
T CPI IC CPI IC CPIST CPI IC CPI IC CPI
τ ττ τ
= ×
= × + ×
= × + × −
=
× × × ×= = = = = ≅
× × × ×
∑other
other
3-25Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Example 2 — 4Enhancement 2
Improve square root (FP) from CPISQRT = 20 cycles to CPISQRT' = 2 cyclesMust separate into 3 instruction groups
FP/SQRT = FP group without SQRT = ADD, SUB, MULT, DIVSQRTAll other instructions
First calculate CPIFP/SQRT from CPIFP , CPISQRT , ICFP / IC , ICSQRT / IC
//
//
'' ''
' ' '' ' '' ' '
'
otherother
otherother
ii
i
FP SQRT SQRTFP SQRT SQRT
FP SQRT SQRTFP SQRT SQRT
ICCPI CPIIC
IC IC ICCPI CPI CPIIC IC IC
IC IC ICCPI CPIIC IC I
CPIC
= ×
= × + × + ×
= × + × + ×
∑
// 25%, / 2% / 25% 2% 23%FP SQRT FP SQRTIC IC IC IC IC IC= = ⇒ = − =
3-26Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Example 2 — 5
/
4
/
FP
kFP
k FPFP FP
k k kk
k FP k FPFP FP
FP SQRT
CPI CPI FP SQRTNNFP
FP IC ICCPI IC ICCPI
IC IC
CPI CPI FP SQRT
FP SQRTFP
∈
∈ ∈
= =
= = =
×= = ×
=
=
∑
∑ ∑
total cycles
instructions
total cycles
Average for group with
Average for group without
/
// /
/ // /
/FP SQRT k
k FP SQRTFP SQRT FP SQRT
k k kk
k FP SQRT k FP SQRTFP SQRT FP SQRT
N NSQRT IC IC
CPI IC ICCPIIC IC
∈
∈ ∈
= =
×= = ×
∑
∑ ∑
instructions
3-27Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Example 2 — 6
/
/
/
/
/
/
/
/ 1 // 1/
/
FP S
kFP k
k FP FP
SQRTkk SQRT
k FP SQRT FP FP
SQRTkk SQRT
k FP SQRT FP FP
kk
k FP S
QRT
FP SQRT
FP SQR
Q
T
FP SQRT RT
ICIC
IC ICIC I
ICCPI CPI
ICICIC
CPI CPIIC IC
ICICCPI CPI
IC IC
ICC
ICI
C
IPI
C
∈
∈
∈
∈
= ×
= × + ×
⎡ ⎤ ⎡ ⎤= × × + × ×⎢ ⎥ ⎢ ⎥
⎣ ⎦⎣ ⎦⎡ ⎤
= ×⎢ ⎥⎢ ⎥⎣ ⎦
∑
∑
∑
∑
/
/
/
/
// /
/ // /
0.25 0.02 0.024 20 2.610.25 0.25
SQRTSQRT
FP FP
SQRTFP SQRT SQRT
FP
FP FP
FP SQRT FP S
SQRT
QRT
ICCPI
IC IC
ICCPI CPI
IC I
C ICIC IC
IC ICC
CPI CPI
C II
I CC
+ ×
= × + ×
−= × + × ⇒ =
3-28Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Example 2 — 7Speedup for Enhancement 2
( )
//
'' '
'
'
42.61 0.23 0.02 1 0.253
1.64
2.00 1.22
2
' ' ' ' ' ' 1.64
otherother
ii
i
FP SQRT SQRTFP SQRT SQRT
ICCPI CPI
ICIC IC IC
CPI CPI CPIIC IC IC
T CPI IC CPI IC CPIST CPI IC CPI IC CPI
τ ττ τ
= ×
= × + × + ×
= × + × + × −
=
× × × ×= = = = = ≅
× × × ×
∑
3-29Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Example 2 — 8
( )' '
''
''
''
i ii i
i i
i ii i
i
CPI CPI CPI CPI
IC ICCPI CPI CPI
IC ICIC IC
CPI CPI CPIIC IC
= − −
⎛ ⎞= − × − ×⎜ ⎟
⎝ ⎠⎛ ⎞= − × − ×⎜ ⎟⎝ ⎠
∑ ∑
∑
Trick — technique to avoid calculating CPIFP/SQRT
( )
( )
( )
'
' '
'
2.00 20 2 0.02
1.64
i i
ii i
i
SQRTSQRT SQRT
IC ICICCPI CPI CPI CPIIC
ICCPI CPI CPI
IC
=
⎡ ⎤= − − ×⎢ ⎥⎣ ⎦⎡ ⎤
= − − ×⎢ ⎥⎣ ⎦⎡ ⎤= − − ×⎣ ⎦
=
∑
If then combine terms as
3-30Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Example 2 — 95. Speedups
Enhancement to square root — S = 1.22
Enhancement to all FP — S = 1.33
Results identical to analysis by Amdahl equationCan derive inputs to Amdahl equation from CPU analysis
( )
( )
4 0.2550%
24 2
' ' ' ' 2
20 0.0220%
2
' '
FP FP FP FPe
FP FP FP FPe
FP FP FP
SQRT SQRT SQRTSQRTe
SQRT SQRTSQRTe
SQRT SQR
ICt CPI ICFT CPI IC ICCPI IC CPIS
CPI IC CPI
t CPI IC ICF
T CPI IC ICCPI IC
SCPI IC
τττ τ
ττ
τ ττ τ
τ
× × ×× ×= = = =
× × × ×× ×
= = = =× ×
× × × × ×= = = =
× × × ×× ×
=×
20 10' ' 2
SQRT
T SQRT
CPICPIτ
= = =×
1.
2.
3-31Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Changing Instruction Mix — 1Program distribution
10000/10000 = 100%10,000Total
4000/10000 = 40%
1000/10000 = 10%
5000/10000 = 50%
ICi / IC
121,000Branch
CPIiICiInstruction Type i
4,000
5,000
8Load / Store
4Integer
4 0.50 12 0.10 8 0.40 6.4 cycles per instructionii
i
ICCPI CPI
IC= × = × + × + × =∑
New program distribution
8000/8000 = 100%8,000Total
4000/8000 = 50%
1000/8000 = 12.5%
3000/8000 = 37.5%
ICi / IC
121,000Branch
CPIiICiInstruction Type i
4,000
3,000
8Load / Store
4Integer
4 0.375 12 0.125 8 0.50 7.0 cycles per instructionii
i
ICCPI CPI
IC= × = × + × + × =∑
3-32Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Changing Instruction Mix — 2
'
' ' '6.4 100007.0 8000
1.14
SpeedupTSTCPI IC
CPI ICττττ
=
× ×=
× ×× ×
=× ×
=
3-33Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
The Instructions Per Second MythMeasures often used to describe computer power
MIPS = million instructions per secondFLOPS = floating point operations per second
Neither gives fair comparison
Example
CPU-1 and CPU-2 Run ALU instructions in 1 cycles and others in 2 cyclesHave clock speed of 1 GHz
CPU-1 compiler produces 50% ALU instructions and 50% otherCPU-2 compiler produces 25% fewer ALU instructions than CPU-1
6
6 6 6
1010 10 10
IC IC RT CPI IC CPIτ
= = = =× × × × ×
instructions / MIPS
run time
9
1 1 6
101 0.50 2 0.50 1.50 66710 1.50
million instructions / secHzCPI MIPS= × + × = ⇒ = ≅
×
3-34Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
MIPS — 2For CPU-2
( )1 1
2 1 1 1
2 2 2 1 1 1
2 1 2 1
2 1 2 1
2
2
0.50
0.75 0.75 0.50 0.375
0.375 0.50 0.875
0.375 0.500.43 0.57
0.875 0.875
1 0.43 2 0.57 1.57
ALU
ALU ALU
ALU other
ALU other
IC IC
IC IC IC IC
IC IC IC IC IC IC
IC IC IC ICIC IC IC IC
CPI
MIPS
= ×
= × = × × = ×
= + = × + × = ×
× ×= = = =
× ×
= × + × =9
16
10 63710 1.57
million instructions / secHz MIPS<= ≅
×
955.0667637
1
2 ==MIPSMIPS
3-35Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
MIPS — 3Run time comparison
( )
1
2
1 1 1
2 2 2
1 1
2 2
1
1
1.501.57 0.8751.09
TSTCPI ICCPI ICCPI ICCPI IC
ICIC
ττ
=
× ×=
× ××
=×
×=
× ×
≅
MIPS is about 5% lower for CPU-2 than CPU-1CPU-2 is about 9% faster than CPU-1
3-36Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Replacing Instruction TypesInstruction count
IC = IC1 + IC2 + ... + ICn
ExamplesType 1 = ALUType 2 = Conditional Branch
New Instruction countReplace 2 ALU instructions + 1 Branch
DEC CXCMP CX, 0JNZ target
New instructionLOOP target
IC' = IC1' + IC2' + ... + ICn
3-37Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Example — Replacing InstructionsA certain CPU has no floating point unit (FPU)
Performs FP calculations by EMULATION
Converts FP operations to integer operationsExample
(2.165 × 104) × (3.247 × 10-3) → 2165 × 3247exp = (4 – 3) + (-3 – 3)
Instruction distribution
210%Branch
25%Store
210%Load
175%ALU
CPIiICi / ICType i
( )1 0.75 2 0.10 0.05 0.101.25
ii
i
ICCPI CPIIC
= ×
= × + × + +
=
∑
3-38Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Replacing Instructions — 2Enhance CPU with ALUReplace ALU instructions that emulate FP with new FP instructions
2/3 of old ALU instructions emulate FP instructions1 new FPU instruction replaces 10 old ALU emulation instructionsNew FPU instructions run in 4 clock cycles
210%Branch
25%Store
210%Load
175%ALU
CPIiICi / ICType i
} 2/3 × 75% = 50%ALUemulation
1/3 × 75% = 25%ALUint{IC'ALU = 0.25 ICIC'FPU = 1/10 × 0.50 IC = 0.05 ICIC'load = ICload = 0.10 ICIC'store = ICstore = 0.05 ICIC'branch = ICbranch = 0.10 IC
IC' = 0.25 IC + 0.05 IC + 0.10 IC + 0.05 IC + 0.10 IC = 0.55 IC
3-39Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Replacing Instructions — 3
20.10 /0.55Branch
20.05 /0.55Store
20.10 /0.55Load
40.05 /0.55FPU
10.25/0.55ALU
CPIiICi / ICType i
New instruction distribution
'' '
0.25 0.05 0.10 0.05 0.101 4 20.55 0.55 0.55 0.55 0.55
0.950.55
ii
i
ICCPI CPIIC
= ×
⎛ ⎞= × + × + × + +⎜ ⎟⎝ ⎠
=
∑
( )1.25 1.25 1.320.95' ' ' ' 0.950.55
0.55
T CPI IC ICST CPI IC IC
ττ
× × ×= = = = ≅
× × ×
25.173.1' =>= CPICPI
3-40Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Load‐Store versus Register‐MemoryCPU-1 is a load-store machine
ALU operands must come from registerMemory operand
Loaded to register before ALU operationStored to memory after ALU operation
Instruction distribution
Possible enhancement25% of ALU memory operands used in only 1 ALU operationCan register-memory ALU operations improve performance?
420%Branch
415%Store
525%Load
440%ALU
CPIiICi / ICType i
5 0.25 4 0.754.25
ii
i
ICCPI CPIIC
= ×
= × + ×=
∑
3-41Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Load‐Store versus Register‐MemoryCPU-2 is an "ideal" register-memory machine
ALU operands may come from register or memory75% of memory operands
Used in multiple ALU operationsPerfect compiler loads "multiple" memory operands to registers
25% of ALU memory operands Used in only a single ALU operationPerfect compiler never loads "single" memory operands to registers
Convert CPU-1 to CPU-2Split ALU operations into ALUmulti and ALUsingle
Replace ALUsingle with ALUregister-memory
Cancel 1 register load for every ALUsingle
3-42Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Detailed Instruction Distribution
420%Branch
415%Store
525%Load
440%ALU
CPIiICi / ICType i
30%ALUmulti
10%ALUsingle
25% - 10% =15%Loadmulti
10%Loadsingle
515%Loadmulti
420%Branch
415%Store
510%Loadsingle
430%ALUmulti
410%ALUsingle
CPIiICi / ICType i
910%ALUregister-memory
reg mem ALU LoadALUCPI CPI CPI− = +
3-43Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
New Instruction Distribution and Speedup
515%Loadmulti
420%Branch
415%Store
510%Loadsingle
430%ALUmulti
410%ALUsingle
CPIiICi / ICType i 910%ALUregister-memory
' '
0.10 0.30 0.15 0.15 0.200.90
reg mem multi multi Store BranchALU ALU LoadIC IC IC IC IC IC
IC IC IC IC ICIC
−= + + + +
= + + + +=
415/90Store
515/90Loadmulti
420/90Branch
430/90ALUmulti
910/90ALUregister-memory
CPIiICi / ICType i'' '
65 15 104 5 990 90 90
425 4.7290
ii
i
ICCPI CPIIC
= ×
= × + × + ×
= =
∑
( )4.25 1425' ' ' ' 0.90
90
No Change in Performance T CPI IC ICST CPI IC IC
ττ
× × ×= = = = ⇒
× × ×
3-44Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Analysis of 8086 Example8086 program compiled from C source
Instruction Clock Cycles Runs Type
MOV WORD PTR [BP-02],0000 13 1 Store
start: CMP WORD PTR [BP-02],N 10 N ALUimm-mem
JGE stop 4/13 N-1 / 1 Conditional Branch
MOV AX,[BP-02] 9 N-1 Load
SHL AX,1 2 N-1 ALUreg
MOV [BP-04],AX 12 N-1 Store
INC WORD PTR [BP-02] 15 N-1 ALUreg-mem
JMP start JMP 14 N-1 Unconditional Branch
stop: RET RET 16 1 Return
3-45Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
CPI for Store and ALUJGE
Runs N-1 times in 4 clock cycles and 1 time in 13 clock cycles
( )( )
( )4 1 13 1 4 1 131 1
cycles instructionsJGE
N NJGECPIJGE N N
× − + × × − += = =
− +
STORE
Runs N-1 times in 12 clock cycles and 1 time in 13 clock cycles
( )( )
( )12 1 13 1 12 1 131 1
cycles instructionsSTORE
N NSTORECPISTORE N N
× − + × × − += = =
− +
3-46Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Instruction Distribution
10ALUimm-mem
2ALUreg
16Return
14Unconditional Branch
[4(N–1) + 13] / NConditional Branch
15ALUreg-mem
[12(N–1) +13] / NStore
9Load
CPIiICi / ICType i
( )
17 3
1 17 3
17 3
17 3
7 3
7 31
7 31
7 3
NN
NNNNNNN
NN
NNN
N
−−
+ −−−−−−
−
−−−
−
3-47Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Instruction Distribution for Loop Length = 100
1014.34%ALUimm-mem
214.20%ALUreg
160.14%Return
1414.20%Unconditional Branch
4.0914.34%Conditional Branch
1514.20%ALUreg-mem
12.0114.34%Store
914.20%Load
CPIiICi / ICType i
( ) ( )9 15 2 14 0.1420 12.01 10 4.09 0.1434 16 0.0014 9.45
ii
i
ICCPI CPIIC
= ×
= + + + × + + + × + × =
∑
7 3 697IC N= − =
100N =
( ) ( ) 39.45 6971.646 10
4 cycles / instruction instructions
sec MHz
CPI ICTR
−××= = = ×
Estimated run time for 296 MHz UltraSPARC II = 4.71 × 10-7 sec
3
7
1.646 10 34944.71 10
S−
−
×= =
×
3-48Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Register Variables
Improved programMemory variables replaced with register variables
Instruction Clock Cycles Runs Type
MOV SI,0000 4 1 MOVimm-reg
start: CMP SI,+0A 3 N ALUimm-reg
JGE stop 4/13 N-1 / 1 Conditional Branch
MOV AX,SI 2 N-1 MOVreg-reg
SHL AX,1 2 N-1 ALUreg
MOV DI,AX 2 N-1 MOVreg-reg
INC SI 3 N-1 ALUreg-reg
JMP start JMP 14 N-1 Unconditional Branch
stop: RET RET 16 1 Return
3-49Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
New Instruction Distribution
3ALUimm-reg
2ALUreg
16Return
14Unconditional Branch
[4(N–1) + 13] / NConditional Branch
3ALUreg-reg
2MOVreg-reg
4MOVimm-reg
CPIiICi / ICType i
( )
17 32 17 3
17 3
17 3
7 3
7 31
7 31
7 3
NNNNNNNN
NN
NNN
N
−−−−−−−
−
−−−
−
3-50Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
New Distribution for Loop Length = 100
314.35%ALUimm-reg
214.20%ALUreg
160.14%Return
1414.20%Unconditional Branch
4.0914.35%Conditional Branch
314.20%ALUreg-reg
228.41%MOVreg-reg
40.14%MOVimm-reg
CPIi'ICi' / IC'Type i
( ) ( ) ( )
'' ''
4 16 0.0014 2 0.2840 3 2 14 0.1420 3 4.09 0.1435 4.31
ii
i
ICCPI CPIIC
= ×
= + × + × + + + × + + × =
∑
7 3 697IC N= − =
100N =
( ) ( ) 44.31 697' ' 7.515 104
cycles / instruction instructionssec
MHz
CPI ICTR
−××= = = ×
Run time with memory variables = 1.646 × 10-3 sec 3
4
1.646 10 2.197.515 10
S−
−
×= =
×
3-51Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Software Content — Instruction Distribution
4
4
8
2
CPIi
150Store
250Load
200Branch
400ALU
ICiType i
Instruction CountIC
Cycles Per InstructionCPI
3-52Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Software Content — Instruction Distribution
4
4
8
2
CPIi
1000IC
150Store
250Load
200Branch
400ALU
ICiType i
∑= ii
IC IC
3-53Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Software Content — Instruction Distribution
15%
25%
20%
40%
ICi/IC
4
4
8
2
CPIi
1000IC
150Store
250Load
200Branch
400ALU
ICiType i
3-54Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Software Content — Instruction Distribution
0.6
1.0
1.6
0.8
CPIi * ICi/IC
15%
25%
20%
40%
ICi/IC
4
4
8
2
CPIi
1000IC
150Store
250Load
200Branch
400ALU
ICiType i
3-55Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Software Content — Instruction Distribution
4.0CPI
0.6
1.0
1.6
0.8
CPIi * ICi/IC
15%
25%
20%
40%
ICi/IC
4
4
8
2
CPIi
1000IC
150Store
250Load
200Branch
400ALU
ICiType i
×∑= ii
i
ICCPI CPI
IC
3-56Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Software Content — Instruction Distribution
4.0CPI
0.6
1.0
1.6
0.8
CPIi * ICi/IC
15%
25%
20%
40%
ICi/IC
4
4
8
2
CPIi
1000IC
150Store
250Load
200Branch
400ALU
ICiType i
= × = × == × ×τ = τ
N CPI IC 4.0 1000 4000T CPI IC 4000
3-57Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Software Content — Instruction Distribution
4.0CPI
600
1000
1600
800
Ni
0.6
1.0
1.6
0.8
CPIi * ICi/IC
15%
25%
20%
40%
ICi/IC
4
4
8
2
CPIi
1000IC
150Store
250Load
200Branch
400ALU
ICiType i
= ×i i iN CPI IC
3-58Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Software Content — Instruction Distribution
4.0CPI
600
1000
1600
800
Ni
0.6
1.0
1.6
0.8
CPIi * ICi/IC
15%
25%
20%
40%
ICi/IC
4
4
8
2
CPIi
1000IC
150Store
250Load
200Branch
400ALU
ICiType i
= × = × =
= + + + =∑= ii
N CPI IC 4.0 1000 4000
N N 800 1600 100 600 4000
3-59Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Software Content — Instruction Distribution
15%
25%
40%
20%
Fi
4.0CPI
600
1000
1600
800
Ni
0.6
1.0
1.6
0.8
CPIi * ICi/IC
15%
25%
20%
40%
ICi/IC
4
4
8
2
CPIi
1000IC
150Store
250Load
200Branch
400ALU
ICiType i
× ×τ ×= = = =
× ×τ ×i i i i i i
it CPI IC CPI IC N
FT CPI IC CPI IC N
3-60Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Enhancement
19%
31%
25%
25%
Fi'
3.2CPI'
600
1000
800
800
Ni'
0.6
1.0
1.6 → 0.8
0.8
CPIi' * ICi/IC
15%
25%
20%
40%
ICi/IC
4
4
8 → 4
2
CPIi'
1000IC
150Store
250Load
200Branch
400ALU
ICiType i
e
× ×τ= = = = =
× ×τ
= = = = =+− +− + e
e
T CPI IC CPI 4.0CPU Equation S 1.25
T' CPI' IC' ' CPI' 3.2T 1 1 1
Amdahl Equation S 1.25F 0.4T' 0.6 0.21 0.41 F
2S× ×τ
= = =× ×τ
ee
e
t 8 ΙCS 2
t ' 4 ΙC
3-61Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019
Instruction DistributionsCPU analysis
Permits performance analysis of machine design "on drawing board"Evaluate proposed design without building CPU implementation
Summary of procedure
Specify Instruction Set Architecture (ISA)Describes machine language for proposed CPUProvides human-readable assembly languageDetermines CPIi for each instruction group i
Count clock cycles to implement a single instruction in ISA
Write C, C++, Fortran compilers for proposed machine languageCompile representative programs to machine language
Can use programs from SPEC CINT and CFP
Sort instructions into groups to find relative instruction count ICi/ICCalculate average CPI and run time TCompare run time with reference machine
4-1Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
From CISC to RISC
4-2Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
CISC Creates the Anti‐CISC Revolution Digital Equipment Company (DEC) introduces VAX (1977)
Commercially successful 32-bit CISC minicomputer
In 1970s and 1980s CISC minicomputers became cheaperSerious computers became available to small organizationsUNIX developed as minicomputer operating systemTCP/IP developed to support networks of minicomputersComputer Science emerged as separate academic disciplineStudents needed topics for final projects, theses, dissertations
Research results on CISC performance Most machine instructions are never usedCISC implementations give up speed in favor of generalityCISC machines run slowly to support unnecessary features
4-3Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
CISC LimitationsCISC instruction set requires microcode
Many different instruction typesEach instruction requires different implementation
Complex operationsMany instructions require complex decoding and sequencing
Central bus organizationAtomic microcode operationsSystem bus = bottleneck
Microcode operations — sequentialMachine instructions — sequential
Machine instruction executes in multiple clock cycles
Memory access Operation complexity — non-uniform instruction lengthInstruction fetch — multiple clock cycles to load instruction
Main Memory
Registers
MAR MDR+PCIRDecoderStatusWord
Address Data
ALU Subsystem
System Bus
INOUT
ALU Operation
1
23
ALU Result Flag
control
4-4Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
RISC "Philosophy"Technological developments from 1975 to 1990
Price of RAM — from $5000 / MByte (1975) to $5 / MByte (1990)Compilers — powerful and efficient with extensive optimizationUnix, C, and TCP/IP — practical portable code
Principal research result on CISC performance~ 90% of run time = ~ 10% of VAX ISA~ 90% of VAX instruction set < 10% of run time
Reduced Instruction Set Computer (RISC) — 1984Apply Amdahl's "Law" to Instruction Set Architecture (ISA)
Speed up operations accounting for most of run timeIgnore performance degradation to other instructions
RISC ISA — keep most important instructions from CISC ISAOther CISC instructions implemented as multiple RISC instructions
Simple hardware implementation — faster execution
4-5Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
RISC MicroprocessorsSimpler ISA
Fewer machine instructionsAll instructions are same length
Simpler hardware design Allows lower CPIi and higher clock speedNo microcode — all instructions implemented in similar wayNo dedicated system busCPU can process several instructions at onceAn instruction completes execution on almost every clock cycle
High level program compiled to RISC Larger ICi — more machine instructions than compilation for CISCRun more quickly than same high level programs on CISC
All processors today use RISC technologyPure RISC (IBM Power, SPARC, MIPS, ARM, …)RISC technology for CISC language Intel x86 (Pentium, Core, Xeon) Explicitly parallel RISC (Intel Itanium, IBM mainframes)
4-6Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
CISC vs. Pure RISC
CISC RISC Instruction Types 300 50 Addressing Modes 15 5 Data Types 10 2 Procedure Handling Automated Coded Implementations Complex Simple Memory Organization Complex Simple
( )
' 12
6
3
CISC
RISC
CISCCISC CISC CISC
RISC RISC RISC
CISC
RISC
CISC
RISC
CISC
RIS
R SC
C
I
ICIC
CPIC
T CPI ICST CPI IC PI
⎛ ⎞≈⎜ ⎟
× ×τ= = = × ×
× ×τ
= × ×
≈
ττ
ττ
ττ
⎝
×
⎠≈
4-7Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Designing a RISC ISA
4-8Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Considerations for a RISC ISAGoals
Simple — no instruction should require more steps than othersComplete — able to perform any desired computationOrthogonal — only one way to encode any given computation
ChoicesComputation model
Register-registerRegister-memory
Range and type of operationsOperands
Data types Data sizes
Addressing modes Displacement sizes
Branch typesConditionalUnconditionalProcedural (call/return)Branch offset (length of jump)
4-9Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Instruction Types Representative instruction distribution
Five programs from SPECint92 benchmark suite Compile for x86 instruction set (ISA for Intel 386/486/Pentium)
Instruction Relative Proportion of Total Run Time
Load 22% Conditional branch 20% Compare 16% Store 12% Add 8% And 6% Sub 5% Move reg-reg 4% Call 1% Return 1% Other 5% Total 100%
Ref: Hennessy / Patterson, figure 2.11
First 10 instructions accountfor 95% of run time
Amdahl's "Law" Fast implementation of 95%Other 5% will not seriously
degrade performance
Must include unconditionalbranch for completeness
4-10Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Addressing Modes Graph
Ref: Hennessy / Patterson, figure 2.6
4-11Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Addressing Modes Representative instruction distribution
Three programs from SPEC CINT92 and SPEC CFP95 benchmarksCompile for VAX instruction set
Mode tex spice gcc Example of Mode register deferred 24 3 11 mem[R1] immediate 43 17 39 #11223344 displacement 32 55 40 mem[R1 + disp] memory indirect 1 6 1 mem[mem[R1]] scaled 0 16 6 mem[R1 + R2 * d + disp] other 0 3 3 total 100 100 100 total (top 3) 99 75 90
First three addressing modes Account for more than 75% of all operand accesses
Ref: Hennessy / Patterson, figure 2.6
4-12Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Instruction LengthInstructions should be of uniform length
Simplifies instruction DECODING
No need to calculate instruction lengthInstruction fields are always in same place
Enables INSTRUCTION FETCH in 1 clock cycle
Practical instruction lengthsMost RISC machines for servers/workstations use 32-bit instructionsSpecial purpose RISC machines use longer instructionsItanium and mainframes use 128-bit instructions
ISA defines 32-bit instructionsNo single field can be 32 bits longIncludes address displacements, immediates, branch length
32 bits
operandsop code
4-13Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Length of Immediate Operand Graph
Ref: Hennessy / Patterson, figure 2.9
4-14Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Length of Immediate OperandRepresentative instruction distribution
Three programs from SPEC CINT92 and SPEC CFP95 benchmarksCompile for VAX instruction set
Ref: Hennessy / Patterson, figure 2.9
Immediate size tex spice gcc 0 3 1 1 4 45 13 50 8 4 35 22 12 3 15 4 16 15 14 3 20 25 10 18 24 2 12 0 28 1 0 0 32 2 0 2
Total 100 100 100 Total to 16 bits 70 78 80
Allocating 16 bits in 32-bit instruction for immediate operands covers more than 70% of cases
#1122
4-15Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Displacement Length Graph
Ref: Hennessy / Patterson, figure 2.7
4-16Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Displacement Length Representative instruction distribution
Programs from SPEC CINT92 and SPEC CFP95 benchmarksCompile for VAX instruction set
Bits in address displacement int FP 0 26 7 1 1 0 2 6 6 3 12 8 4 16 5 5 6 10 6 10 4 7 6 3 8 2 5 9 1 1 10 1 10 11 0 4 12 0 7 13 1 6 14 0 4 15 12 20
Total 100 100
Ref: Hennessy / Patterson, figure 2.7
Allocating 16 bits foraddress displacementscovers almost all cases
mem[R1 + 1122]
4-17Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Branch Instructions Graph
Ref: Hennessy / Patterson, figure 2.12
4-18Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Branch Instructions Representative instruction distribution
Programs from SPEC CINT92 and SPEC CFP95 benchmarksCompile for VAX instruction set
Integer FP Call / Return 13 10 Unconditional Branch 6 4 Conditional branch 81 86 Total 100 100 Total of Conditional and Unconditional Branch 87 90
Ref: Hennessy / Patterson, figure 2.12
Conditional branch accounts for more than 80% of all branch instructions
Unconditional branch must be included for completenessCall and return
Include many steps — saving registers and branchingAre difficult to implement
4-19Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Branch Offset Graph
Ref: Hennessy / Patterson, figure 2.13
4-20Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Branch Offset Representative instruction distribution
Programs from SPEC CINT92 and SPEC CFP95 benchmarksCompile for VAX instruction set
Offset bits for branch address int FP
0 0 0 1 1 0 2 13 36 3 26 21 4 16 11 5 24 12 6 6 9 7 5 6 8 6 4 9 2 1 10 1 0 11 0 0 12 0 0 13 0 0 14 0 0 15 0 0
Total 100 100
Ref: Hennessy / Patterson, figure 2.13
Allocating 16 bits forbranch offsetscovers almost all cases
PC ← PC + 1122
4-21Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019
Summary — RISC ISA By the NumbersInstruction Types
10 instructions cover 95% of run timeChoose 30 – 50 most necessary / convenient instructions
Addressing Modes Register ImmediateDisplacement
Instruction Length32-bit instructions
Branch InstructionsConditional branchUnconditional branch
Length of immediate values16-bit length for
Immediate operandDisplacementBranch offset
75% – 90% of run time addressing modes
75% – 90% of run time addressing modes
70% – 80% of run time immediates100% of run time address displacements100% of run time branch offsets
5-1Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
DLX Architecture
A Model RISC Processor
5-2Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
DLX Architecture —General FeaturesFlat memory model with 32-bit address
Data typesIntegers (32-bit)Floating Point
Single precision (32-bit)Double precision (64 bits)
Register-register operation model
32 integer registers (32 bits wide)Named R0, R1, ... , R31Addressed as 00000 to 11111 in register address spaceReg[R0] = 0 (constant)Other registers identical (no special purpose registers)
32 FP registers (32 bits wide)F0, F1, ... , F31Satisfy IEEE 754 standard FP formatStore double precision FP is register pair (even , odd)
R0 R1 R2 ... R31
F0 F1 F2 ... F31
instructioncacheALU
FPU
datacache
5-3Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Addressing Modes
Three memory addressing modes implemented using Displacement
100(R1) Reg[R3] ← Mem[100+Reg[R1]]
Register Deferred0(R1) Reg[R3] ← Mem[0+Reg[R1]]
Absolute100(R0) Reg[R3] ← Mem[100+Reg[R0]]
Register ADD R3, R4, R5 Reg[R3] ← Reg[R4] + Reg[R5] Immediate ADD R3, R4, #3 Reg[R3] ← Reg[R4] + 3 Displacement LW R3, 100(R1) Reg[R3] ← Mem[100+Reg[R1]] Register Deferred LW R3, 0(R1) Reg[R3] ← Mem[Reg[R1]] Absolute LW R3, 100(R0) Reg[R3] ← Mem[100]
5-4Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Data Transfer Instructions
LW R1, 30(R2) Load Word Reg[R1] ←32 Mem[30 + Reg[R2]]
SW 30(R2), R1 Store Word Mem[30 + Reg[R2]] ←32 Reg[R1]
LB R1, 30(R2) Load Byte Reg[R1] ←32 (Mem[30 + Reg[R2]]0)24 ## Mem[30 + Reg[R2]]
SB 30(R2), R1 Store Byte Mem[30 + Reg[R2]] ←8 Reg[R1]24..31
LBU R1, 30(R2) Load Byte
unsigned Reg[R1] ←32 024 ## Mem[30 + Reg[R2]]
LH R1, 30(R2) Load Half Word
Reg[R1] ←32 (Mem[30 + Reg[R2] ]0)16 ## Mem[30 + Reg[R2]]
LF F1, 30(R2) Load Float Reg[F1] ←32 Mem[30 + Reg[R2]]
SF 30(R2), F1 Store Float Mem[30 + Reg[R2]] ←32 Reg[F1]
MOVF F3, F1 Move Float Reg[F3] ←32 Reg[F1]
MOVD F2, F0 Move Double Reg[F2],Reg[F3] ←64 Reg[F0],Reg[F1]
MOVFP2I R2, F2 FP to INT Reg[R2] ←32 Reg[F2]
MOVI2FP F2, R2 INT to FP Reg[F2] ←32 Reg[R2]
5-5Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Arithmetic/Logic Instructions ADD R1, R2, R3 Add Reg[R1] ← Reg[R2] + Reg[R3] ADDI R1, R2, #3 Add Immediate Reg[R1] ← Reg[R2] + 3 SUB R1, R2, R3 Sub Reg[R1] ← Reg[R2] - Reg[R3] SUBI R1, R2, #3 Sub Immediate Reg[R1] ← Reg[R2] - 3 MULT R1, R2, R3 Multiply Reg[R1] ← Reg[R2] * Reg[R3] DIV R1, R2, R3 Divide Reg[R1] ← Reg[R2] ÷ Reg[R3] AND R1, R2, R3 And Reg[R1] ← Reg[R2] AND Reg[R3] ANDI R1, R2, #3 And Immediate Reg[R1] ← Reg[R2] AND 3 OR R1, R2, R3 Or Reg[R1] ← Reg[R2] OR Reg[R3] ORI R1, R2, #3 Or Immediate Reg[R1] ← Reg[R2] OR 3 XOR R1, R2, R3 Exclusive Or Reg[R1] ← Reg[R2] XOR Reg[R3]
XORI R1, R2, #3 Exclusive Or Immediate Reg[R1] ← Reg[R2] XOR 3
LHI R1, #42 Load High Reg[R1] ← 42 ## 016
SLT R1, R2, R3 Set Less Than if Reg[R2] < Reg[R3] then Reg[R1] ← 1 else Reg[R1] ← 0
SGT R1, R2, R3 Set Greater Than
if Reg[R2] > Reg[R3] then Reg[R1] ← 1 else Reg[R1] ← 0
SLE R1, R2, R3 Set Less Than or Equal
if Reg[R2] ≤ Reg[R3] then Reg[R1] ← 1 else Reg[R1] ← 0
SGE R1, R2, R3 Set Greater Than or Equal
if Reg[R2] ≥ Reg[R3] then Reg[R1] ← 1 else Reg[R1] ← 0
SEQ R1, R2, R3 Set Equal if Reg[R2] = Reg[R3] then Reg[R1] ← 1 else Reg[R1] ← 0
SNE R1, R2, R3 Set Not Equal if Reg[R2] ≠ Reg[R3] then Reg[R1] ← 1 else Reg[R1] ← 0
5-6Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Floating Point Instructions ADDF F1, F2, F3 Add Float Reg[F1] ← Reg[F2] + Reg[F3]
ADDD F0, F2, F4 Add Double ⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟
⎠
⎞⎜⎜⎝
⎛⎯⎯←⎟⎟
⎠
⎞⎜⎜⎝
⎛Reg[F5]
Reg[F4]
Reg[F3]
Reg[F2]
Reg[F1]
Reg[F0]64
SUBF F1, F2, F3 Sub Float SUBD F0, F2, F4 Sub Double
MULTF F1, F2, F3 Multiply Float
MULTD F0, F2, F4 Multiply Double
DIV F1, F2, F3 Divide Float DIVD F0, F2, F4 Divide Double
NOTE: Floating point numbers are represented as single or double
precision numbers according to IEEE 754.
The ALU functions for FP are not simple binary operations on the bits
in the register.
LTF F2, F3 Set Less Than if Reg[F2] < Reg[F3] then StatFP ←1 1 else StatFP ←1 0
GTF F2, F3 Set Greater Than
if Reg[F2] > Reg[F3] then StatFP ←1 1 else StatFP ←1 0
LEF F2, F3 Set Less Than or Equal
if Reg[F2] ≤ Reg[F3] then StatFP ←1 1 else StatFP ←1 0
GEF F2, F3 Set Greater Than or Equal
if Reg[F2] ≥ Reg[F3] then StatFP ←1 1 else StatFP ←1 0
EQF F2, F3 Set Equal if Reg[F2] = Reg[F3] then StatFP ←1 1 else StatFP ←1 0
NEF F2, F3 Set Not Equal if Reg[F2] ≠ Reg[F3] then StatFP ←1 1 else StatFP ←1 0
LTD, GTD, LED, GED, EQD, NED Double precision comparisons
5-7Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Control Instructions
J offset Jump PC ← PC + offset (-225 ≤ offset ≤ 225 - 1)
JAL offset Jump and Link
Reg[R31] ← PC PC ← PC + offset
(-225 ≤ offset ≤ 225 - 1)
JR R3 Jump Register PC ← Reg[R3]
JALR R2, offset Jump and
Link Register
Reg[R2] ← PC PC ← PC + offset
(-215 ≤ offset ≤ 215 - 1)
BEQZ R4, offset Branch equal zero
if Reg[R4] == 0 then PC ← PC + offset (-215 ≤ offset ≤ 215 - 1)
BNEZ R4, offset Branch not equal zero
if Reg[R4] != 0 then PC ← PC + offset (-215 ≤ offset ≤ 215 - 1)
TRAP N Software interrupt Details not specified in Hennessy and Patterson
Note: Register NPC is updated (NPC ← PC + 4) when branch instruction is loaded Register PC is updated (PC ← NPC or PC ← NPC + offset) at end of instruction execution
5-8Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Programming in DLX Assembly Language
ADDI R1, R0, #0x400 ; 256 integers = 1024 bytes = 400h bytes
LW R2, -4(R1) ; load word from a[] (400 – 4 = 3FC) LW R3, 3FC(R1) ; load word from b[] (400 + 3FC = 7FC)ADD R4, R2, R3 ; addLW R2, 7FC(R1) ; load word from c[] (400 + 7FC = BFC)SUB R4, R4, R2 ; subLW R2, BFC(R1) ; load word from d[] (400 + BFC = FFC)ADD R4, R4, R2 ; addSW -4(R1), R4 ; store sum in a[]SUBI R1, R1, #4 ; i--BNEZ R1, -0x28 ; if R1 <> 0 jump 10 back instructions
for ( i = 0 ; i < 256 ; i++)a[i] = a[i] + b[i] – c[i] + d[i]
}
a[] = 000 – 3FFb[] = 400 – 7FFc[] = 800 – BFFd[] = C00 – FFF
5-9Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Implementation
General approachNo central system busBase hardware organization on assembly line with uniform operations Separate memory for instructions and data
High level designInstructions move through 5 stages (left to right)
First two stages identical for all instructions — FETCH and DECODE
Last three stages operate according to instruction
EXECUTE (ALU instructions and address calculations)MEMORY ACCESS (Load/Store instructions)WRITE BACK (register update for Load and ALU instructions)
InstructionFetch
InstructionMemory
InstructionDecode Execute Data
Access
DataMemory
WriteBack
Address Instruction Address Data
5-10Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
RISC PerformanceCompare VAX with MIPS 2000 (RISC CPU) on SPEC 89 results
Same clock rate
Ref: Hennessy-Patterson Figure 2-30
6 312
VAX VAX
MIPS MIPS
CPI ICSCPI IC
ττ
× ×= ≈ × =
× ×
2MIPS
VAX
ICIC
≈
16
MIPS
VAX
CPICPI
≈
5-11Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Instruction Formats32-bit instructions (0 to 31)
Three instruction formatsJ-type
Jump (unconditional branch) instructionsSpecifies branch offset
R-typeRegister-register ALU instructionsSpecifies destination register (rd), and two source registers (rs1, rs2)
I-typeAll other instructionsSpecifies destination register (rd), immediate, and source register (rs)
0-5 6-10 11-15 16-31 Type 6 5 5 5 11
R opcode rs1 rs2 rd function I opcode rs rd immediate J opcode offset
5-12Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
J‐Type Instruction Format
6 26Opcode Offset added to PC
Encodes: • Jump PC ← PC + offset
• Jump and link r31 ← PC PC ← offset
• Trap and return from exception Implementation unspecified in Hennessy and Patterson Two possible implementations for Offset field 1. Lower 26 bits of physical address of Interrupt Service Routine 2. Trap number = index to Interrupt Vector Table
5-13Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
R ‐ Type Instruction
6 5 5 5 11Opcode rs1 rs2 rd function
Encodes: • Register-register ALU operations rd ← rs1 function rs2
Function encodes the ALU operation: Add, Sub, ...
5-14Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
I ‐ Type Instruction
6 5 5 16Opcode rs rd Immediate
Encodes: • Loads rd ← imm(rs)
• Stores imm(rs) ← rd
• ALU operations with immediate operand rd ← rs op immediate
• Conditional branch instructions if rs eq/ne 0 then PC ← PC + imm (rd unused)
• Jump register PC ← rs
• Jump and link register rd ← PC PC ← PC + immediate
5-15Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
ImplementationDetails
5-16Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Execution Stages by Instruction Type
Write loaded data to register
Update PC
Write result to register
Update PC
Update PCLoad data
from memory
Store data
to memory
Update PC
Calculate branch condition
Calculate branch address
Calculate memory address
Calculate memory address
Calculate ALU operation
Decode operation and operands
Decode operation and operands
Decode operation and operands
Decode operation and operands
Fetch instruction from memory
Fetch instruction from memory
Fetch instruction from memory
Fetch instruction from memory
BranchLoadStoreALU
5-17Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Temporary Registers for ImplementationIR
Instruction RegisterHolds fetched instruction during execution
PCProgram CounterMemory address of next instruction
NPCNext Program CounterTemporary update of PC (points to fall-through instruction)
A, B, IOperand buffersValues read from data registers according to instruction
ALUout
ALU outputResult of ALU operation
LMDLoad Memory DataData loaded from memory
CondCondition flagResult of test for conditional branch
5-18Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Example Type‐I ALU Instruction
Instruction addi R1, R2, #5
Operation Reg[R1] ← Reg[R2] + 5
0-5 6-10 11-15 16-31 addi 00010 00001 0000 0000 0000 0101 Encoding
op rs rd immediate Hardware Stage 1
IR ← Mem[PC] NPC ← PC + 4
Hardware Stage 2
A ← Reg[IR6-10] /* A ← Reg[R2] */ B ← Reg[IR11-15] /* B ← Reg[R1] */ I ← (IR16)16 ## IR16-31
Hardware Stage 3
ALUout ← A + I
Hardware Stage 4
Hardware Stage 5
Reg[IR11-15] ← ALUout /* Reg[R1] ← A + I */ PC ← NPC
5-19Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Example Type‐R ALU Instruction
Instruction add R1, R2, R3
Operation Reg[R1] ← Reg[R2] + Reg[R3]
0-5 6-10 11-15 16-20 21-31 R-R 00010 00011 00001 add Encoding
op rs1 rs2 rd funct Hardware Stage 1
IR ← Mem[PC] NPC ← PC + 4
Hardware Stage 2
A ← Reg[IR6-10] /* A ← Reg[R2] */ B ← Reg[IR11-15] /* B ← Reg[R3] */ I ← (IR16)16 ## IR16-31
Hardware Stage 3
ALUout ← A + B
Hardware Stage 4
Hardware Stage 5
Reg[IR16-20] ← ALUout /* Reg[R1] ← A + B */ PC ← NPC
5-20Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Example Type‐I Store Instruction
Instruction SW 32(R1), R2
Operation Mem[32+Reg[R1]] ← Reg[R2]
0-5 6-10 11-15 16-31 SW 00001 00010 0000 0000 0010 0000 Encoding
op rs rd immediate Hardware Stage 1
IR ← Mem[PC] NPC ← PC + 4
Hardware Stage 2
A ← Reg[IR6-10] /* A ← Reg[R1] */ B ← Reg[IR11-15] /* B ← Reg[R2] */ I ← (IR16)16 ## IR16-31
Hardware Stage 3
ALUout ← A + I
Hardware Stage 4
Mem[ALUout] ← B /* Mem[A+I] ← Reg[R2] */ PC ← NPC
Hardware Stage 5
5-21Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Example Type‐I Load Instruction
Instruction LW R2, 32(R1)
Operation Reg[R2] ← Mem[32+Reg[R1]]
0-5 6-10 11-15 16-31 LW 00001 00010 0000 0000 0010 0000 Encoding
op rs rd immediate Hardware Stage 1
IR ← Mem[PC] NPC ← PC + 4
Hardware Stage 2
A ← Reg[IR6-10] /* A ← Reg[R1] */ B ← Reg[IR11-15] /* B ← Reg[R2] */ I ← (IR16)16 ## IR16-31
Hardware Stage 3
ALUout ← A + I
Hardware Stage 4
LMD ← Mem[ALUout] /* LMD ← Mem[A+I] */
Hardware Stage 5
Reg[IR11-15] ← LMD /* Reg[R2] ← Mem[A+I] */ PC ← NPC
5-22Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Example Type‐I Conditional Branch Instruction
Instruction beqz R1, 1024
Operation if (Reg[R1] == 0) PC ← NPC + 1024 else PC ← NPC
0-5 6-10 11-15 16-31 beqz 00001 00000 0000 0100 0000 0000 Encoding
op rs rd immediate Hardware Stage 1
IR ← Mem[PC] NPC ← PC + 4
Hardware Stage 2
A ← Reg[IR6-10] /* A ← Reg[R1] */ B ← Reg[IR11-15] /* B ← Reg[R0] */ I ← (IR16)16 ## IR16-31
Hardware Stage 3
ALUout ← NPC + I if (A == 0) cond = 1 else cond = 0
Hardware Stage 4
if (cond == 1) PC ← ALUout else PC ← NPC
Hardware Stage 5
5-23Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
DLX Hardware Drawing — Version 1
mux (multiplexer) — chooses 1 output from N inputs
5-24Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I ALU Instruction — 1
PC
mem[PC]
PC + 4
addi r1, r2, #5 regs[r1] ← regs[r2] + 5
5-25Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I ALU Instruction — 2
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
NPC
addi r1, r2, #5 regs[r1] ← regs[r2] + 5
5-26Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I ALU Instruction — 3
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
A
I
A+I
A
NPC cond
NPC
addi r1, r2, #5 regs[r1] ← regs[r2] + 5
5-27Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I ALU Instruction — 4
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
A
I
A+I
A
NPC cond
NPC
A+I
NPC
A+I
A+IReg[IR11-15]
NPC
addi r1, r2, #5 regs[r1] ← regs[r2] + 5
5-28Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐R ALU Instruction — 1
PC
mem[PC]
PC + 4
add r1, r2, r3 regs[r1] ← regs[r2] + regs[r3]
5-29Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐R ALU Instruction — 2
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
NPC
add r1, r2, r3 regs[r1] ← regs[r2] + regs[r3]
5-30Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐R ALU Instruction — 3
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
A
B
A+B
A
NPC cond
NPC
add r1, r2, r3 regs[r1] ← regs[r2] + regs[r3]
5-31Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐R ALU Instruction — 4
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
A
B
A+B
A
NPC cond
NPC
A+B
NPC
A+B
A+BReg[IR16-20]
NPC
add r1, r2, r3 regs[r1] ← regs[r2] + regs[r3]
5-32Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Store Instruction — 1
PC
mem[PC]
PC + 4
sw 32(r1), r2 mem[32+ regs[r1]] ← regs[r2]
5-33Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Store Instruction — 2
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
NPC
sw 32(r1), r2 mem[32+ regs[r1]] ← regs[r2]
5-34Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Store Instruction — 3
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
A
I
A+I
A
NPC cond
NPC
B
sw 32(r1), r2 mem[32+ regs[r1]] ← regs[r2]
5-35Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Store Instruction — 4
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
A
I
A+I
A
NPC cond
NPC
A+I
NPCNPC
B
B
sw 32(r1), r2 mem[32+ regs[r1]] ← regs[r2]
5-36Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Load Instruction — 1
PC
mem[PC]
PC + 4
lw r2, 32(r1) regs[r2] ← mem[32+ regs[r1]]
5-37Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Load Instruction — 2
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
NPC
lw r2, 32(r1) regs[r2] ← mem[32+ regs[r1]]
5-38Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Load Instruction — 3
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
A
I
A+I
A
NPC cond
NPC
lw r2, 32(r1) regs[r2] ← mem[32+ regs[r1]]
5-39Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Load Instruction — 4
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
A
I
A+I
A
NPC cond
NPC
A+I
NPC
mem[A+I]
lw r2, 32(r1) regs[r2] ← mem[32+ regs[r1]]
5-40Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Load Instruction — 5
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
A
I
A+I
A
NPC cond
NPC
A+I
NPC
mem[A+I]
mem[A+I]Reg[IR11-15]
NPC
mem[A+I]
lw r2, 32(r1) regs[r2] ← mem[32+ regs[r1]]
5-41Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Branch Instruction — 1
PC
mem[PC]
PC + 4
beqz r1, 1024 if (regs[r1] == 0) PC ← NPC + I else PC ← NPC
5-42Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Branch Instruction — 2
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
NPC
beqz r1, 1024 if (regs[r1] == 0) PC ← NPC + I else PC ← NPC
5-43Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Branch Instruction — 3
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
NPC
I
NPC+I
A
NPC cond
NPC
beqz r1, 1024 if (regs[r1] == 0) PC ← NPC + I else PC ← NPC
5-44Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
Type‐I Branch Instruction — 4
PC
mem[PC]
PC + 4
Reg[IR6-10]
Reg[IR11-15]
Reg[IR16-31]
NPC
I
NPC+I
A
NPC cond
NPC
NPC+I
NPC / NPC+INPC / NPC+I
beqz r1, 1024 if (regs[r1] == 0) PC ← NPC + I else PC ← NPC
5-45Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019
PerformanceInstruction distribution for version 1 based on compilation of
SPEC 92
420%Branch
415%Store
525%Load
440%ALU
CPIiICi / ICType i
4 0.40 5 0.25 4 0.15 4 0.254.25
ii
i
ICCPI CPIIC
= ×
= × + × + × + ×=
∑
6-1Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Speeding Up DLX
6-2Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
DLX Execution Stages — Version 1Clock Cycle 1
I1 enters Instruction Fetch (IF)Clock Cycle2
I1 moves to Instruction Decode (ID)Instruction Fetch (IF) holds state fixed
Clock Cycle3I1 moves to Execute (EX)Instruction Fetch (IF) holds state fixedInstruction Decode (ID) holds state fixed
Clock Cycle4I1 moves to Memory Access (MEM)Instruction Fetch (IF) holds state fixedInstruction Decode (ID) holds state fixedExecute (EX) holds state fixed
Clock Cycle5I1 performs Write Back (WB) using instruction (IR) stored in IF stagePC updated and stages IF, ID, EX, MEM are reset
6-3Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Room for ImprovementDLX based on assembly line
No central system busInstructions move from execution stage to execution stageAssembly line permits pipeliningIn each stage, new work begins when old work passes to next stage
CC1 CC2 CC3 CC4 CC5
InstructionFetch
InstructionMemory
InstructionDecode Execute Data
Access
DataMemory
WriteBack
Address Instruction Address Data
6-4Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
DLX — Version 2
I1 moves to Write Back (WB)I2 and its execution state move to Memory Access (MEM)I3 and its execution state move to Execute (EX)I4 and its execution state move to Instruction Decode (ID)I5 enters Instruction Fetch (IF)
CC 5
I1 and its execution state move to Memory Access (MEM)I2 and its execution state move to Execute (EX)I3 and its execution state move to Instruction Decode (ID)I4 enters Instruction Fetch (IF)
CC 4
I1 and its execution state move to Execute (EX)I2 and its execution state move to Instruction Decode (ID)I3 enters Instruction Fetch (IF)
CC 3
I1 and its execution state move to Instruction Decode (ID)I2 enters Instruction Fetch (IF)
CC 2
I1 enters Instruction Fetch (IF)CC 1
6-5Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Ideal Instruction Pipelining — Processor View
In any clock cycle (after CC 4)5 instructions are being processed at one timeEach instruction in a different stage of execution
IF ID EX MEM WB 1 I1 2 I2 I1 3 I3 I2 I1 4 I4 I3 I2 I1 5 I5 I4 I3 I2 I1
6 I6 I5 I4 I3 I2
7 I7 I6 I5 I4 I3
8 I8 I7 I6 I5 I4
stageclockcycle
6-6Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Ideal Instruction Pipelining — Instruction View
1 2 3 4 5 6 7 8 I1 IF ID EX MEM WB I2 IF ID EX MEM WB I3 IF ID EX MEM WB I4 IF ID EX MEM WB I5 IF ID EX MEMI6 IF ID EX I7 IF ID I8 IF
clock cycle
6-7Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Average CPI for DLX PipelineFrom diagram
I1 finishes after N=5 clock cyclesI2 finishes after N=6 clock cyclesI3 finishes after N=7 clock cycles
GenerallyIC instructions are finished after N = IC + 4 clock cycles
4
4 41 1IC
ICCPIIC IC >>
+= = = + ⎯⎯⎯→clock cycles
finished instructions
On averageOne instruction completes on every clock cycle
CPI is 1 clock cycle per instruction for DLX pipelineLimitation
Dependencies between instructions cause waiting conditions
6-8Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Pipelining — Functional RequirementsEach stage receives a new instruction on every clock cycle
Cannot hold partial results for all instructionsMust pass along all intermediate results for every instruction
ExampleIF stage
Loads instruction to IRFinds NPC for next instructionPasses IR and NPC (intermediate results) to ID stage
ID stageStores received IR and NPC for incoming instructionDecodes IR to A, B, and IPasses IR, NPC, A, B, and I to EX stage
Stage buffersCollection of D-flip/flops (edge-triggered latches)Store intermediate results of each stage at end of clock cycle
6-9Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Review — Synchronous TransferD-flip/flop (edge-triggered latch)
Input DOutput of some digital system
Output QChanges only on falling CLK edge
Trigger — 1-to-0 CLK transition
Q
D
CLK
1NCLK − NCLK CC N
D
CLK
Pr
Cr
Q
Q
D
CLK
Pr
Cr
Q
Q
D
CLK
Pr
Cr
Q
Q
...
D0 D1 Dn-1
Q0 Q1 Qn-1
CLK
Clock Cycle NCC N begins on CLKN-1
Input D can changeNo effect on latch
CC N ends on CLKN
Latch samples input DStores instantaneous input
value Forwards stored value to
output Q
6-10Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Stage Buffers
5 execution stages built from Combinational logic — output = function (present input)Asynchronous memory — output = function (present input, past input)
4 stage buffers (edge-triggered latches) and PC built from Synchronous sequential logic
output = function (present input, past input, external clock)Store and forward input on falling edge of CLK
Described as data structure using C notation
IF/ID.NPC
IF/ID.IR
IF/ID
IFLogic
ID/EX.NPC
ID/EX.A
ID/EX.B
ID/EX.I
ID/EX.IR
ID/EX
IDLogic
EX/MEM.cond
EX/MEM.ALU
EX/MEM.B
EX/MEM.IR
EX/MEM
EXLogic
MEM/WB.ALU
MEM/WB.LMD
MEM/WB.IR
MEM/WB
MEMLogic
WBLogic
CLK
PC
6-11Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
DLX Drawing — version 2
DLXv2
6-12Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Formal Specification of Version 2
Instruction Fetch (IF)PC ← NPC
New PC for new instruction fetch in every clock cycle
IF/ID.IR ← Mem[PC]
Instruction Decode (ID)ID/EX.NPC ← IF/ID.NPCID/EX.A ← Reg[IF/ID.IR6-10]ID/EX.B ← Reg[IF/ID.IR11-15]ID/EX.I ← (IR16)
16 ## IF/ID.IR16-31ID/EX.IR ← IF/ID.IR
Stage Buffers (←) "See" inputs during clock cycleSample and store inputs on falling CLK at end of clock cycle
Type 0-5 6-10 11-15 16-31 R op rs1 rs2 rd function I op rs rd immediate
⎧
← ⎨⎩ OUT
PC + 4 (no branch)IF/ID.NPC
ALU (branch taken - special case)
6-13Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Formal Specification of Version 2Execute (EX)
Memory (MEM)
Write Back (WB)
←
⎧⎪← ⎨⎪⎩
←←
OUT
EX/MEM.cond (ID/EX.A == 0)
ID/EX.A function ID/EX.B (R-ALU)
EX/MEM.ALU ID/EX.A op ID/EX.I (I-ALU, Memory)
ID/EX.NPC + ID/EX.I (Branch)
EX/MEM.B ID/EX.B
EX/MEM. IDR /EX.I IR
←
←
←←
OUT OUT
OUT
OUT
Mem L
MEM/WB.ALU EX/MEM.ALU
MEM/WB.LMD [EX/MEM.ALU ] ( )
[EX
oad
Mem Stor/MEM.ALU ] EX/MEM.B ( )e
MEM/WB. EX/MIR EM.IR
⎧← ⎨
⎩←
11-1OUT
OU
5
16-20 T
MEM/WB.ALU (I-ALU)[MEM/WB. ]
MEM/WB.LMD (Load)
[MEM/WB. ] MEM/WB.ALU (R-A
IRReg
LU)IRReg
Type 0-5 6-10 11-15 16-31 R op rs1 rs2 rd function I op rs rd immediate
6-14Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Instruction Transfer Timing
IF/ID.NPC
IF/ID.IR
IF/ID
IFLogic
ID/EX.NPC
ID/EX.A
ID/EX.B
ID/EX.I
ID/EX.IR
ID/EX
IDLogic
EX/MEM.cond
EX/MEM.ALU
EX/MEM.B
EX/MEM.IR
EX/MEM
EXLogic
MEM/WB.ALU
MEM/WB.LMD
MEM/WB.IR
MEM/WB
MEMLogic
WBLogic
CLK
PC
IR1
IR1
IR1
IR1 IR1
EX/MEM.IR "sees" Mem[PC(I1)]ID/EX.IR "sees" Mem[PC(I2)] IF/ID.IR "sees" Mem[PC(I3)]
ID/EX.IR ← Mem[PC(I1)]IF/ID.IR ← Mem[PC(I2)]Memory ← PC(I3)
CC 3 beginsCLK 2
Mem[PC(I1)] controls Write BackMEM/WB.IR ← Mem[PC(I1)]CC 5 beginsCLK 4
MEM/WB.IR "sees" Mem[PC(I1)]...
EX/MEM.IR ← Mem[PC(I1)]...
CC 4 beginsCLK 3
ID/EX.IR "sees" Mem[PC(I1)]IF/ID.IR "sees" Mem[PC(I2)]
IF/ID.IR ← Mem[PC(I1)]Memory ← PC(I2)
CC 2 beginsCLK 1
IF/ID.IR "sees" Mem[PC(I1)]Memory ← PC(I1)CC 1 beginsCLK 0
DLXv2
6-15Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Simple 5‐Instruction Program for DLX
AND R10, R12, R1310I5
LW R8, 32(R9)0CI4
SW 32(R6), R708I3
ADD R3, R4, R504I2
ADDI R1, R2, #500I1
InstructionAddressInstruction
Number
6-16Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Program Execution Table
IF ID EX MEM WB
CC1
ADDI R1, R2, #5
IF/ID.IR ← Mem[00] IF/ID.NPC ← 04
CC2
ADD R3, R4, R5
IF/ID.IR ← Mem[04] IF/ID.NPC ← 08
ID/EX.NPC ← 04 ID/EX.A ← R2 ID/EX.B ← R1 ID/EX.I ← 5 ID/EX.IR ← ADDI R1, R2, #5
CC3
SW 32(R6), R7
IF/ID.IR ← Mem[08] IF/ID.NPC ← 0C
ID/EX.NPC ← 08 ID/EX.A ← R4 ID/EX.B ← R5 ID/EX.I ← ??? ID/EX.IR ← ADD R3, R4, R5
EX/MEM.cond ← (R2 == 0) EX/MEM.ALU ← R2 + 5 EX/MEM.B ← R1 EX/MEM.IR ← ADDI R1, R2, #5
CC4
LW R8, 32(R9)
IF/ID.IR ← Mem[0C] IF/ID.NPC ← 10
ID/EX.NPC ← 0C ID/EX.A ← R6 ID/EX.B ← R7 ID/EX.I ← 32 ID/EX.IR ← SW 32(R6), R7
EX/MEM.cond ← (R4 == 0) EX/MEM.ALU ← R4 + R5 EX/MEM.B ← R5 EX/MEM.IR ← ADD R3, R4, R5
MEM/WB.ALU ← R2 + 5 MEM/WB.IR ← ADDI R1, R2, #5
CC5
AND R10, R12, R13
IF/ID.IR ← Mem[10] IF/ID.NPC ← 14
ID/EX.NPC ← 10 ID/EX.A ← R9 ID/EX.B ← R8 ID/EX.I ← 32 ID/EX.IR ← LW R8, 32(R9)
EX/MEM.cond ← (R6 == 0) EX/MEM.ALU ← R6 + 32 EX/MEM.B ← R7 EX/MEM.IR ← SW 32(R6), R7
MEM/WB.ALU ← R4 + R5 MEM/WB.IR ← ADD R3, R4, R5
R1 ← R2 + 5
CC6
ID/EX.NPC ← 14 ID/EX.A ← R12 ID/EX.B ← R13 ID/EX.I ← ??? ID/EX.IR ← AND R10, R12, R13
EX/MEM.cond ← (R9 == 0) EX/MEM.ALU ← R9 + 32 EX/MEM.B ← R8 EX/MEM.IR ← LW R8, 32(R9)
Mem[R6 + 32] ← R7 MEM/WB.ALU ← R6 + 32 MEM/WB.IR ← SW 32(R6), R7
R3 ← R4 + R5
CC7
EX/MEM.cond ← (R12 == 0) EX/MEM.ALU ← R12 AND R2 EX/MEM.B ← R13 EX/MEM.IR ← AND R10, R12, R13
MEM/WB.LMD ← Mem[R9 + 32] MEM/WB.ALU ← R9 + 32 MEM/WB.IR ← LW R8, 32(R9)
CC8 MEM/WB.ALU ← R12 AND R2 MEM/WB.IR ← AND R10, R12, R13
R8 ← Mem[R9 + 32]
CC9 R10 ← R12 AND R2
Latch on CLK1 Latch on CLK2
DLXv2
6-17Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
First Clock Cycles
After CLK0Memory ← PC =00 ⇒ IF/ID.IR "sees" Mem[00] and IF/ID.NPC "sees" 04 as
inputs After CLK 1
Memory ← PC =04 ⇒ IF/ID.IR "sees" Mem[04] and IF/ID.NPC "sees" 08 as inputs
IF/ID.IR latches Mem[00] and ID/EX.IR "sees" IF/ID.IR (ADDI R1, R2, #5) as input
R i t " " IF/ID IR d ID/EX A B I " " R2 R1 5 i t
IF ID EX
CC1
ADDI R1, R2, #5
IF/ID.IR ← Mem[00] IF/ID.NPC ← 04
CC2
ADD R3, R4, R5
IF/ID.IR ← Mem[04] IF/ID.NPC ← 08
ID/EX.NPC ← 04 ID/EX.A ← R2 ID/EX.B ← R1 ID/EX.I ← 5 ID/EX.IR ← ADDI R1, R2, #5
CC3
SW 32(R6), R7
IF/ID.IR ← Mem[08] IF/ID.NPC ← 0C
ID/EX.NPC ← 08 ID/EX.A ← R4 ID/EX.B ← R5 ID/EX.I ← ??? ID/EX.IR ← ADD R3, R4, R5
EX/MEM.cond ← (R2 == 0) EX/MEM.ALU ← R2 + 5 EX/MEM.B ← R1 EX/MEM.IR ← ADDI R1, R2, #5
CC4
LW R8, 32(R9)
IF/ID.IR ← Mem[0C] IF/ID.NPC ← 10
ID/EX.NPC ← 0C ID/EX.A ← R6 ID/EX.B ← R7 ID/EX.I ← 32 ID/EX.IR ← SW 32(R6), R7
EX/MEM.cond ← (R4 == 0) EX/MEM.ALU ← R4 + R5 EX/MEM.B ← R5 EX/MEM.IR ← ADD R3, R4, R5
DLXv2
6-18Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Processor State Just Before CLK 4
Input and Output Data at Stage Buffers in CC 4
DLXv2
6-19Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Processor State Just After CLK 4
Input and Output Data at Stage Buffers in CC 5
DLXv2
6-20Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
New Technology, New Headaches
Analysis of Pipeline Hazards
6-21Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Instruction Dependencies: DefinitionsInstruction dependencies
Result of one instruction needed to execute later instructionHazard
Processor runs smoothly but provides wrong answersPipeline hazard
Several instructions in various stages of executionPipeline uses a resource value before update by earlier instructionExample
PC ← NPC on each clock cycle
Branch instruction requires PC ← NPC+ICorrect evaluation of NPC+I not available on next clock cycle
Hazard TypesStructural Hazard — conflict over access to resource Data Hazard — instruction result not ready when neededControl Hazard — branch address not ready when needed
6-22Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Dealing with HazardsAvoid error
Pause pipeline and wait for resource to be availableCalled wait state or pipeline stallDegrades processor performance
Adds stall clock cycles to instruction execution
Eliminate cause of stallImprove implementation based on analysis of stallsMain activity of hardware architects
1ideal stall
ideal stall stallIC
CPI
N N CPI CPI CPIIC →
=
+= = + ⎯⎯⎯⎯⎯→ +
large on DLX
processing clock cycles (ideal) + stalled clock cyclescompleted instructions
11
ideal stall
ideal stall stall
CPI CPICPI CPI CPI
= − =+ +
performance degradation
6-23Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Structural HazardsConflict over access to resource
No structural hazards in DLX
Typical structural hazard — unified cache hazardInstructions and data in same memory deviceCannot access data and fetch instruction on same clock cycleInstruction fetch waits 1 clock cycle for every data memory access
Loads and Stores
CC1 CC2 CC3 CC4 CC5
InstructionFetch
Instruction and DataMemory
InstructionDecode Execute Data
AccessWriteBack
Address Instruction Address Data
No DLX version implemented
with unified cache
6-24Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Stall on Cache Hazard
On CC5 Load Word (LW) instruction blocks Instruction Fetch (IF)No instruction is fetched on CC5No instruction (NOP) is forwarded to ID on CC6NOP = bubble = Φ forwarded to EX on CC7, etc
IF ID EX MEM WB CC1 I1 CC2 LW I1 CC3 I2 LW I1 CC4 I3 I2 LW I1 CC5 φ I3 I2 LW I1 CC6 I4 φ I3 I2 LW CC7 I4 φ I3 I2 CC8 I4 φ I3 I4 φ I4
No DLX version implemented
with unified cache
6-25Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Effect of Cache Hazard on CPI
stallCPI ⎛ ⎞⎜ ⎟⎝ ⎠
⎛ ⎞= = × = ×⎜ ⎟⎝ ⎠
⎛ ⎞= ×⎜ ⎟⎝ ⎠
∑
∑
i = type
i,j
i i
i
stall cycles stall cyclesstall cycles
instructions instructions instructi
stalls stalls
stalls stall
stalls of type i
ins
o
t
ns
stall
ruction
cycl
s of
e
ts ytall
s
pe j
stallcache
iIC
IC
CPI
×
⎛ ⎞ ⎛ ⎞= × ×⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠
= ×
∑i
(instruction j only causes stall type j)
i i
data s
instructions of type j
instruction
instructions
stall cycles
1
talls
data stall
1 stall
st
s
dat
stall cy
a memory
cle
all load
load store
load store
ICIC IC
IC I
IC
C IC
I C
× + × ×
⎛ ⎞= × × +⎜ ⎟
⎝ ⎠
= × × +
data memory store
data memory acces
1 stall
stall
1 stall
stall
1 stall
sta
s
0.25 loads 0.15
data memory access
1 cycle
1 stall cycle
1 stall cycle
instrucl tionl
ideal stallCPI CPI CPI
⎛ ⎞⎜ ⎟⎝ ⎠
= ⇒ = + =
instruction
stall cycles0.40
inst
stores
ruct on
i1.40
6-26Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Data HazardsInstruction result not ready when needed
Operations performed in the wrong orderClassification named for correct order of operations
Read After Write (RAW)Correct I2 reads register after I1 writes to itHazard I2 reads register before I1 writes to it
I2 uses incorrect valueWrite After Write (WAW)
Correct I2 writes to register after I1 writes to itHazard I2 writes to register before I1 writes to it
Incorrect value stays in register Write After Read (WAR)
Correct I2 writes to register after I1 reads itHazard I2 writes to register before reads I1 it
I1 uses incorrect valueRead After Read (RAR)
No hazard — reads do not affect registers
6-27Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Data Hazards in DLXv2RAW hazards
DLX registers updated in stage 5Next instruction may read register in stage 2Possible hazard to be avoided
WAW hazards cannot occur
DLX writes in uniform order Memory updated in MEMRegisters updated in WB
All updates performed in order of executionI2 cannot perform WB or MEM before I1 performs WB or MEM
WAR hazards cannot occur
Loads performed in MEM and register reads in IDStores performed in MEM and registers updated in WBI2 cannot perform WB or MEM before I1 performs ID or MEM
CC1 CC2 CC3 CC4 CC5
InstructionFetch
InstructionMemory
InstructionDecode Execute Data
Access
DataMemory
WriteBack
Address Instruction Address Data
6-28Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Register‐Register RAW Dependencies in DLXv2 Program with register-register dependencies
I1 ADD R1,R2,R3 I1 has R1 as destinationI2 SUB R4,R5,R1I3 AND R6,R7,R1 I2 — I4 have R1 as sourceI4 OR R8,R9,R1
IF ID EX MEM WB CC1 ADD CC2 SUB ADD CC3 AND SUB ADD CC4 OR AND SUB ADD CC5 OR AND SUB ADD CC6 OR AND SUB CC7 OR AND CC8 OR
Bad timing (uncorrected execution)I1 updates R1 in WB during CC5I2 reads R1 in ID during CC3I3 reads R1 in ID during CC4I4 reads R1 in ID during CC5
6-29Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Detailed View of CC5 (Uncorrected) in DLXv2
SUB and AND instructions suffer RAW hazard — read wrong value of R1
OR instruction reads correct value of R1
IF/IDIF
Logic ID/EXID
Logic EX/MEMEX
Logic MEM/WBMEMLogic
WBLogic
CC5
PC
SUBAND ADDOR
EX/MEM.ALU sees wrong AND result
END of CC5:
ID/EX.R1 sees wrong value for ORR1 stores ADD result
START of CC5: MEM/WB.ALU sees wrong SUB result
ADD result stored in R1ID/EX.R1 latches correct value for OR
EX/MEM.ALU latches wrong AND result
MEM/WB.ALU latches wrong SUB result
6-30Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Pipeline Stall to Avoid RAW Hazard in DLXv2
Wait states during CC3 and CC4ID/EX freezes internal state on SUBIF/ID freezes internal state on AND (cannot enter ID until SUB
finishes and moves to EX) ID performs NOP (no operation) to avoid reading old value of R1ID/EX passes φ (NOP) to EX
Continuation — no hazard in CC5WB operation performed at start of clock cycleLatching of register values in ID performed at end of clock cycle
IF ID EX MEM WB CC1 ADD CC2 SUB ADD CC3 AND SUB ADD CC4 AND SUB φ ADD CC5 AND SUB φ φ ADD CC6 OR AND SUB φ φ CC7 OR AND SUB φ CC8 OR AND SUB OR AND OR
The DLX control system must be able to identify all hazards and insert stall cycles when necessary.
6-31Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Pipeline Stall in Instruction View in DLXv2
Performance degradation too large
stall cycles stalls instruction types
stalls instruction type instruction
2 stall cycle 0.5 register dependencies 0.4 ALU
stall ALU instruction instruction
cycles2 0.5 0.4
instructio1.4 (29%
n
stallCP
I
I
CP
= × ×
= × ×
= × × ⇒=⇒ degradation)
Wait states — ID/EX freezes state and passes NOP (no operation) to EX
40%ALUIC
IC=
Clock Cycle
1 2 3 4 5 6 7 8 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF ID ID ID EX MEM WB AND R6,R7,R1 IF IF IF ID EX MEM OR R8,R9,R1 IF ID EX
6-32Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Forwarding or Bypass (DLX Version 3)ADD writes ALU result to R1 in CC5SUB needs R1 for ALU operation in CC4AND needs R1 for ALU operation in CC5
Trick to prevent stallADD calculates ALU result in CC3Allow SUB and AND to read incorrect value in IDProvide correct value from EX/MEM.ALU and MEM/WB.ALU directly to EX
InstructionFetch
InstructionMemory
InstructionDecode Execute
DataMemoryAccess
DataMemory
WriteBack
Address Instruction Address Data
IF ID EX MEM WB CC1 ADD CC2 SUB ADD CC3 AND SUB ADD CC4 OR AND SUB ADD CC5 OR AND SUB ADD CC6 OR AND SUB CC7 OR AND CC8 OR
DLX Version 3
6-33Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
DLX Pipelined Implementation in DLXv3
MUXes in EX choose from NPC, A, B, I, EX/MEM.ALU, MEM/WB.ALU
6-34Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Forwarding in Instruction View in DLXv3
Processor moves state of ADD instruction from buffer to bufferSUB needs ALU result in CC4
ADD provides ALU result from EX/MEM.ALUAND needs ALU result in CC5
ADD provides ALU result from MEM/WB.ALU
Clock Cycle
1 2 3 4 5 6 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF ID EX MEM WB AND R6,R7,R1 IF ID EX MEM OR R8,R9,R1 IF ID EX
0No stall cycles for Register-Register RAW hazard
stallCPI =
6-35Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Register‐Load RAW Dependencies in DLXv3Program with register-load dependencies
I1 LW R1,32(R2) I1 has R1 as destinationI2 SUB R4,R5,R1I3 AND R6,R7,R1 I2 — I4 have R1 as sourceI4 OR R8,R9,R1
IF ID EX MEM WB CC1 LW CC2 SUB LW CC3 AND SUB LW CC4 OR AND SUB LW CC5 OR AND SUB LW CC6 OR AND SUB CC7 OR AND CC8 OR
Bad timing (uncorrected execution)I1 updates R1 in WB during CC5I2 reads R1 in ID during CC3I3 reads R1 in ID during CC4I4 reads R1 in ID during CC5
6-36Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Memory Forwarding or Bypass (Version 4)LW writes loaded data to R1 in CC5SUB needs R1 for ALU operation in CC4AND needs R1 for ALU operation in CC5
Trick to minimize stallLW loads loaded data in CC4Allow SUB to read incorrect value in IDStall SUB for 1 clock cycle in ID (load performed later than ALU operation)Provide correct value from MEM/WB.LMD directly to EX
InstructionFetch
InstructionMemory
InstructionDecode Execute
DataMemoryAccess
DataMemory
WriteBack
Address Instruction Address Data
IF ID EX MEM WB CC1 LW CC2 SUB LW CC3 AND SUB LW CC4 OR SUB φ LW CC5 AND SUB φ LW CC6 OR AND SUB φ CC7 OR AND SUB CC8 OR AND CC9 OR
DLX Version 4
6-37Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
DLX Pipelined Implementation in DLXv4
MUXes in EX choose from NPC, A, B, I, EX/MEM.ALU, MEM/WB.ALU,MEM/WB.ALU
6-38Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Forwarding in Instruction View in DLXv4
Loaded data used immediately in ALU operation in about 50% of loads
load
stall
ICIC
CPI
CP
= × ×
= × ×
= × =
stall cycles stalls instruction types
stalls instruction type instruction
1 stall cycle 0.5 ALU uses loaded data
stall Load instruction
cycles cycles0.50 0.25 0.125
instruction instruction
I ⇒= 1.125 (11% degradation)
Clock Cycle
1 2 3 4 5 6 7 LW R1,32(R2) IF ID EX MEM WB SUB R4,R5,R1 IF ID ID EX MEM WB AND R6,R7,R1 IF IF ID EX MEM OR R8,R9,R1 IF ID EX
6-39Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Register‐Store RAW Dependencies in DLXv4Program with register-store dependency
I1 SUB R1,R5,R4 I1 has R1 as destinationI2 SW 32(R2),R1 I2 has R1 as source
IF ID EX MEM WB CC1 SUB CC2 SW SUB CC3 SW SUB CC4 SW SUB CC5 SW SUB CC6 SW
Bad timing (uncorrected execution) in DLXv4I1 updates R1 in WB during CC5I2 reads R1 in ID during CC3
Trick to prevent stall (Version 5)SW reads incorrect value in IDProvide correct value from MEM/WB.ALU directly to data memory
6-40Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
DLX Pipelined Implementation — Version 5
New MUX in MEM chooses B or MEM/WB.ALU
6-41Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Compiler Scheduling to Prevent RAW Hazards
C program codeI = I + 123;J = J – 567;
1 2 3 4 5 6 7 8 9 10 11 12 LW F D X M W ADD F D D X M W SW F F D X M W LW F D X M W SUB F D D X M W SW F F D X M W
First pass compilationLW R2, IADD R2,R2, #123SW I, R2LW R3, JSUB R3, R3, #567
SW J, R3
1 2 3 4 5 6 7 8 9 10 11 12 LW F D X M W LW F D X M W ADD F D X M W SW F D X M W SUB F D X M W SW F D X M W
Second pass compilationLW R2, ILW R3, JADD R2,R2, #123SW I, R2SUB R3, R3, #567
SW J, R3 DLXv5
6-42Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
DLX Control HazardOn each clock cycle
PC ← NPC New PC for new instruction fetch in every clock cycle
Control hazardIncorrect address on branch instructions
Stages of branch execution
Action during CCLatched stateClock CycleCLK
Calculate address NPC+I and condID/EX.NPC,I ← NPC,I32
IF/ID.IR "sees" correct instructionPC ← branch address54
PC "sees" correct address via MUX using cond to choose NPC or NPC+IEX/MEM.ALU,cond ← ALU, cond43
Decode of branch instruction, NPC, IIF/ID.IR ← branch21
IF/ID.IR "sees" instruction and PC(I1)Memory ← PC(I1)10
6-43Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Pipeline Flush for Control Hazard in DLXv5Pipeline flush
Empty and restart pipelineSimplest solution to implement
IT
...
I3
I2
I1
WBMEMEXIDIFTarget
…………………………
φφ
WBMEMEXIDIFφφIFFall-Through
WBMEMEXIDIFBEQZ R1,IT
987654321
Decode branch and flush pipelinePC "sees" correct address
Fall-Through (NPC) Target (NPC+I)
Correct instruction is fetched
6-44Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Performance Degradation for Pipeline Flush
Stalled (wasted) cycles
stall cycles stalls instruction types
stalls instruction type instruction
3 stall cycle 1 branch stall
stall branch instruction
cycles cycles3 0.20 0.60
instruction instruction
1.60 (
branch
stall
ICIC
PI
CPI
C ⇒=
= × ×
= × ×
= × =
38% degradation)
IT
...
I3
I2
I1
WBMEMEXIDIFTarget
…………………………
φφ
WBMEMEXIDIFφφIFFall-Through
WBMEMEXIDIFBEQZ R1,IT
987654321
DLXv5
6-45Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Improving Branch Performance — 1Enhancement 1
Earlier instruction fetch after pipeline flushVersion 5 PC "sees" correct address in CC 4 but fetches in CC5Version 6a PC latches correct address when ready — in CC 4
Special CLKfor pipeline flush recovery
cycles2 0.20
instruction
cycles0.40
instruc
1.40 (29% degradation
t
)
ion
stall
C
CP
PI
I
⇒=
= ×
=
DLXv6a
IT
…
I3
I2
I1
IFTarg
……………
φ
IFφIFF-T
MEMEXIDIFBEQZ
4321
6-46Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Improving Branch Performance — 2Enhancement 2 — dedicated ALU for branch address in ID stage
Version 6bBranch address available in CC3PC updates in CC3
cycles1 0.20
instruction
cycles0.20
instruc
1.20 (17% degradation
t
)
ion
stall
C
CP
PI
I
⇒=
= ×
=
DLXv6b
IT
…
I3
I2
I1
IFTarg
…………
IFIFF-T
EXIDIFBEQZ
321
6-47Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Improving Branch Performance — 3Enhancement 3
Versions 5 – 6b Flush entire pipeline Restart with correct branch address
Version 6c Flush entire pipeline on branch takenContinue instruction in IF on branch not taken
Branch address and cond ready
IT
...
I3
I2
I1
WBMEMEXIDIFTarget
…………………………
IF
WBMEMEXIDIFFall-Through
WBMEMEXIDIFBEQZ R1,IT
987654321
Branch taken (cond = 1 ⇒ PC ← NPC + I)
Branch not taken (cond = 0 ⇒ PC ← NPC)DLXv6c
6-48Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
DLX Version 6c
6-49Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Version 6c Branch Processing — 1 CC1BEQZ fetched to IFPC "sees"PCF-T = NPC = PC+4Points to IFALL-THROUGH
6-50Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Version 6c Branch Processing — 2 CC2IF fetches IFALL-THROUGHBEQZ advances to IDCalculatesITARG = NPC+Icond
PC "sees"NPC = PCF-T+4
Points to IFALL-THROUGH+1
6-51Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Version 6c Branch Processing — 3 CC3IF fetches IFALL-THROUGH+1BEQZ advances to EXID/EX latchesNPC+Icond
PC "sees" PCTARG = PC+IPoints to ITARG
6-52Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Version 6c Branch Processing — 4 CC3PC
Receives special CLKLatches PCTARG = PC+IID fetches ITARGPC "sees"PCTARG+1 = PCTARG+1+4Points to ITARG+1
On CC4IF/ID.IR latches ITARGPC latchesPCTARG+1 = PCTARG+4
6-53Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Branch Performance of Version 6cMethod called Predict-Not-Taken
Branch taken — Flush entire pipelineBranch not taken — Continue instruction in IFBetter performance on not taken (no pipeline stall)Ideal method if most branches are not taken
Statistics from SPEC CINTNot taken 33%Taken 67%
stall cycles stalls instruction types
stalls instruction type instruction
stall cycles taken branch
taken branch branch instruction
cycles cycles1 0.67 0.20 0.13
instruction instruction
branch
stall
ICIC
CPI
CPI
= × ×
= × ×
= × × =
1.13 (12% degradation)⇒=
6-54Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
DLXv6c Pipeline
InstructionFetch
InstructionMemory
InstructionDecode
IntegerALU
DataMemoryAccess
DataMemory
WriteBack
FloatingPoint Unit
(FPU)
IF ID EX MEM WB
ForwardingALU result to ALU sourceMemory load to ALU source (with 1 CC stall)ALU result to memory store
Other dependencies Require stall until Write-Back of intermediate result
DLXv6c
6-55Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
DLXv6c Formal Specification (Integer Pipeline) — 1Instruction Fetch (IF)
Instruction Decode (ID)ID/EX.A ← Reg[IF/ID.IR6-10]ID/EX.B ← Reg[IF/ID.IR11-15]ID/EX.I ← (IR16)
16 ## IF/ID.IR16-31ID/EX.IR ← IF/ID.IRID/EX.NNPC ← IF/ID.NPC + (IR16)
16 ## IF/ID.IR16-31ID/EX.cond ← (Reg[IF/ID.IR6-10] == 0)
Stage Buffers (←)Sample and store inputs on falling CLK"See" new inputs during clock cycle
(between falling CLKs)
Type 0-5 6-10 11-15 16-31 R op rs1 rs2 rd function I op rs rd immediate
⎧← ⎨
⎩⎧
← ⎨⎩
←
PC + 4, cond = 0PC
ID/EX.NNPC , cond = 1
PC + 4, cond = 0IF/ID.NPC
ID/EX.NNPC , cond = 1
IF/ID. MeIR m[PC]
6-56Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Execute (EX)
Memory (MEM)
Write Back (WB)
←
←←
←
OUT OUT
OUT
OUT
OUT
MEM / WB.ALU EX/ MEM.ALU
MEM /WB.LMD [EX/ MEM.ALU ] ( )
[EX / MEM.ALU ] EX /M
Mem Load
M
Fowarding: MEM / WB.ALU substituted fo
EM.B ( )
MEM
r B
I/WB. EX
em St
/ ME
e
R
or
M.IR
⎧← ⎨
⎩←
11-1OUT
OU
5
16-20 T
MEM/WB.ALU (I-ALU)[MEM/WB. ]
MEM/WB.LMD (Load)
[MEM/WB. ] MEM/WB.ALU (R-A
IRReg
LU)IRReg
DLXv6c Formal Specification (Integer Pipeline) — 2
⎧← ⎨
⎩
←
OUT OU
O T
T
U
Forwarding: EX / MEM.ALU or MEM / WB.AL
ID/EX.A function ID/EX.B (R - ALU)EX/ MEM.ALU
ID/E
U or
MEM / WB.LMD substituted for A o
X.A o
r B
p ID/EX.I (I- ALU, Memory)
EX/ MEM.B ID/EX.B
EX/ MEM.IR ← ID/E .IRX Type 0-5 6-10 11-15 16-31 R op rs1 rs2 rd function I op rs rd immediate
6-57Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Forwarding ALU – ALU
1 2 3 4 5 6 7 8 9
ADD R1, R2, R3 IF ID EX MEM WB
ADD R4, R1, R5 IF ID EX MEM WB
ADD R6, R4, R1 IF ID EX MEM WB
ADD R7, R2, R1 IF ID EX MEM WB
6-58Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Forwarding Load – ALU
1 2 3 4 5 6 7 8 9
LW R1, 8(R2) IF ID EX MEM WB
ADD R3, R1, R2 IF ID ID EX MEM WB
ADD R4, R3, R1 IF IF ID EX MEM WB
1 2 3 4 5 6 7 8
LW R1, 8(R2) IF ID EX MEM WB
ADD R4, R4, R1 IF ID ID EX MEM WB
ADD R4, R4, R3 IF IF ID EX MEM WB 1 2 3 4 5 6 7 8
LW R1, 8(R2) IF ID EX MEM WB
ADD R4, R4, R3 IF ID EX MEM WB
ADD R4, R4, R1 IF ID EX MEM WB
6-59Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Forwarding ALU ‐ Store
1 2 3 4 5 6 7 8 9
ADD R1, R3, R2 IF ID EX MEM WB
SW 8(R2), R1 IF ID EX MEM WB
1 2 3 4 5 6 7 8 9
ADD R1, R3, R2 IF ID EX MEM WB
ADD R4, R5, R6 IF ID EX MEM WB
SW 8(R2), R1 IF ID ID EX MEM WB
SW 10(R4), R1 IF IF ID EX MEM WB
6-60Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
ALU ‐ Branch
1 2 3 4 5 6 7 8 9
ADD R1, R3, R2 IF ID EX MEM WB
BEQZ R1, targ IF ID ID ID EX MEM WB
1 2 3 4 5 6 7 8 9
ADD R1, R3, R2 IF ID EX MEM WB
ADD R4, R5, R6 IF ID EX MEM WB
ADD R7, R8, R9 IF ID EX MEM WB
BEQZ R1, targ IF ID EX MEM WB
6-61Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Improvement by Re‐Scheduling in DLXv6c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ADDI R1, R0, #400 F D X M W SUBI R1, R1, #4 F D X M W LW R2, 0(R1) F D X M W LW R3, 400(R1) F D X M W
Forward R1
LW R5, 800(R1) F D X M W LW R6, C00(R1) F D X M W ADD R4, R2, R3 F D X M W SUB R4, R4, R5 F D X M W ADD R4, R4, R6 F D X M W SW 0(R1), R4
Forward R4
F D X M W BNEZ R1, FFD8 F D X M W
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ADDI R1, R0, #400 F D X M W LW R2, -4(R1) F D X M W LW R3, 3FC(R1) F D X M W
Forward R1
ADD R4, R2, R3 F D D X M W Forward R3 LW R2, 7FC(R1) F F D X M W SUB R4, R4, R2 F D D X M W Forward R2 LW R2, BFC(R1) F F D X M W ADD R4, R4, R2 F D D X M W Forward R2 SW -4(R1), R4 F F D X M W SUBI R1, R1, #4 F D X M W BNEZ R1, -40 F D D D X M W
a[i] = a[i] + b[i] – c[i] + d[i] a[] = 000 – 3FFb[] = 400 – 7FFc[] = 800 – BFFd[] = C00 – FFF
6-62Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
General Branch PredictionBranch statistics from SPEC CINT
Branch not taken 33%Branch taken 67%Most branch instructions
Used to build loopsRun more than once
Branch predictionAdvanced techniqueNot implemented in DLX modelUsed in modern RISC processors and Intel x86 since Pentium
Branch predictor Records statistics on branch instructions
Source address, target address, taken/not-taken
Predicts branch behavior based on previous behavior
6-63Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Branch Prediction for DLX Pipeline
2. Validate branch instruction in ID stageUsual Calculation:
Target addressCondition flag — taken or not-taken
CC1 CC2 CC3 CC4 CC5
InstructionFetch
InstructionMemory
InstructionDecode Execute Data
Access
DataMemory
WriteBack
Address Instruction Address Data
1. Branch predictor in IF stageIdentifies branch instruction
According to source addressPredicts branch from branch history
TakenPredicts branch target address
Not-takenUses fall-through address
3. After validationUpdate branch predictor
Target addressBranch history
Taken/not-taken
6-64Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Branch Prediction Performance
Branch taken — first execution
IT
...
I3
I2
I1
WBMEMEXIDIFTarget
…………………………
IF
WBMEMEXIDIFFall-Through
WBMEMEXIDIFBEQZ R1,IT
987654321
Branch taken — second execution
IT+2
IT+1
IT
I1
WBMEMEXIDIFTarget+2
WBMEMEXIDIFTarget+1
WBMEMEXIDIFTarget
WBMEMEXIDIFBEQZ R1,IT
987654321
Misprediction
Correct prediction
6-65Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
Branch Prediction Performance for Simple LoopSimple static loop
2 02 large
stallbranch N BCPI
N B × →= ⎯⎯⎯⎯⎯→× +
⎫⎪⎬⎪⎭
fall-through
ADDI R1, R0, #N ; N iterations
L1: ALU Block
SUBI R1, R1, #1 ; B lines of code
BNEZ R1, L1
I
ADDI R1, R0, # N IF ID EX MEM WB L1: ALU Block IF ID EX MEM WB
< B-2 lines of ALU code >
BNEZ R1, L1 IF ID EX MEM WB Ifall - through IF ID φ φ φ L1: ALU Block IF ID EX MEM WB
< B-2 lines of ALU code >
BNEZ R1, L1 IF ID EX MEM WB L1: ALU Block IF ID EX MEM WB
... < B-2 lines of ALU code >
BNEZ R1, L1 IF ID EX MEM WB L1: ALU Block IF ID φ φ φ Ifall - through IF ID EX MEM WB
R1 = N-1
R1 = N-2
R1= 0
6-66Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
More Compiler Optimizations — 1Common sub-expression elimination
Compiler encounters instructions B = 10*(A/3);C = (A/3)/4;
Calculates (A/3) into registerUses register in later calculations
LW R1,AADDI R2,R0,#3DIV R1,R1,R2ADDI R2,R0,#10MULT R1,R1,R2SW B,R1LW R1,AADDI R2,R0,#3DIV R1,R1,R2ADDI R2,R0,#4DIV R1,R1,R2SW C,R1
LW R1,AADDI R2,R0,#3DIV R1,R1,R2ADDI R2,R0,#10MULT R3,R1,R2SW B,R3ADDI R2,R0,#4DIV R3,R1,R2SW C,R3
First-passcompilation
Second-passcompilation
6-67Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
More Compiler Optimizations — 2Loop unrolling
Instead of loop compiler replicates instructionsEliminates overhead of testing loop control variable
InliningProcedure call replaced by code of procedure or macro
00 ADDI R2,R0,#0x0504 ADDI R1,R0,#0x0808 LW R3,0x1000(R1)0C JAL 1010 SW 2000(R1),R314 SUBI R1,R1,#0x0418 BNEZ R1,-0x141C ADDI R2,R0,#320 ADD R3,R3,R224 JR R31
00 ADDI R2,R0,#0x0504 LW R3,0x1008(R0)08 ADD R3,R3,R20C SW 2008(R1),R310 LW R3,0x1004(R0)14 ADD R3,R3,R218 SW 2004(R1),R31C ADDI R2,R0,#3
First-passcompilation
Second-passcompilation
6-68Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019
More Hardware OptimizationsSuperscaling
Run 2 or more pipelines in parallel Instructions without dependencies execute in parallelUsed in most RISC processors and Pentium 1 – 4, Centrino, Core
Dynamic SchedulingProcessor performs dynamic instruction schedulingSame result as compiler schedulingVery efficient when combined with superscalingUsed in IBM mainframes since 1967Used in Pentium II – 4, Centrino, and Core processors
Register AliasingTasks require logical registers (R0, R1, … as defined in ISA)Physical registers allocated per task from large register poolMultiple tasks use same logical register in parallel
Instruction PredicationUsual test-and-set instructions (SLT, SGT, SEQ, …) set predication flagsInstruction can be run or cancelled according to a predicate flag
7-1Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
Computer Arithmetic
7-2Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
Unsigned Integers
{ }
( )( )
( ) ( )( )
2 1 2 3 1 0
2
-1 -2 -3 1 010 1 2
2 2
3 1 0
2 2min m
2
in
... 0,100 ... 0 11 ... 1
2 2 2 ... 2 2
00 ... 0
decimal value of
n n n i
n n
n n nn n n
K
k a a a a a a
k k
K k
k
k a a a a a
k K k
− − −
− − −
= ∈
≤ ≤
= = × + × + × + + × + ×
=
=
⇒
Binary representation
Decimal value
( ) ( )
( )( ) ( )( )( )
( )
2 2max max
-2 0
2
2 max
2 max
10
0
11 ... 1 1 100 ... 0
1 1 2 0 2 ... 0 2 2
2 1,
0 2 1
n n
n n n
n
nK k
k k
K k
K k
k
=
= ⇒ + =
+ = × + × + + × =
= −
≤ = ≤ − 2 1 bits in cannot be represented n nk > −
7-3Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
Signed Integers
{ }
( ) ( )
2 1 2 3 1 0
110
10 2 2 3 1 0
-1 -2 -3 1 02 3 1 0
... 0,1
0 2 10 ...
0 2 2 2 ... 2 2
− − −
−
− −
− −
= ∈
≤ ≤ −
= =
= × + × + × + + × + ×
n n n i
n
n n
n n nn n
k a a a a a a
kk K k K a a a a
a a a a
Binary representation (two's complement)
Non-negative values
Negative va
( )( )( )( ) ( )( )( )( ) ( )( )
( )( ) ( )( )
110 10 2
2 10
2 10
-12 10
0 2 ' 1
11 ... 1 11 ... 1 ' 1 00 ... 0 1 1
11 ... 0 11 ... 0 ' 1 00 ... 1 1 2
10 ... 0 10 ... 0 ' 1 01 ... 1 1 2
−> ≥ − ⇒ = − +
= ⇒ = − + = − + = −
= ⇒ = − + = − + = −
= ⇒ = − + = − + = −
n
n
k k K k
k k K K
k k K K
k k K K
lues
7-4Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
Negative Signed Numbers
( )( )
( )( )( ) ( )
( )
110 10 2
2 1 2 3 1 0
2 1 2 3 1 0
2 2 1 1 2 2 0 0
0 2 ' 1
...' (1 ) (1- ) (1- ) ... (1- ) (1- )
' 1 (1 ) (1- ) ... (1- ) 1
11 ... 1 1
2 1 1
2
−
− − −
− − −
− − − −
> ≥ − ⇒ = − +
=
= −
+ + = + − + + +
⎛ ⎞= +⎜ ⎟
⎝ ⎠
= − +
=
n
n n n
n n n
n n n n
n
n
n
k k K k
k a a a a ak a a a a a
K k k K a a a a a a
K
whic
( )( ) ( ) ( )2 2 2' 1 2+ = − = − +n
n
K k K k K k
h has no representation in bits
overflow bit (ignored for signed)
7-5Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
General Formula for Signed Numbers
( )( )
( )
2 0 110
1 2 0 1
1 22 0 1
1 22 0 1
1 22 0 1
1
0 , 0, 1
0 2 2 , 0
2 1 2 2 , 1
0 2 2 , 0
2 2 1
n n
n n n
n nn n
n n nn n
n nn n
n
K a a ak
K a a a a
a a a
a a a
a a a
− −
− − −
− −− −
− −− −
− −− −
−
⎧ … == ⎨− … =⎩
⎧ × + × + … + =⎪= ⎨ ⎡ ⎤− − × + × + … + =⎪ ⎣ ⎦⎩
× + × + … + ==
− × − × 1 22 0 1
1 22 0 1
1 22 0 1
1 21 2 0
2 2 , 1
0 2 2 , 01 2 2 , 1
2 2
n nn n
n nn n
n nn n
n nn n
a a a
a a aa a a
a a a
− −− −
− −− −
− −− −
− −− −
⎧⎪⎨ ⎡ ⎤ + × + … + =⎪ ⎣ ⎦⎩
⎧ × + × + … + == ⎨
− × + × + … + =⎩
= − × + × + … +
( )( ) ( )2 2' 1 2nK k K k+ = −
7-6Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
Multiplying Unsigned IntegersLong Multiplication
Algorithm
0 1 1 × 1 0 1 0 1 1 0 0 0 0 1 1 0 1 1 1 1
1 2 3 0
1 2 3 0
Operands
n n n
n n n
a a a a ab b b b b
− − −
− − −
=
=
1 2 3 0 0 0Zero temporary register n n n outP P P P P c− − −= ← ←
{
[ ][ ]
}
0
0
1 2 3 0 1 2 3 0
1 2 3 0 1 2 3 0
1 2 3 0 1
, 0, 1
times
if if
shift right bits
to form new
Result is found in bits
out
out n n n n n n
n n n n n n
n n n n n
n
P ac P
P b a
c P P P P a a a a
P P P P a a a a
P P P P a a
− − − − − −
− − − − − −
− − − −
=⎧← ⎨ + =⎩
2 3 0 na a− −
7-7Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
Multiplying Signed IntegersAlgorithm
1 2 3 0
1 2 3 0
Operands (non-negative)
n n n
n n n
a a a a ab b b b b
− − −
− − −
=
=
1 2 3 0 0Zero temporary register n n nP P P P P− − −= ←
{
[ ]
}
0
0
1 1 2 3 0 1 2 3 0
1 2 3 0 1 2 3 0
1 2 3 0 1 2
, 0, 1
times
if if
shift right bits
to form new
Result is found in bits
n n n n n n n
n n n n n n
n n n n n n
n
P aP
P b a
P P P P P a a a aP P P P a a a a
P P P P a a a
− − − − − − −
− − − − − −
− − − − − −
=⎧← ⎨ + =⎩
3 0 a
Uses Pn-1 instead of cout
7-8Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
Dividing Unsigned Integers
Long Division
Algorithm for a / b
1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 1 0 0 1 0 0 1 1 1 1 1 1 0
1 2 3 0
1 2 3 0
Operands
n n n
n n n
a a a a ab b b b b
− − −
− − −
=
=
1 2 3 0 0Zero temporary register n n nP P P P P− − −= ←
{
[ ]
}
1 2 3 0 1 2 3 0
2 0 2 0
0 0
0 1,
times
shift bits left to form
if
Remainder is in and Quotient is in
n n n n n n
discard new P new a
n n
n
P a P P P P a a a a
P P bP b
a a a a
P a
− − − − − −
− −
⎡ ⎤⎢ ⎥⎢ ⎥⎣ ⎦
← −⎧≥ ⎨ ←⎩
7-9Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
Floating Point Numbers (IEEE‐754 Standard)
fes
Single precisions — 1 bit
e — 8 bits
f — 23 bits
( ) ( ) 1272
1 2541 1. 2 ,
126 127 127− ≤ ≤
= − × ×− ≤ − ≤
s e eN f
e
( ) ( ) 10232
1 20461 1. 2 ,
1022 1023 1023s e e
N fe
− ≤ ≤= − × ×
− ≤ − ≤
NaN (not a number — overflow/underflow)Not zero255
0255
Not zero0
000
Nfe
( ) 1271 0. 2sN f −= − × × ( ) 10231 0. 2sN f −= − × ×
( )1 sN = − ×∞
Special values
Double precisions — 1 bit
e — 11 bits
f — 52 bits
7-10Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
Floating Point OperationsAddition
Multiplication
=
-3 -4 -7 -7
-7 -3
1.010×2 +1.010×2 = 10100×2 +1010×2
10100
+ 01010
11110
11110×2 1.1110×2
× ×
= →
-3 -4 -7 -7
-7 -7 -7 -7
1.010×2 1.010×2 = 10100×2 1010×2
10100
× 01010
11001000
11001000×2 ×2 1.1001000×2 1.100×2
7-11Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
Floating Point Multiplication
Multiply unsigned numbers
Rounding algorithm
[ ] ( ) [ ]
1 1
2 2
1 21 2
1271
1272
127 127 1271 2 1 2
( 1) 1. 2
( 1) 1. 2
( 1) 1. 1. 2 ( 1) 1. 1. 2
s e
s e
e es s s e
x f
y f
x y f f f f
−
−
+ − −+ −
= − × ×
= − × ×
× = − × × × = − × × ×
1 2 3 0 1 2 3 0 1 21. 1. n n n n n nP P P P a a a a f f− − − − − − ← ×
1nX − 2nX − ... 1X 0X r If 1 0nP − = assemble 2−nP 3−nP 0P 1−na 2−na
1nX − 2nX − ... 1X 0X r If 1 1nP − = assemble 1−nP 2−nP 1P 0P 1−na
1e e← +
( 1)1 2 0... (2 ). is rounding bitn
n nX X X r r− −− − + ×
7-12Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
Floating Point Addition — 11 1
2 2
1271 1 127
1 21272 2
( 1) 1. 2( 1) 1. 2
( 1) 1. 2
s es e
s e
x fX x x f
x f
−−
−
⎫= − × × ⎪ = + = − × ×⎬= − × × ⎪⎭
1a. If 2 1e e> then 1 2x x↔ 1b. 1 2d e e= − 1c. 1e e= 1d. 0w = 2. ( )1 2 2 2 ''If then s s x x≠ ← (two's complement) and w ← 1
Pn-1 . Pn-2 Pn-3 ... P0 3a. Construct from x2 xn-1 . xn-2
xn-3 ... x0
3b. Shift P right by d places, shifting in w from left
Pn-1 . Pn-2 ... Pn-d Pn-1-d ... P0 g r w . w ... w Pn-1 ... Pd Pd-1 Pd-2
4. Add P ← P + 1.f1
7-13Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
Floating Point Addition — 2
5. If 1 2s s≠ and Pn-1 = 1 and 0outc = then P ← (P)''
6. If 1 2s s= and 1outc = then
shift right Pn-1 Pn-2 Pn-3 ... P1 P0 ← 1 Pn-1 Pn-2 Pn-3 ... P1 r ← P0 e ← e + 1
Otherwise shift P left L times until Pn-1 = 1 P = 0 0 0 ... 1 Ps ... S1 P0 ⇒ P ← 1 Ps Ps-1 Ps-2 ... P0 g r 0 ... 0 1. f ← P e e L← − L = 0 ⇒ r ← g L = 1 ⇒ r is rounding bit L > 1 ⇒ r ← 0
1 1
2 2
1271 1 127
1 21272 2
( 1) 1. 2( 1) 1. 2
( 1) 1. 2
s es e
s e
x fX x x f
x f
−−
−
⎫= − × × ⎪ = + = − × ×⎬= − × × ⎪⎭
7-14Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
DLX FP Pipeline
Floating Point Unit (FPU) ADD/SUB: 4 pipelined stagesMULT: 7 pipelined stagesDIV/SQRT: 24 stages — 15 non-pipelined
InstructionFetch
InstructionMemory
InstructionDecode
IntegerALU
DataMemoryAccess
DataMemory
WriteBack
A1 A2 A3 A4
M1 M2 M3 M4 M5 M6 M7
MULT
ADD/SUB
DIV/SQRTFPU
7-15Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
FP Execution
W
W
1716151413121110987654321
SD 0(R2),F0
MULTD F0,F2,F6
ADDD F2,F4,F8
LD F4,0(R2)
MXXXXXXDFFFF
MM7M6M5M4M3M2M1DDDDFF
WMA4A3A2A1DDF
Dependency stallsForwarding
WMXDF
1716151413121110987654321
ADDD F12,F14,F16
ADDD F6,F8,F10
LD F4,0(R2)
WMA4A3A2A1DFPipelined FP ADD
WMA4A3A2A1DF
WMXDF
1716151413121110987654321
MULTD F0,F2,F6
MULTD F2,F4,F8
LD F4,0(R2)
WMM7M6M5M4M3M2M1DF
Pipelined FP MULT
WMM7M6M5M4M3M2M1DF
WMXDF
F = IF D = ID X = EX M = MEM W = WB
7-16Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019
More Hardware → Fewer Steps → Speedup
Ripple AdderAdds order by order
cout → next order as cin
n – 1 propagation delays from order to order
A0 B0A1 B1An-2 Bn-2An-1 Bn-1
FA
a b
cout s
cin
FA
a b
cout s
cin
FA
a b
cout s
cin
FA
a b
cout s
cin
0
S0S1Sn-2Sn-1cout
...
Look Ahead AdderEach stage produces
Calculate n – 1 values for cin in (large) dedicated hardware-1 -- -1 1 -1-1 -11 -1 ( )i i
i i i i
i iii ii i
s a b cc c caa b gb p+
= ⊕ ⊕⎧⎨ = + = +⎩
0
1 1 0 1 1 0
2 1 0 1 2 2 1 2 1 0
1 0
1 -1 -2 -1 -
0
1 1 0 1 0
2 1 0 2 1 0
2 -3 -1 -2 -3 -4
-1 -2 1 -
2 1 0 0
3
1 -
1 2 3
2 0
0
-5
0
0
-4
[ ][ ]
... ... ...k k k k k k k k k k
k
k k k k k
k k k
k
c cc c c cc c c
pp p p p p pp p p p p p p p p p
p p p p p p p p p pp p p p
gg g g g gg g g g g g
g g gccp
g gg p
−− − −
= +
= + = + + = + +
= + + + = + + +
= + + + +
+ + + 0
8-1Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Memory
and I/O Organization
8-2Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Principle of Locality Locality — small proportion of memory accounts for most run time
Rule of thumb — For 90% of run time next instruction/data will come from 10% of program/data closest to current instruction
Amdahl's Law — make access to most local memory as fast as possible
Address DataFFFFFFFF
FFFFFFFEFFFFFFFD
00000004
00000003000000020000000100000000
B bytes ofinstruction
anddata
currentaddress
A
A + 5% B
A - 5% B
Percentage of memory accounting for 90% of run time for SPEC 92
8-3Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Memory Hierarchy
Long TermStorage
Main Memory(RAM) Cache Register
All Filesand Data
Running Programsand Data
Next FewInstructionsand Data
CurrentData
Memory location inside CPU
Fast access to small amount of information
Organized by CPU
Memory location in or near CPU
Fast access to important data and instructions
from RAM
Copy of RAM section
Memory location outside CPU
Stores "all" data and instructions of
running programs
Organized by addresses
Memory locations outside CPU and RAM
Stores data and instructions of "all"
programs
Organized by OS
8-4Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Memory Hierarchy in RISC Workstation/Server
Flash0.035 to 0.1 msMax ~ 60 TB
Typical 500 GBExternal
Solid State Drive (SSD)
CMOS DRAM10 to 20 nsMax < 8 MB
Typical 0CPU Internal
L3 cache
(Level 3)
Magnetic4 to 20 msMax ~ 8 TB
Typical 1 TBExternalDisk
CMOS DRAM10 to 50 ns~ 4 – 64 GBExternalMain Memory
CMOS SRAM3 to 10 nsMax < 8 MB Typical 2 MB
CPU InternalL2 cache
(Level 2)
CMOS SRAM0.5 to 1.0 nsMax < 64 KB
Typical 32 KBCPU Internal
L1 cache
(Level 1)
CMOS latches0.1 to 0.5 nsMax < 2KB
Typical 512 BCPU Internal
General Registers
TechnologyAccess Time in System
QuantityLocationLevel
8-5Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
CPU and Memory Hierarchy
InstructionMemory
ALU
IR
NPC
IF/ID
LMD
ALUout
IR
MEM/WB
ALUout
B
EX/MEM
A
B
I
NNPC
IR
ID/EX
cond
data in
address
dataout
DataMemory
datacache
instructioncache
CacheController
externalbus
I/Ocontroller(chipset)
MainMemory(RAM)
Long TermStorage
(Disk)Register
Subsystem
control
address data out
address data outdatain
CPUL2 Cache L2
8-6Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
CPU and Memory Hierarchy — 1
L1 (level 1 cache) holds copy of small section of main memoryMost recently accessed addresses (memory locations)L1 split into physically separate Instruction Cache and Data Cache
CPU accesses L1 cache directlyIf (address in L1 cache) {access performed in 1 clock cycle}Else {
L1 cache accesses cache controllerIf (address in L2 cache) {controller copies contents to L1 from L2}Else {controller copies location to L1 from main memory}}
L1 instructions
CPU
cachecontroller
MainMemoryL2
I/O
DiskL1 data
ALU
Registers
requestupdate access latency >> 1 clock cycle
access in 1 CC
8-7Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
CPU and Memory Hierarchy — 2 CPU accesses disk and I/O devices by memory addressing
Part of total address space reserved for I/O and storage devicesDisk write of k bytes — CPU performs k stores to same I/O address
Disk read of k bytes — CPU performs k loads from same I/O addressI/O addresses are not copied to cache (marked NON-CACHEABLE)
Virtual MemoryTotal memory space larger than physical memoryDivided into pages of specific sizeInfrequently used pages moved to "page file" on diskPage properties and location specified in page tablesVirtual address divided into
Page address — points to page table entryOffset — points to byte address in page
L1 data
L1 instructions
CPUcache
controllerMain
MemoryL2
I/O
Disk
8-8Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Memory Organization — 1n-bit address space
Physical Address = An-1 An-2 … A1 A0
Can form 2n addresses, from 0 to (2n – 1)
Every byte in RAM has an n-bit address
Processor refers to memory locations by physical RAM addresses
Processor stores memory addresses in n-bit address registers
Data Byte 11111…111 Data Byte 11111…110 Data Byte 11111…101 Data Byte 11111…100
… … Data Byte 00000…111 Data Byte 00000…110 Data Byte 00000…101 Data Byte 00000…100 Data Byte 00000…011 Data Byte 00000…010 Data Byte 00000…001 Data Byte 00000…000 Memory Location Address
memory addresses
n-bit registerCPU
8-9Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Memory Organization — 2Copy memory data to cache
How much data to copy?Which data is in cache?How to find addresses in cache?
Copy DATA BLOCK to cache
Block = B bytesPage size (virtual memory) = integer × block size
Sets and Slots
Cache = S sets
Set = W slots Slot = 1 data block
S Sets in CacheS – 1 ...10
W Slots in Set 0
Copy memory block to:Deterministic setAny slot in set
8-10Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Memory Organization — 3Address space has n-bit physical address
N = 2n byte address space
Address space divided (logically) into address BLOCKS (lines)
Block size B = 2b bytes / block
Blocks in address space = N / B = 2n – b
(n-b)-bit Block Number = Int (Address / B)
b-bit byte offset = Address % B = 0, 1, … , B – 1 = 2b – 1
Block Number Byte Offset
n - b
n-bit Physical Address
b
8-11Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Memory Organization — 4
Cache organized into S = 2s Sets
Set Index = 0, 1, … , S – 1 = 2s – 1
Address Block in cache must be copied into specific Set
Set Index (location of block in cache) = Block Number % S
Tag Set Index Byte Offset
[n - (s + b)] s b(n - b)-bit Block Number
n-bit Physical Address
8-12Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Memory Organization — 5
W-way associative organization
Each set contains W = 2w slotsOne block copied to one slotBlocks copied to slots in any convenient orderTAG written near block content to identify which block is in cache
[n – (s + b)]-bit tag = Int (Block Number / S )
Total Cache Size
Set 0
W Slots of set 0
Set 1 ... Set S-1
Tag Set Index Byte Offset
[n - (s + b)] s b(n - b)-bit Block Number
n-bit Physical Address
sets blocks bytes bytescache set block cache
total cache size S W B = × × == × ×
8-13Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Example of Memory Organization — 1n = 32-bit addressN = 232 bytes = 4,294,967,296 bytes = 4 GBB = 16 = 24 bytes per block
b = 4-bit block offset
Block Number Byte Offset
428-bit Block Number
32-bit Physical Address
4,294,967,295...4,294,967,280268,435,455
......
...353433322
1
0
Block
31302928272625242322212019181716
1514131211109876543210
Byte Address
N / B = 232-4 blocks = 268,435,456 blocks
= 256 Mblocks
8-14Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Example of Memory Organization — 2n = 32-bit address N = 232 bytes = 4,294,967,296 bytes = 4 GBB = 24 bytes per block
Tag Set Index Byte Offset
20 8 428-bit Block Number
32-bit Physical Address
...
268,435,455...153512791023767511255255
...
...
... 268,435,200
12822
1
0
Set
10267705142582
128110257695132571
128010247685122560
Blocks that can be Assigned to Set
S = 256 = 28 sets in cache
s = 8-bit set index = Block Number % 256
N / B = 268,435,456 blocks = 256 Mblocks
8-15Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Example of Memory Organization — 3n = 32-bit address N = 232 bytes = 4,294,967,296 bytes = 4 GBB = 24 bytes per block
Tag Set Index Byte Offset
20 8 428-bit Block Number
32-bit Physical Address
2
1
0
Set
…258514
25712815131025
102481920256
Possible Cache Content(Block Numbers)
…12
1524
43201
Tags
13100000000000000000010100000001 = 128110
11010000000100000000000000000101
Address 2050910 = 0x0000501D
S = 256 = 28 sets in cache
s = 8-bit set index = Block Number % 256
W = 4 = 22 way associative
Tag = 32 – (4 + 8) = 20 bits
Total Cache Size = 256 sets/cache × 4 blocks/set × 16 bytes/block = 16 KB
8-16Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Cache Definitions and PoliciesCache hit
CPU finds block in cache
Cache missCPU needs block not in cacheCache loads block on cache read miss
Write allocateCache loads block on cache write miss
No write allocateWrite to RAM without loading block on cache write miss
Swapping out cache blockNeed a new block in a full setRemove block that is LEAST RECENTLY USED (LRU) WRITE BACK — update RAM when block is swapped outWRITE THROUGH — update RAM on every write to cache
8-17Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Cache Performance IssuesL1 Cache hit
CPU reads/writes memory in 1 clock cycleL1 Cache Miss
CPU stalls while L1 loads missing blockMiss Rate
Cache misses per cache accesses (instruction fetch, load, store)Depends on cache size and organization
Miss PenaltyNumber of stall cycles while cache loads missing blockDepends on cache hardware technology and organization
For 1 level unified (not split) cache
load store
stall
IC ICIC
CPI
IC+
= × ×
+= × ×
stall cycles stalls instruction types
stalls instruction type instruction
miss penalty miss rate
8-18Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Performance of 1‐Level Split Cache
Typical valuesInstruction Miss Penalty = Data Miss Penalty = 50 cycles per stall
Instruction Miss Rate = 0.5%Data Miss Rate = 5.0%ICload = 25%
ICstore = 15%
stallCPI = × ×
+ × ×
=
instruction stall cycles instruction stalls instruction accesses
stalls instruction access instruction
data stall cycles data stalls data accesses
stalls data access instruction
instruction miss pe
load storeICIC
IC
× ×
++ × ×
1 instruction accessnalty instruction miss rate
instruction
data miss penalty data miss rate
50 0.005 1 50 0.05 0.40 0.251.25 20%
stall
CPCPI
I= × × + × × =
= ⇒ degradation
8-19Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Two‐Level Unified Cache — Definitions
1
2
1 2
L
L
L L
A
P
P
P P
ICIC
=
=
=
=
=
=
+
Miss Penalty (cycles) from L1 to L2stall cycles
(L1 miss, L2 hit)Miss Penalty (cycles) from L2 to Main Memory
stall cycles(L1 miss, L2 miss)
Data Access Instructions (Load/Store)Total Instructions
Miss Penalties
1
1
2
2
1
1
=
− =
=
− =
Miss Rate at L1L1 miss
=L1 accessHit Rate at L1
L1 hit=
L1 accessMiss Rate at L2L2 miss
=L2 accessHit Rate at L2
L2 hit=
L2 access
L
L
L
L
M
M
M
M
Miss Rates CPU
UnifiedL2
cache
MainMemory
UnifiedL1
cache
1LP 2LP
1LM 2LM
8-20Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Performance of Two‐Level Unified Cache
n
nn×∑stall
stall types=
stall cycles stall cycles stalls of type CPI = =
IC stall type IC
( ){ }
ii i= ∑
= instruction, data
stall types (L1 miss, L2 hit) ,(L1 miss, L2 miss)
CPU MemoryAccess
L1 Hit No Penalty
L1 MissL2 Hit
L2Access L1 Penalty = L2 Access Time
L2 MissL2 Penalty = Main RAM Access Time
CPU
UnifiedL2
cache
MainMemory
UnifiedL1
cache
1LP 2LP
1LM 2LM
8-21Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Stalls in Two‐Level Unified Cache
i
i
i
i
ii
i
⎡= ×⎢
⎣
⎤+ × ⎥
⎦
= ×
∑
∑
stall
{data, instr}
{data, instr}
=
(L1 miss, L2 hit)stall cyclesCPI
(L1 miss, L2 hit) IC
(L1 miss, L2 miss)stall cycles(L1 miss, L2 miss) IC
(L1 miss, L2stall cycles(L1 miss, L2 hit)
=
i i
i
i i
i i
⎡×⎢
⎣
⎤+ × × ⎥
⎦
hit) L1 accessL1 access IC
(L1 miss, L2 miss) L1 accessstall cycles(L1 miss, L2 miss) L1 access IC
CPU
UnifiedL2
cache
MainMemory
UnifiedL1
cache
1LP 2LP
1LM 2LM
8-22Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Assume L1 and L2 Are Statistically Independent
1
(L1 miss, L2 hit) L1 miss followed by L2 hit L1 miss L2 access L2 hitL1 access L1 access L1 access L1 miss L2 access
L1 miss L2 hit
L1 access L2 access
(L1 miss, L2 miss) L1 miss L1 access
i i i
i i i i
i
i
i i
i
= = × ×
= ×
=
1
stall
= {data, instr}
followed by L2 miss L1 miss L2 access L2 missL1 access L1 access L1 miss L2 access
L1 miss L2 missL1 access L2 access
L1 missstall cyclesCPI
(L1 miss, L2 hit) L1 access
i
i i i
i
i
i
ii
= × ×
= ×
= ×∑ L1 accessL2 hitL2 access IC
L1 miss L1 accessstall cycles L2 miss(L1 miss, L2 miss) L1 access L2 access IC
i
i
i i
i i
⎡× ×⎢
⎣
⎤+ × × × ⎥
⎦
8-23Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Simplifying
1 1
i i
i i
i i
i i
L L
i
P M
⎡= × × ×⎢
⎣
⎤+ × × × ⎥
⎦
= ×
∑stall
= {data, instr}
L1 miss L1 accessstall cycles L2 hitCPI
(L1 miss, L2 hit) L1 access L2 access IC
L1 miss L1 accessstall cycles L2 miss(L1 miss, L2 miss) L1 access L2 access IC
( )
( )
[ ]
[ ]
2 1 2 1 2
1 1 2 1 2 2
1 1 1 2 1 2 2 2
1 1 2 2
1 ( )
1 1 ( )
1
1
A A
L L L L L
A
L L L L L L
A
L L L L L L L L
A
L L L L
IC IC IC ICM P P M MIC IC
ICM P M P P MIC
ICM P P M P M P MIC
ICM P P MIC
+ +× − × + + × × ×
⎡ ⎤= × + × × − + + ×⎡ ⎤⎢ ⎥ ⎣ ⎦
⎣ ⎦⎡ ⎤
= × + × − × + × + ×⎢ ⎥⎣ ⎦
⎡ ⎤= × + × + ×⎢ ⎥
⎣ ⎦stall
CPI
8-24Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Split L1 Cache with Unified L2 Cache
1 1 11
misses at L1 misses at L1 accesses at L1IC accesses at L1 IC
misses at L1data misses at L1 instruction misses at L1data accesses at L1 instruction accesses at L1 acces
i
A A
L L LIC ICM M MIC IC
= ×
⎡ ⎤⎢ ⎥⎢ ⎥⎣ ⎦
↓
+ =
= × + = + ×
( )
1 1
1 1
1 1 2 2
1 1 1 2 2
1
1
data,instructions
instructions data
I D
stallunified
stall I Dsplit
accesses at L1ses at L1 IC
CPI
CPI
i
ii
A
L L
A
L L
A
L L L L
A
L L L L L
ICM MIC
ICM MIC
ICM P P MIC
ICM M P P MIC
=
⎡ ⎤⎡ ⎤⎢ ⎥ ⎣ ⎦⎢ ⎥⎣ ⎦
⎡ ⎤⎢ ⎥⎢ ⎥⎣ ⎦
= × + ×
= + ×
= × + × + ×
↓
= + × × + ×
×∑
⎡ ⎤⎣ ⎦
8-25Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Second Layer Cache (L2) with Split L1Definitions
One layer (L1) of split cache
Miss Penalty at L1 (to main memory) PL1 ~ 50 cycles
Split L1 cache and unified L2 cacheMiss Penalty at L1 (to L2) PL1 ~ 5 cyclesMiss Penalty at L2 (to main memory) PL2 ~ 45 cyclesMiss Rate at L2 ~ 1%
11 11stall
levelDL L
AIL
ICMCP PC
I MI−
⎡ ⎤+⎢ ⎥
⎦××
⎣=
1
2
L
LA
PP
ICIC
=
=
=
=
Miss Penalty (cycles) at L1
Miss Penalty (cycles) at L2
Data Access Instructions (Load/Store)
Total Instructions
( )12 1 2 21
AI Dstall
level L L LL LICM MI
CPI P P MC−
⎡ ⎤+ × × +⎥
⎦×⎢=
⎣
1
1
2
ILDL
L
MMM
=
=
=
Instruction Miss Rate at L1
Data Miss Rate at L1
Miss Rate at L2
1 2 2 5 45 0.01 5.45L L LP P M+ × = + × =
CPU
UnifiedL2
cache
Split L1cache
Instructioncache
Datacache
MainMemory
1LP 2LP
8-26Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Issues Affecting Miss RateCompulsory miss
Block not copied to cache until first access to byte in blockFirst access to block always misses in cache Compulsory miss not affected by cache properties
Capacity missCache is smaller than main memorySome blocks removed from cache to make room for required blocksCapacity miss rate lower for larger cache size
Conflict missBlock must be copied to specific setSome blocks removed from set to make room for required blocks
Not caused by overall capacity missFor example, misses caused by address aliasing
Conflict miss rate lower when block placement more flexible
8-27Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Cache Miss Statistics — 1
Miss rate components (relative percent) (Sum = 100% of total miss rate)
Cache size
Associativity Total miss rate
Compulsory Capacity Conflict
1 KB 1-way 0.133 0.002 1% 0.080 60% 0.052 39%1 KB 2-way 0.105 0.002 2% 0.080 76% 0.023 22%1 KB 4-way 0.095 0.002 2% 0.080 84% 0.013 14%
1 KB 8-way 0.087 0.002 2% 0.080 92% 0.005 6%2 KB 1-way 0.098 0.002 2% 0.044 45% 0.052 53%2 KB 2-way 0.076 0.002 2% 0.044 58% 0.030 39%2 KB 4-way 0.064 0.002 3% 0.044 69% 0.018 28%2 KB 8-way 0.054 0.002 4% 0.044 82% 0.008 14%4 KB 1-way 0.072 0.002 3% 0.031 43% 0.039 54%4 KB 2-way 0.057 0.002 3% 0.031 55% 0.024 42%4 KB 4-way 0.049 0.002 4% 0.031 64% 0.016 32%4 KB 8-way 0.039 0.002 5% 0.031 80% 0.006 15%8 KB 1-way 0.046 0.002 4% 0.023 51% 0.021 45%8 KB 2-way 0.038 0.002 5% 0.023 61% 0.013 34%8 KB 4-way 0.035 0.002 5% 0.023 66% 0.010 28%8 KB 8-way 0.029 0.002 6% 0.023 79% 0.004 15%
Hennessy and Patterson, figure 5.9
8-28Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Cache Miss Statistics — 2 Miss rate components (relative percent)
(Sum = 100% of total miss rate) Cache size
Associativity Total miss rate
Compulsory Capacity Conflict
16 KB 1-way 0.029 0.002 7% 0.015 52% 0.012 42%16 KB 2-way 0.022 0.002 9% 0.015 68% 0.005 23%
16 KB 4-way 0.020 0.002 10% 0.015 74% 0.003 17%16 KB 8-way 0.018 0.002 10% 0.015 80% 0.002 9%32 KB 1-way 0.020 0.002 10% 0.010 52% 0.008 38%32 KB 2-way 0.014 0.002 14% 0.010 74% 0.002 12%32 KB 4-way 0.013 0.002 15% 0.010 79% 0.001 6%32 KB 8-way 0.013 0.002 15% 0.010 81% 0.001 4%64 KB 1-way 0.014 0.002 14% 0.007 50% 0.005 36%64 KB 2-way 0.010 0.002 20% 0.007 70% 0.001 10%64 KB 4-way 0.009 0.002 21% 0.007 75% 0.000 3%64 KB 8-way 0.009 0.002 22% 0.007 78% 0.000 0%
128 KB 1-way 0.010 0.002 20% 0.004 40% 0.004 40%128 KB 2-way 0.007 0.002 29% 0.004 58% 0.001 14%128 KB 4-way 0.006 0.002 31% 0.004 61% 0.001 8%128 KB 8-way 0.006 0.002 31% 0.004 62% 0.000 7%
8-29Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Total Miss Rate
Total miss rate drops as capacity or associativity increases
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128KB
1-way2-way4-way8-wayCapacity Misses
8-30Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Conflict Miss Rate
Conflict miss rate drops as associativity increases
-
0.010
0.020
0.030
0.040
0.050
0.060
1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB
1-way2-way4-way8-way
8-31Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Associativity Trade‐OffBlock can be anywhere in set
Finding (or missing) block in set requires searching every tag in setLarger associativity ⇒ more blocks per set ⇒ longer search time
Tag Set Index Byte Offset
[n - (s + b)] s b(n - b)-bit Block Number
n-bit Physical Address
For fixed cache capacity = S × W × BLarger associativity W
⇒ fewer sets S ⇒ smaller set index s
⇒ larger tag size n – (s + b)⇒ longer tag search
Small advantage beyond 4-way associativity
8-32Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Miss Example — Extreme Locality
Programint i,a;for (i = 0 ; i < 4096 ; i++){
a = i;}
Memory accesses4096 write accesses to integer a
1 compulsory cache miss (write allocate) on i = 04095 cache hits on i > 0
0 read accesses to a
Miss rate
Compiler assignmentsRegister ← iMemory ← a
41 2.44 104096
−= = ×miss
miss rateaccesses
8-33Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Miss Example — Extreme Non‐Locality
Block size16 bytes/block = 4 words/block = 4 array elements
Programint i,a[4096];for (i = 0 ; i < 4096 ; i++){
a[i] = i;}
Memory accesses4096 write accesses to integer array a[]
Compulsory cache miss every 4 array elements 4096/4 = 1024 cache misses (write allocate)
0 read accesses to a[]Miss rate
1024 0.254096
= =misses
miss rateaccesses
Compiler assignmentsRegister ← iMemory ← a[]
8-34Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Miss Example —Good Locality
Programint a[4096],i,j;for (i = 0 ; i < 10 ; i++){
for (j = 0 ; j < 4096 ; j++){a[j] = i + j;}
}
Memory accesses40960 write accesses to integer array a[]
i = 0: Compulsory cache miss every 4 elements i > 0: Entire array in cache ⇒ cache hits
Miss rate
Cache parameters16 KB = 4 Kwords425616
Capacity = B × S × WWSB
Compiler assignmentsRegister ← i,jMemory ← a[]
3210
4 slots
7685122560
7695132571
7705142592
…………
1023767511255
sets
1024 0.02540960
= =misses
miss rateaccesses
4K integers = 16 Kbytes= 1024 blocks
8-35Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Miss Example — Extreme Non‐Locality
Programint a[5120],i,j;for (i = 0 ; i < 10 ; i++){
for (j = 0 ; j < 5120 ; j++){a[j] = i + j;}
}
Memory accesses51200 write accesses to integer array a[]
Compulsory cache miss every 4 elements LRU: Next element in never cache
Miss rate
Cache parameters16 KB = 4 Kwords425616
Capacity = B × S × WWSB
Compiler assignmentsRegister ← i,jMemory ← a[]
768
512
256
0
1024i = 0
i = 1
...
51225601024
7685122560
Set 0 (Bold = Miss)
768512256
1024
768512
01024
76825601024
12800 0.2551200
missesmiss rate
accesses= =
8-36Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Miss Example — Better Locality Using MRU
Programint a[5120],i,j;for (i = 0 ; i < 10 ; i++){
for (j = 0 ; j < 5120 ; j++){a[j] = i + j;}
}
Memory accesses with MRU51200 write accesses to integer array a[]
i = 0: Compulsory cache miss every 4 elements i > 0: Conflict misses to slot 3 (2 out of 5 accesses)
j=0…256, 1024…1279
Miss rate
Cache parameters16 KB = 4 Kwords425616
Capacity = B × S × WWSB
Compiler assignmentsRegister ← i,jMemory ← a[]
4 slots
1 320
256
257
259
…
511
10247685120
10257695131
10267705142
…………
12791023767255
1280 512 9 20.025 1 9 0.11551200 5+ × ⎛ ⎞= = × + × =⎜ ⎟
⎝ ⎠misses
accessesmiss rate
5K integers = 1280 blocks
8-37Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Miss Example — Improved Locality
Programint a[5120],i,j;for (i = 0 ; i < 10 ; i++){
for (j = 0 ; j < 4096 ; j++){a[j] = i + j;}
for (i = 0 ; i < 10 ; i++){for (j = 4096 ; j < 5120 ; j++){a[j] = i + j;}
}
Memory accesses51200 write accesses to integer array a[]
i = 0: Compulsory cache miss every 4 elements i > 0: Entire array in cache ⇒ cache hits
Miss rate
Cache parameters16 KB = 4 Kwords425616
Capacity = B × S × WWSB
Compiler assignmentsRegister ← i,jMemory ← a[]
1024 256 0.02540960 10240
missesmiss rate
accesses+
= =+ 5K integers = 1280 blocks
7685122560
7695132571
7705142592
…………
1023767511255
7685122561024
7695132571025
7705142591026
…………
10237675111279
8-38Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Miss Example — Address Aliasing
Programint a[512],b[512],c[512],i,j;for (i = 0 ; i < 20; i++){
for (j = 0 ; j < 512 ; j++){c[j] = a[j] + b[j] + c[j];
}}
Memory accesses20 × 3 × 512 = 30720 read accesses to a[], b[], c[]20 × 512 = 10240 write accesses to array c[]Set assignment = int(0200AS1S2B/10)%100 = S1S2i = 0: 3 × 512 / 4 = 384 compulsory missesi > 0: 2 × 512 / 4 = 256 conflict misses for sets a[], c[]
Miss rate
Cache parameters8 KB = 2 Kwords225616
Capacity = B × S × WWSB
384 256 19 2688 0.12830720 10240 40960
missesmiss rate
accesses+ ×
= = =+
3 × 512 integers = 6 KB = 384 blocks
10
0
1
…
7F
…
FF
2 slots
a[]c[]
b[]
Compiler assignmentsRegister ← i,jMemory (address = 0200AS1S2B)a: 02000000 – 020007FFb: 02001000 – 020017FFc: 02002000 – 020027FF
8-39Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Miss Example — Address Aliasing with Larger W
Programint a[512],b[512],c[512],i,j;for (i = 0 ; i < 20; i++){
for (j = 0 ; j < 512 ; j++){c[j] = a[j] + b[j] + c[j];
}}
Memory accesses20 × 3 × 512 = 30720 read accesses to a[], b[], c[]20 × 512 = 10240 write accesses to array c[]Set assignment = int(0200AS1S2B/10)%80 = S1S2i = 0: 3 × 512 / 4 = 384 compulsory missesi > 0: All arrays in cache ⇒ cache hits
Miss rate
Cache parameters8 KB = 2 Kwords412816
Capacity = B × S × WWSB
384 384 0.00937530720 10240 40960
missesmiss rate
accesses= = =
+
3 × 512 integers = 6 KB = 384 blocks
2
c[]
310
0
1
…
7F
4 slots
a[] b[]
Compiler assignmentsRegister ← i,jMemory (address = 0200AS1S2B)a: 02000000 – 020007FFb: 02001000 – 020017FFc: 02002000 – 020027FF
8-40Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Workstation Layout with PCI
ATA disk controllersPATA— parallel ATASATA— serial ATA
CPU HostBridge
I/O ControllerISA/EISABridge
Long‐Term StorageUser
Interface
network
ISA bus
Main Memory
I/O Controllers
System Controllers
switching fabric
8-41Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
PCI ServicesBoot services
BIOS (basic input/output system)ROM-based software for initiating system boot
TimersSystem timers, counters and real time clocks
Interrupt controllersProgrammable interrupt controlIRQ — interrupt requestsInterrupt messages
Direct Memory Access (DMA)Permit devices to access memory without CPU intervention
8-42Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
BIOSHardware system is started or reset
CPU performs self-checkCPU fetches instruction from address FFFF0hAddress FFFF0h contains branch instruction Target of branch is firmware code located in PCI BIOS ROM
ROM = Read Only MemoryUsually E2PROM = Electrically Erasable Programmable ROM
BIOS = BASIC INPUT/OUTPUT SYSTEM
BIOS Locates keyboard, display, boot deviceUEFI (Unified Extensible Firmware Interface) system
BIOS load UEFIHardware-oriented operating system that runs above firmwarePerforms system management including boot of main OS
Non-UEFI systemBIOS begins loading OS from boot device
8-43Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Interrupt HandlingAPIC
Advanced Programmable Interrupt Controller
Local APIC Interrupt controller in CPU Local interrupt
INTR + int_number from deviceInterrupt messages
Structured message
I/O APICInterrupt controller in PCI chipsetSends/receives interrupt messagesReplaces old IRQ system (each device
assigned private IRQ)All device interrupts defined as IRQ9Interrupt message describes external event
CPU bus
8-44Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Interprocessor InterruptsCPU can send /receive Interprocessor Interrupt (IPI)
Used in multiprocessor (MP) systems Standard APIC interrupt message syntax
Generating IPI message CPU writes to interrupt command register (ICR) in local APIC Local APIC issues IPI message on system bus
8-45Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
DMA ⎯ Direct Memory AccessPeripheral device accesses memory directly
No need for CPU to execute transfer instructionsUsed for large data transfers
CPU works concurrentlyCan preempt DMA for cache update
CPU Bus Adaptor
I/O Controller
Long‐TermStorage
UserInterface
Network
Main Memory
I/O Controllers System Controllers
TimersInterrupts
DMA
Switching Fabric
8-46Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
DMA OperationInterrupt mode
CPU Instructions to DMA controller set up transfer Start addressNumber of bytes to transfer
DMA controller Acts as master of data pathTransfers data between RAM and peripheral device IRQ at end of transfer
CPU takes back bus control
PCI arbitrationPCI device
Requests control of bus Requests read / write memory access
PCI bridgeGrants bus control
8-47Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Cache Coherency and ConsistencyMultiple processors share data with main memory
Each processor copies data blocks to cacheDifferences can develop between caches and/or main memory
ExampleCPU-1 and CPU-2 read X to cache from main memoryCPU-1 writes to X in cacheInvalid copies of X in CPU-2 cache and main memory
Consistency Copies of data locations are always identical
CoherenceReads and writes occur in the correct orderEasier than consistency
CPU-1with
L1 Cache
L2 cache
Main Memory
CPU-2with
L1 Cache
L2 cache
8-48Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Bus SnoopingWrite-back cache policy
CPU updates private cache No update to main memory
CPU evicts cache block Updates main memory
Bus snooping On CPU write
CPU always writes destination addresses on bus (short bus cycle)CPU writes data to private cache (not on bus)
Bus devices monitor all addresses written on busSee which CPUs are loading a cache blockSee which CPUs are writing to a cache block
Write synchronizationBus arbitration prevents writes to multiple copies of cache blockOnly one CPU places target address on memory bus per bus cycleOnly one CPU can write to same cache block in one bus cycle
ArchitecturalState
ExecutionCore
Cache
MainMemory
I/O BusPCI Bridge
CPU 0 CPU 1
ArchitecturalState
ExecutionCore
Cache
CPU 2
ArchitecturalState
ExecutionCore
Cache
8-49Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Client/Server Model of I/O
Fast CPU is client for slower I/O servicesBuffer stores client requests and forwards at slower server response rate
LatencyTime between client request and buffer response
ThroughputNumber of services provided per unit time
BandwidthMaximum data transfer rate of I/O channel (including buffer)
CapacityMaximum throughput of server through bufferDepends on bandwidth and service rate (server speed)
UtilizationRequest rate as proportion of capacity
ClientProcessor
FIFO Buffer ServerDevice
request forward
response
Queue
8-50Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Buffer Operation
Client requestsIntermittent high speed requests (bursty)Peak client request rate >> Average client request rate
Service responses
Peak client rate > Server rate > Average client rate
FIFO buffers requests in order of arrival
Stores requests arriving at higher client request rateForwards requests to server at lower server response rate
Request forwarding rate = server response rate
ClientProcessor
FIFO Buffer ServerDevice
request forward
response
Queue
8-51Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Buffer Latency and Overflow
Minimum latencyDetermined by maximum service response rate
Buffer overflowBuffer fill rate = Client Request Rate – Service Response RateBuffer fills continuously for too long ⇒ buffer overflow
ExamplePeak CPU disk read rate = 1 read/cycle = 109 read requests/secondDisk can provide = 107 responses per second = 100 CPU cycles/readCPU sees minimum latency of about 100 CPU cyclesBuffer can hold 1000 requestsContinuous requests ⇒ overflow in 1000/(109 – 107) ~ 10-6 seconds
ClientProcessor
FIFO Buffer ServerDevice
request forward
response
Queue
8-52Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Utilization — Latency Trade‐OffUtilization
Higher client request rateMore services per second ⇒ Higher utilization
Server cannot work faster (service rate is fixed)More requests are buffered ⇒ longer queue length (higher buffer level)
Total latency for one request = server latency + queuing timeMore requests in buffer queue ⇒ longer queuing time
Higher utilization ⇒ Higher total latency
Buffer overflowAverage request rate > average response rateMore requests enter buffer than leaveBuffer level risesAfter long time, buffer overflows
0
2
4
6
8
10
12
14
16
18
20
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
utilization
latencybuffer level
8-53Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Queuing Theory — 1
AssumptionsClient Requests
Arrive independently (Poisson statistics)Have random length (bytes to transfer)Average Request Rate in steady state
Buffer Stores requests and forwards in order of arrival (FIFO) at service rateAverage Buffer Level (stored requests) in steady state
ServerProvides services to each request independently (Poisson statistics)Average Service Rate in steady state
ClientProcessor
FIFO Buffer ServerDevice
request forward
response
Queue
8-54Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Queuing Theory — 2
Results
ClientProcessor
FIFO Buffer ServerDevice
request forward
response
Queue
Request RateUtilization
Service Rate
1Latency
Service Rate Request Rate
1 1Service Rate 1 Utilization
Buffer Level Latency Request Rate
1Utilization
1 Utilization
=
=−
⎛ ⎞= ⎜ ⎟−⎝ ⎠
= ×
⎛ ⎞= ×⎜ ⎟−⎝ ⎠0
2
4
6
8
10
12
14
16
18
20
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
utilization
latencybuffer level
8-55Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019
Traffic Shaping
High utilization causes congestionHigher packet error rate (noise + collisions)Buffer overflowRe-transmit lost packets ⇒ more requests ⇒ more collisions
Traffic shapingBuffer at client imposes request quotas on clientClient request rate = maximum transmission rate on networkForward rate = actual transmission rate = optimum network rate
ClientFIFO Buffer
Serverrequest forward
response
Queue
Network
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
9-1Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Advanced Architectures
9-2Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
General OutlookFundamental performance parameters
= × ×τ
= +ideal stall
T CPI ICCPI CPI CPI
Areas for possible improvementInstruction and thread level parallelism to achieve
Reducing instruction dependency stalls to lower
Reducing cache latency to lower
Reducing branch stalls to lower
1idealCPI <stalldata dependencyCPI
stallcache missCPI
stallbranch penaltyCPI
( ) 11010 10
1
−−≈ =τ
=
= + +
ideal
stall stall stall stalldatadependency cachemiss branch penalty
IC
CPICPI CPI CPI CPI
for integer pipeline
grows with software complexity
reaching physical limit seconds GHz
Technological limitations
9-3Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Practical SuccessesInstruction Level Parallelism (ILP)
Provide multiple copies of hardware units in each processorBegin executing multiple instructions on same clock cycleMultiple instructions finish on every clock cycle ⇒ CPI < 1
Reducing instruction dependency stalls Compiler rescheduling or dynamic rescheduling (out-of-order execution)
Improving floating point performanceParallel process FP instructions
Reducing branch stallsAdvanced branch prediction
Reducing cache latency Processor pre-fetches cache blocks based on address prediction Optimization of data structures
Thread Level ParallelismProvide multiple complete processor coresDivide code into independently executing threads
9-4Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Types of ParallelismPipelining
Instruction In+1 begins before In completes 1 instruction completes on every clock cycle τR = 1 / τ instructions complete every second
SuperscalarM > 1 copies of pipeline in parallel Execute M instructions start on same clock cycleM instructions complete on every clock cycle
Superpipelining Divide pipeline into smaller stages — less work per stageLess work ⇒ shorter clock cycle τ' < τ ⇒ higher clock rate R' > RR' > R instructions complete every second
MultiprocessorN > 1 program sections running on N processorsOverall program runs in less time
9-5Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
CC1 CC2 CC3 CC4 CC5
IF ID EX MEM WB
DLX PipelineFive pipeline stages
New instruction begins on each clock cycle One instruction completes on each clock cycle
1 2 3 4 5 6 7 8 I1 IF ID EX MEM WB I2 IF ID EX MEM WB I3 IF ID EX MEM WB
4 1
ideal
pipeline
large
large
cycles
instruction
Run‐Time
→
→
+= = ⎯⎯⎯⎯→
= × ×τ⎯⎯⎯⎯→ ×τ
idealIC
idealIC
ICCPIIC
CPI IC IC
9-6Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Vector ProcessorsSIMD execution model
Single Instruction performed in parallel on Multiple Data
Typical applications for vector operationsData compression/decompressionAudio processing Graphics processing
9-7Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
CC1 CC2 CC3 CC4 CC5
IF ID EX MEM WB
DLX Vector Pipeline Example
1 2 3 4 5 6 7 8
p_LW P1, 400(R1) IF ID EX MEM WB
p_LW P2, 800(R1) IF ID EX MEM WB
p_ADD P3, P1, P2 IF ID ID EX MEM WB
Perform 4 word additions in parallelp_ADD P3, P1, P2
Load 4 memory words (16 bytes) to register P2p_LW P2, 800(R1)
Load 4 memory words (16 bytes) to register P1p_LW P1, 400(R1)
9-8Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Examples of Vector ProcessingIntel MMX Technology
Integer operations on 64-bit registers8 bytes, 4 words, or 2 dwords
Intel SSE (SSE2/SSE3/AVX) TechnologySimilar to MMX for Floating Point operations
PowerPC AltiVec vector processorSimilar to SSE128-bit registers
Compiler SupportVector instructions part of processor instruction setReasonable compiler supports vectorization
double‐precision FP operations
single‐precision FP operations
Register width
42
84
256 bits = 32 bytes128 bits = 16 bytes
SSE2/SSE3/AVXSSE
9-9Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Intel Vectorization Example
Loop runs 25 timesOperates on 4 array elements
in each loop iteration
int i; float a[100], red; … red = 0; for (i = 0; i < 100; i++) { red += a[i]; }
Loop runs 100 timesOperates on one array element
in each loop iteration
int i; float a[100], red; … red = 0; ASM p_XOR xmm0,xmm0 /* zero 128-bit accumulator */ for (i = 0; i < 25; i++) { ASM p_ADD xmm0,a[4*i] /* four 32-bit additions */ } ASM h_ADD xmm0 /* add four 32-bit FP registers to one */ ASM p_MOV red, xmm0 /* move FP sum to memory location */
9-10Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Cache Refresh Latency
Reads 1024 × 1024 SSE operands (16-byte operands) = 16 MBReads sequentially, without repeated access to same data
Pentium 4 has 64 byte block size = 4 x 16-byte operand Will miss in L1 on every 4 accesses
for (i = 0; i < 1024; i++) { for (j = 0; j < 1024; j++) { SSE_operation a[i][j]; } }
4 operationspipeline
I/O bus cacheupdate
4 operations
idle idle
...fetch
miss miss
misspenalty
misspenalty
cacheupdate
9-11Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Cache Prefetch
Reads 1024 × 1024 SSE operands (16-byte values) = 16 MBOn Pentium 4 prefetching,
Software prefetch loads 128 bytes (2 cache blocks)128 bytes = (128/16) = 8 SSE operandsPrefetch 8 operands forwardNOP on prefetch of cache hit
for (i = 0; i < 1024; i++) { for (j = 0; j < 1024; j++) { prefetch a[i][j+8]; SSE_operation a[i][j]; } }
4 operationspipeline
I/O bus cache read cache read
...fetch
prefetch prefetch
4 operations
9-12Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
P6 Superscalar with Dynamic Rescheduling
Intel P6 architecture for Intel x86 since Pentium II
Fetch/DecodeConvert IA-32 instructions to 1 – 6 RISC-type micro-ops per CC
Instruction pool — out-of-order dynamic reschedulingHolds micro-ops until ready for executionScheduler issues micro-ops to parallel execution in ALU, FPU, Load, StoreFinished micro-ops return to instruction pool with execution results
Retirement (in-order write back)Finished micro-ops write in original program order
InstructionMemory
Write BackDecode
Execution Units
InstructionPool
ALU
FPU
ALU
Store
FPU
Load
Fetchand
Decode
Registers
DataMemory
9-13Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Instruction ScoreboardStatus field assigned to instructions in program listing
Instruction executed — destination operand not availableExecutedX
Instruction executed — all destination operand(s) availableFinishedF
All source operands availableReadyR
At least one source operand not available Not ReadyNR
Instructions executed according to Status fields
Only instructions marked Ready can be executedScheduling policy
Depends on hardware organization
Update scoreboard after each clock cycleCompleted instructions marked FinishedInstructions marked Ready as operands become available
9-14Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Scoreboard Example for DLX
F
F
F
F
F
F
F
F
F
12
R
F
F
F
F
F
F
F
F
11
NR
R
F
F
F
F
F
F
F
10
NRNRNRNRNRNRNRNRNRSW [Z],R4
NRNRNRNRNRNRNRNRNRSUB R4,R4,#789
XRRRRRRRRLW R4,[Z]
F
F
F
F
F
F
9
F
F
F
F
F
F
8
R
F
F
F
F
F
7
NR
R
F
F
F
F
6
NR
NR
X
F
F
F
5
NR
NR
R
F
F
F
4
NR
NR
R
R
F
F
3
NR
NR
R
NR
R
F
2
NR
NR
R
NR
NR
X
1Instruction
SW [Y],R3
SUB R3,R3,#456
LW R3,[Y]
SW [X],R2
ADD R2,R2,#123
LW R2,[X]
Scheduling rules
Issue only ready instructions
Choose instructions in ORIGINAL PROGRAM ORDER
Scoreboard generates inefficient code
Instruction status after EX clock cycle (ignoring IF, ID, MEM, WB)
Executed and destination operand available
F
ExecutedX
ReadyR
Not ReadyNR
9-15Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Scoreboard Example for Dynamic DLX
FRRRNRNRNRNRNRSW [Z],R4
FFFFRRRNRNRSUB R4,R4,#789
FFFFFFXRRLW R4,[Z]
F
F
F
F
F
F
9
F
F
F
F
F
F
8
R
F
F
F
F
F
7
R
F
F
R
F
F
6
R
F
F
R
F
F
5
NR
R
F
R
F
F
4
NR
R
F
NR
R
F
3
NR
NR
X
NR
R
F
2
NR
NR
R
NR
NR
X
1Instruction
SW [Y],R3
SUB R3,R3,#456
LW R3,[Y]
SW [X],R2
ADD R2,R2,#123
LW R2,[X]
Scheduling rulesIssue only ready instructionsChoose instructions in ORDER THEY BECOME READY
Scoreboard generates compiler-rescheduled code
Instruction status after EX clock cycle (ignoring IF, ID, MEM, WB)
Executed and destination operand available
F
ExecutedX
ReadyR
Not ReadyNR
9-16Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Scoreboard Example for P6
FRNRNRNRStoreSW [Z],R4
FFRNRNRALUSUB R4,R4,#789
FFFRRLoadLW R4,[Z]
CC6
F
F
F
F
F
F
CC5
F
F
F
F
F
F
CC4
R
F
F
F
F
F
CC3
NR
R
F
R
F
F
CC2
NR
NR
R
NR
R
F
CC1
Store
ALU
Load
Store
ALU
Load
Execution UnitInstruction
SW [Y],R3
SUB R3,R3,#456
LW R3,[Y]
SW [X],R2
ADD R2,R2,#123
LW R2,[X]
FinishedF
ReadyR
Not ReadyNR
Scheduling rulesIssued only ready instructionsAmong ready instructions, maintain program list orderOnly 1 load and 1 store per CCUp to 2 ALU and 2 FPU instructions per CC
Execution condition after each clock cycle (ignoring fetch, decode, write-back)
9-17Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
InstructionMemory
Write BackDecode
Execution Units
InstructionPool
ALU
FPU
ALU
Store
FPU
Load
Fetchand
Decode
Registers
DataMemory
Program Execution in P6
ADD [X],123SUB [Y],567SUB [Z],789
IA-32 instructionsdecoded in 2 CCto RISC micro-opswith register renaming
CC5StoreSW [Z],R4
CC4
CC3
CC2
CC1
StoreALU
SW [Y],R3SUB R4,R4,#789
StoreALULoad
SW [X],R2SUB R3,R3,#567LW R4,[Z]
ALULoad
ADD R2,R2,#123LW R3,[Y]
LoadLW R2,[X]LW R2,[X]ADD R2,R2,#123SW [X],R2LW R3,[Y]SUB R3,R3,#567SW [Y],R3LW R4,[Z]SUB R4,R4,#789SW [Z],R4
Dynamic scheduling
9-18Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Hardware Utilization
Good program efficiencyProgram executes in minimum number of sequential cycles
Low hardware utilizationMost execution units idle in most clock cycles
Higher ILP ⇒ higher utilization of execution unitsHigher utilization ⇒ larger pool of independent instructions
Speculation — deep branch predictionMany instructions executed before program flow determined
Hardware multithreadingInstructions from different threads are independent
SW [Y],R4SW [Y],R3SW [X],R2IDLEIDLEStore
IDLEIDLELW R4,[Z]LW R3,[Y]LW R2,[X]Load
IDLEIDLEIDLEIDLEIDLEFPU
IDLEIDLEIDLEIDLEIDLEFPU
IDLEIDLEIDLEIDLEIDLEALU
IDLESUB R4,R4,#789SUB R3,R3,#567ADD R2,R2,#123IDLEALU
CC5CC4CC3CC2CC1Unit
9-19Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Deep Superpipeline for DLX Divide each pipeline stage into 2 smaller stages
Each new stage does half the work in half the timeNew stage finishes in half the time ⇒ Double clock speed
1 2 3 4 5 6 7 8 9 10 11 12 I1 IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2 I2 IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2 I3 IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2
29 1
12 2 2
superpipeline
superpipeline
superpipeline superpipeline
pipeline
ideallarge
pipeline pipelineideal ideal idealpipelinelarge
Double clock speed
IC
IC
ICCPIIC
T CPI IC IC T
ττ
τ τ
→
→
⇒ =
+= ⎯⎯⎯⎯→
= × × ⎯⎯⎯⎯→ × = ×
Problems with deep superpipelineSome instructions cannot be effectively splitSome operations do not scale in time — faster clock ⇒ more stall cycles
Cache update, branch penalty, page fault, etc
9-20Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Pentium 4 SuperpipelinePentium III
10 stage pipeline at clock speed up to about 1.5 GHzPentium 4
20 stage pipeline at clock speed up to about 4.0 GHzExpect
1.5 GHz processor ~ 50% faster than same processor at 1.0 GHzMeasurement on SPEC CINT2000
1.5 GHz Pentium-4 ~ 20% faster than 1.0 GHz Pentium-III
( )
44
4 4
4 4
4 4
11.51.0 1.2 1.251 1.0
1.51.25
1 1.25 0.25 0
PIII PIIIPIII
P PIIIP
P P
P P PIII PIIIideal stall ideal stall
P PIII P PIIIideal ideal stall stall
CPI IC CPIS CPI CPICPICPI IC
CPI CPI CPI CPI
CPI CPI CPI CPI
× ×= = × = ⇒ = ×
× ×
+ = × +
= = ⇒ = × + ≈
GHz
GHz
.5
9-21Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Hyper‐Threading
Two copies of architectural state — one execution coreOS sees two sets of registers — looks like two CPUs
OS assigns threads to CPU 0 and CPU 1CPU 0 and CPU 1 issue instructions to shared execution core
No stall in either threadCPU 0 and CPU 1 issue instructions on alternate clock cycles
Stall in one threadOther CPU issues instructions on each clock cycle until stall ends
Both CPUs keep working on most clock cycles
ArchitecturalState
ExecutionCore
Cache
MainMemory
I/O BusPCI Bridge
CPU 0 CPU 1
ArchitecturalState
Architectural StateRegisters, stack pointersand program counter
Execution CoreALU, FPU, vectorprocessors, memory unit
9-22Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Expected Improvement from Hyper‐Threading
( )
( ) [ ]2'
stall cyclesstall cycles per stall
stall
stall cycles2 simultaneous stalls
stall
Without hyper-threading
With hyper-threading
Speedup
stallS
stallS
HT HT HT
CPI P CPS P CPS
CPI P CPS P
CPI ICSCPI IC
ττ
= × = × =
= × ≈ ×
× ×= =
× × [ ]
[ ]
2
2
11
0.52 0.5 / 2 0.25
1 2 0.25 1. 1.221 2 0.25
cycles per stall
Take for Pentium-4
Measured improve e
m nt
SHT
S
stall
S
CPS PCPICPI CPS P
CPICPS P
S
+ ×=
+ ×
≈= ⇒ = =
+= = =
×
+ × Intel, "Hyper-Threading Technology Architecture and Microarchitecture"
9-23Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Intel Nehalem Micro‐Architecture
David Kanter, "Inside Nehalem: Intel's Future Processor and System",http://realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT040208182719&mode=print
9-24Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Amdahl’s Equation in Parallel Processing
( )
( )
( )
1
1
11 1-
work can be parallelized
work cannot be parallelized
Fraction of processing that can be performed independently Number of processing units
NN processors
N
NN
P P
P
CPICPI PN
CPI P
CPI F CPI FN
FN
S
=−
=
==
= ×
+ ×
= × + ×
==
( )
1 1 1
1-
N N
N processors N processorsP
P
CPI IC CPIFCPI IC CPI FN
ττ
= =
− −
× ×= = =
× × +
9-25Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
MP and HT Performance Enhancements
MP Without Hyper Threading
0.65
0.85
S/CPU
2.64
1.72
SCPUs
Hyper Threading Without MP
0.60
S/CPU
1.22
SCPUs
Speed-up (S) for On Line Transaction Processing (OLTP) Workload
9-26Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Rise and Fall of Multiprocessor R&D
Ref: Mark D. Hill and Ravi Rajwar, "The Rise and Fall of Multiprocessor Papers in the International Symposium on Computer Architecture (ISCA)",http://pages.cs.wisc.edu/~markhill/mp2001.html
Topics of papers submitted to ISCA1973 to 2001
Sorted as percent of total
ISCA — International Symposium on Computer Architecture
Hennessey and Patterson joke that proper place formultiprocessing in their book is Chapter 11 (a section of USbusiness law on bankruptcy)
9-27Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Basic Interprocess Communication ModelsShared memory system
Interprocess communication — write/read shared memory locationSingle shared address space
Sequential coherence enforced by cache snoopingBus imposes write / read orderCache coherency overhead
Message passing systemInterprocess communication — send/receive structured messages
Send / request dataProvide requested data or status
Sequential coherence enforced by message content + synchronization
No snooping or snooping overheadMessage management contributes overhead
9-28Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Multiprocessor Shared Memory Multi‐Threading
One or more physical microprocessorsArchitectural state
Registers, including stack pointers and program counterExecution core
Integer ALUs, FPUs, vector processors, memory accessOS assigns a thread to each processor
Each thread runs independentlyOn long stall (page fault) a CPU can switch threads
ArchitecturalState
ExecutionCore
Cache
MainMemory
I/O BusPCI Bridge
CPU 0 CPU 1
ArchitecturalState
ExecutionCore
Cache
9-29Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Multi‐Core Shared Memory Multi‐Threading
Multiprocessor system on one physical chip
Cheaper than multi-microprocessor systemCan be bottleneck at memory bus
Both processors need to update cache simultaneouslyOne processor must wait
ArchitecturalState
ExecutionCore
L1 Cache
MainMemory
I/O BusPCI Bridge
CPU 0 CPU 1
ArchitecturalState
ExecutionCore
L1 Cache
L2 Cache
9-30Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Open MP for Shared Memory SystemsApplication Program Interface (API) for multiprocessing
Supports shared memory applications in C/C++ and Fortran Provides directives for explicit thread-based parallelizationSimple programming models on shared memory machines
Fork — Join ModelMaster thread (consumer thread)
Programs initiate as single threadExecutes sequentially until parallel construct is encountered
Fork (producer thread)Master thread creates team of parallel threads Program statements in parallel construct execute in parallel
JoinTeam threads complete Synchronize and terminateMaster thread continues
NestingForks can be defined within parallel sectionsRef: https://computing.llnl.gov/tutorials/openMP/
9-31Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
General Code Structure
#include <omp.h>main () {int var1, var2, var3;/* Serial code */
...#pragma omp parallel private(var1, var2) shared(var3){/* Parallel section executed by all threads */
.../* All threads join master thread and disband */ }
/* Resume serial code */ ...
}
Variables shared among all threadsOne copy accessed by all threads
Variables private to each threadEach thread has private copy
9-32Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
"Hello Worlds" Program#include <omp.h>main () {int nthreads, tid;/* Fork team of threads with private variables */#pragma omp parallel private(tid){/* Obtain and print thread id */tid = omp_get_thread_num();printf("Hello World from thread = %d\n", tid);/* Only master thread does this */if (tid == 0) {nthreads = omp_get_num_threads();printf("Number of threads = %d\n", nthreads);}
} /* All threads join master thread and terminate */}
9-33Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Parallel For#pragma omp parallel
{
#pragma omp for
for (i = 0; i < 12; i++)
c[i] = a[i] + b[i];
}
MasterThread
parallel fori = 0i = 1i = 2i = 3
i = 4i = 5i = 6i = 7
i = 8i = 9i = 10i = 11
fork
join
MasterThread
omp parallel
Data decomposition12 loop iterations dividedamong 3 CPU cores
Each core executes 4 loopiterations in parallel
9-34Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Sections#pragma omp parallel shared(a,b,c,d) private(i)
{#pragma omp sections
{
#pragma omp sectionfor (i=0; i < N; i++)
c[i] = a[i] + b[i];
#pragma omp sectionfor (i=0; i < N; i++)
d[i] = a[i] * b[i];
} /* end of sections */} /* end of parallel section */
}
Functional decompositionEnclosed sections of code divided among threads in team
MasterThread
parallelsections
c[i] = a[i] + b[i] d[i] = a[i] * b[i]
fork
join
MasterThread
omp parallel
9-35Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Message Passing Example — Vector Product
[ ] [ ]*Compute from data pre-distributed to nodes∑ 3i=0 a i b i
load Ra, a
load Rb, b
Ra ← Ra * Rbrecv P2, RbRa ← Ra + Rbrecv P1, RbRa ← Ra + Rbstore p, Ra
load Ra, a
load Rb, b
Ra ← Ra * Rbsend P3, Ra
load Ra, a
load Rb, b
Ra ← Ra * Rbrecv P0, RbRa ← Ra + Rbsend P3, Ra
load Ra, a
load Rb, b
Ra ← Ra * Rbsend P1, Ra
P3P2P1P0
Message overheadSource or destinationTime of creation
Sequential consistency guaranteed by message overheadP3 distinguishes two reads (receives) from P1 and P2 by source addressNo data hazard
9-36Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Scatter and Gather
4
3
2
1
Task 3Task 2Task 1Task 0
4321
Task 3Task 2Task 1Task 0
Scatter
Send buffer
Destination buffers
D
C
B
A
Task 3Task 2Task 1Task 0
DCBA
Task 3Task 2Task 1Task 0
Gather
Destination buffer
Send buffers
9-37Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
Reduce
4321
Task 3Task 2Task 1Task 0
10
Task 3Task 2Task 1Task 0
Reduce: ADD
Send buffers
Destination buffer
9-38Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019
MPI "Hello World"#include "mpi.h"main( argc, argv )int argc;char **argv;{char message[20];int myrank; /* myrank = this process number */MPI_Status status; /* MPI_Status = error flags */MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &myrank );
/* MPI_COMM_WORLD = list of active MPI processes */if (myrank == 0) /* code for process zero */{strcpy(message,"Hello, there");MPI_Send(message, strlen(message)+1, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
}else /* code for process one */{MPI_Recv(message, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD, &status);printf("received :%s:\n", message);
}MPI_Finalize();
}
Ref: "MPI: A Message-Passing Interface Standard Version 1.3"
10-1Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
Real‐Life RISC
10-2Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
MIPS ArchitectureRISC Instruction Set Architecture (ISA)
Defines registers + instructionsMIPS cores
Define device-dependent implementation detailsPipeline organization, I/O organization, control registers, ...
MIPS32 32-bit RISC ISABasis for DLX
MIPS64 64-bit RISC ISABinary compatible with MIPS32
ApplicationsTypically licensed to OEMs Design implemented in embedded systemsMIPS-based PCs used in China
10-3Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
MIPS32 ISA — 1Registers
32-bit integer registersR0, R1, ... , R31Regs[R0] = 0 (read-only)
32-bit FP registers F0, F1, ... , F31
Special registersHI, LO
64-bit result of integer multiply Quotient + remainder result of integer divide
Instruction formats
32 26 25 21 20 16 15 0Type 6 5 5 5 5 6
R opcode rs rt rd sa function I opcode rs rt immediate J opcode target
10-4Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
MIPS32 ISA — 2Coprocessors
Logical extensions of basic MIPS ISAAccessed via coprocessor read / write instructions
CP0System Control Coprocessor — on CPU Supports virtual memory system and exception handling
Translates virtual addresses into physical addressesControls cache subsystemHandles switches between kernel / supervisor/ user statesManages exceptions / diagnostic control / error recovery
CP1Interface to FPU
CP2Available for device-specific implementations
CP3Interface to FPU on MIPS64 and newer MIPS32
10-5Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
MIPS32 ISA — 3Some MIPS instructions not in DLX
rt ← substr(rs,pos=sa,size=rd)EXT rt, rs, pos, sizeExtract
Multiply and add to HI_LOMADD rs, rt
PrefetchPREFCache
Trap if equal / greater or equal / not equalTEQ / TGE / TNETrap
System Call SYSCALL System
Critical section for shared memorySYNCSynchronize
Branch less / less or equal zeroBLTZ / BLEZ
Branch greater / greater or equal zeroBGTZ / BGEZ Branch
Multiply to HI_LOMULT rs, rt
Multiply to GRMUL rd, rs, rt
Multiply
Shift Word Left Logical / Arithmetic SLL / SRA
Rotate Word Right ROTR Shift
Set on Less Than ImmediateSLTI rt, rs, immTest+Set
Coprocessor Load / Store Store Word from Coprocessor_z, z = 1 or 2 SWCz imm(reg), rt
Load Word to Coprocessor_z, z = 1 or 2 LWCz rt, imm(reg)
10-6Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
MIPS64 ISARegisters
64-bit integer registersR0, R1, ... , R31Regs[R0] = 0 (read-only)
32-bit FP registers on 32-bit FPU
64-bit FP registers on 64-bit FPUF0, F1, ... , F31
Special registersHI, LO
128-bit result of integer multiply Quotient + remainder result of integer divide
Instruction formats32-bit instruction length — binary compatible with MIPS32MIPS32/64 instructions act on lower 32-bits in registersMIPS64_double instructions act on full 64-bits in registersMemory address = 64-bit pointer (register) + 16-bit immediate
10-7Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
ARM OverviewMicroprocessor and microcontroller for embedded systems
Advanced RISC Machine developed by ARM LimitedARM Ltd primarily licenses ISA implementations to developersMost widely used 32-bit RISC ISA
Over 50 billion ARM processors used in phones, games, peripherals98 percent of mobile phones use ARM
10-8Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
RISC Architectural FeaturesData Types
Byte 8 bits Halfword 16 bits (in ARMv4 and higher)Word 32 bits
Standard RISCLoad/store architectureLarge uniform register file Simple addressing modesUniform and fixed-length instruction fieldsScalar in-order pipeline
Additional ARM architectural featuresShift + ALU operationsAuto-increment / auto-decrement addressing modes for loops Load and Store Multiple instructionsConditional execution of most instructions
Cancel instructions on certain condition flagsReplaces control hazards in forward jumps
10-9Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
ARM Versions
CortexARMv8
CortexARMv7
ARM11ARMv6
XScale
ARM10E
ARM9EARMv5TEARMv5TEJ
ARM9TDMI
ARM8
StrongARM
ARM7TDMI
ARMv4
ARM7
ARM6ARMv3
ARM3
ARM2ARMv2
ARM1ARMv1
Processor FamilyArchitecture Version
ICEI
DSP Enhancement(implies TDMI)
E
Extension Features
Multiplier (64-bit result)M
DebuggerD
ThumbT
Jazelle (Java)J
10-10Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
Seven Operating Modes
Runs privileged operating system tasks (ARMv4 and above)Accesses user mode registers
System
Supports software emulation of hardware coprocessorsUndef
Implements virtual memory and/or memory protectionHandles memory access violations
Abort
Protected mode for operating systemEntered on reset and on Software Interrupt
Supervisor
General purpose interrupt handling Entered on low priority (normal) interrupt
IRQ
Supports high speed data transfer or DMA process Entered on high priority (fast) interrupt
FIQ
Normal (non-privileged) program execution modeNo access to protected resources
User
10-11Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
Registers32-bit general purpose registers
16 architectural registers r0, … , r15r11 — FP = Frame Pointerr12 — IP = intra-procedure-call scratch registerr13 — SP = Stack Pointer used by push/pop instructionsr14 — LR = Link Register used to return from function callsr15 — PC = Program Counter
31 physical registers Multiple copies of r8, … , r14Each copy accessible in specific operating mode
32-bit status registersCurrent Program Status Register (CPSR) visible in all modes
32 bits wide
5 Saved Program Status Registers (SPSR)Privileged modes copy previous CPSR
10-12Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
Modes and Visible Registers
SPSRSPSRSPSRSPSRSPSR
CPSRCPSRCPSRCPSRCPSRCPSR
r15 (PC)r15 (PC)r15 (PC)r15 (PC)r15 (PC)r15 (PC)
r14 (LR)r14 (LR)r14 (LR)r14 (LR)r14 (LR)r14 (LR)
r13 (SP)r13 (SP)r13 (SP)r13 (SP)r13 (SP)r13 (SP)
r12 (IP)r12 (IP)r12 (IP)r12 (IP)r12 (IP)r12 (IP)
r11 (FP)r11 (FP)r11 (FP)r11 (FP)r11 (FP)r11 (FP)
r10r10r10r10r10r10
r9r9r9r9r9r9
r8r8r8r8r8r8
r7r7r7r7r7r7
r6r6r6r6r6r6
r5r5r5r5r5r5
r4r4r4r4r4r4
r3r3r3r3r3r3
r2r2r2r2r2r2
r1r1r1r1r1r1
r0r0r0r0r0r0
Undef AbortSupervisor IRQFIQUser
10-13Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
Instruction SetsARM
32-bit instructionsAligned on 32-bit boundaries
Lowest 2 bits of PC (r15) always 0
Thumb16-bit instructions 1-to-1 mapped to 32-bit instructionsShortened versions with restricted options and implicit operands
Example — add dest, src1, src2 becomes add dest, src
Aligned on 16-bit boundariesLowest bit of PC (r15) always 0
Set by T = 1 in CPSRJazelle
Executes Java bytecode directlyARM reads 4 8-bit instructions per instruction fetchSet by J = 1 in CPSR
10-14Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
Conditional ExecutionFlag set suffix S
ALU instructions set CPSR flagsN — Negative Z — ZeroC — Carry V — oVerflow
Conditional execution suffixExecute instructions if flag combination is true
Example — operate-compare-if-else
SUBS r3, r3, #1
ADDNE r0, r1, r2
SUBEQ r0, r1, r2ADD r4, r5, r6
SUB r3, r3, #1CMP r3, #0BEQ L1 ; branch on 0ADD r0, r1, r2 ; skip if r3 = 0B L2 ; jump to L2
L1: SUB r0, r1, r2 ; skip if r3 != 0L2: ADD r4, r5, r6
Conditional ExecutionUsual Execution
10-15Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
Basic Instructions Transfer
MOV — Move register-to-register or immediate-to-register MVN — Move NotLDR, STR — load / store
BranchB — Add sign-extended 24-bit signed immediate to PC (r15)BL — Branch and store PC+4 in link register (r14) Conditional branch — conditional execution of B or BL
ALU
MLA Multiply and ADDORR Logical ORSUB SubtractSBC Subtract with CarryRSB Reverse SubtractRSC Reverse Subtract with CarryTST TestTEQ Test Equivalence
ADD AddADC Add with CarryAND Logical ANDBIC Logical Bit ClearCMP CompareCMN Compare NegativeEOR Logical XORMUL Multiply
10-16Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
Shift and RotateLegal ALU instruction operands
32-bit source / destination register contentsSign-extended 12-bit immediate Shifted-operand
Shifted / rotated 32-bit source register contents Number of shifts set by 8-bit immediate
ShiftsLSL — Logical Left Shift (unsigned)ASR — Arithmetic Right ShiftLSR — Logical Shift Right
Rotates ROR — Rotate RightRRX — Rotate Right Extended (CF into MSB)
10-17Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
VFP ExtensionNo FP operations in basic instruction set
Not needed in simple embedded applications VFP implements FP ISA extension as optional coprocessor Since ARM10
Vector Floating Point (VFP) Single precision and double precision floating point computation
ANSI/IEEE Std 754 compliant
Single Instruction Multiple Data (SIMD) FP unitOne FP operation performed in parallel on 256-bit vector
8 single-precision (4-byte) FP numbers4 double-precision (8-byte) FP numbers
Accesses 32 single precision FP registers (32-bit width)
VFPv3Operates on 8 double-precision (8-byte) FP numbers (512-bit vector)
10-18Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
VFP InstructionsTransfer
Load / store FP values into registers from memoryTransfer / copy 32-bit values between VFP and ARM GP registersConversions between float, double, unsigned / signed integers
FPUAddSubtractMultiplyDivideSquare root Combined multiply-accumulateCompare FP values in registers
VFPv3 Store FP constant in register
10-19Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
DSP EnhancementsDigital Signal Processing
Operations on sampled-digitized-encoded analog informationTypical applications
Audio / video, speech processing, modems, medical instruments
Typical algorithmsD / A, normalization, correlation, convolution, FFT, encoding / decodingReal time control
Practical applicationsGSM-AMR (Adaptive Multi-Rate) speech codec in 3G GSM phonesServo motor control (HDD/DVD)Audio encode/decode (MP3, AAC, WMA)MPEG4 decodeVoice and handwriting recognition
10-20Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
DSP Instructions
Count leading zeros COUNTZ(Rm)CLZ{cond} Rd, Rm
Saturating subtract double SAT(Rm – SAT(Rs x 2)) QDSUB Rd, Rm, Rs
Saturating subtract SAT(Rm – Rd) QSUB Rd, Rm, Rs
Saturating add double SAT(Rm + SAT(Rs x 2)) QDADD Rd, Rm, Rs
Saturating add SAT(Rm + Rd) QADD Rd, Rm, Rs
Signed multiply long 16 x 32 → 32 SMULWy{cond}
Signed multiply 16 x 16 → 32 SMULxy{cond}
Signed MAC long 16 x 16 + 64 → 64 SMLALxy{cond}
Signed MAC wide 32 x 16 + 32 → 32 SMLAWy{cond}
Signed MAC 16 x 16 + 32 → 32 SMLAxy{cond}
Purpose Operation Instruction
16 — halfword32 — word64 — doubleword
Pin overflow result at max or minNo modulo arithmetic or report of overflow
Saturating
Multiply – Accumulate (Rd ← R1 x R2 + R3)MAC
10-21Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
Performance Comparisons
ARM9 before DSP enhancements
ARM10 with DSP enhancements
Q15 / Q31 — integer arithmetic techniques used in DSP
10-22Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
Apple iPhone 5 HardwareApple A6 Application Processor
Dual ARMv7 cores + 3 GPUHand-optimized layout
Memory Hynix 16 GB Flash
Network processorsSkyworks GSM / GPRS / EDGE moduleSkyworks CDMA module Triquint WCDMA / HSUPA / UMTSQualcomm LTE processorMurata WiFi module
Interface controllersApple Power Management ICApple Audio CODECTexas Instruments touch screen controllerSTMicroelectronics 3-axis gyroSTMicroelectronics 3-axis linear accelerometer
Ref: http://www.chipworks.com/blog/recentteardowns/2012/09/20/2467/
10-23Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
WinARMCross-compiler
Develop applications for ARM in C / C++ on Windows platformsExtensive documentation
Tools GNU GCC compilerGNU-Utils for compiler/linkerARM header-filesSample applications with source-code
Downloadhttp://www.siwawi.arubi.uni-kl.de/avr_projects/arm_projects
Convert C code to assembly codearm-elf-gcc -S filename.c -o filename.asm
10-24Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
ARM Compilation 1‐1C source
main(){
int x = 0;while (x < 10){
x++;}
}
Assembly source
main:mov ip, spstmfd sp!, {fp, ip, lr, pc}sub fp, ip, #4sub sp, sp, #4mov r3, #0str r3, [fp, #-16]b .L2
.L3:ldr r3, [fp, #-16]add r3, r3, #1str r3, [fp, #-16]
.L2:ldr r3, [fp, #-16]cmp r3, #9ble .L3ldmfd sp, {r3, fp, sp, pc}
10-25Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
ARM Compilation 1‐2Building data frame
mov ip, spstmfd sp!, {fp, ip, lr, pc}
; push fp, ip, lr, pc to stacksub fp, ip, #4
; fp = ip – 4; sp = ip – 16 = fp - 12
sub sp, sp, #4; sp = fp – 16
mov r3, #0str r3, [fp, #-16]
—ip = sp
fp
ip
lr
pc
—
sp
ip
fp - 16
fp - 12
fp - 8
fp - 4
x = 0sp
fp
ip
lr
pc
—
fp
ip
fp
ip
lr
pc
—
sp
fp
ip
10-26Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
ARM Compilation 1‐3Executing loop
b .L2 ; branch to .L2
.L3:
ldr r3, [fp, #-16] ; r3 ← x
add r3, r3, #1 ; r3++
str r3, [fp, #-16] ; x ← r3
.L2:
ldr r3, [fp, #-16] ; r3 ← x
cmp r3, #9 ; compare r3 , 9
ble .L3 ; jump .L3 if r3 ≤ 9
ldmfd sp, {r3, fp, sp, pc} ; restore registers
10-27Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
ARM Compilation 2‐1C source
main(){
int x , y;for (x = 0 ; x < 10 ; x++){
y = x + 4;}
}
Assembly source
main:mov ip, spstmfd sp!, {fp, ip, lr, pc}sub fp, ip, #4sub sp, sp, #8 ; 2 integersmov r3, #0str r3, [fp, #-20]b .L2
.L3:ldr r3, [fp, #-20]add r3, r3, #4str r3, [fp, #-16]ldr r3, [fp, #-20]add r3, r3, #1str r3, [fp, #-20]
.L2:ldr r3, [fp, #-20]cmp r3, #9ble .L3sub sp, fp, #12ldmfd sp, {fp, sp, pc}
fp – 20x = 0sp
fp - 16
fp - 12
fp - 8
fp - 4
y
fp
ip
lr
pc
—
fp
ip
10-28Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019
ARM Compilation 2‐2Executing loop
b .L2
.L3:
ldr r3, [fp, #-20] ; r3 ← x
add r3, r3, #4 ; r3 ← r3 + 4
str r3, [fp, #-16] ; y ← r3
ldr r3, [fp, #-20] ; r3 ← x
add r3, r3, #1 ; r3++
str r3, [fp, #-20] ; x ← r3
.L2:
ldr r3, [fp, #-20] ; r3 ← x
cmp r3, #9 ; compare r3 , 9
ble .L3 ; jump .L3 if r3 ≤ 9
sub sp, fp, #12 ; sp ← fp – 12
ldmfd sp, {fp, sp, pc} ; restore registers